Concatenate Columns in Apache Spark Dataframe

Scala (Spark) concatenate Columns in a dataframe

According to Spark documentation the signature of the concat function is concat(col1, col2, ..., colN). Given your list contains the column names, i.e: c1, c2 ... cN, map will convert each one of these into Column class objects. The conversion is done using the col function. Finally, the _* will unpack the (converted to Column) list items, similarly to how python's * operator works, assigning concat arguments.

Concatenate Columns with specific number of registers using Scala

I think your solution is correct. Please set truncate = "false" in show method in case you are using spark-shell to check it visually.

Here is test DF data:

scala> df.show
+---------+--------+
|FirstName|LastName|
+---------+--------+
|     John|   Smith|
|    James|    Bond|
+---------+--------+

Now if we print concatenation result and add length as separate column we get exactly the sum of lengths:

scala> df
  .select(concat(rpad($"FirstName", 50, " "), rpad($"LastName", 50, " "))
  .as("concat"))
  .withColumn("lenght", length(col("concat")))
  .show(100, false)
+----------------------------------------------------------------------------------------------------+------+
|concat                                                                                              |lenght|
+----------------------------------------------------------------------------------------------------+------+
|John                                              Smith                                             |100   |
|James                                             Bond                                              |100   |
+----------------------------------------------------------------------------------------------------+------+

Concatenate columns in dataframe and order them based on number

Use array_sort and array. If you want to remove duplicates, use dropDuplicates.

df.show()
+---+---+
|id1|id2|
+---+---+
|111|222|
|111|112|
|222|111|
+---+---+

import pyspark.sql.functions as F
df2 = df.withColumn('ids', F.array_sort(F.array('id1', 'id2')))
df2.show()
+---+---+----------+
|id1|id2|       ids|
+---+---+----------+
|111|222|[111, 222]|
|111|112|[111, 112]|
|222|111|[111, 222]|
+---+---+----------+

df3 = df2.dropDuplicates(['ids'])
df3.show()
+---+---+----------+
|id1|id2|       ids|
+---+---+----------+
|111|222|[111, 222]|
|111|112|[111, 112]|
+---+---+----------+

how can concatenate two string columns in one column in spark python

Just use concat function:

sqlContext.sql("SELECT playerID, concat(nameFirst, nameLast) as full_name FROM Master")

concatenate columns and selecting some columns in Pyspark data frame

All you have to do is get the array of column names that ends with _serial

serialCols = [x for x in df.columns if str(x).endswith('_serial')]

Then use it with concat_ws inbuilt function to concat the column values in select expression as

from pyspark.sql import functions as f
df.select(
    df['id'],
    f.concat_ws('', *serialCols).alias('serial_number'),
    df['name'],
    df['city']
).show(truncate=False)

Here I have used an empty character to concat the strings

so the above code should give you

+---+-------------+----+----+
|id |serial_number|name|city|
+---+-------------+----+----+
|1  |123_abc      |sam |NY  |
|2  |abc_123      |Tim |NJ  |
|3  |             |Jim |SFO |
+---+-------------+----+----+

Edit: One could also use pyspark.sql.functions.concat() in place of concat_ws().

Concatenate column names in a new column based on their values

concat_ws with a smart when condition may do it.

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(1, 0, 1),
     (2, 1, 2),
     (3, 0, 1),
     (4, 1, 1),
     (5, 2, 2),
     (6, 0, 1)],
    ['id', 'col1', 'col2']
)
df = df.select(
    '*',
    F.concat_ws(', ', *[F.when(F.col(c) == 1, c) for c in {'col1', 'col2'}]).alias('ones'),
    F.concat_ws(', ', *[F.when(F.col(c) == 2, c) for c in {'col1', 'col2'}]).alias('twos'),
)
df.show()
# +---+----+----+----------+----------+
# | id|col1|col2|      ones|      twos|
# +---+----+----+----------+----------+
# |  1|   0|   1|      col2|          |
# |  2|   1|   2|      col1|      col2|
# |  3|   0|   1|      col2|          |
# |  4|   1|   1|col1, col2|          |
# |  5|   2|   2|          |col1, col2|
# |  6|   0|   1|      col2|          |
# +---+----+----+----------+----------+

How to concatenate multiple columns in PySpark with a separator?

It's pretty simple:

from pyspark.sql.functions import col, concat, lit

df = df.withColumn("column_join", concat(col("column_1"), lit("-"), col("column_2"), lit("-"), col("column_3")))

Use concat to concatenate all the columns with the - separator, for which you will need to use lit.

If it doesn't directly work, you can use cast to change the column types to string, col("column_1").cast("string")

UPDATE:

Or you can use a more dynamic approach using a built-in function concat_ws

pyspark.sql.functions.concat_ws(sep, *cols)

Concatenates multiple input string columns together into a single string column, using the given separator.

>>> df = spark.createDataFrame([('abcd','123')], ['s', 'd'])
>>> df.select(concat_ws('-', df.s, df.d).alias('s')).collect()
[Row(s=u'abcd-123')]

Code:

from pyspark.sql.functions import col, concat_ws

concat_columns = ["column_1", "column_2", "column_3"]
df = df.withColumn("column_join", concat_ws("-", *[F.col(x) for x in concat_columns]))

How to create new string column in PySpark DataFrame based on values of other columns?

You can use concat function or format_string like this:

from pyspark.sql import functions as F

df = df.withColumn(
    "New", 
    F.format_string("Hey there %s %s!", "Name", "Surname")
)

df.show(truncate=False)
# +---+----+-------+-----------------------+
# |Id |Name|Surname|New                    |
# +---+----+-------+-----------------------+
# |1  |John|Johnson|Hey there John Johnson!|
# |2  |Anna|Maria  |Hey there Anna Maria!  |
# +---+----+-------+-----------------------+

If you prefer using concat:

F.concat(F.lit("Hey there "), F.col("Name"), F.lit(" "), F.col("Surname"), F.lit("!"))