Concatenate Columns in Apache Spark Dataframe

Scala (Spark) concatenate Columns in a dataframe

According to Spark documentation the signature of the concat function is concat(col1, col2, ..., colN). Given your list contains the column names, i.e: c1, c2 ... cN, map will convert each one of these into Column class objects. The conversion is done using the col function. Finally, the _* will unpack the (converted to Column) list items, similarly to how python's * operator works, assigning concat arguments.

Concatenate Columns with specific number of registers using Scala

I think your solution is correct. Please set truncate = "false" in show method in case you are using spark-shell to check it visually.

Here is test DF data:

scala> df.show
+---------+--------+
|FirstName|LastName|
+---------+--------+
| John| Smith|
| James| Bond|
+---------+--------+

Now if we print concatenation result and add length as separate column we get exactly the sum of lengths:

scala> df
.select(concat(rpad($"FirstName", 50, " "), rpad($"LastName", 50, " "))
.as("concat"))
.withColumn("lenght", length(col("concat")))
.show(100, false)
+----------------------------------------------------------------------------------------------------+------+
|concat |lenght|
+----------------------------------------------------------------------------------------------------+------+
|John Smith |100 |
|James Bond |100 |
+----------------------------------------------------------------------------------------------------+------+

Concatenate columns in dataframe and order them based on number

Use array_sort and array. If you want to remove duplicates, use dropDuplicates.

df.show()
+---+---+
|id1|id2|
+---+---+
|111|222|
|111|112|
|222|111|
+---+---+

import pyspark.sql.functions as F
df2 = df.withColumn('ids', F.array_sort(F.array('id1', 'id2')))
df2.show()
+---+---+----------+
|id1|id2| ids|
+---+---+----------+
|111|222|[111, 222]|
|111|112|[111, 112]|
|222|111|[111, 222]|
+---+---+----------+

df3 = df2.dropDuplicates(['ids'])
df3.show()
+---+---+----------+
|id1|id2| ids|
+---+---+----------+
|111|222|[111, 222]|
|111|112|[111, 112]|
+---+---+----------+

how can concatenate two string columns in one column in spark python

Just use concat function:

sqlContext.sql("SELECT playerID, concat(nameFirst, nameLast) as full_name FROM Master")

concatenate columns and selecting some columns in Pyspark data frame

All you have to do is get the array of column names that ends with _serial

serialCols = [x for x in df.columns if str(x).endswith('_serial')]

Then use it with concat_ws inbuilt function to concat the column values in select expression as

from pyspark.sql import functions as f
df.select(
df['id'],
f.concat_ws('', *serialCols).alias('serial_number'),
df['name'],
df['city']
).show(truncate=False)

Here I have used an empty character to concat the strings

so the above code should give you

+---+-------------+----+----+
|id |serial_number|name|city|
+---+-------------+----+----+
|1 |123_abc |sam |NY |
|2 |abc_123 |Tim |NJ |
|3 | |Jim |SFO |
+---+-------------+----+----+

Edit: One could also use pyspark.sql.functions.concat() in place of concat_ws().

Concatenate column names in a new column based on their values

concat_ws with a smart when condition may do it.

from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 0, 1),
(2, 1, 2),
(3, 0, 1),
(4, 1, 1),
(5, 2, 2),
(6, 0, 1)],
['id', 'col1', 'col2']
)
df = df.select(
'*',
F.concat_ws(', ', *[F.when(F.col(c) == 1, c) for c in {'col1', 'col2'}]).alias('ones'),
F.concat_ws(', ', *[F.when(F.col(c) == 2, c) for c in {'col1', 'col2'}]).alias('twos'),
)
df.show()
# +---+----+----+----------+----------+
# | id|col1|col2| ones| twos|
# +---+----+----+----------+----------+
# | 1| 0| 1| col2| |
# | 2| 1| 2| col1| col2|
# | 3| 0| 1| col2| |
# | 4| 1| 1|col1, col2| |
# | 5| 2| 2| |col1, col2|
# | 6| 0| 1| col2| |
# +---+----+----+----------+----------+

How to concatenate multiple columns in PySpark with a separator?

It's pretty simple:

from pyspark.sql.functions import col, concat, lit

df = df.withColumn("column_join", concat(col("column_1"), lit("-"), col("column_2"), lit("-"), col("column_3")))

Use concat to concatenate all the columns with the - separator, for which you will need to use lit.

If it doesn't directly work, you can use cast to change the column types to string, col("column_1").cast("string")

UPDATE:

Or you can use a more dynamic approach using a built-in function concat_ws

pyspark.sql.functions.concat_ws(sep, *cols)

Concatenates multiple input string columns together into a single string column, using the given separator.

>>> df = spark.createDataFrame([('abcd','123')], ['s', 'd'])
>>> df.select(concat_ws('-', df.s, df.d).alias('s')).collect()
[Row(s=u'abcd-123')]

Code:

from pyspark.sql.functions import col, concat_ws

concat_columns = ["column_1", "column_2", "column_3"]
df = df.withColumn("column_join", concat_ws("-", *[F.col(x) for x in concat_columns]))

How to create new string column in PySpark DataFrame based on values of other columns?

You can use concat function or format_string like this:

from pyspark.sql import functions as F

df = df.withColumn(
"New",
F.format_string("Hey there %s %s!", "Name", "Surname")
)

df.show(truncate=False)
# +---+----+-------+-----------------------+
# |Id |Name|Surname|New |
# +---+----+-------+-----------------------+
# |1 |John|Johnson|Hey there John Johnson!|
# |2 |Anna|Maria |Hey there Anna Maria! |
# +---+----+-------+-----------------------+

If you prefer using concat:

F.concat(F.lit("Hey there "), F.col("Name"), F.lit(" "), F.col("Surname"), F.lit("!"))


Related Topics



Leave a reply



Submit