Scala (Spark) concatenate Columns in a dataframe
According to Spark documentation the signature of the concat function is concat(col1, col2, ..., colN)
. Given your list contains the column names, i.e: c1, c2 ... cN
, map
will convert each one of these into Column class objects. The conversion is done using the col function. Finally, the _*
will unpack the (converted to Column) list items, similarly to how python's * operator works, assigning concat
arguments.
Concatenate Columns with specific number of registers using Scala
I think your solution is correct. Please set truncate = "false" in show
method in case you are using spark-shell to check it visually.
Here is test DF data:
scala> df.show
+---------+--------+
|FirstName|LastName|
+---------+--------+
| John| Smith|
| James| Bond|
+---------+--------+
Now if we print concatenation result and add length as separate column we get exactly the sum of lengths:
scala> df
.select(concat(rpad($"FirstName", 50, " "), rpad($"LastName", 50, " "))
.as("concat"))
.withColumn("lenght", length(col("concat")))
.show(100, false)
+----------------------------------------------------------------------------------------------------+------+
|concat |lenght|
+----------------------------------------------------------------------------------------------------+------+
|John Smith |100 |
|James Bond |100 |
+----------------------------------------------------------------------------------------------------+------+
Concatenate columns in dataframe and order them based on number
Use array_sort
and array
. If you want to remove duplicates, use dropDuplicates
.
df.show()
+---+---+
|id1|id2|
+---+---+
|111|222|
|111|112|
|222|111|
+---+---+
import pyspark.sql.functions as F
df2 = df.withColumn('ids', F.array_sort(F.array('id1', 'id2')))
df2.show()
+---+---+----------+
|id1|id2| ids|
+---+---+----------+
|111|222|[111, 222]|
|111|112|[111, 112]|
|222|111|[111, 222]|
+---+---+----------+
df3 = df2.dropDuplicates(['ids'])
df3.show()
+---+---+----------+
|id1|id2| ids|
+---+---+----------+
|111|222|[111, 222]|
|111|112|[111, 112]|
+---+---+----------+
how can concatenate two string columns in one column in spark python
Just use concat function:
sqlContext.sql("SELECT playerID, concat(nameFirst, nameLast) as full_name FROM Master")
concatenate columns and selecting some columns in Pyspark data frame
All you have to do is get the array of column names that ends with _serial
serialCols = [x for x in df.columns if str(x).endswith('_serial')]
Then use it with concat_ws
inbuilt function to concat the column values in select
expression as
from pyspark.sql import functions as f
df.select(
df['id'],
f.concat_ws('', *serialCols).alias('serial_number'),
df['name'],
df['city']
).show(truncate=False)
Here I have used an empty character to concat the strings
so the above code should give you
+---+-------------+----+----+
|id |serial_number|name|city|
+---+-------------+----+----+
|1 |123_abc |sam |NY |
|2 |abc_123 |Tim |NJ |
|3 | |Jim |SFO |
+---+-------------+----+----+
Edit: One could also use pyspark.sql.functions.concat()
in place of concat_ws()
.
Concatenate column names in a new column based on their values
concat_ws
with a smart when
condition may do it.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 0, 1),
(2, 1, 2),
(3, 0, 1),
(4, 1, 1),
(5, 2, 2),
(6, 0, 1)],
['id', 'col1', 'col2']
)
df = df.select(
'*',
F.concat_ws(', ', *[F.when(F.col(c) == 1, c) for c in {'col1', 'col2'}]).alias('ones'),
F.concat_ws(', ', *[F.when(F.col(c) == 2, c) for c in {'col1', 'col2'}]).alias('twos'),
)
df.show()
# +---+----+----+----------+----------+
# | id|col1|col2| ones| twos|
# +---+----+----+----------+----------+
# | 1| 0| 1| col2| |
# | 2| 1| 2| col1| col2|
# | 3| 0| 1| col2| |
# | 4| 1| 1|col1, col2| |
# | 5| 2| 2| |col1, col2|
# | 6| 0| 1| col2| |
# +---+----+----+----------+----------+
How to concatenate multiple columns in PySpark with a separator?
It's pretty simple:
from pyspark.sql.functions import col, concat, lit
df = df.withColumn("column_join", concat(col("column_1"), lit("-"), col("column_2"), lit("-"), col("column_3")))
Use concat
to concatenate all the columns with the -
separator, for which you will need to use lit
.
If it doesn't directly work, you can use cast
to change the column types to string, col("column_1").cast("string")
UPDATE:
Or you can use a more dynamic approach using a built-in function concat_ws
pyspark.sql.functions.concat_ws(sep, *cols)
Concatenates multiple input string columns together into a single string column, using the given separator.
>>> df = spark.createDataFrame([('abcd','123')], ['s', 'd'])
>>> df.select(concat_ws('-', df.s, df.d).alias('s')).collect()
[Row(s=u'abcd-123')]
Code:
from pyspark.sql.functions import col, concat_ws
concat_columns = ["column_1", "column_2", "column_3"]
df = df.withColumn("column_join", concat_ws("-", *[F.col(x) for x in concat_columns]))
How to create new string column in PySpark DataFrame based on values of other columns?
You can use concat
function or format_string
like this:
from pyspark.sql import functions as F
df = df.withColumn(
"New",
F.format_string("Hey there %s %s!", "Name", "Surname")
)
df.show(truncate=False)
# +---+----+-------+-----------------------+
# |Id |Name|Surname|New |
# +---+----+-------+-----------------------+
# |1 |John|Johnson|Hey there John Johnson!|
# |2 |Anna|Maria |Hey there Anna Maria! |
# +---+----+-------+-----------------------+
If you prefer using concat:
F.concat(F.lit("Hey there "), F.col("Name"), F.lit(" "), F.col("Surname"), F.lit("!"))
Related Topics
How to Temporarily Disable a Foreign Key Constraint in MySQL
Create Unique Constraint With Null Columns
MySQL Cannot Add Foreign Key Constraint
Convert Month Number to Month Name Function in Sql
SQL Server 2012 Column Identity Increment Jumping from 6 to 1000+ on 7Th Entry
Save Pl/Pgsql Output from Postgresql to a CSV File
Does Postgresql Support "Accent Insensitive" Collations
MySQL - How to Unpivot Columns to Rows
MySQL Insert Query Doesn't Work With Where Clause
Quick Selection of a Random Row from a Large Table in MySQL
Table Naming Dilemma: Singular Vs. Plural Names
How to Update Only One Field Using Entity Framework
Condition Within Join or Where
How to Do a Batch Insert in MySQL