Pyspark : How to concat two dataframes in Pyspark
You need to perform a crossJoin
between the two dataframes.
See below for details -
from pyspark.sql import Row
df1 = spark.createDataFrame([Row(NBB1 = 776)])
df1.show()
#Output
+----+
|NBB1|
+----+
| 776|
+----+
df2 = spark.createDataFrame([Row(NBB2 = 4867)])
df2.show()
#Output
+----+
|NBB2|
+----+
|4867|
+----+
df1.crossJoin(df2).show()
#Output
+----+----+
|NBB1|NBB2|
+----+----+
| 776|4867|
+----+----+
How to merge several dataframes column-wise in pyspark?
df_1 = spark.createDataFrame([[1, '2018-10-10', 3]], ['id', 'date', 'value'])
df_2 = spark.createDataFrame([[1, '2018-10-10', 3], [2, '2018-10-10', 4]], ['id', 'date', 'value'])
df_3 = spark.createDataFrame([[1, '2018-10-10', 3], [2, '2018-10-10', 4]], ['id', 'date', 'value'])
from functools import reduce
# list of data frames / tables
dfs = [df_1, df_2, df_3]
# rename value column
dfs_renamed = [df.selectExpr('id', 'date', f'value as value_{i}') for i, df in enumerate(dfs)]
# reduce the list of data frames with inner join
reduce(lambda x, y: x.join(y, ['id', 'date'], how='inner'), dfs_renamed).show()
+---+----------+-------+-------+-------+
| id| date|value_0|value_1|value_2|
+---+----------+-------+-------+-------+
| 1|2018-10-10| 3| 3| 3|
+---+----------+-------+-------+-------+
Pyspark -- How to left merge dataframes
You can apply join in pyspark as
df = df1.join(df2, df1.lkey==df2.rkey, 'left_outer')
Related Topics
How to Count the Number of Messages
Key Error When Selecting Columns in Pandas Dataframe After Read_Csv
How to Check All Versions of Python Installed on Osx and Centos
How to Add List into a New Column in CSV - Python
Python, Delete Json Element Having Specific Key from a Loop
How to Verify If a Button Is Enabled and Disabled in Webdriver Python
Clicking Links With Python Beautifulsoup
Remove Partial String from Dataframe With Pandas
How to Drop Rows from Pandas Data Frame That Contains a Particular String in a Particular Column
Regular Expression: Match Everything After a Particular Word
Python: [Errno 10054] an Existing Connection Was Forcibly Closed by the Remote Host
How to Delete the Words Between Two Delimiters
How to Clear/Delete the Contents of a Tkinter Text Widget
What Is the Fastest Way to Stack Numpy Arrays in a Loop
Export Pandas Dataframe into a Pdf File Using Python
Finding Out Who Got the Highest Mark Among the Students
I Received an Error Message That I Don't Quite Understand
Pandas Populate New Dataframe Column Based on Matching Columns in Another Dataframe