Spark Dataframe distinguish columns with duplicated name
I would recommend that you change the column names for your join
.
df1.select(col("a") as "df1_a", col("f") as "df1_f")
.join(df2.select(col("a") as "df2_a", col("f") as "df2_f"), col("df1_a" === col("df2_a"))
The resulting DataFrame
will have schema
(df1_a, df1_f, df2_a, df2_f)
How to resolve duplicate column names while joining two dataframes in PySpark?
There is no shortcut here. Pyspark expects the left and right dataframes to have distinct sets of field names (with the exception of the join key).
One solution would be to prefix each field name with either a "left_" or "right_" as follows:
# Obtain columns lists
left_cols = df.columns
right_cols = df2.columns
# Prefix each dataframe's field with "left_" or "right_"
df = df.selectExpr([col + ' as left_' + col for col in left_cols])
df2 = df2.selectExpr([col + ' as right_' + col for col in right_cols])
# Perform join
df3 = df.alias('l').join(df2.alias('r'), on='c_0')
Pyspark: Reference is ambiguous when joining dataframes on same column
You should rename the duplicate column
compare_num_avails_inv = (
avails_ns.join(
alloc_ns,
(F.col('avails_ns.BreakDateTime') == F.col('alloc_ns.AllocationDateTime')) & (F.col('avails_ns.RetailUnit') == F.col('alloc_ns.RetailUnit')),
how='left'
)
.withColumnRenamed(alloc_ns.RetailUnit, 'RetailUnitNs')
.fillna({'allocs_sum': 0})
.withColumn('diff', F.col('avails_sum') - F.col('allocs_sum'))
)
This way you don't need to drop the column if it is required
How to merge duplicate columns in pyspark?
Edited to answer OP request to coalesce from list,
Here's a reproducible example
import pyspark.sql.functions as F
df = spark.createDataFrame([
("z","a", None, None),
("b",None,"c", None),
("c","b", None, None),
("d",None, None, "z"),
], ["a","c", "c","c"])
df.show()
#fix duplicated column names
old_col=df.schema.names
running_list=[]
new_col=[]
i=0
for column in old_col:
if(column in running_list):
new_col.append(column+"_"+str(i))
i=i+1
else:
new_col.append(column)
running_list.append(column)
print(new_col)
df1 = df.toDF(*new_col)
#coalesce columns to get one column from a list
a=['c','c_0','c_1']
to_drop=['c_0','c_1']
b=[]
[b.append(df1[col]) for col in a]
#coalesce columns to get one column
df_merged=df1.withColumn('c',F.coalesce(*b)).drop(*to_drop)
df_merged.show()
Output:
+---+----+----+----+
| a| c| c| c|
+---+----+----+----+
| z| a|null|null|
| b|null| c|null|
| c| b|null|null|
| d|null|null| z|
+---+----+----+----+
['a', 'c', 'c_0', 'c_1']
+---+---+
| a| c|
+---+---+
| z| a|
| b| c|
| c| b|
| d| z|
+---+---+
How to rename duplicated columns after join?
If you are trying to rename the status
column of bb_df
dataframe then you can do so while joining as
result_df = aa_df.join(bb_df.withColumnRenamed('status', 'user_status'),'id', 'left').join(cc_df, 'id', 'left')
PySpark dataframe: working with duplicated column names after self join
In cases where there are duplicated column names, use aliases on your DataFrames to avoid the ambiguity:
a = df1.alias('l').join(df2.alias('r'), on='a', how = 'full').select('l.f', 'r.f').collect()
print(a)
#[Row(f=3, f=3), Row(f=None, f=2), Row(f=2, f=None)]
Related Topics
Split a String with Unknown Number of Spaces as Separator in Python
Installing Pip Packages to $Home Folder
Escape String Python for MySQL
Pyqt Gui Size on High Resolution Screens
How to Install Pycrypto on Windows
Print a String as Hexadecimal Bytes
Problems with Pip Install Numpy - Runtimeerror: Broken Toolchain: Cannot Link a Simple C Program
Python: Get the Print Output in an Exec Statement
Numpy Random Choice to Produce a 2D-Array with All Unique Values
Calculate Area of Polygon Given (X,Y) Coordinates
Parsing Datetime Strings Containing Nanoseconds
Convert a List of Characters into a String
How to Retrieve Items from a Dictionary in the Order That They'Re Inserted
What Happens When a Module Is Imported Twice
How to Properly Assert That an Exception Gets Raised in Pytest