Spark Dataframe Distinguish Columns with Duplicated Name

Spark Dataframe distinguish columns with duplicated name

I would recommend that you change the column names for your join.

df1.select(col("a") as "df1_a", col("f") as "df1_f")
.join(df2.select(col("a") as "df2_a", col("f") as "df2_f"), col("df1_a" === col("df2_a"))

The resulting DataFrame will have schema

(df1_a, df1_f, df2_a, df2_f)

How to resolve duplicate column names while joining two dataframes in PySpark?

There is no shortcut here. Pyspark expects the left and right dataframes to have distinct sets of field names (with the exception of the join key).

One solution would be to prefix each field name with either a "left_" or "right_" as follows:

# Obtain columns lists
left_cols = df.columns
right_cols = df2.columns

# Prefix each dataframe's field with "left_" or "right_"
df = df.selectExpr([col + ' as left_' + col for col in left_cols])
df2 = df2.selectExpr([col + ' as right_' + col for col in right_cols])

# Perform join
df3 = df.alias('l').join(df2.alias('r'), on='c_0')

Pyspark: Reference is ambiguous when joining dataframes on same column

You should rename the duplicate column

compare_num_avails_inv = (
avails_ns.join(
alloc_ns,
(F.col('avails_ns.BreakDateTime') == F.col('alloc_ns.AllocationDateTime')) & (F.col('avails_ns.RetailUnit') == F.col('alloc_ns.RetailUnit')),
how='left'
)
.withColumnRenamed(alloc_ns.RetailUnit, 'RetailUnitNs')
.fillna({'allocs_sum': 0})
.withColumn('diff', F.col('avails_sum') - F.col('allocs_sum'))
)

This way you don't need to drop the column if it is required

How to merge duplicate columns in pyspark?

Edited to answer OP request to coalesce from list,

Here's a reproducible example

    import pyspark.sql.functions as F

df = spark.createDataFrame([
("z","a", None, None),
("b",None,"c", None),
("c","b", None, None),
("d",None, None, "z"),
], ["a","c", "c","c"])

df.show()

#fix duplicated column names
old_col=df.schema.names
running_list=[]
new_col=[]
i=0
for column in old_col:
if(column in running_list):
new_col.append(column+"_"+str(i))
i=i+1
else:
new_col.append(column)
running_list.append(column)
print(new_col)

df1 = df.toDF(*new_col)

#coalesce columns to get one column from a list

a=['c','c_0','c_1']
to_drop=['c_0','c_1']
b=[]
[b.append(df1[col]) for col in a]

#coalesce columns to get one column
df_merged=df1.withColumn('c',F.coalesce(*b)).drop(*to_drop)

df_merged.show()

Output:

+---+----+----+----+
| a| c| c| c|
+---+----+----+----+
| z| a|null|null|
| b|null| c|null|
| c| b|null|null|
| d|null|null| z|
+---+----+----+----+

['a', 'c', 'c_0', 'c_1']

+---+---+
| a| c|
+---+---+
| z| a|
| b| c|
| c| b|
| d| z|
+---+---+

How to rename duplicated columns after join?

If you are trying to rename the status column of bb_df dataframe then you can do so while joining as

result_df = aa_df.join(bb_df.withColumnRenamed('status', 'user_status'),'id', 'left').join(cc_df, 'id', 'left')

PySpark dataframe: working with duplicated column names after self join

In cases where there are duplicated column names, use aliases on your DataFrames to avoid the ambiguity:

a = df1.alias('l').join(df2.alias('r'), on='a', how = 'full').select('l.f', 'r.f').collect()
print(a)
#[Row(f=3, f=3), Row(f=None, f=2), Row(f=2, f=None)]


Related Topics



Leave a reply



Submit