How to Concatenate Two Dataframes Without Duplicates

How to concatenate two dataframes without duplicates?

The simplest way is to just do the concatenation, and then drop duplicates.

>>> df1
A B
0 1 2
1 3 1
>>> df2
A B
0 5 6
1 3 1
>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
A B
0 1 2
1 3 1
2 5 6

The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.

Concatenate two dataframes and drop duplicates in Pandas

Add DataFrame.drop_duplicates for get last rows per type and date after concat.

Solution working if type and date pairs are unique in both DataFrames.

df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))

Concat/merge/join two dataframes removing duplicate rows from the 2nd dataframe based on the index

I read the pandas documentation on concat, merge, and join, as well as various blogs.

This blog was very helpful: https://www.kite.com/blog/python/pandas-merge-join-concat/. In summary, it points to the use of concat since I am trying to append two dataframes vertically.

I tried multiple variations of concat, merge, and join, but finally settled on this approach: Pandas/Python: How to concatenate two dataframes without duplicates?

def append_non_duplicates(a, b, col=None):
if (a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame):
raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
if a is None:
return b
if b is None:
return a
if col is not None:
aind = a.iloc[:, col].values
bind = b.iloc[:, col].values
else:
aind = a.index.values
bind = b.index.values
take_rows = list(set(bind)-set(aind))
take_rows = [i in take_rows for i in bind]
return a.append(b.iloc[take_rows, :])

Like Daniel, I too am surprised there isn't an easier way to do this in out-of-the-box pandas.

Combining two pandas dataframes without including duplicates?

I am not sure if there is a more elegant solution, but you could concatenate the dataframes first with the duplicates then drop them afterwards.

output = pd.concat([df1, df2]).drop_duplicates()

pd.concat: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

drop_duplicates: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html



Related Topics



Leave a reply



Submit