How to Concatenate Two Dataframes Without Duplicates

How to concatenate two dataframes without duplicates?

The simplest way is to just do the concatenation, and then drop duplicates.

>>> df1
   A  B
0  1  2
1  3  1
>>> df2
   A  B
0  5  6
1  3  1
>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
   A  B
0  1  2
1  3  1
2  5  6

The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.

Concatenate two dataframes and drop duplicates in Pandas

Add DataFrame.drop_duplicates for get last rows per type and date after concat.

Solution working if type and date pairs are unique in both DataFrames.

df = (pd.concat([df1, df2], ignore_index=True, sort =False)
        .drop_duplicates(['type','date'], keep='last'))

Concat/merge/join two dataframes removing duplicate rows from the 2nd dataframe based on the index

I read the pandas documentation on concat, merge, and join, as well as various blogs.

This blog was very helpful: https://www.kite.com/blog/python/pandas-merge-join-concat/. In summary, it points to the use of concat since I am trying to append two dataframes vertically.

I tried multiple variations of concat, merge, and join, but finally settled on this approach: Pandas/Python: How to concatenate two dataframes without duplicates?

def append_non_duplicates(a, b, col=None):
    if (a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame):
        raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
    if a is None:
        return b
    if b is None:
        return a
    if col is not None:
        aind = a.iloc[:, col].values
        bind = b.iloc[:, col].values
    else:
        aind = a.index.values
        bind = b.index.values
    take_rows = list(set(bind)-set(aind))
    take_rows = [i in take_rows for i in bind]
    return a.append(b.iloc[take_rows, :])

Like Daniel, I too am surprised there isn't an easier way to do this in out-of-the-box pandas.

Combining two pandas dataframes without including duplicates?

I am not sure if there is a more elegant solution, but you could concatenate the dataframes first with the duplicates then drop them afterwards.

output = pd.concat([df1, df2]).drop_duplicates()

pd.concat: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

drop_duplicates: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

How to Concatenate Two Dataframes Without Duplicates