How to concatenate two dataframes without duplicates?
The simplest way is to just do the concatenation, and then drop duplicates.
>>> df1
A B
0 1 2
1 3 1
>>> df2
A B
0 5 6
1 3 1
>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
A B
0 1 2
1 3 1
2 5 6
The reset_index(drop=True)
is to fix up the index after the concat()
and drop_duplicates()
. Without it you will have an index of [0,1,0]
instead of [0,1,2]
. This could cause problems for further operations on this dataframe
down the road if it isn't reset right away.
Concatenate two dataframes and drop duplicates in Pandas
Add DataFrame.drop_duplicates
for get last rows per type
and date
after concat
.
Solution working if type
and date
pairs are unique in both DataFrames.
df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))
Concat/merge/join two dataframes removing duplicate rows from the 2nd dataframe based on the index
I read the pandas documentation on concat, merge, and join, as well as various blogs.
This blog was very helpful: https://www.kite.com/blog/python/pandas-merge-join-concat/. In summary, it points to the use of concat since I am trying to append two dataframes vertically.
I tried multiple variations of concat, merge, and join, but finally settled on this approach: Pandas/Python: How to concatenate two dataframes without duplicates?
def append_non_duplicates(a, b, col=None):
if (a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame):
raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
if a is None:
return b
if b is None:
return a
if col is not None:
aind = a.iloc[:, col].values
bind = b.iloc[:, col].values
else:
aind = a.index.values
bind = b.index.values
take_rows = list(set(bind)-set(aind))
take_rows = [i in take_rows for i in bind]
return a.append(b.iloc[take_rows, :])
Like Daniel, I too am surprised there isn't an easier way to do this in out-of-the-box pandas.
Combining two pandas dataframes without including duplicates?
I am not sure if there is a more elegant solution, but you could concatenate the dataframes first with the duplicates then drop them afterwards.
output = pd.concat([df1, df2]).drop_duplicates()
pd.concat: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
drop_duplicates: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
Related Topics
Executing Command Line Programs from Within Python
Datetime Timezone Conversion Using Pytz
How to Access a File's Properties on Windows
How to Form Tuple Column from Two Columns in Pandas
Using Cprofile Results with Kcachegrind
Making an Executable in Cython
Installing Numpy with Pip on Windows 10 for Python 3.7
How to Access the Real Value of a Cell Using the Openpyxl Module for Python
Get an Attribute Value Based on the Name Attribute with Beautifulsoup
Does Tkinter Have a Table Widget
Python 2.X - Write Binary Output to Stdout
Operation on Every Pair of Element in a List
Unique Combinations of Values in Selected Columns in Pandas Data Frame and Count
Is the Shortcircuit Behaviour of Python's Any/All Explicit
Convert Structured Array to Regular Numpy Array
Pandas Deleting Row with Df.Drop Doesn't Work
How to Extract Frequency Associated with Fft Values in Python