Python Pandas - Find difference between two data frames
By using drop_duplicates
pd.concat([df1,df2]).drop_duplicates(keep=False)
Update :
The above method only works for those data frames that don't already have duplicates themselves. For example:
df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})
It will output like below , which is wrong
Wrong Output :
pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3
Correct Output
Out[656]:
A B
1 2 3
2 3 4
3 3 4
How to achieve that?
Method 1: Using isin
with tuple
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4
Method 2: merge
with indicator
df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only
Comparing two dataframes and getting the differences
This approach, df1 != df2
, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same
method, and exception is raised if differences found, even in columns/indices order.
If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:
>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)
group by
>>> df_gpby = df.groupby(list(df.columns))
get index of unique records
>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
filter
>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red
Comparing two dataframes and getting the differences
This approach, df1 != df2
, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same
method, and exception is raised if differences found, even in columns/indices order.
If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:
>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)
group by
>>> df_gpby = df.groupby(list(df.columns))
get index of unique records
>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
filter
>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red
Compare two DataFrames and output their differences side-by-side
The first part is similar to Constantine, you can get the boolean of which rows are empty*:
In [21]: ne = (df1 != df2).any(1)
In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool
Then we can see which entries have changed:
In [23]: ne_stacked = (df1 != df2).stack()
In [24]: changed = ne_stacked[ne_stacked]
In [25]: changed.index.names = ['id', 'col']
In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool
Here the first entry is the index and the second the columns which has been changed.
In [27]: difference_locations = np.where(df1 != df2)
In [28]: changed_from = df1.values[difference_locations]
In [29]: changed_to = df2.values[difference_locations]
In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation
* Note: it's important that df1
and df2
share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index
, but I think I'll leave that as an exercise.
How to find differences between two dataframes of different lengths?
You could define several helper functions to adjust the length and widths of the two dataframes:
def equalize_length(short, long):
return pd.concat(
[
short,
pd.DataFrame(
{
col: ["nan"] * (long.shape[0] - short.shape[0])
for col in short.columns
}
),
]
).reset_index(drop=True)
def equalize_width(short, long):
return pd.concat(
[
short,
pd.DataFrame({col: [] for col in long.columns if col not in short.columns}),
],
axis=1,
).reset_index(drop=True)
def equalize(df, other_df):
if df.shape[0] <= other_df.shape[0]:
df = equalize_length(df, other_df)
else:
other_df = equalize_length(other_df, df)
if df.shape[1] <= other_df.shape[1]:
df = equalize_width(df, other_df)
else:
other_df = equalize_width(other_df, df)
df = df.fillna("nan")
other_df = other_df.fillna("nan")
return df, other_df
And then, in your code:
a, b = equalize(a, b)
comparevalues = a.values == b.values
rows, cols = np.where(comparevalues == False)
for item in zip(rows, cols):
a.iloc[item[0], item[1]] = " {} --> {} ".format(
a.iloc[item[0], item[1]], b.iloc[item[0], item[1]]
)
print(a) # with 'a' being shorter in lenght but longer in width than 'b'
# Output
A B C D
0 1 abcd --> dah jamba OQEWINVSKD --> nan
1 2 efgh --> fupa refresh --> dimez DKVLNQIOEVM --> nan
2 3 ijkl portobello --> pocketfresh asdlikvn --> nan
3 4 uhyee --> danju performancehigh --> reverbb asdkvnddvfvfkdd --> nan
4 5 uhuh jackalack nan
5 nan --> 6 nan --> freshhhhhhh nan --> boombackimmatouchit nan
Find difference between two pandas dataframes when both contains same rows but one dataframe contains it more than once
You can add a new column to catch the duplicates:
df1['merge'] = df1.groupby(['0','1','2']).cumcount()
df2['merge'] = df2.groupby(['0','1','2']).cumcount()
pd.concat([df1,df2]).drop_duplicates(keep=False)
Afterwards you can drop the added column again
Related Topics
How to Get the Path of the Current Executed File in Python
What Are the 'Levels', 'Keys', and Names Arguments for in Pandas' Concat Function
Comparing Two Numpy Arrays for Equality, Element-Wise
What Does %S Mean in a Python Format String
Convert Array of Indices to One-Hot Encoded Array in Numpy
How to Detach Matplotlib Plots So That the Computation Can Continue
Pylint "Unresolved Import" Error in Visual Studio Code
Plotting in a Non-Blocking Way with Matplotlib
Error Unicodedecodeerror: 'Utf-8' Codec Can't Decode Byte 0Xff in Position 0: Invalid Start Byte
Check If Something Is (Not) in a List in Python
Attributeerror: 'Module' Object Has No Attribute
Why Is Python 3.X's Super() Magic
Convert Numpy Array to Python List
Python App Does Not Print Anything When Running Detached in Docker