Comparing Two Dataframes and Getting the Differences

Comparing two dataframes and getting the differences

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order.

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red

Compare two DataFrames and output their differences side-by-side

The first part is similar to Constantine, you can get the boolean of which rows are empty*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool

Then we can see which entries have changed:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool

Here the first entry is the index and the second the columns which has been changed.

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation

* Note: it's important that df1 and df2 share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index, but I think I'll leave that as an exercise.

Pandas better method to compare two dataframes and find entries that only exist in one

Looks like using 'outer' as the how was the solution

z = pd.merge(ORIGINAL, NEW, on=cols, how = 'outer', indicator=True)
z = z[z._merge != 'both'] # Filter out records from both

Output looks like this (after only showing the columns I care about)

  Name       Site   _merge
Charlie A left_only
Doug B right_only

Python Pandas - Find difference between two data frames

By using drop_duplicates

pd.concat([df1,df2]).drop_duplicates(keep=False)

Update :

The above method only works for those data frames that don't already have duplicates themselves. For example:

df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})

It will output like below , which is wrong

Wrong Output :

pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3

Correct Output

Out[656]: 
A B
1 2 3
2 3 4
3 3 4


How to achieve that?

Method 1: Using isin with tuple

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4

Method 2: merge with indicator

df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only

Compare two DataFrames and get the differences between them as output

Assuming this input:

df1 = pd.DataFrame([['tom', 10],['nick',15], ['juli',14]])
df2 = pd.DataFrame([['juli', 14],['daniel',15], ['tom',10], ['tom',10]])

You could use merge with the indicator option.

The rationale here is to create an additional column with an index per group to identify the duplicates.

cols = list(df1.columns)
(df1.assign(idx=df1.groupby(cols).cumcount())
.merge(df2.assign(idx=df2.groupby(cols).cumcount()),
on=list(df1.columns)+['idx'],
indicator=True,
how='outer')
.drop('idx', axis=1)
.query('_merge != "both"')
#.to_excel('output.xlsx') ## uncomment to export as xlsx
)

output:

        0   1      _merge
1 nick 15 left_only
3 daniel 15 right_only
4 tom 10 right_only

python pandas - compare two dataframes in multiple ways by custom ID

I am not sure if it is the fastest possible solution, but this problem seems to call for pd.merge. As you say, let's first deal with things that are in one dataframe but not the other:

def get_only_left(df1, df2):
left_merge = pd.merge(df1, df2, on='ID', suffixes=('', '_other'), how='left')
added_columns = [c + '_other' for c in df1.columns if c != 'ID']
mask = left_merge.loc[:, added_columns].isna().all(axis=1)
return left_merge[mask].drop(added_columns, axis=1)

pd.concat([get_only_left(prior_df, current_df), get_only_left(current_df, prior_df)])

This gives

     Date    ID  Value Category Subcategory
4 30-Nov 0005 500.0 D D900
4 31-Dec 0006 600.0 D D900

Then, let's deal with properly changing values.

columns = list(current_df.columns)
df = pd.merge(current_df, prior_df, on='ID', suffixes=('', '_prior'), how='inner')
mask = df['Value'] != df['Value_prior']
df[mask].loc[:, columns + ['Value_prior']]

This gives

     Date    ID  Value Category Subcategory  Value_prior
3 31-Dec 0004 400.0 E E900 450.0

Then similarly:

mask = df['Category'] != df['Category_prior']
df[mask].loc[:, columns + ['Category_prior']]

gives

     Date    ID  Value Category Subcategory Category_prior
3 31-Dec 0004 400.0 E E900 D

And finally

import numpy as np
mask = np.logical_and(df['Category'] == df['Category_prior'], df['Subcategory'] != df['Subcategory_prior'])
df[mask].loc[:, columns + ['Subcategory_prior']]

gives

     Date    ID  Value Category Subcategory Subcategory_prior
1 31-Dec 0002 200.0 B B101 B120

Comparing two pandas dataframes for differences

You also need to be careful to create a copy of the DataFrame, otherwise the csvdata_old will be updated with csvdata (since it points to the same object):

csvdata_old = csvdata.copy()

To check whether they are equal, you can use assert_frame_equal as in this answer:

from pandas.util.testing import assert_frame_equal
assert_frame_equal(csvdata, csvdata_old)

You can wrap this in a function with something like:

try:
assert_frame_equal(csvdata, csvdata_old)
return True
except: # appeantly AssertionError doesn't catch all
return False

There was discussion of a better way...

Comparing two dataframes and getting the differences

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order.

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red

Diff of two Dataframes

merge the 2 dfs using method 'outer' and pass param indicator=True this will tell you whether the rows are present in both/left only/right only, you can then filter the merged df after:

In [22]:
merged = df1.merge(df2, indicator=True, how='outer')
merged[merged['_merge'] == 'right_only']

Out[22]:
Buyer Quantity _merge
3 Carl 2 right_only
4 Mark 1 right_only


Related Topics



Leave a reply



Submit