Comparing Two Dataframes and Getting the Differences

Comparing two dataframes and getting the differences

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order.

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
         Date   Fruit   Num   Color
9  2013-11-25  Orange   8.6  Orange
8  2013-11-25   Apple  22.1     Red

Compare two DataFrames and output their differences side-by-side

The first part is similar to Constantine, you can get the boolean of which rows are empty*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0    False
1     True
2     True
dtype: bool

Then we can see which entries have changed:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id  col
1   score         True
2   isEnrolled    True
    Comment       True
dtype: bool

Here the first entry is the index and the second the columns which has been changed.

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
               from           to
id col
1  score       1.11         1.21
2  isEnrolled  True        False
   Comment     None  On vacation

* Note: it's important that df1 and df2 share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index, but I think I'll leave that as an exercise.

Pandas better method to compare two dataframes and find entries that only exist in one

Looks like using 'outer' as the how was the solution

z = pd.merge(ORIGINAL, NEW, on=cols, how = 'outer', indicator=True)
z = z[z._merge != 'both'] # Filter out records from both

Output looks like this (after only showing the columns I care about)

  Name       Site   _merge
  Charlie    A     left_only
  Doug       B     right_only

Python Pandas - Find difference between two data frames

By using drop_duplicates

pd.concat([df1,df2]).drop_duplicates(keep=False)

Update :

The above method only works for those data frames that don't already have duplicates themselves. For example:

df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})

It will output like below , which is wrong

Wrong Output :

pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]: 
   A  B
1  2  3

Correct Output

How to achieve that?

Method 1: Using isin with tuple

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]: 
   A  B
1  2  3
2  3  4
3  3  4

Method 2: merge with indicator

df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]: 
   A  B     _merge
1  2  3  left_only
2  3  4  left_only
3  3  4  left_only

Compare two DataFrames and get the differences between them as output

Assuming this input:

df1 = pd.DataFrame([['tom', 10],['nick',15], ['juli',14]])
df2 = pd.DataFrame([['juli', 14],['daniel',15], ['tom',10], ['tom',10]])

You could use merge with the indicator option.

The rationale here is to create an additional column with an index per group to identify the duplicates.

cols = list(df1.columns)
(df1.assign(idx=df1.groupby(cols).cumcount())
    .merge(df2.assign(idx=df2.groupby(cols).cumcount()),
           on=list(df1.columns)+['idx'],
           indicator=True,
           how='outer')
    .drop('idx', axis=1)
    .query('_merge != "both"')
   #.to_excel('output.xlsx')  ## uncomment to export as xlsx
)

output:

        0   1      _merge
1    nick  15   left_only
3  daniel  15  right_only
4     tom  10  right_only

python pandas - compare two dataframes in multiple ways by custom ID

I am not sure if it is the fastest possible solution, but this problem seems to call for pd.merge. As you say, let's first deal with things that are in one dataframe but not the other:

def get_only_left(df1, df2):
    left_merge = pd.merge(df1, df2, on='ID', suffixes=('', '_other'), how='left')
    added_columns = [c + '_other' for c in df1.columns if c != 'ID']
    mask = left_merge.loc[:, added_columns].isna().all(axis=1)
    return left_merge[mask].drop(added_columns, axis=1)

pd.concat([get_only_left(prior_df, current_df), get_only_left(current_df, prior_df)])

This gives

     Date    ID  Value Category Subcategory
4  30-Nov  0005  500.0        D        D900
4  31-Dec  0006  600.0        D        D900

Then, let's deal with properly changing values.

columns = list(current_df.columns)
df = pd.merge(current_df, prior_df, on='ID', suffixes=('', '_prior'), how='inner')
mask = df['Value'] != df['Value_prior']
df[mask].loc[:, columns + ['Value_prior']]

This gives

     Date    ID  Value Category Subcategory  Value_prior
3  31-Dec  0004  400.0        E        E900        450.0

Then similarly:

mask = df['Category'] != df['Category_prior']
df[mask].loc[:, columns + ['Category_prior']]

gives

     Date    ID  Value Category Subcategory Category_prior
3  31-Dec  0004  400.0        E        E900              D

And finally

import numpy as np
mask = np.logical_and(df['Category'] == df['Category_prior'], df['Subcategory'] != df['Subcategory_prior'])
df[mask].loc[:, columns + ['Subcategory_prior']]

gives

     Date    ID  Value Category Subcategory Subcategory_prior
1  31-Dec  0002  200.0        B        B101              B120

Comparing two pandas dataframes for differences

You also need to be careful to create a copy of the DataFrame, otherwise the csvdata_old will be updated with csvdata (since it points to the same object):

csvdata_old = csvdata.copy()

To check whether they are equal, you can use assert_frame_equal as in this answer:

from pandas.util.testing import assert_frame_equal
assert_frame_equal(csvdata, csvdata_old)

You can wrap this in a function with something like:

try:
    assert_frame_equal(csvdata, csvdata_old)
    return True
except:  # appeantly AssertionError doesn't catch all
    return False

There was discussion of a better way...