Python Pandas - Find Difference Between Two Data Frames

Python Pandas - Find difference between two data frames

By using drop_duplicates

pd.concat([df1,df2]).drop_duplicates(keep=False)

Update :

The above method only works for those data frames that don't already have duplicates themselves. For example:

df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})

It will output like below , which is wrong

Wrong Output :

pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]: 
   A  B
1  2  3

Correct Output

How to achieve that?

Method 1: Using isin with tuple

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]: 
   A  B
1  2  3
2  3  4
3  3  4

Method 2: merge with indicator

df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]: 
   A  B     _merge
1  2  3  left_only
2  3  4  left_only
3  3  4  left_only

Comparing two dataframes and getting the differences

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order.

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
         Date   Fruit   Num   Color
9  2013-11-25  Orange   8.6  Orange
8  2013-11-25   Apple  22.1     Red

Comparing two dataframes and getting the differences

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
         Date   Fruit   Num   Color
9  2013-11-25  Orange   8.6  Orange
8  2013-11-25   Apple  22.1     Red

Compare two DataFrames and output their differences side-by-side

The first part is similar to Constantine, you can get the boolean of which rows are empty*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0    False
1     True
2     True
dtype: bool

Then we can see which entries have changed:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id  col
1   score         True
2   isEnrolled    True
    Comment       True
dtype: bool

Here the first entry is the index and the second the columns which has been changed.

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
               from           to
id col
1  score       1.11         1.21
2  isEnrolled  True        False
   Comment     None  On vacation

* Note: it's important that df1 and df2 share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index, but I think I'll leave that as an exercise.

How to find differences between two dataframes of different lengths?

You could define several helper functions to adjust the length and widths of the two dataframes:

def equalize_length(short, long):
    return pd.concat(
        [
            short,
            pd.DataFrame(
                {
                    col: ["nan"] * (long.shape[0] - short.shape[0])
                    for col in short.columns
                }
            ),
        ]
    ).reset_index(drop=True)

def equalize_width(short, long):
    return pd.concat(
        [
            short,
            pd.DataFrame({col: [] for col in long.columns if col not in short.columns}),
        ],
        axis=1,
    ).reset_index(drop=True)

def equalize(df, other_df):
    if df.shape[0] <= other_df.shape[0]:
        df = equalize_length(df, other_df)
    else:
        other_df = equalize_length(other_df, df)
    if df.shape[1] <= other_df.shape[1]:
        df = equalize_width(df, other_df)
    else:
        other_df = equalize_width(other_df, df)
    df = df.fillna("nan")
    other_df = other_df.fillna("nan")
    return df, other_df

And then, in your code:

a, b = equalize(a, b)

comparevalues = a.values == b.values

rows, cols = np.where(comparevalues == False)

for item in zip(rows, cols):
    a.iloc[item[0], item[1]] = " {} --> {} ".format(
        a.iloc[item[0], item[1]], b.iloc[item[0], item[1]]
    )

print(a)  # with 'a' being shorter in lenght but longer in width than 'b'
# Output
             A                      B                              C                          D
0            1          abcd --> dah                           jamba        OQEWINVSKD --> nan
1            2         efgh --> fupa              refresh --> dimez        DKVLNQIOEVM --> nan
2            3                   ijkl    portobello --> pocketfresh           asdlikvn --> nan
3            4       uhyee --> danju    performancehigh --> reverbb    asdkvnddvfvfkdd --> nan
4            5                   uhuh                      jackalack                        nan        
5   nan --> 6    nan --> freshhhhhhh    nan --> boombackimmatouchit                         nan

Find difference between two pandas dataframes when both contains same rows but one dataframe contains it more than once

You can add a new column to catch the duplicates:

df1['merge'] = df1.groupby(['0','1','2']).cumcount()

df2['merge'] = df2.groupby(['0','1','2']).cumcount()

pd.concat([df1,df2]).drop_duplicates(keep=False)

Afterwards you can drop the added column again

Python Pandas - Find Difference Between Two Data Frames