Python Pandas - Find Difference Between Two Data Frames

Python Pandas - Find difference between two data frames

By using drop_duplicates

pd.concat([df1,df2]).drop_duplicates(keep=False)

Update :

The above method only works for those data frames that don't already have duplicates themselves. For example:

df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})

It will output like below , which is wrong

Wrong Output :

pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3

Correct Output

Out[656]: 
A B
1 2 3
2 3 4
3 3 4


How to achieve that?

Method 1: Using isin with tuple

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4

Method 2: merge with indicator

df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only

Comparing two dataframes and getting the differences

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order.

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red

Comparing two dataframes and getting the differences

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order.

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red

Compare two DataFrames and output their differences side-by-side

The first part is similar to Constantine, you can get the boolean of which rows are empty*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool

Then we can see which entries have changed:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool

Here the first entry is the index and the second the columns which has been changed.

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation

* Note: it's important that df1 and df2 share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index, but I think I'll leave that as an exercise.

How to find differences between two dataframes of different lengths?

You could define several helper functions to adjust the length and widths of the two dataframes:

def equalize_length(short, long):
return pd.concat(
[
short,
pd.DataFrame(
{
col: ["nan"] * (long.shape[0] - short.shape[0])
for col in short.columns
}
),
]
).reset_index(drop=True)

def equalize_width(short, long):
return pd.concat(
[
short,
pd.DataFrame({col: [] for col in long.columns if col not in short.columns}),
],
axis=1,
).reset_index(drop=True)

def equalize(df, other_df):
if df.shape[0] <= other_df.shape[0]:
df = equalize_length(df, other_df)
else:
other_df = equalize_length(other_df, df)
if df.shape[1] <= other_df.shape[1]:
df = equalize_width(df, other_df)
else:
other_df = equalize_width(other_df, df)
df = df.fillna("nan")
other_df = other_df.fillna("nan")
return df, other_df

And then, in your code:

a, b = equalize(a, b)

comparevalues = a.values == b.values

rows, cols = np.where(comparevalues == False)

for item in zip(rows, cols):
a.iloc[item[0], item[1]] = " {} --> {} ".format(
a.iloc[item[0], item[1]], b.iloc[item[0], item[1]]
)
print(a)  # with 'a' being shorter in lenght but longer in width than 'b'
# Output
A B C D
0 1 abcd --> dah jamba OQEWINVSKD --> nan
1 2 efgh --> fupa refresh --> dimez DKVLNQIOEVM --> nan
2 3 ijkl portobello --> pocketfresh asdlikvn --> nan
3 4 uhyee --> danju performancehigh --> reverbb asdkvnddvfvfkdd --> nan
4 5 uhuh jackalack nan
5 nan --> 6 nan --> freshhhhhhh nan --> boombackimmatouchit nan

Find difference between two pandas dataframes when both contains same rows but one dataframe contains it more than once

You can add a new column to catch the duplicates:

df1['merge'] = df1.groupby(['0','1','2']).cumcount()

df2['merge'] = df2.groupby(['0','1','2']).cumcount()

pd.concat([df1,df2]).drop_duplicates(keep=False)

Afterwards you can drop the added column again



Related Topics



Leave a reply



Submit