Compare Two Dataframes and Output Their Differences Side-By-Side

Compare two DataFrames and output their differences side-by-side

The first part is similar to Constantine, you can get the boolean of which rows are empty*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool

Then we can see which entries have changed:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool

Here the first entry is the index and the second the columns which has been changed.

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation

* Note: it's important that df1 and df2 share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index, but I think I'll leave that as an exercise.

Comparing two dataframes and getting the differences

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order.

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red

Pandas better method to compare two dataframes and find entries that only exist in one

Looks like using 'outer' as the how was the solution

z = pd.merge(ORIGINAL, NEW, on=cols, how = 'outer', indicator=True)
z = z[z._merge != 'both'] # Filter out records from both

Output looks like this (after only showing the columns I care about)

  Name       Site   _merge
Charlie A left_only
Doug B right_only

How to find differences between two dataframes of different lengths?

You could define several helper functions to adjust the length and widths of the two dataframes:

def equalize_length(short, long):
return pd.concat(
[
short,
pd.DataFrame(
{
col: ["nan"] * (long.shape[0] - short.shape[0])
for col in short.columns
}
),
]
).reset_index(drop=True)

def equalize_width(short, long):
return pd.concat(
[
short,
pd.DataFrame({col: [] for col in long.columns if col not in short.columns}),
],
axis=1,
).reset_index(drop=True)

def equalize(df, other_df):
if df.shape[0] <= other_df.shape[0]:
df = equalize_length(df, other_df)
else:
other_df = equalize_length(other_df, df)
if df.shape[1] <= other_df.shape[1]:
df = equalize_width(df, other_df)
else:
other_df = equalize_width(other_df, df)
df = df.fillna("nan")
other_df = other_df.fillna("nan")
return df, other_df

And then, in your code:

a, b = equalize(a, b)

comparevalues = a.values == b.values

rows, cols = np.where(comparevalues == False)

for item in zip(rows, cols):
a.iloc[item[0], item[1]] = " {} --> {} ".format(
a.iloc[item[0], item[1]], b.iloc[item[0], item[1]]
)
print(a)  # with 'a' being shorter in lenght but longer in width than 'b'
# Output
A B C D
0 1 abcd --> dah jamba OQEWINVSKD --> nan
1 2 efgh --> fupa refresh --> dimez DKVLNQIOEVM --> nan
2 3 ijkl portobello --> pocketfresh asdlikvn --> nan
3 4 uhyee --> danju performancehigh --> reverbb asdkvnddvfvfkdd --> nan
4 5 uhuh jackalack nan
5 nan --> 6 nan --> freshhhhhhh nan --> boombackimmatouchit nan

Diff of two Dataframes

merge the 2 dfs using method 'outer' and pass param indicator=True this will tell you whether the rows are present in both/left only/right only, you can then filter the merged df after:

In [22]:
merged = df1.merge(df2, indicator=True, how='outer')
merged[merged['_merge'] == 'right_only']

Out[22]:
Buyer Quantity _merge
3 Carl 2 right_only
4 Mark 1 right_only

Compare two DataFrames and output a new DataFrame with the different index

Using drop_duplicates

import pandas as pd
dataA = {'Name':['Jony', 'Mike', 'Joanna'], 'Color':['Blue', 'Red', 'Green']}
dataB = {'Name':['Jony', 'Mike'], 'Color':['Blue', 'Red']}

dfA = pd.DataFrame(dataA)
dfB = pd.DataFrame(dataB)

df = pd.concat([dfA, dfB]).drop_duplicates(keep=False, ignore_index=True)

How to compare 2 non-identical dataframes in python

Since your goal is just to compare differences, use DataFrame.compare instead of aggregating into strings.

However,

DataFrame.compare can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames

So we just need to align the row/column indexes, either via merge or reindex.



Align via merge

  1. Outer-merge the two dfs:

    merged = df1.merge(df2, how='outer', left_on='col_id', right_on='id')
    # col_id num name_x id no name_y
    # 0 1 3 linda 1 2 granpa
    # 1 2 4 james 2 6 linda
    # 2 NaN NaN NaN 3 7 sam
  2. Divide the merged frame into left/right frames and align their columns with set_axis:

    cols = df1.columns
    left = merged.iloc[:, :len(cols)].set_axis(cols, axis=1)
    # col_id num name
    # 0 1 3 linda
    # 1 2 4 james
    # 2 NaN NaN NaN

    right = merged.iloc[:, len(cols):].set_axis(cols, axis=1)
    # col_id num name
    # 0 1 2 granpa
    # 1 2 6 linda
    # 2 3 7 sam
  3. compare the aligned left/right frames (use keep_equal=True to show equal cells):

    left.compare(right, keep_shape=True, keep_equal=True)
    # col_id num name
    # self other self other self other
    # 0 1 1 3 2 linda granpa
    # 1 2 2 4 6 james linda
    # 2 NaN 3 NaN 7 NaN sam

    left.compare(right, keep_shape=True)
    # col_id num name
    # self other self other self other
    # 0 NaN NaN 3 2 linda granpa
    # 1 NaN NaN 4 6 james linda
    # 2 NaN 3 NaN 7 NaN sam


Align via reindex

If you are 100% sure that one df is a subset of the other, then reindex the subsetted rows.

In your example, df1 is a subset of df2, so reindex df1:

df1.assign(id=df1.col_id)          # copy col_id (we need original col_id after reindexing)
.set_index('id') # set index to copied id
.reindex(df2.id) # reindex against df2's id
.reset_index(drop=True) # remove copied id
.set_axis(df2.columns, axis=1) # align column names
.compare(df2, keep_equal=True, keep_shape=True)

# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam


Nullable integers

Normally int cannot mix with nan, so pandas converts to float. To keep the int values as int (like the examples above):

  • Ideally we'd convert the int columns to nullable integers with astype('Int64') (capital I).
  • However, there is currently a comparison bug with Int64, so just use astype(object) for now.

find common data between two dataframes on a specific range of date

You can define a helper function to make dataframes of your dictionaries and slice them on certain date range:

def format(dictionary, start, end):
"""Helper function.

Args:
dictionary: dictionary to format.
start: start date (DD/MM/YY).
end: end date (DD/MM/YY).

Returns:
Dataframe.

"""
return (
pd.DataFrame(dictionary)
.pipe(lambda df_: df_.assign(date=pd.to_datetime(df_["date"], format="%d/%m/%y")))
.pipe(
lambda df_: df_.loc[
(df_["date"] >= pd.to_datetime(start, format="%d/%m/%y"))
& (df_["date"] <= pd.to_datetime(end, format="%d/%m/%y")),
:,
]
).reset_index(drop=True)
)

Then, with dictionaries you provided, here is how you can "show the "id_number" of df2 that are in df1" for the desired date range:

df1 = format(data1, "05/09/22", "10/09/22")
df2 = format(data2, "05/09/22", "10/09/22")

print(df2[df2["id_number"].isin(df1["id_number"])]["id_number"])
# Output
0 AA576bdk89
1 GG6jabkhd589
2 BXV6jabd589
Name: id_number, dtype: object


Related Topics



Leave a reply



Submit