Compare Two Dataframes and Output Their Differences Side-By-Side

Compare two DataFrames and output their differences side-by-side

The first part is similar to Constantine, you can get the boolean of which rows are empty*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0    False
1     True
2     True
dtype: bool

Then we can see which entries have changed:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id  col
1   score         True
2   isEnrolled    True
    Comment       True
dtype: bool

Here the first entry is the index and the second the columns which has been changed.

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
               from           to
id col
1  score       1.11         1.21
2  isEnrolled  True        False
   Comment     None  On vacation

* Note: it's important that df1 and df2 share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index, but I think I'll leave that as an exercise.

Comparing two dataframes and getting the differences

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order.

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
         Date   Fruit   Num   Color
9  2013-11-25  Orange   8.6  Orange
8  2013-11-25   Apple  22.1     Red

Pandas better method to compare two dataframes and find entries that only exist in one

Looks like using 'outer' as the how was the solution

z = pd.merge(ORIGINAL, NEW, on=cols, how = 'outer', indicator=True)
z = z[z._merge != 'both'] # Filter out records from both

Output looks like this (after only showing the columns I care about)

  Name       Site   _merge
  Charlie    A     left_only
  Doug       B     right_only

How to find differences between two dataframes of different lengths?

You could define several helper functions to adjust the length and widths of the two dataframes:

def equalize_length(short, long):
    return pd.concat(
        [
            short,
            pd.DataFrame(
                {
                    col: ["nan"] * (long.shape[0] - short.shape[0])
                    for col in short.columns
                }
            ),
        ]
    ).reset_index(drop=True)

def equalize_width(short, long):
    return pd.concat(
        [
            short,
            pd.DataFrame({col: [] for col in long.columns if col not in short.columns}),
        ],
        axis=1,
    ).reset_index(drop=True)

def equalize(df, other_df):
    if df.shape[0] <= other_df.shape[0]:
        df = equalize_length(df, other_df)
    else:
        other_df = equalize_length(other_df, df)
    if df.shape[1] <= other_df.shape[1]:
        df = equalize_width(df, other_df)
    else:
        other_df = equalize_width(other_df, df)
    df = df.fillna("nan")
    other_df = other_df.fillna("nan")
    return df, other_df

And then, in your code:

a, b = equalize(a, b)

comparevalues = a.values == b.values

rows, cols = np.where(comparevalues == False)

for item in zip(rows, cols):
    a.iloc[item[0], item[1]] = " {} --> {} ".format(
        a.iloc[item[0], item[1]], b.iloc[item[0], item[1]]
    )

print(a)  # with 'a' being shorter in lenght but longer in width than 'b'
# Output
             A                      B                              C                          D
0            1          abcd --> dah                           jamba        OQEWINVSKD --> nan
1            2         efgh --> fupa              refresh --> dimez        DKVLNQIOEVM --> nan
2            3                   ijkl    portobello --> pocketfresh           asdlikvn --> nan
3            4       uhyee --> danju    performancehigh --> reverbb    asdkvnddvfvfkdd --> nan
4            5                   uhuh                      jackalack                        nan        
5   nan --> 6    nan --> freshhhhhhh    nan --> boombackimmatouchit                         nan

Diff of two Dataframes

merge the 2 dfs using method 'outer' and pass param indicator=True this will tell you whether the rows are present in both/left only/right only, you can then filter the merged df after:

In [22]:
merged = df1.merge(df2, indicator=True, how='outer')
merged[merged['_merge'] == 'right_only']

Out[22]:
  Buyer  Quantity      _merge
3  Carl         2  right_only
4  Mark         1  right_only

Compare two DataFrames and output a new DataFrame with the different index

Using drop_duplicates

import pandas as pd
dataA = {'Name':['Jony', 'Mike', 'Joanna'], 'Color':['Blue', 'Red', 'Green']}
dataB = {'Name':['Jony', 'Mike'], 'Color':['Blue', 'Red']}

dfA = pd.DataFrame(dataA)
dfB = pd.DataFrame(dataB)

df = pd.concat([dfA, dfB]).drop_duplicates(keep=False, ignore_index=True)

How to compare 2 non-identical dataframes in python

Since your goal is just to compare differences, use DataFrame.compare instead of aggregating into strings.

However,

DataFrame.compare can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames

So we just need to align the row/column indexes, either via merge or reindex.

Align via `merge`

Outer-merge the two dfs:

merged = df1.merge(df2, how='outer', left_on='col_id', right_on='id')
#    col_id  num  name_x  id  no  name_y
# 0       1    3   linda   1   2  granpa
# 1       2    4   james   2   6   linda
# 2     NaN  NaN     NaN   3   7     sam

Divide the merged frame into left/right frames and align their columns with set_axis:

cols = df1.columns
left = merged.iloc[:, :len(cols)].set_axis(cols, axis=1)
#    col_id  num    name
# 0       1    3   linda
# 1       2    4   james
# 2     NaN  NaN     NaN

right = merged.iloc[:, len(cols):].set_axis(cols, axis=1)
#    col_id  num    name
# 0       1    2  granpa
# 1       2    6   linda
# 2       3    7     sam

compare the aligned left/right frames (use keep_equal=True to show equal cells):

left.compare(right, keep_shape=True, keep_equal=True)
#        col_id         num          name
#    self other  self other   self  other
# 0     1     1     3     2  linda granpa
# 1     2     2     4     6  james  linda
# 2   NaN     3   NaN     7    NaN    sam

left.compare(right, keep_shape=True)
#        col_id         num          name
#    self other  self other   self  other
# 0   NaN   NaN     3     2  linda granpa
# 1   NaN   NaN     4     6  james  linda
# 2   NaN     3   NaN     7    NaN    sam

Align via `reindex`

If you are 100% sure that one df is a subset of the other, then reindex the subsetted rows.

In your example, df1 is a subset of df2, so reindex df1:

df1.assign(id=df1.col_id)          # copy col_id (we need original col_id after reindexing)
   .set_index('id')                # set index to copied id
   .reindex(df2.id)                # reindex against df2's id
   .reset_index(drop=True)         # remove copied id
   .set_axis(df2.columns, axis=1)  # align column names
   .compare(df2, keep_equal=True, keep_shape=True)

#        col_id         num          name
#    self other  self other   self  other
# 0     1     1     3     2  linda granpa
# 1     2     2     4     6  james  linda
# 2   NaN     3   NaN     7    NaN    sam

Nullable integers

Normally int cannot mix with nan, so pandas converts to float. To keep the int values as int (like the examples above):

Ideally we'd convert the int columns to nullable integers with astype('Int64') (capital I).
However, there is currently a comparison bug with Int64, so just use astype(object) for now.

find common data between two dataframes on a specific range of date

You can define a helper function to make dataframes of your dictionaries and slice them on certain date range:

def format(dictionary, start, end):
    """Helper function.

    Args:
        dictionary: dictionary to format.
        start: start date (DD/MM/YY).
        end: end date (DD/MM/YY).

    Returns:
        Dataframe.

    """
    return (
        pd.DataFrame(dictionary)
        .pipe(lambda df_: df_.assign(date=pd.to_datetime(df_["date"], format="%d/%m/%y")))
        .pipe(
            lambda df_: df_.loc[
                (df_["date"] >= pd.to_datetime(start, format="%d/%m/%y"))
                & (df_["date"] <= pd.to_datetime(end, format="%d/%m/%y")),
                :,
            ]
        ).reset_index(drop=True)
    )

Then, with dictionaries you provided, here is how you can "show the "id_number" of df2 that are in df1" for the desired date range:

df1 = format(data1, "05/09/22", "10/09/22")
df2 = format(data2, "05/09/22", "10/09/22")

print(df2[df2["id_number"].isin(df1["id_number"])]["id_number"])
# Output
0      AA576bdk89
1    GG6jabkhd589
2     BXV6jabd589
Name: id_number, dtype: object

Compare Two Dataframes and Output Their Differences Side-By-Side