Compare two DataFrames and output their differences side-by-side
The first part is similar to Constantine, you can get the boolean of which rows are empty*:
In [21]: ne = (df1 != df2).any(1)
In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool
Then we can see which entries have changed:
In [23]: ne_stacked = (df1 != df2).stack()
In [24]: changed = ne_stacked[ne_stacked]
In [25]: changed.index.names = ['id', 'col']
In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool
Here the first entry is the index and the second the columns which has been changed.
In [27]: difference_locations = np.where(df1 != df2)
In [28]: changed_from = df1.values[difference_locations]
In [29]: changed_to = df2.values[difference_locations]
In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation
* Note: it's important that df1
and df2
share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index
, but I think I'll leave that as an exercise.
Comparing two dataframes and getting the differences
This approach, df1 != df2
, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same
method, and exception is raised if differences found, even in columns/indices order.
If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:
>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)
group by
>>> df_gpby = df.groupby(list(df.columns))
get index of unique records
>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
filter
>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red
Pandas better method to compare two dataframes and find entries that only exist in one
Looks like using 'outer' as the how
was the solution
z = pd.merge(ORIGINAL, NEW, on=cols, how = 'outer', indicator=True)
z = z[z._merge != 'both'] # Filter out records from both
Output looks like this (after only showing the columns I care about)
Name Site _merge
Charlie A left_only
Doug B right_only
How to find differences between two dataframes of different lengths?
You could define several helper functions to adjust the length and widths of the two dataframes:
def equalize_length(short, long):
return pd.concat(
[
short,
pd.DataFrame(
{
col: ["nan"] * (long.shape[0] - short.shape[0])
for col in short.columns
}
),
]
).reset_index(drop=True)
def equalize_width(short, long):
return pd.concat(
[
short,
pd.DataFrame({col: [] for col in long.columns if col not in short.columns}),
],
axis=1,
).reset_index(drop=True)
def equalize(df, other_df):
if df.shape[0] <= other_df.shape[0]:
df = equalize_length(df, other_df)
else:
other_df = equalize_length(other_df, df)
if df.shape[1] <= other_df.shape[1]:
df = equalize_width(df, other_df)
else:
other_df = equalize_width(other_df, df)
df = df.fillna("nan")
other_df = other_df.fillna("nan")
return df, other_df
And then, in your code:
a, b = equalize(a, b)
comparevalues = a.values == b.values
rows, cols = np.where(comparevalues == False)
for item in zip(rows, cols):
a.iloc[item[0], item[1]] = " {} --> {} ".format(
a.iloc[item[0], item[1]], b.iloc[item[0], item[1]]
)
print(a) # with 'a' being shorter in lenght but longer in width than 'b'
# Output
A B C D
0 1 abcd --> dah jamba OQEWINVSKD --> nan
1 2 efgh --> fupa refresh --> dimez DKVLNQIOEVM --> nan
2 3 ijkl portobello --> pocketfresh asdlikvn --> nan
3 4 uhyee --> danju performancehigh --> reverbb asdkvnddvfvfkdd --> nan
4 5 uhuh jackalack nan
5 nan --> 6 nan --> freshhhhhhh nan --> boombackimmatouchit nan
Diff of two Dataframes
merge
the 2 dfs using method 'outer' and pass param indicator=True
this will tell you whether the rows are present in both/left only/right only, you can then filter the merged df after:
In [22]:
merged = df1.merge(df2, indicator=True, how='outer')
merged[merged['_merge'] == 'right_only']
Out[22]:
Buyer Quantity _merge
3 Carl 2 right_only
4 Mark 1 right_only
Compare two DataFrames and output a new DataFrame with the different index
Using drop_duplicates
import pandas as pd
dataA = {'Name':['Jony', 'Mike', 'Joanna'], 'Color':['Blue', 'Red', 'Green']}
dataB = {'Name':['Jony', 'Mike'], 'Color':['Blue', 'Red']}
dfA = pd.DataFrame(dataA)
dfB = pd.DataFrame(dataB)
df = pd.concat([dfA, dfB]).drop_duplicates(keep=False, ignore_index=True)
How to compare 2 non-identical dataframes in python
Since your goal is just to compare differences, use DataFrame.compare
instead of aggregating into strings.
However,
DataFrame.compare
can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames
So we just need to align the row/column indexes, either via merge
or reindex
.
Align via merge
Outer-
merge
the two dfs:merged = df1.merge(df2, how='outer', left_on='col_id', right_on='id')
# col_id num name_x id no name_y
# 0 1 3 linda 1 2 granpa
# 1 2 4 james 2 6 linda
# 2 NaN NaN NaN 3 7 samDivide the
merged
frame intoleft
/right
frames and align their columns withset_axis
:cols = df1.columns
left = merged.iloc[:, :len(cols)].set_axis(cols, axis=1)
# col_id num name
# 0 1 3 linda
# 1 2 4 james
# 2 NaN NaN NaN
right = merged.iloc[:, len(cols):].set_axis(cols, axis=1)
# col_id num name
# 0 1 2 granpa
# 1 2 6 linda
# 2 3 7 samcompare
the alignedleft
/right
frames (usekeep_equal=True
to show equal cells):left.compare(right, keep_shape=True, keep_equal=True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
left.compare(right, keep_shape=True)
# col_id num name
# self other self other self other
# 0 NaN NaN 3 2 linda granpa
# 1 NaN NaN 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
Align via reindex
If you are 100% sure that one df is a subset of the other, then reindex
the subsetted rows.
In your example, df1
is a subset of df2
, so reindex
df1
:
df1.assign(id=df1.col_id) # copy col_id (we need original col_id after reindexing)
.set_index('id') # set index to copied id
.reindex(df2.id) # reindex against df2's id
.reset_index(drop=True) # remove copied id
.set_axis(df2.columns, axis=1) # align column names
.compare(df2, keep_equal=True, keep_shape=True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
Nullable integers
Normally int
cannot mix with nan
, so pandas converts to float
. To keep the int
values as int
(like the examples above):
- Ideally we'd convert the
int
columns to nullable integers withastype('Int64')
(capitalI
). - However, there is currently a comparison bug with
Int64
, so just useastype(object)
for now.
find common data between two dataframes on a specific range of date
You can define a helper function to make dataframes of your dictionaries and slice them on certain date range:
def format(dictionary, start, end):
"""Helper function.
Args:
dictionary: dictionary to format.
start: start date (DD/MM/YY).
end: end date (DD/MM/YY).
Returns:
Dataframe.
"""
return (
pd.DataFrame(dictionary)
.pipe(lambda df_: df_.assign(date=pd.to_datetime(df_["date"], format="%d/%m/%y")))
.pipe(
lambda df_: df_.loc[
(df_["date"] >= pd.to_datetime(start, format="%d/%m/%y"))
& (df_["date"] <= pd.to_datetime(end, format="%d/%m/%y")),
:,
]
).reset_index(drop=True)
)
Then, with dictionaries you provided, here is how you can "show the "id_number" of df2 that are in df1" for the desired date range:
df1 = format(data1, "05/09/22", "10/09/22")
df2 = format(data2, "05/09/22", "10/09/22")
print(df2[df2["id_number"].isin(df1["id_number"])]["id_number"])
# Output
0 AA576bdk89
1 GG6jabkhd589
2 BXV6jabd589
Name: id_number, dtype: object
Related Topics
How to Write to a CSV Line by Line
Python and Regular Expression with Unicode
Calling Java/Scala Function from a Task
Dictionaries and Default Values
Python Requests. 403 Forbidden
How Is the 'Is' Keyword Implemented in Python
How to Postpone/Defer the Evaluation of F-Strings
When Does Python Allocate New Memory for Identical Strings
How to Define a Threshold Value to Detect Only Green Colour Objects in an Image with Python Opencv
"Getaddrinfo Failed", What Does That Mean
Caesar Cipher Function in Python
Parsing Date/Time String with Timezone Abbreviated Name in Python
Pygame Already Installed; However, Python Terminal Says "No Module Named 'Pygame' " (Ubuntu 20.04.1)