set difference for pandas
from pandas import DataFrame
df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
print(df2[~df2.isin(df1).all(1)])
print(df2[(df2!=df1)].dropna(how='all'))
print(df2[~(df2==df1)].dropna(how='all'))
Computing Set Difference in Pandas between two dataframes
Do a left join with indicator
which gives information on the origin of each row, then you can filter based on the indicator
:
df1.merge(df2, indicator=True, how="left")[lambda x: x._merge=='left_only'].drop('_merge',1)
#State City Population
#0 NY Albany 856654
#2 SC Charleston 35323
#4 WV Charleston 34523
Set difference of two dataframes in Pandas
You can try hashing the rows and then checking
Ex.
df1['match'] = df.apply(lambda x: hash(tuple(x)), axis=1)
df2['match'] = df2.apply(lambda x: hash(tuple(x)), axis=1)
df_diff = df1[~df1['match'].isin(df2['match'])]
Find the difference (set difference) between two dataframes in python
try:
df1[~df1.isin(df2)]
A,B,C,D
How to find the set difference between two Pandas DataFrames
The results are correct, however, setdiff1d
is order dependent. It will only check for elements in the first input array that do not occur in the second array.
If you do not care which of the dataframes have the unique columns you can use setxor1d
. It will return "the unique values that are in only one (not both) of the input arrays", see the documentation.
import numpy
colsA = ['a', 'b', 'c', 'd']
colsB = ['b','c']
c = numpy.setxor1d(colsA, colsB)
Will return you an array containing 'a' and 'd'.
If you want to use setdiff1d
you need to check for differences both ways:
//columns in train.columns that are not in train_1.columns
c1 = np.setdiff1d(train.columns, train_1.columns)
//columns in train_1.columns that are not in train.columns
c2 = np.setdiff1d(train_1.columns, train.columns)
Comparing two dataframes and getting the differences
This approach, df1 != df2
, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same
method, and exception is raised if differences found, even in columns/indices order.
If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:
>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)
group by
>>> df_gpby = df.groupby(list(df.columns))
get index of unique records
>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
filter
>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red
Pandas: set difference by group
Using set
, groupby
, apply
, and shift
.
- For efficiency:
- Convert
members
toset
type because-
is an unsupported operand, which will cause aTypeError
. - Leave
additions
anddeletions
asset
type
- Convert
Using apply
- With a dataframe of 60000 rows:
91.4 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# clean the members column
df.members = df.members.str.replace(' ', '').str.split(',').map(set)
# create del and add
df['deletions'] = df.groupby('team')['members'].apply(lambda x: x.shift() - x)
df['additions'] = df.groupby('team')['members'].apply(lambda x: x - x.shift())
# result
month team members additions deletions
0 A {Z, X, Y} NaN NaN
1 A {X, Y} {} {Z}
2 A {W, X, Y} {W} {}
0 B {D, E} NaN NaN
1 B {D, F, E} {F} {}
2 B {F} {} {D, E}
More Efficiently
pandas.DataFrame.diff
- With a dataframe of 60000 rows:
60.7 ms ± 3.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['deletions'] = df.groupby('team')['members'].diff(periods=-1).shift()
df['additions'] = df.groupby('team')['members'].diff()
Find difference between two data frames
By using drop_duplicates
pd.concat([df1,df2]).drop_duplicates(keep=False)
Update :
The above method only works for those data frames that don't already have duplicates themselves. For example:
df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})
It will output like below , which is wrong
Wrong Output :
pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3
Correct Output
Out[656]:
A B
1 2 3
2 3 4
3 3 4
How to achieve that?
Method 1: Using isin
with tuple
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4
Method 2: merge
with indicator
df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only
How to compare pandas DataFrames using set difference
You can make use of merge
with indicator=True
:
u = df1.merge(df2, how='outer', indicator=True)
df3 = u.query('_merge == "left_only"').drop('_merge', 1)
df4 = u.query('_merge == "right_only"').drop('_merge', 1)
df3
col1 col2 col3 col4
1 2 2 1 1
3 0 0 4 1
df4
col1 col2 col3 col4
4 3 3 1 1
5 1 1 5 1
If the column names of df1
and df2
are different, ensure they're both made to be the same:
df1.columns = df2.columns
If the index also needs to be preserved, you can first reset it before merging, then you can set it after.
u, v = df1.reset_index(), df2.reset_index()
w = (u.merge(v, how='outer', on=df1.columns.tolist(), indicator=True)
.fillna({'index_x': -1, 'index_y': -1}, downcast='infer'))
w
index_x col1 col2 col3 col4 index_y _merge
0 0 1 1 1 1 0 both
1 1 2 2 1 1 -1 left_only
2 2 0 0 1 1 2 both
3 5 0 0 4 1 -1 left_only
4 -1 3 3 1 1 1 right_only
5 -1 1 1 5 1 3 right_only
Now,
df3 = (w.query('_merge == "left_only"')
.set_index('index_x')
.drop(['_merge', 'index_y'], 1)
.rename_axis([None], axis=0))
df4 = (w.query('_merge == "right_only"')
.set_index('index_y')
.drop(['_merge', 'index_x'], 1)
.rename_axis([None], axis=0))
df3
col1 col2 col3 col4
1 2 2 1 1
5 0 0 4 1
df4
col1 col2 col3 col4
1 3 3 1 1
3 1 1 5 1
pandas, access a series of lists as a set and take the set difference of 2 set series
Thanks to: https://www.geeksforgeeks.org/python-difference-two-lists/
def Diff(li1, li2):
return list(set(li1) - set(li2)) + list(set(li2) - set(li1))
df['C'] = df.apply(lambda x: Diff(x['A'], x['B']), axis=1)
Output
A B C
0 [1, 2, 3] [1, 2] [3]
1 [4, 5, 6] [5, 6] [4]
2 [7, 8, 9] [7, 8, 9] []
Related Topics
Typeerror: Can Only Concatenate Str (Not "Float") to Str
What Is the Fastest Way to Open Urls in New Tabs via Selenium - Python
Python - Pygame Error When Executing Exe File
Why Does This Not Work as an Array Membership Test
How to Break a Long Line to Multiple Lines in Python
Get Name of Current Script in Python
How to Implement the Softmax Function in Python
Pandas Index Column Title or Name
Configuring So That Pip Install Can Work from Github
Full Examples of Using Pyserial Package
How to Declare an Array in Python
How to Use Youtube-Dl from a Python Program
How to Detect the Python Version at Runtime
Pass a Parameter to a Fixture Function