Set Difference for Pandas

set difference for pandas

from pandas import  DataFrame

df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})

print(df2[~df2.isin(df1).all(1)])
print(df2[(df2!=df1)].dropna(how='all'))
print(df2[~(df2==df1)].dropna(how='all'))

Computing Set Difference in Pandas between two dataframes

Do a left join with indicator which gives information on the origin of each row, then you can filter based on the indicator:

df1.merge(df2, indicator=True, how="left")[lambda x: x._merge=='left_only'].drop('_merge',1)

#State City Population
#0 NY Albany 856654
#2 SC Charleston 35323
#4 WV Charleston 34523

Set difference of two dataframes in Pandas

You can try hashing the rows and then checking

Ex.

df1['match'] = df.apply(lambda x: hash(tuple(x)), axis=1)
df2['match'] = df2.apply(lambda x: hash(tuple(x)), axis=1)
df_diff = df1[~df1['match'].isin(df2['match'])]

Find the difference (set difference) between two dataframes in python

try:

df1[~df1.isin(df2)]

A,B,C,D

How to find the set difference between two Pandas DataFrames

The results are correct, however, setdiff1d is order dependent. It will only check for elements in the first input array that do not occur in the second array.

If you do not care which of the dataframes have the unique columns you can use setxor1d. It will return "the unique values that are in only one (not both) of the input arrays", see the documentation.

import numpy

colsA = ['a', 'b', 'c', 'd']
colsB = ['b','c']

c = numpy.setxor1d(colsA, colsB)

Will return you an array containing 'a' and 'd'.


If you want to use setdiff1d you need to check for differences both ways:

//columns in train.columns that are not in train_1.columns
c1 = np.setdiff1d(train.columns, train_1.columns)

//columns in train_1.columns that are not in train.columns
c2 = np.setdiff1d(train_1.columns, train.columns)

Comparing two dataframes and getting the differences

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order.

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red

Pandas: set difference by group

Using set, groupby, apply, and shift.

  • For efficiency:

    • Convert members to set type because - is an unsupported operand, which will cause a TypeError.
    • Leave additions and deletions as set type

Using apply

  • With a dataframe of 60000 rows:

    • 91.4 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# clean the members column
df.members = df.members.str.replace(' ', '').str.split(',').map(set)

# create del and add
df['deletions'] = df.groupby('team')['members'].apply(lambda x: x.shift() - x)
df['additions'] = df.groupby('team')['members'].apply(lambda x: x - x.shift())

# result
month team members additions deletions
0 A {Z, X, Y} NaN NaN
1 A {X, Y} {} {Z}
2 A {W, X, Y} {W} {}
0 B {D, E} NaN NaN
1 B {D, F, E} {F} {}
2 B {F} {} {D, E}

More Efficiently

  • pandas.DataFrame.diff
  • With a dataframe of 60000 rows:

    • 60.7 ms ± 3.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['deletions'] = df.groupby('team')['members'].diff(periods=-1).shift()
df['additions'] = df.groupby('team')['members'].diff()

Find difference between two data frames

By using drop_duplicates

pd.concat([df1,df2]).drop_duplicates(keep=False)

Update :

The above method only works for those data frames that don't already have duplicates themselves. For example:

df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})

It will output like below , which is wrong

Wrong Output :

pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3

Correct Output

Out[656]: 
A B
1 2 3
2 3 4
3 3 4


How to achieve that?

Method 1: Using isin with tuple

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4

Method 2: merge with indicator

df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only

How to compare pandas DataFrames using set difference

You can make use of merge with indicator=True:

u = df1.merge(df2, how='outer', indicator=True)
df3 = u.query('_merge == "left_only"').drop('_merge', 1)
df4 = u.query('_merge == "right_only"').drop('_merge', 1)

df3

col1 col2 col3 col4
1 2 2 1 1
3 0 0 4 1

df4

col1 col2 col3 col4
4 3 3 1 1
5 1 1 5 1

If the column names of df1 and df2 are different, ensure they're both made to be the same:

df1.columns = df2.columns

If the index also needs to be preserved, you can first reset it before merging, then you can set it after.

u, v = df1.reset_index(), df2.reset_index()
w = (u.merge(v, how='outer', on=df1.columns.tolist(), indicator=True)
.fillna({'index_x': -1, 'index_y': -1}, downcast='infer'))
w

index_x col1 col2 col3 col4 index_y _merge
0 0 1 1 1 1 0 both
1 1 2 2 1 1 -1 left_only
2 2 0 0 1 1 2 both
3 5 0 0 4 1 -1 left_only
4 -1 3 3 1 1 1 right_only
5 -1 1 1 5 1 3 right_only

Now,

df3 = (w.query('_merge == "left_only"')
.set_index('index_x')
.drop(['_merge', 'index_y'], 1)
.rename_axis([None], axis=0))
df4 = (w.query('_merge == "right_only"')
.set_index('index_y')
.drop(['_merge', 'index_x'], 1)
.rename_axis([None], axis=0))

df3

col1 col2 col3 col4
1 2 2 1 1
5 0 0 4 1

df4

col1 col2 col3 col4
1 3 3 1 1
3 1 1 5 1

pandas, access a series of lists as a set and take the set difference of 2 set series

Thanks to: https://www.geeksforgeeks.org/python-difference-two-lists/

def Diff(li1, li2):
return list(set(li1) - set(li2)) + list(set(li2) - set(li1))

df['C'] = df.apply(lambda x: Diff(x['A'], x['B']), axis=1)

Output

           A          B    C
0 [1, 2, 3] [1, 2] [3]
1 [4, 5, 6] [5, 6] [4]
2 [7, 8, 9] [7, 8, 9] []


Related Topics



Leave a reply



Submit