Set Difference for Pandas

set difference for pandas

from pandas import  DataFrame

df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})

print(df2[~df2.isin(df1).all(1)])
print(df2[(df2!=df1)].dropna(how='all'))
print(df2[~(df2==df1)].dropna(how='all'))

Computing Set Difference in Pandas between two dataframes

Do a left join with indicator which gives information on the origin of each row, then you can filter based on the indicator:

df1.merge(df2, indicator=True, how="left")[lambda x: x._merge=='left_only'].drop('_merge',1)

#State       City   Population
#0  NY      Albany      856654
#2  SC  Charleston       35323
#4  WV  Charleston       34523

Set difference of two dataframes in Pandas

You can try hashing the rows and then checking

Ex.

df1['match'] = df.apply(lambda x: hash(tuple(x)), axis=1)
df2['match'] = df2.apply(lambda x: hash(tuple(x)), axis=1)
df_diff = df1[~df1['match'].isin(df2['match'])]

Find the difference (set difference) between two dataframes in python

try:

df1[~df1.isin(df2)]

A,B,C,D

How to find the set difference between two Pandas DataFrames

The results are correct, however, setdiff1d is order dependent. It will only check for elements in the first input array that do not occur in the second array.

If you do not care which of the dataframes have the unique columns you can use setxor1d. It will return "the unique values that are in only one (not both) of the input arrays", see the documentation.

import numpy

colsA = ['a', 'b', 'c', 'd']
colsB = ['b','c']

c = numpy.setxor1d(colsA, colsB)

Will return you an array containing 'a' and 'd'.

If you want to use setdiff1d you need to check for differences both ways:

//columns in train.columns that are not in train_1.columns
c1 = np.setdiff1d(train.columns, train_1.columns)

//columns in train_1.columns that are not in train.columns
c2 = np.setdiff1d(train_1.columns, train.columns)

Comparing two dataframes and getting the differences

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order.

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
         Date   Fruit   Num   Color
9  2013-11-25  Orange   8.6  Orange
8  2013-11-25   Apple  22.1     Red

Pandas: set difference by group

Using `set`, `groupby`, `apply`, and `shift`.

For efficiency:
- Convert members to set type because - is an unsupported operand, which will cause a TypeError.
- Leave additions and deletions as set type

Using `apply`

With a dataframe of 60000 rows:
- 91.4 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# clean the members column
df.members = df.members.str.replace(' ', '').str.split(',').map(set)

# create del and add
df['deletions'] = df.groupby('team')['members'].apply(lambda x: x.shift() - x)
df['additions'] = df.groupby('team')['members'].apply(lambda x: x - x.shift())

# result
 month team    members additions deletions
     0    A  {Z, X, Y}       NaN       NaN
     1    A     {X, Y}        {}       {Z}
     2    A  {W, X, Y}       {W}        {}
     0    B     {D, E}       NaN       NaN
     1    B  {D, F, E}       {F}        {}
     2    B        {F}        {}    {D, E}

More Efficiently

pandas.DataFrame.diff
With a dataframe of 60000 rows:
- 60.7 ms ± 3.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

df['deletions'] = df.groupby('team')['members'].diff(periods=-1).shift()
df['additions'] = df.groupby('team')['members'].diff()

Find difference between two data frames

By using drop_duplicates

pd.concat([df1,df2]).drop_duplicates(keep=False)

Update :

The above method only works for those data frames that don't already have duplicates themselves. For example:

df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})

It will output like below , which is wrong

Wrong Output :

pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]: 
   A  B
1  2  3

Correct Output

How to achieve that?

Method 1: Using isin with tuple

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]: 
   A  B
1  2  3
2  3  4
3  3  4

Method 2: merge with indicator

df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]: 
   A  B     _merge
1  2  3  left_only
2  3  4  left_only
3  3  4  left_only

How to compare pandas DataFrames using set difference

You can make use of merge with indicator=True:

u = df1.merge(df2, how='outer', indicator=True)
df3 = u.query('_merge == "left_only"').drop('_merge', 1)
df4 = u.query('_merge == "right_only"').drop('_merge', 1)

df3

   col1  col2  col3  col4
1     2     2     1     1
3     0     0     4     1

df4

   col1  col2  col3  col4
4     3     3     1     1
5     1     1     5     1

If the column names of df1 and df2 are different, ensure they're both made to be the same:

df1.columns = df2.columns

If the index also needs to be preserved, you can first reset it before merging, then you can set it after.

u, v = df1.reset_index(), df2.reset_index()
w = (u.merge(v, how='outer', on=df1.columns.tolist(), indicator=True)
      .fillna({'index_x': -1, 'index_y': -1}, downcast='infer'))
w

   index_x  col1  col2  col3  col4  index_y      _merge
0        0     1     1     1     1        0        both
1        1     2     2     1     1       -1   left_only
2        2     0     0     1     1        2        both
3        5     0     0     4     1       -1   left_only
4       -1     3     3     1     1        1  right_only
5       -1     1     1     5     1        3  right_only

Now,

df3 = (w.query('_merge == "left_only"')
        .set_index('index_x')
        .drop(['_merge', 'index_y'], 1)
        .rename_axis([None], axis=0))
df4 = (w.query('_merge == "right_only"')
        .set_index('index_y')
        .drop(['_merge', 'index_x'], 1)
        .rename_axis([None], axis=0))

df3

   col1  col2  col3  col4
1     2     2     1     1
5     0     0     4     1

df4

   col1  col2  col3  col4
1     3     3     1     1
3     1     1     5     1

pandas, access a series of lists as a set and take the set difference of 2 set series

Thanks to: https://www.geeksforgeeks.org/python-difference-two-lists/

def Diff(li1, li2):
    return list(set(li1) - set(li2)) + list(set(li2) - set(li1))

df['C'] = df.apply(lambda x: Diff(x['A'], x['B']), axis=1)

Output

           A          B    C
0  [1, 2, 3]     [1, 2]  [3]
1  [4, 5, 6]     [5, 6]  [4]
2  [7, 8, 9]  [7, 8, 9]   []

Set Difference for Pandas