Faster Way to Compare Rows in a Data Frame

Fastest way to compare row and previous row in pandas dataframe with millions of rows

I was thinking along the same lines as Andy, just with groupby added, and I think this is complementary to Andy's answer. Adding groupby is just going to have the effect of putting a NaN in the first row whenever you do a diff or shift. (Note that this is not an attempt at an exact answer, just to sketch out some basic techniques.)

df['time_diff'] = df.groupby('User')['Time'].diff()

df['Col1_0'] = df['Col1'].apply( lambda x: x[0] )

df['Col1_0_prev'] = df.groupby('User')['Col1_0'].shift()

   User  Time                 Col1  time_diff Col1_0 Col1_0_prev
0     1     6     [cat, dog, goat]        NaN    cat         NaN
1     1     6         [cat, sheep]          0    cat         cat
2     1    12        [sheep, goat]          6  sheep         cat
3     2     3          [cat, lion]        NaN    cat         NaN
4     2     5  [fish, goat, lemur]          2   fish         cat
5     3     9           [cat, dog]        NaN    cat         NaN
6     4     4          [dog, goat]        NaN    dog         NaN
7     4    11                [cat]          7    cat         dog

As a followup to Andy's point about storing objects, note that what I did here was to extract the first element of the list column (and add a shifted version also). Doing it like this you only have to do an expensive extraction once and after that can stick to standard pandas methods.

Fastest way to compare rows of two pandas dataframes?

You can use merge with reset_index - output are indexes of B which are equal in A in custom columns:

A = pd.DataFrame({'A':[1,0,1,1],
                  'B':[0,0,1,1],
                  'C':[1,0,1,1],
                  'D':[1,1,1,0],
                  'E':[1,1,0,1]})

print (A)
   A  B  C  D  E
0  1  0  1  1  1
1  0  0  0  1  1
2  1  1  1  1  0
3  1  1  1  0  1

B = pd.DataFrame({'0':[1,0,1],
                  '1':[1,0,1],
                  '2':[1,0,0]})

print (B)
   0  1  2
0  1  1  1
1  0  0  0
2  1  1  0

print (pd.merge(B.reset_index(), 
                A.reset_index(), 
                left_on=B.columns.tolist(), 
                right_on=A.columns[[0,1,2]].tolist(),
                suffixes=('_B','_A')))

   index_B  0  1  2  index_A  A  B  C  D  E
0        0  1  1  1        2  1  1  1  1  0
1        0  1  1  1        3  1  1  1  0  1
2        1  0  0  0        1  0  0  0  1  1    

print (pd.merge(B.reset_index(), 
                A.reset_index(), 
                left_on=B.columns.tolist(), 
                right_on=A.columns[[0,1,2]].tolist(),
                suffixes=('_B','_A'))[['index_B','index_A']])    

   index_B  index_A
0        0        2
1        0        3
2        1        1

faster way to compare rows in a data frame

Here is an Rcpp solution. However, if the result matrix gets too big (i.e., there are too many hits), this will throw an error. I run the loops twice, first to get the necessary size of the result matrix and then to fill it. There is probably a better possibility. Also, obviously, this will only work with integers. If your matrix is numeric, you'll have to deal with floating point precision.

library(Rcpp)
library(inline)

#C++ code:
body <- '
const IntegerMatrix        M(as<IntegerMatrix>(MM));
const int                  m=M.ncol(), n=M.nrow();
long                        count1;
int                         count2;
count1 = 0;
for (int i=0; i<(n-1); i++)
{
   for (int j=(i+1); j<n; j++)
   {
     count2 = 0;
     for (int k=0; k<m; k++) {
        if (M(i,k)==M(j,k)) count2++;
     }
     if (count2>3) count1++;
   } 
}
IntegerMatrix              R(count1,3);
count1 = 0;
for (int i=0; i<(n-1); i++)
{
   for (int j=(i+1); j<n; j++)
   {
     count2 = 0;
     for (int k=0; k<m; k++) {
        if (M(i,k)==M(j,k)) count2++;
     }
     if (count2>3) {
        count1++;
        R(count1-1,0) = i+1;
        R(count1-1,1) = j+1;
        R(count1-1,2) = count2;
     }
   } 
}
return  wrap(R);
'

fun <- cxxfunction(signature(MM = "matrix"), 
                     body,plugin="Rcpp")

#with your data
fun(as.matrix(data))
#      [,1] [,2] [,3]
# [1,]    1    2    4
# [2,]    1    4    5
# [3,]    2    4    4

#Benchmarks
set.seed(42)
mat1 <- matrix(sample(1:10,250*26,TRUE),ncol=26)
mat2 <- matrix(sample(1:10,2500*26,TRUE),ncol=26)
mat3 <- matrix(sample(1:10,10000*26,TRUE),ncol=26)
mat4 <- matrix(sample(1:10,25000*26,TRUE),ncol=26)
library(microbenchmark)
microbenchmark(
  fun(mat1),
  fun(mat2),
  fun(mat3),
  fun(mat4),
  times=3
  )
# Unit: milliseconds
#      expr          min           lq       median           uq          max neval
# fun(mat1)     2.675568     2.689586     2.703603     2.732487     2.761371     3
# fun(mat2)   272.600480   274.680815   276.761151   276.796217   276.831282     3
# fun(mat3)  4623.875203  4643.634249  4663.393296  4708.067638  4752.741979     3
# fun(mat4) 29041.878164 29047.151348 29052.424532 29235.839275 29419.254017     3

Compare two rows on a loop for on Pandas

Solution with different ouput, because is compared original columns with DataFrame.diff and set less like 0 values to 0 by DataFrame.mask:

df1 = df.mask(df.diff(axis=1).lt(0), 0)
print (df1)
    A   B   C
0   6   0   0
1   8   9  14
2  10  12   0
3   1   0   4
4   3   0   9

If use list comprehension with zip shifted columns names output is different, because is compared already assigned columns B, C...:

for a, b in zip(df.columns, df.columns[1:]):
    df[b] = np.where(df[a] > df[b], 0, df[b])

print (df)
    A   B   C
0   6   0   3
1   8   9  14
2  10  12   0
3   1   0   4
4   3   0   9

How to compare every row of dataframe to dataframe in R?

You can do this: in short, with each row of the dataframe, duplicate it to create a new dataframe with all values changed to that row, and compare that dataframe with the original (whether the values are the same). rowSums of each of that comparison will give you the vectors you want.

# Create the desired output in list 
lst <- 
  lapply(1:nrow(df), function(nr) {
     rowSums(replicate(nrow(df), df[nr, ], simplify = FALSE) %>% 
             do.call("rbind", .) == df)})

# To create the desired dataframe
df %>% tibble(desired_column = I(lst))

In tibble call in the last row, I() is used to put in list output as a column.

Fastest way to compare all rows of a DataFrame

First, what code have you tried?

But to delete duplicates, this is very easy in pandas. Example below:

import pandas as pd
import numpy as np
# Creating the Test DataFrame below -------------------------------
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN], 
                    'B' : [1,0,3,5,0,0,np.NaN,9,0,0], 
                    'C' : ['AA1233445','A9875', 'rmacy','Idaho Rx','Ab123455','TV192837','RX','Ohio Drugs','RX12345','USA Pharma'], 
                    'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
                    'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
print(dfp)

#Output Below----------------

     A    B           C            D           E
0  NaN  1.0   AA1233445     123456.0      Assign
1  NaN  0.0       A9875     123456.0    Unassign
2  3.0  3.0       rmacy    1234567.0      Assign
3  4.0  5.0    Idaho Rx   12345678.0        Ugly
4  5.0  0.0    Ab123455      12345.0  Appreciate
5  5.0  0.0    TV192837      12345.0        Undo
6  3.0  NaN          RX   12345678.0      Assign
7  1.0  9.0  Ohio Drugs  123456789.0    Unicycle
8  5.0  0.0     RX12345    1234567.0      Assign
9  NaN  0.0  USA Pharma          NaN     Unicorn

# Remove all records with duplicated values in column a:
# keep='first' keeps the first occurences.

df2 = dfp[dfp.duplicated(['A'], keep='first')]
#output
     A    B           C           D         E
1  NaN  0.0       A9875    123456.0  Unassign
5  5.0  0.0    TV192837     12345.0      Undo
6  3.0  NaN          RX  12345678.0    Assign
8  5.0  0.0     RX12345   1234567.0    Assign
9  NaN  0.0  USA Pharma         NaN   Unicorn

if you want to have a new dataframe with no dupes that checks across all columns use the tilde. the ~ operator is essentially the not equal to or != operator. official documentation here

df2 = dfp[~dfp.duplicated(keep='first')]

Compare rows of two dataframes in pandas

If the data has same columns, but different number of rows, this is one possible solution:

res = (pd.concat([df1,df2])
       .drop_duplicates(keep=False)
       .drop_duplicates(subset='id_account', keep='last')
      )

Output:

   id_account                   name      cnpj create_date
0          10      Supermarket Carol  80502030  2022-05-30
3          40    Supermarket Magical  60304050  2022-05-30
5          60  Supermarket of Dreams  90804050  2022-05-30

How to compare row by row in a dataframe

Idea is use fuzzy matching lib fuzzywuzzy for ratio of all combinations of Names by cross join by DataFrame.merge and removed rows with same names in both columns by DataFrame.query, also was added new column by lengths of data by Series.str.len:

from fuzzywuzzy import fuzz

df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x:  fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
df1['len'] = df1['Name_x'].str.len()
print (df1)
    Name_x   ID   Name_y  ratio  len
1      Abc  123      BCD      0    3
2      BCD  123      Abc      0    3
6      Pqr  789  PQR.com     20    3
7  PQR.com  789      Pqr     20    7

Then filter rows by treshold and boolean indexing. Then is necessary choose which value is necessary, one possible solution is get longer text. So is uses DataFrameGroupBy.idxmax with DataFrame.loc and then DataFrame.set_index for Series:

N = 15    
df2 = df1[df1['ratio'].gt(N)]
s = df2.loc[df2.groupby('ID')['len'].idxmax()].set_index('ID')['Name_x']
print (s)
ID
789    PQR.com
Name: Name_x, dtype: object

Last Series.map by ID and replace non matched values by original with Series.fillna:

df['Name'] = df['ID'].map(s).fillna(df['Name'])
print (df)
      Name   ID
0      Abc  123
1      BCD  123
2      Def  345
3  PQR.com  789
4  PQR.com  789

EDIT: If there is more valid strings per ID is is more complicated:

print (df)
               Name          ID
0      Air Ordnance  1578013421
1  Air-Ordnance.com  1578013421
2          Garreett  1578013421
3           Garrett  1578013421

First get fuzz.ratio like in solution before:

from fuzzywuzzy import fuzz

df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x:  fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
print (df1)
              Name_x          ID            Name_y  ratio
1       Air Ordnance  1578013421  Air-Ordnance.com     79
2       Air Ordnance  1578013421          Garreett     30
3       Air Ordnance  1578013421           Garrett     32
4   Air-Ordnance.com  1578013421      Air Ordnance     79
6   Air-Ordnance.com  1578013421          Garreett     25
7   Air-Ordnance.com  1578013421           Garrett     26
8           Garreett  1578013421      Air Ordnance     30
9           Garreett  1578013421  Air-Ordnance.com     25
11          Garreett  1578013421           Garrett     93
12           Garrett  1578013421      Air Ordnance     32
13           Garrett  1578013421  Air-Ordnance.com     26
14           Garrett  1578013421          Garreett     93

Then filter by threshold:

N = 50    
df2 = df1[df1['ratio'].gt(N)]
print (df2)

              Name_x          ID            Name_y  ratio
1       Air Ordnance  1578013421  Air-Ordnance.com     79
4   Air-Ordnance.com  1578013421      Air Ordnance     79
11          Garreett  1578013421           Garrett     93
14           Garrett  1578013421          Garreett     93

But for more precision is necessary specify, what strings are valid in list L, filter by list:

L = ['Air-Ordnance.com','Garrett']
df2 = df2.loc[df2['Name_x'].isin(L),['Name_x','Name_y','ID']].rename(columns={'Name_y':'Name'})
print (df2)
              Name_x          Name          ID
4   Air-Ordnance.com  Air Ordnance  1578013421
14           Garrett      Garreett  1578013421

Last merge with left join to original and repalce missing values:

df = df.merge(df2, on=['Name','ID'], how='left')
df['Name'] = df.pop('Name_x').fillna(df['Name'])
print (df)
               Name          ID
0  Air-Ordnance.com  1578013421
1  Air-Ordnance.com  1578013421
2           Garrett  1578013421
3           Garrett  1578013421