Drop All Duplicate Rows Across Multiple Columns in Python Pandas

Drop all duplicate rows across multiple columns in Python Pandas

This is much easier in pandas now with drop_duplicates and the keep parameter.

import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT:
From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop

%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop

Pandas, drop duplicates across multiple columns only if None in other column

Use Series.notna chained by | for bitwise Or with all duplicates by keep=False by columns in DataFrame.duplicated:

df = df[df['Col1'].notna() | ~df.duplicated(subset=['Col2', 'Col3', 'Col4'], keep=False) ]
print (df)
   Col1 Col2  Col3  Col4
1    i1    a     2     1
2    i2    v     7     5
3    i3    b     1     3
4  None    c     2     2
5    i4    b     1     3

Drop duplicate rows in dataframe based on multplie columns with list values

You can convert each of the columns with lists into str and then drop duplicates.

Step 1: Convert each column that has lists into a string type using
astype(str).
Step 2: use drop_duplicates with the columns as strings. Since you
want all duplicates to be removed, set keep=False.
Step 3: drop the temp created astype(str) columns as you no longer
need them.

The full code will be:

c = ['col1','col2','col3','col4']
d =[[52,['kjd','pkh','sws'],['aqs','zxc','asd'],['plm','okn','ijb']],
    [47,['qaz','wsx','edc'],['aws','rfc','tgb'],['rty','wer','dfg']],
    [85,['kjd','pkh','sws'],['aqs','zxc','asd'],['plm','okn','ijb']],
    [27,['asw','bxs','mdh'],['wka','kdy','kaw'],['pqm','lsc','yhb']]]

import pandas as pd
df = pd.DataFrame(d,columns=c)
print(df)

df['col2s'] = df['col2'].astype(str)
df['col3s'] = df['col3'].astype(str)
df['col4s'] = df['col4'].astype(str)

df.drop_duplicates(subset=['col2s', 'col3s','col4s'],keep=False,inplace=True)
df.drop(['col2s', 'col3s','col4s'],axis=1,inplace=True)
print (df)

The output of this will be:

Original DataFrame:

   col1             col2             col3             col4
0    52  [kjd, pkh, sws]  [aqs, zxc, asd]  [plm, okn, ijb]
1    47  [qaz, wsx, edc]  [aws, rfc, tgb]  [rty, wer, dfg]
2    85  [kjd, pkh, sws]  [aqs, zxc, asd]  [plm, okn, ijb]
3    27  [asw, bxs, mdh]  [wka, kdy, kaw]  [pqm, lsc, yhb]

DataFrame after dropping the duplicates:

   col1             col2             col3             col4
1    47  [qaz, wsx, edc]  [aws, rfc, tgb]  [rty, wer, dfg]
3    27  [asw, bxs, mdh]  [wka, kdy, kaw]  [pqm, lsc, yhb]

Drop consecutive duplicates across multiple columns - Pandas

I want to drop rows where values in year and sale are the same That means you can calculate the difference, check if they are equal zero on year and sale:

# if the data are numeric
# s = df[['year','sale']].diff().ne(0).any(1)

s = df[['year','sale']].ne(df[['year','sale']].shift()).any(1)
df[s]

Output:

   month  year  sale
0      1  2012    55
1      4  2014    40
3     10  2013    84
4     12  2014    31
5     12  2014    32

Remove duplicates when values are swapped in columns and give a count

IIUC, you could use a frozenset as grouper:

group = df[['Col1', 'Col2']].agg(frozenset, axis=1)

(df
 .groupby(group, as_index=False)  # you can also group by [group, 'Score']
 .agg(**{c: (c, 'first') for c in df},
      Duplicates=('Score', 'count'),
     )
)

output:

  Col1 Col2  Score  Duplicates
0    A    B    0.6           3
1    A    C    0.8           2
2    D    E    0.9           1

Drop All Duplicate Rows Across Multiple Columns in Python Pandas