Drop All Duplicate Rows Across Multiple Columns in Python Pandas

Drop all duplicate rows across multiple columns in Python Pandas

This is much easier in pandas now with drop_duplicates and the keep parameter.

import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT:
From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop

%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop

Pandas, drop duplicates across multiple columns only if None in other column

Use Series.notna chained by | for bitwise Or with all duplicates by keep=False by columns in DataFrame.duplicated:

df = df[df['Col1'].notna() | ~df.duplicated(subset=['Col2', 'Col3', 'Col4'], keep=False) ]
print (df)
Col1 Col2 Col3 Col4
1 i1 a 2 1
2 i2 v 7 5
3 i3 b 1 3
4 None c 2 2
5 i4 b 1 3

Drop duplicate rows in dataframe based on multplie columns with list values

You can convert each of the columns with lists into str and then drop duplicates.

  • Step 1: Convert each column that has lists into a string type using
    astype(str).
  • Step 2: use drop_duplicates with the columns as strings. Since you
    want all duplicates to be removed, set keep=False.
  • Step 3: drop the temp created astype(str) columns as you no longer
    need them.

The full code will be:

c = ['col1','col2','col3','col4']
d =[[52,['kjd','pkh','sws'],['aqs','zxc','asd'],['plm','okn','ijb']],
[47,['qaz','wsx','edc'],['aws','rfc','tgb'],['rty','wer','dfg']],
[85,['kjd','pkh','sws'],['aqs','zxc','asd'],['plm','okn','ijb']],
[27,['asw','bxs','mdh'],['wka','kdy','kaw'],['pqm','lsc','yhb']]]

import pandas as pd
df = pd.DataFrame(d,columns=c)
print(df)

df['col2s'] = df['col2'].astype(str)
df['col3s'] = df['col3'].astype(str)
df['col4s'] = df['col4'].astype(str)

df.drop_duplicates(subset=['col2s', 'col3s','col4s'],keep=False,inplace=True)
df.drop(['col2s', 'col3s','col4s'],axis=1,inplace=True)
print (df)

The output of this will be:

Original DataFrame:

   col1             col2             col3             col4
0 52 [kjd, pkh, sws] [aqs, zxc, asd] [plm, okn, ijb]
1 47 [qaz, wsx, edc] [aws, rfc, tgb] [rty, wer, dfg]
2 85 [kjd, pkh, sws] [aqs, zxc, asd] [plm, okn, ijb]
3 27 [asw, bxs, mdh] [wka, kdy, kaw] [pqm, lsc, yhb]

DataFrame after dropping the duplicates:

   col1             col2             col3             col4
1 47 [qaz, wsx, edc] [aws, rfc, tgb] [rty, wer, dfg]
3 27 [asw, bxs, mdh] [wka, kdy, kaw] [pqm, lsc, yhb]

Drop consecutive duplicates across multiple columns - Pandas

I want to drop rows where values in year and sale are the same That means you can calculate the difference, check if they are equal zero on year and sale:

# if the data are numeric
# s = df[['year','sale']].diff().ne(0).any(1)

s = df[['year','sale']].ne(df[['year','sale']].shift()).any(1)
df[s]

Output:

   month  year  sale
0 1 2012 55
1 4 2014 40
3 10 2013 84
4 12 2014 31
5 12 2014 32

Remove duplicates when values are swapped in columns and give a count

IIUC, you could use a frozenset as grouper:

group = df[['Col1', 'Col2']].agg(frozenset, axis=1)

(df
.groupby(group, as_index=False) # you can also group by [group, 'Score']
.agg(**{c: (c, 'first') for c in df},
Duplicates=('Score', 'count'),
)
)

output:

  Col1 Col2  Score  Duplicates
0 A B 0.6 3
1 A C 0.8 2
2 D E 0.9 1


Related Topics



Leave a reply



Submit