Remove Duplicate Rows from Pandas Dataframe Where Only Some Columns Have the Same Value

Remove duplicate rows from Pandas dataframe where only some columns have the same value

Use drop_duplicates with parameter subset, for keeping only last duplicated rows add keep='last':

df1 = df.drop_duplicates(subset=['A','B'])
#same as
#df1 = df.drop_duplicates(subset=['A','B'], keep='first')
print (df1)
A B C
0 1 2 x
2 3 4 z
3 3 5 x

df2 = df.drop_duplicates(subset=['A','B'], keep='last')
print (df2)
A B C
1 1 2 y
2 3 4 z
3 3 5 x

how do I remove rows with duplicate values of columns in pandas data frame?

Using drop_duplicates with subset with list of columns to check for duplicates on and keep='first' to keep first of duplicates.

If dataframe is:

df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
'Column2': ["'bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)

Result:

  Column1   Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'

Then:

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)

Result:

  Column1   Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'

Remove duplicates by columns A, keeping the row with the highest value in column B

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]:
A B
1 1 20
3 2 40
4 3 10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]:
A B
A
1 1 20
2 2 40
3 3 10

Drop all duplicate rows across multiple columns in Python Pandas

This is much easier in pandas now with drop_duplicates and the keep parameter.

import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)

How do I remove any rows that are identical across all the columns except for one column?

IIUC, you can compute the duplicates per column using apply(pd.Series.duplicated), then count the True values per row and compare it to the wanted threshold:

ncols = df.shape[1]-1
ndups = df.drop(columns='Hugo_Symbol').apply(pd.Series.duplicated).sum(axis=1)
df2 = df[ndups.lt(ncols-1)]

output (using the simple provided example):

  Hugo_Symbol  TCGA-1  TCGA-2  TCGA-3
0 First 0.123 0.234 0.345
2 Third 0.456 0.678 0.789
3 Fourth 0.789 0.456 0.321

There is however one potential blind spot. Imagine this dataset:

A B C D
X X C D
A B X X

The first row won't be dropped as it comes first and has duplicates spread over several other rows (that might not be an issue in your case).

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT:
From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop

%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop

remove duplicate rows based on specific criteria with pandas

First create a masking to separate duplicate and non-duplicate rows based on Id, then concatenate non-duplicate slice with duplicate slice without all row values equal to 0.

>>> duplicateMask = df.duplicated('Id', keep=False)
>>> pd.concat([df.loc[duplicateMask & df[['Sales', 'Rent', 'Rate']].ne(0).any(axis=1)],
df[~duplicateMask]])
Id Name Sales Rent Rate
0 40808 A2 0 43 340
1 17486 DV 491 0 346
4 27977 A-M 0 0 94
6 80210 M-1 0 0 -37
7 15545 M-2 0 0 -17
10 53549 A-M8 0 0 50
12 66666 MK 0 0 0


Related Topics



Leave a reply



Submit