Delete Duplicate Rows in Two Columns Simultaneously

Delete duplicate rows in two columns simultaneously

If you want to use subset, you could try:

  subset(df, !duplicated(subset(df, select=c(allrl, RAW.PVAL))))
# RAW.PVAL GR allrl Bak
#1 0.05 fr EN1 B12
#3 0.45 fr EN2 B10
#4 0.35 fg EN2 B066

But, I think @thelatemail's approach would be better

  df[!duplicated(df[c("RAW.PVAL","allrl")]),]

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT:
From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop

%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop

how to remove rows that appear same in two columns simultaneously in dataframe?

I believe you need sorting both columns by np.sort and filter by DataFrame.duplicated with inverse mask:

df1 = pd.DataFrame(np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1), index=DF1.index)

df = DF1[~df1.duplicated()]
print (df)
Id1 Id2
0 286 409
1 286 257
3 257 183

Detail : If use numpy.sort with axis=1 it sorting per rows, so first and third 'row' are same:

print (np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1))
[[286 409]
[257 286]
[286 409]
[183 257]]

Then use DataFrame.duplicated function (working with DataFrame, so used DataFrame constructor):

df1 = pd.DataFrame(np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1), index=DF1.index)
print (df1)
0 1
0 286 409
1 257 286
2 286 409
3 183 257

Third value is duplicate:

print (df1.duplicated())
0 False
1 False
2 True
3 False
dtype: bool

Last is necessary invert mask for remove duplicates, output is filtered in boolean indexing:

print (DF1[~df1.duplicated()])
Id1 Id2
0 286 409
1 286 257
3 257 183

Remove duplicate rows where columns values have been swapped

I think there is a problem with using np.sort(df.values, axis=1). While it sorts each row independently (good), it does not respect which column the values come from (bad). In other words, these two hypothetical rows

forename_1   surname_1   area_1   forename_2   surname_2   area_2
george neil g jim bob k
george jim k neil bob g

would get sorted identically

In [377]: np.sort(np.array([['george', 'neil', 'g', 'jim', 'bob', 'k'],
['george', 'jim', 'k', 'neil', 'bob', 'g']]), axis=1)
.....: Out[377]:
array([['bob', 'g', 'george', 'jim', 'k', 'neil'],
['bob', 'g', 'george', 'jim', 'k', 'neil']],
dtype='<U6')

even though their (forename, surname, area) triplets are different.

To handle this possibility, we could instead use jezrael's original stack/unstack approach, with a df.sort_values sandwiched in the middle:

import numpy as np
import pandas as pd
df = pd.DataFrame(
{'area_1': ['g', 'k', 'k', 'k', 'q', 'w', 's'],
'area_2': ['k', 'g', 'g', 'q', 'k', 'p', 'l'],
'forename_1': ['george', 'george', 'jim', 'pete', 'dan', 'ben', 'charlie'],
'forename_2': ['jim', 'neil', 'george', 'dan', 'pete', 'richard', 'graham'],
'surname_1': ['neil', 'jim', 'bob', 'keith', 'joe', 'steve', 'david'],
'surname_2': ['bob', 'bob', 'neil', 'joe', 'keith', 'ed', 'josh']})

def using_stack_sort_unstack(df):
df = df.copy()
df.columns = df.columns.str.split('_', expand=True)
df2 = df.stack()
df2 = df2.sort_values(by=['forename', 'surname', 'area'])
colnum = (df2.groupby(level=0).cumcount()+1).astype(str)
df2.index = pd.MultiIndex.from_arrays([df2.index.get_level_values(0), colnum])
df2 = df2.unstack().drop_duplicates()
df2.columns = df2.columns.map('_'.join)
return df2

print(using_stack_sort_unstack(df))

yields

  area_1 area_2 forename_1 forename_2 surname_1 surname_2
0 g k george jim neil bob
1 k g george neil jim bob
3 q k dan pete joe keith
5 w p ben richard steve ed
6 s l charlie graham david josh

The purpose of the stack/sort/unstack operations:

    df2 = df.stack()
df2 = df2.sort_values(by=['forename', 'surname', 'area'])
colnum = (df2.groupby(level=0).cumcount()+1).astype(str)
df2.index = pd.MultiIndex.from_arrays([df2.index.get_level_values(0), colnum])
df2 = df2.unstack().drop_duplicates()

is to sort the ('forename', 'surname', 'area') triplets in each row
individually. The sorting helps drop_duplicates identify (and drop) rows
which we want to consider identical.


This shows the difference between using_stack_sort_unstack and using_npsort.
Notice that using_npsort(df) returns 4 rows while
using_stack_sort_unstack(df) returns 5 rows:

def using_npsort(df):
df1 = pd.DataFrame(np.sort(df.values, axis=1), index=df.index).drop_duplicates()
df2 = df.loc[df1.index]
return df2
print(using_npsort(df))

# area_1 area_2 forename_1 forename_2 surname_1 surname_2
# 0 g k george jim neil bob
# 3 k q pete dan keith joe
# 5 w p ben richard steve ed
# 6 s l charlie graham david josh

Remove Duplicates in Table Based on Multiple Column

To combine the two columns, you have to capture BOTH sets of the data as an array. This applies to removing duplicates on any data set range or table, as well as if you want to Filter on multiple members.

In your case since you want the second and third columns in your table evaluated, you can easily rewrite your code as:

Sheets("A").ListObjects("Data").Range.RemoveDuplicates Columns:=Array(2,3), Header:=xlYes

Compare and Remove Duplicate Rows based off TWO columns VBA

Include the whole range and use Array to include both columns:

Range("A1:B6").RemoveDuplicates Columns:=Array(1,2), Header:=xlNo


Related Topics



Leave a reply



Submit