Delete duplicate rows in two columns simultaneously
If you want to use subset
, you could try:
subset(df, !duplicated(subset(df, select=c(allrl, RAW.PVAL))))
# RAW.PVAL GR allrl Bak
#1 0.05 fr EN1 B12
#3 0.45 fr EN2 B10
#4 0.35 fg EN2 B066
But, I think @thelatemail's approach would be better
df[!duplicated(df[c("RAW.PVAL","allrl")]),]
Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C
You can do it using group by:
c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]
c_maxes
is a Series
of the maximum values of C
in each group but which is of the same length and with the same index as df
. If you haven't used .transform
then printing c_maxes
might be a good idea to see how it works.
Another approach using drop_duplicates
would be
df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)
Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.
EDIT:
From pandas 0.18
up the second solution would be
df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
or, alternatively,
df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])
In any case, the groupby
solution seems to be significantly more performing:
%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop
%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop
how to remove rows that appear same in two columns simultaneously in dataframe?
I believe you need sorting both columns by np.sort
and filter by DataFrame.duplicated
with inverse mask:
df1 = pd.DataFrame(np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1), index=DF1.index)
df = DF1[~df1.duplicated()]
print (df)
Id1 Id2
0 286 409
1 286 257
3 257 183
Detail : If use numpy.sort
with axis=1
it sorting per rows, so first and third 'row'
are same:
print (np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1))
[[286 409]
[257 286]
[286 409]
[183 257]]
Then use DataFrame.duplicated
function (working with DataFrame, so used DataFrame constructor):
df1 = pd.DataFrame(np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1), index=DF1.index)
print (df1)
0 1
0 286 409
1 257 286
2 286 409
3 183 257
Third value is duplicate:
print (df1.duplicated())
0 False
1 False
2 True
3 False
dtype: bool
Last is necessary invert mask for remove duplicates, output is filtered in boolean indexing
:
print (DF1[~df1.duplicated()])
Id1 Id2
0 286 409
1 286 257
3 257 183
Remove duplicate rows where columns values have been swapped
I think there is a problem with using np.sort(df.values, axis=1)
. While it sorts each row independently (good), it does not respect which column the values come from (bad). In other words, these two hypothetical rows
forename_1 surname_1 area_1 forename_2 surname_2 area_2
george neil g jim bob k
george jim k neil bob g
would get sorted identically
In [377]: np.sort(np.array([['george', 'neil', 'g', 'jim', 'bob', 'k'],
['george', 'jim', 'k', 'neil', 'bob', 'g']]), axis=1)
.....: Out[377]:
array([['bob', 'g', 'george', 'jim', 'k', 'neil'],
['bob', 'g', 'george', 'jim', 'k', 'neil']],
dtype='<U6')
even though their (forename, surname, area)
triplets are different.
To handle this possibility, we could instead use jezrael's original stack/unstack approach, with a df.sort_values
sandwiched in the middle:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'area_1': ['g', 'k', 'k', 'k', 'q', 'w', 's'],
'area_2': ['k', 'g', 'g', 'q', 'k', 'p', 'l'],
'forename_1': ['george', 'george', 'jim', 'pete', 'dan', 'ben', 'charlie'],
'forename_2': ['jim', 'neil', 'george', 'dan', 'pete', 'richard', 'graham'],
'surname_1': ['neil', 'jim', 'bob', 'keith', 'joe', 'steve', 'david'],
'surname_2': ['bob', 'bob', 'neil', 'joe', 'keith', 'ed', 'josh']})
def using_stack_sort_unstack(df):
df = df.copy()
df.columns = df.columns.str.split('_', expand=True)
df2 = df.stack()
df2 = df2.sort_values(by=['forename', 'surname', 'area'])
colnum = (df2.groupby(level=0).cumcount()+1).astype(str)
df2.index = pd.MultiIndex.from_arrays([df2.index.get_level_values(0), colnum])
df2 = df2.unstack().drop_duplicates()
df2.columns = df2.columns.map('_'.join)
return df2
print(using_stack_sort_unstack(df))
yields
area_1 area_2 forename_1 forename_2 surname_1 surname_2
0 g k george jim neil bob
1 k g george neil jim bob
3 q k dan pete joe keith
5 w p ben richard steve ed
6 s l charlie graham david josh
The purpose of the stack/sort/unstack operations:
df2 = df.stack()
df2 = df2.sort_values(by=['forename', 'surname', 'area'])
colnum = (df2.groupby(level=0).cumcount()+1).astype(str)
df2.index = pd.MultiIndex.from_arrays([df2.index.get_level_values(0), colnum])
df2 = df2.unstack().drop_duplicates()
is to sort the ('forename', 'surname', 'area')
triplets in each row
individually. The sorting helps drop_duplicates
identify (and drop) rows
which we want to consider identical.
This shows the difference between using_stack_sort_unstack
and using_npsort
.
Notice that using_npsort(df)
returns 4 rows whileusing_stack_sort_unstack(df)
returns 5 rows:
def using_npsort(df):
df1 = pd.DataFrame(np.sort(df.values, axis=1), index=df.index).drop_duplicates()
df2 = df.loc[df1.index]
return df2
print(using_npsort(df))
# area_1 area_2 forename_1 forename_2 surname_1 surname_2
# 0 g k george jim neil bob
# 3 k q pete dan keith joe
# 5 w p ben richard steve ed
# 6 s l charlie graham david josh
Remove Duplicates in Table Based on Multiple Column
To combine the two columns, you have to capture BOTH sets of the data as an array. This applies to removing duplicates on any data set range or table, as well as if you want to Filter on multiple members.
In your case since you want the second and third columns in your table evaluated, you can easily rewrite your code as:
Sheets("A").ListObjects("Data").Range.RemoveDuplicates Columns:=Array(2,3), Header:=xlYes
Compare and Remove Duplicate Rows based off TWO columns VBA
Include the whole range and use Array
to include both columns:
Range("A1:B6").RemoveDuplicates Columns:=Array(1,2), Header:=xlNo
Related Topics
How to Maintain Size of Ggplot with Long Labels
Dplyr Issues When Using Group_By(Multiple Variables)
Using Parallel's Parlapply: Unable to Access Variables Within Parallel Code
How to Disable "Save Workspace Image" Prompt in R
Grid of Multiple Ggplot2 Plots Which Have Been Made in a for Loop
Can Sweave Produce Many PDFs Automatically
Ggplot2: Adjust the Symbol Size in Legends
Replace All Values in a Matrix <0.1 with 0
Removing Multiple Columns from R Data.Table with Parameter for Columns to Remove
How to Set Fixed Continuous Colour Values in Ggplot2
How to Determine If You Have an Internet Connection in R
Creating Regular 15-Minute Time-Series from Irregular Time-Series
Add a Horizontal Line to Plot and Legend in Ggplot2
Finding Row Index Containing Maximum Value Using R
How to Convert Data Frame to Spatial Coordinates
Ggplot2 Does Not Appear to Work When Inside a Function R