Remove Duplicate Column Pairs, Sort Rows Based on 2 Columns

Remove duplicate column pairs, sort rows based on 2 columns

Instead of sorting for the whole dataset, sort the 'Var1', 'Var2', and then use duplicated to remove the duplicate rows

testdata[1:2] <- t( apply(testdata[1:2], 1, sort) )
testdata[!duplicated(testdata[1:2]),]
# Var1 Var2 f g
#1 1 4 blue blue
#2 2 3 green green
#5 5 5 orange2 orange1

Removing duplicates based on two columns in R

Using data.table v1.9.5 - installation instructions here:

require(data.table) # v1.9.5+
df[!duplicated(rleidv(df, cols = c("X", "Y"))), ]

rleidv() is best understood with examples:

rleidv(c(1,1,1,2,2,3,1,1))
# [1] 1 1 1 2 2 3 4 4

A unique index is generated for each consecutive run of values.

And the same can be accomplished on a list() or data.frame() or data.table() on a specific set of columns as well. For example:

df = data.frame(a = c(1,1,2,2,1), b = c(2,3,4,4,2))
rleidv(df) # computes on both columns 'a,b'
# [1] 1 2 3 3 4
rleidv(df, cols = "a") # only looks at 'a'
# [1] 1 1 2 2 3

The rest should be fairly obvious. We just check for duplicated() values, and return the non-duplicated ones.

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT:
From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop

%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop

How do I remove duplicate rows based on two Columns? (E.g. A Pair Value Set)

You probably want to leave data, because Cassandra likes Gabriel and Gabriel likes Cassandra are different actions. So I will suggest the following query:

WITH cte AS(SELECT hs.NAME Highschooler ,
hs.grade inGrade1 ,
hs2.NAME likes ,
hs2.grade inGrade2 ,
ROW_NUMBER() OVER (PARTITION BY CASE WHEN l.id1 < l.id2 THEN l.id1
ELSE l.id2 END,
CASE WHEN l.id1 < l.id2 THEN l.id2
ELSE l.id1 END
ORDER BY (SELECT NULL)) rn
FROM highschooler hs
JOIN likes l ON hs.id = l.id1
JOIN highschooler hs2 ON hs2.id = l.id2)
SELECT * FROM cte WHERE rn = 1

This is the demostration:

DECLARE @t TABLE ( id1 INT, id2 INT )

INSERT INTO @t
VALUES ( 1, 2 ),
( 2, 1 ),
( 1, 3 ),
( 5, 6 ),
( 6, 5 ),
( 7, 8 );
WITH cte AS(SELECT * ,
ROW_NUMBER() OVER (PARTITION BY CASE WHEN id1 < id2 THEN id1
ELSE id2 END,
CASE WHEN id1 < id2 THEN id2
ELSE id1 END
ORDER BY (SELECT NULL)) rn
FROM @t)
SELECT * FROM cte WHERE rn = 1

Output:

id1 id2 rn
1 2 1
1 3 1
5 6 1
7 8 1

Remove duplicates based on the content of two columns not the order

Use:

#if want select columns by columns names
m = ~pd.DataFrame(np.sort(df[['First','Second']], axis=1)).duplicated()
#if want select columns by positons
#m = ~pd.DataFrame(np.sort(df.iloc[:,:2], axis=1)).duplicated()
print (m)

0 True
1 False
2 True
dtype: bool

df = df[m]
print (df)
First Second Value
0 A B 0.5
2 A C 0.2

r remove rows from a data frame that contain a duplicate of either combination of 2 columns

We can sort by row using apply with MARGIN=1, transpose (t) the output, use duplicated to get the logical index of duplicate rows, negate (!) to get the rows that are not duplicated, and subset the dataset.

combo[!duplicated(t(apply(combo, 1, sort))),]
# Var1 Var2
#2 B A
#3 C A
#6 C B


Related Topics



Leave a reply



Submit