Remove duplicate column pairs, sort rows based on 2 columns
Instead of sorting for the whole dataset, sort the 'Var1', 'Var2', and then use duplicated
to remove the duplicate rows
testdata[1:2] <- t( apply(testdata[1:2], 1, sort) )
testdata[!duplicated(testdata[1:2]),]
# Var1 Var2 f g
#1 1 4 blue blue
#2 2 3 green green
#5 5 5 orange2 orange1
Removing duplicates based on two columns in R
Using data.table v1.9.5
- installation instructions here:
require(data.table) # v1.9.5+
df[!duplicated(rleidv(df, cols = c("X", "Y"))), ]
rleidv()
is best understood with examples:
rleidv(c(1,1,1,2,2,3,1,1))
# [1] 1 1 1 2 2 3 4 4
A unique index is generated for each consecutive run of values.
And the same can be accomplished on a list()
or data.frame()
or data.table()
on a specific set of columns as well. For example:
df = data.frame(a = c(1,1,2,2,1), b = c(2,3,4,4,2))
rleidv(df) # computes on both columns 'a,b'
# [1] 1 2 3 3 4
rleidv(df, cols = "a") # only looks at 'a'
# [1] 1 1 2 2 3
The rest should be fairly obvious. We just check for duplicated()
values, and return the non-duplicated ones.
Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C
You can do it using group by:
c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]
c_maxes
is a Series
of the maximum values of C
in each group but which is of the same length and with the same index as df
. If you haven't used .transform
then printing c_maxes
might be a good idea to see how it works.
Another approach using drop_duplicates
would be
df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)
Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.
EDIT:
From pandas 0.18
up the second solution would be
df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
or, alternatively,
df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])
In any case, the groupby
solution seems to be significantly more performing:
%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop
%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop
How do I remove duplicate rows based on two Columns? (E.g. A Pair Value Set)
You probably want to leave data, because Cassandra likes Gabriel and Gabriel likes Cassandra are different actions. So I will suggest the following query:
WITH cte AS(SELECT hs.NAME Highschooler ,
hs.grade inGrade1 ,
hs2.NAME likes ,
hs2.grade inGrade2 ,
ROW_NUMBER() OVER (PARTITION BY CASE WHEN l.id1 < l.id2 THEN l.id1
ELSE l.id2 END,
CASE WHEN l.id1 < l.id2 THEN l.id2
ELSE l.id1 END
ORDER BY (SELECT NULL)) rn
FROM highschooler hs
JOIN likes l ON hs.id = l.id1
JOIN highschooler hs2 ON hs2.id = l.id2)
SELECT * FROM cte WHERE rn = 1
This is the demostration:
DECLARE @t TABLE ( id1 INT, id2 INT )
INSERT INTO @t
VALUES ( 1, 2 ),
( 2, 1 ),
( 1, 3 ),
( 5, 6 ),
( 6, 5 ),
( 7, 8 );
WITH cte AS(SELECT * ,
ROW_NUMBER() OVER (PARTITION BY CASE WHEN id1 < id2 THEN id1
ELSE id2 END,
CASE WHEN id1 < id2 THEN id2
ELSE id1 END
ORDER BY (SELECT NULL)) rn
FROM @t)
SELECT * FROM cte WHERE rn = 1
Output:
id1 id2 rn
1 2 1
1 3 1
5 6 1
7 8 1
Remove duplicates based on the content of two columns not the order
Use:
#if want select columns by columns names
m = ~pd.DataFrame(np.sort(df[['First','Second']], axis=1)).duplicated()
#if want select columns by positons
#m = ~pd.DataFrame(np.sort(df.iloc[:,:2], axis=1)).duplicated()
print (m)
0 True
1 False
2 True
dtype: bool
df = df[m]
print (df)
First Second Value
0 A B 0.5
2 A C 0.2
r remove rows from a data frame that contain a duplicate of either combination of 2 columns
We can sort
by row
using apply
with MARGIN=1
, transpose (t
) the output, use duplicated
to get the logical index of duplicate rows, negate (!
) to get the rows that are not duplicated, and subset the dataset.
combo[!duplicated(t(apply(combo, 1, sort))),]
# Var1 Var2
#2 B A
#3 C A
#6 C B
Related Topics
Fully Reproducible Parallel Models Using Caret
Add Objects to Package Namespace
Finding the Maximum Value for Each Row Among 3 Columns in R
Plot Data in Descending Order as Appears in Data Frame
Multiple Graphs in One Canvas Using Ggplot2
Calculate Cumsum() While Ignoring Na Values
Read.CSV Doesn't Seem to Detect Factors in R 4.0.0
Dplyr: Lead() and Lag() Wrong When Used with Group_By()
Create Dataframe from a Matrix
Handling Dates When We Switch to Daylight Savings Time and Back
What Is the Most Useful R Trick
What Ides Are Available for R in Linux
Why Is the Terminology of Labels and Levels in Factors So Weird
Using Cbind on an Arbitrarily Long List of Objects
How to Make Graphics with Transparent Background in R Using Ggplot2