Remove Duplicates Column Combinations from a Dataframe in R

Remove duplicate combinations in R

df[!duplicated(t(apply(df[c("a", "b")], 1, sort))), ]
  a b c
1 1 4 A
2 2 3 B
3 1 5 C

Where:

df <- data.frame(
  a = c(1L, 2L, 1L, 4L, 5L, 3L, 3L), 
  b = c(4L, 3L, 5L, 1L, 1L, 2L, 2L), 
  c = c("A", "B", "C", "A", "C", "B", "E")
)

Remove duplicates column combinations from a dataframe in R

duplicated() has a method for data.frames, which is designed for just this sort of task:

df <- data.frame(a = c(1:4, 1:4), 
                 b = c(4:1, 4:1), 
                 d = LETTERS[1:8])

df[!duplicated(df[c("a", "b")]),]
#   a b d
# 1 1 4 A
# 2 2 3 B
# 3 3 2 C
# 4 4 1 D

Can I remove duplicate combinations across two columns R

I would suggest this base R approach:

#Data
df <- structure(list(firm1 = c("A", "A", "A", "B", "D", "G"), firm2 = c("B", 
"D", "G", "A", "A", "A")), row.names = c(NA, -6L), class = "data.frame")

The code:

df[!duplicated(lapply(strsplit(paste0(df$firm1,df$firm2),split = ''),sort)),]

Output:

  firm1 firm2
1     A     B
2     A     D
3     A     G

Removing duplicate all-way-combinations while retaining all columns

Here's a base solution, using the complete.cases function, and also creating a sorted feedID column:

# remove any rows with NA values
test <- test[complete.cases(test[,c('ID', 'feedID','feedID2')]),]
#remove any rows with feedID == feedID2
test <- test[!(test$feedID == test$feedID2),]
# add new feedID3 column
test$feedID3 <- apply(test, 1, function(x) paste(sort(c(x[2], x[3])), collapse = '-'))
# remove any duplicates, and remove last column
test[!duplicated(test[,c('feedID3', 'ID')]), -4]

   ID feedID feedID2
2 49V     A1      G2
6 52V     B1      D1
7 52V     D1      D2

data

Note that we have converted "NA" to NA, and we have also set stringsAsFactors = TRUE

test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),
                   feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1",  "D2" ),
                   feedID2 = c("A1", "G2", "A1", "G2", NA, "D1", "D2",  NA ),
                   stringsAsFactors = FALSE)

Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact

You can group by x2 and x3 and use slice(), i.e.

library(dplyr)

df %>% 
 group_by(x2, x3) %>% 
 slice(which.max(x4))

# A tibble: 3 x 4
# Groups:   x2, x3 [3]
  x1    x2    x3       x4
  <chr> <chr> <chr> <int>
1 X     A     B         4
2 Z     A     C         1
3 X     C     B         5

Remove duplicate combinations from cross join result in R

After full join, filter with inequality to avoid reverse duplicates:

df <- source_df %>% 
   full_join(source_df, by = character()) %>%
   filter(TableFrom < TableTo)

R: Remove duplicates from a dataframe based on categories in a column

Here is a snippet that does what you asked:

df$Category <- factor(df$Category, levels = c("PT", "DI", "GT", "SY"))

df <- df[order(df$Category),]

df[!duplicated(df[,c('Name', 'Course')]),]

Output:

Name Course Category
Jason     ML       PT
Nancy     ML       PT
Jason     DS       DI
Nancy     DS       DI
John      DS       GT
James     ML       SY

Idea is that we sort based on the priority structure. Then we apply the unique operations, which will return the first match. The return will be what we want.

Remove duplicate rows ignoring column order? in R

df <- data.frame(
    var1 = c("a", "b", "a", "c", "b", "c"), 
    var2 = c("b", "a", "c", "a", "c", "b"), 
    value = c(0.576, 0.576, 0.987, 0.987, 0.034, 0.034)
)

A one-liner base-r solution:

df_unique <- df[!duplicated(apply(df[,1:2], 1, function(row) paste(sort(row), collapse=""))),]

df_unique
  var1 var2 value
1    a    b 0.576
3    a    c 0.987
5    b    c 0.034

What it does: work across the first 2 columns row-wise (apply with MARGIN = 1), sort (alphabetically) the content, paste into a single string, remove all indices where the string has already occurred before (!duplicated).

Another (probably better) approach, stepping back, is to take your original matrix and clear out the bottom half using lower.tri. This way only half of the combinations will have non-0 values:

mat <- matrix(c(0, 0.576, 0.987, 0.576, 0, 0.034, 0.987, 0.034, 0), 
              nrow=3, dimnames=list(letters[1:3], letters[1:3]))

mat[lower.tri(mat, diag = TRUE)] <- NA
mat
   a     b     c
a NA 0.576 0.987
b NA    NA 0.034
c NA    NA    NA

Remove duplicates across columns

We can sort the elements in each row with apply, transpose the output, apply duplicated to return a logical vector and use that for subsetting the rows

df[!duplicated(t(apply(df[, 1:2], 1, sort))),]
#     [,1] [,2]
#[1,] "a"  "b" 
#[2,] "a"  "c" 
#[3,] "a"  "d" 
#[4,] "b"  "c" 
#[5,] "b"  "d" 
#[6,] "c"  "d"

or another option is pmin/pmax

df[!duplicated(cbind(pmin(df[,1], df[,2]), pmax(df[,1], df[,2]))),]

data

df <- structure(c("a", "a", "a", "b", "b", "b", "c", "c", "c", "b", 
"c", "d", "a", "c", "d", "a", "b", "d"), .Dim = c(9L, 2L))

Remove Duplicates Column Combinations from a Dataframe in R

Remove duplicate combinations in R

Remove duplicates column combinations from a dataframe in R

Can I remove duplicate combinations across two columns R

Removing duplicate all-way-combinations while retaining all columns

data

Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact

Remove duplicate combinations from cross join result in R

R: Remove duplicates from a dataframe based on categories in a column

Remove duplicate rows ignoring column order? in R

Remove duplicates across columns

data

Related Topics

Leave a reply