Remove Duplicates Column Combinations from a Dataframe in R

Remove duplicate combinations in R

df[!duplicated(t(apply(df[c("a", "b")], 1, sort))), ]
a b c
1 1 4 A
2 2 3 B
3 1 5 C

Where:

df <- data.frame(
a = c(1L, 2L, 1L, 4L, 5L, 3L, 3L),
b = c(4L, 3L, 5L, 1L, 1L, 2L, 2L),
c = c("A", "B", "C", "A", "C", "B", "E")
)

Remove duplicates column combinations from a dataframe in R

duplicated() has a method for data.frames, which is designed for just this sort of task:

df <- data.frame(a = c(1:4, 1:4), 
b = c(4:1, 4:1),
d = LETTERS[1:8])

df[!duplicated(df[c("a", "b")]),]
# a b d
# 1 1 4 A
# 2 2 3 B
# 3 3 2 C
# 4 4 1 D

Can I remove duplicate combinations across two columns R

I would suggest this base R approach:

#Data
df <- structure(list(firm1 = c("A", "A", "A", "B", "D", "G"), firm2 = c("B",
"D", "G", "A", "A", "A")), row.names = c(NA, -6L), class = "data.frame")

The code:

df[!duplicated(lapply(strsplit(paste0(df$firm1,df$firm2),split = ''),sort)),]

Output:

  firm1 firm2
1 A B
2 A D
3 A G

Removing duplicate all-way-combinations while retaining all columns

Here's a base solution, using the complete.cases function, and also creating a sorted feedID column:

# remove any rows with NA values
test <- test[complete.cases(test[,c('ID', 'feedID','feedID2')]),]
#remove any rows with feedID == feedID2
test <- test[!(test$feedID == test$feedID2),]
# add new feedID3 column
test$feedID3 <- apply(test, 1, function(x) paste(sort(c(x[2], x[3])), collapse = '-'))
# remove any duplicates, and remove last column
test[!duplicated(test[,c('feedID3', 'ID')]), -4]

ID feedID feedID2
2 49V A1 G2
6 52V B1 D1
7 52V D1 D2

data

Note that we have converted "NA" to NA, and we have also set stringsAsFactors = TRUE

test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),
feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1", "D2" ),
feedID2 = c("A1", "G2", "A1", "G2", NA, "D1", "D2", NA ),
stringsAsFactors = FALSE)

Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact

You can group by x2 and x3 and use slice(), i.e.

library(dplyr)

df %>%
group_by(x2, x3) %>%
slice(which.max(x4))

# A tibble: 3 x 4
# Groups: x2, x3 [3]
x1 x2 x3 x4
<chr> <chr> <chr> <int>
1 X A B 4
2 Z A C 1
3 X C B 5

Remove duplicate combinations from cross join result in R

After full join, filter with inequality to avoid reverse duplicates:

df <- source_df %>% 
full_join(source_df, by = character()) %>%
filter(TableFrom < TableTo)

R: Remove duplicates from a dataframe based on categories in a column

Here is a snippet that does what you asked:

df$Category <- factor(df$Category, levels = c("PT", "DI", "GT", "SY"))

df <- df[order(df$Category),]

df[!duplicated(df[,c('Name', 'Course')]),]

Output:

Name Course Category
Jason ML PT
Nancy ML PT
Jason DS DI
Nancy DS DI
John DS GT
James ML SY

Idea is that we sort based on the priority structure. Then we apply the unique operations, which will return the first match. The return will be what we want.

Remove duplicate rows ignoring column order? in R

df <- data.frame(
var1 = c("a", "b", "a", "c", "b", "c"),
var2 = c("b", "a", "c", "a", "c", "b"),
value = c(0.576, 0.576, 0.987, 0.987, 0.034, 0.034)
)

A one-liner base-r solution:

df_unique <- df[!duplicated(apply(df[,1:2], 1, function(row) paste(sort(row), collapse=""))),]

df_unique
var1 var2 value
1 a b 0.576
3 a c 0.987
5 b c 0.034

What it does: work across the first 2 columns row-wise (apply with MARGIN = 1), sort (alphabetically) the content, paste into a single string, remove all indices where the string has already occurred before (!duplicated).

Another (probably better) approach, stepping back, is to take your original matrix and clear out the bottom half using lower.tri. This way only half of the combinations will have non-0 values:

mat <- matrix(c(0, 0.576, 0.987, 0.576, 0, 0.034, 0.987, 0.034, 0), 
nrow=3, dimnames=list(letters[1:3], letters[1:3]))

mat[lower.tri(mat, diag = TRUE)] <- NA
mat
a b c
a NA 0.576 0.987
b NA NA 0.034
c NA NA NA

Remove duplicates across columns

We can sort the elements in each row with apply, transpose the output, apply duplicated to return a logical vector and use that for subsetting the rows

df[!duplicated(t(apply(df[, 1:2], 1, sort))),]
# [,1] [,2]
#[1,] "a" "b"
#[2,] "a" "c"
#[3,] "a" "d"
#[4,] "b" "c"
#[5,] "b" "d"
#[6,] "c" "d"

or another option is pmin/pmax

df[!duplicated(cbind(pmin(df[,1], df[,2]), pmax(df[,1], df[,2]))),]

data

df <- structure(c("a", "a", "a", "b", "b", "b", "c", "c", "c", "b", 
"c", "d", "a", "c", "d", "a", "b", "d"), .Dim = c(9L, 2L))


Related Topics



Leave a reply



Submit