Remove duplicate combinations in R
df[!duplicated(t(apply(df[c("a", "b")], 1, sort))), ]
a b c
1 1 4 A
2 2 3 B
3 1 5 C
Where:
df <- data.frame(
a = c(1L, 2L, 1L, 4L, 5L, 3L, 3L),
b = c(4L, 3L, 5L, 1L, 1L, 2L, 2L),
c = c("A", "B", "C", "A", "C", "B", "E")
)
Remove duplicates column combinations from a dataframe in R
duplicated()
has a method for data.frame
s, which is designed for just this sort of task:
df <- data.frame(a = c(1:4, 1:4),
b = c(4:1, 4:1),
d = LETTERS[1:8])
df[!duplicated(df[c("a", "b")]),]
# a b d
# 1 1 4 A
# 2 2 3 B
# 3 3 2 C
# 4 4 1 D
Can I remove duplicate combinations across two columns R
I would suggest this base R
approach:
#Data
df <- structure(list(firm1 = c("A", "A", "A", "B", "D", "G"), firm2 = c("B",
"D", "G", "A", "A", "A")), row.names = c(NA, -6L), class = "data.frame")
The code:
df[!duplicated(lapply(strsplit(paste0(df$firm1,df$firm2),split = ''),sort)),]
Output:
firm1 firm2
1 A B
2 A D
3 A G
Removing duplicate all-way-combinations while retaining all columns
Here's a base solution, using the complete.cases
function, and also creating a sorted feedID
column:
# remove any rows with NA values
test <- test[complete.cases(test[,c('ID', 'feedID','feedID2')]),]
#remove any rows with feedID == feedID2
test <- test[!(test$feedID == test$feedID2),]
# add new feedID3 column
test$feedID3 <- apply(test, 1, function(x) paste(sort(c(x[2], x[3])), collapse = '-'))
# remove any duplicates, and remove last column
test[!duplicated(test[,c('feedID3', 'ID')]), -4]
ID feedID feedID2
2 49V A1 G2
6 52V B1 D1
7 52V D1 D2
data
Note that we have converted "NA"
to NA
, and we have also set stringsAsFactors = TRUE
test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),
feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1", "D2" ),
feedID2 = c("A1", "G2", "A1", "G2", NA, "D1", "D2", NA ),
stringsAsFactors = FALSE)
Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact
You can group by x2 and x3 and use slice()
, i.e.
library(dplyr)
df %>%
group_by(x2, x3) %>%
slice(which.max(x4))
# A tibble: 3 x 4
# Groups: x2, x3 [3]
x1 x2 x3 x4
<chr> <chr> <chr> <int>
1 X A B 4
2 Z A C 1
3 X C B 5
Remove duplicate combinations from cross join result in R
After full join, filter with inequality to avoid reverse duplicates:
df <- source_df %>%
full_join(source_df, by = character()) %>%
filter(TableFrom < TableTo)
R: Remove duplicates from a dataframe based on categories in a column
Here is a snippet that does what you asked:
df$Category <- factor(df$Category, levels = c("PT", "DI", "GT", "SY"))
df <- df[order(df$Category),]
df[!duplicated(df[,c('Name', 'Course')]),]
Output:
Name Course Category
Jason ML PT
Nancy ML PT
Jason DS DI
Nancy DS DI
John DS GT
James ML SY
Idea is that we sort based on the priority structure. Then we apply the unique operations, which will return the first match. The return will be what we want.
Remove duplicate rows ignoring column order? in R
df <- data.frame(
var1 = c("a", "b", "a", "c", "b", "c"),
var2 = c("b", "a", "c", "a", "c", "b"),
value = c(0.576, 0.576, 0.987, 0.987, 0.034, 0.034)
)
A one-liner base-r
solution:
df_unique <- df[!duplicated(apply(df[,1:2], 1, function(row) paste(sort(row), collapse=""))),]
df_unique
var1 var2 value
1 a b 0.576
3 a c 0.987
5 b c 0.034
What it does: work across the first 2 columns row-wise (apply
with MARGIN = 1
), sort
(alphabetically) the content, paste
into a single string, remove all indices where the string has already occurred before (!duplicated
).
Another (probably better) approach, stepping back, is to take your original matrix and clear out the bottom half using lower.tri
. This way only half of the combinations will have non-0 values:
mat <- matrix(c(0, 0.576, 0.987, 0.576, 0, 0.034, 0.987, 0.034, 0),
nrow=3, dimnames=list(letters[1:3], letters[1:3]))
mat[lower.tri(mat, diag = TRUE)] <- NA
mat
a b c
a NA 0.576 0.987
b NA NA 0.034
c NA NA NA
Remove duplicates across columns
We can sort
the elements in each row
with apply
, t
ranspose the output, apply duplicated
to return a logical vector and use that for subsetting the rows
df[!duplicated(t(apply(df[, 1:2], 1, sort))),]
# [,1] [,2]
#[1,] "a" "b"
#[2,] "a" "c"
#[3,] "a" "d"
#[4,] "b" "c"
#[5,] "b" "d"
#[6,] "c" "d"
or another option is pmin/pmax
df[!duplicated(cbind(pmin(df[,1], df[,2]), pmax(df[,1], df[,2]))),]
data
df <- structure(c("a", "a", "a", "b", "b", "b", "c", "c", "c", "b",
"c", "d", "a", "c", "d", "a", "b", "d"), .Dim = c(9L, 2L))
Related Topics
How to Run a High Pass or Low Pass Filter on Data Points in R
Graph Flow Chart of Transition from States
How to Sum Data.Frame Column Values
Breaks for Scale_X_Date in Ggplot2 and R
Car::Scatter3D in R - Labeling Axis Better
How to See All Rows of a Data Frame in a Jupyter Notebook with an R Kernel
How to Ignore Na in Ifelse Statement
How to Add Expressions to Labels in Facet_Wrap
How to Minimize Size of Object of Class "Lm" Without Compromising It Being Passed to Predict()
Group Vector on Conditional Sum
In R, How to Plot into a Memory Buffer Instead of a File
Update Plot Within Observer Loop in Shiny Application
Write.Csv() a List of Unequally Sized Data.Frames
Multiple Y Axis for Bar Plot and Line Graph Using Ggplot