R Equivalent of Select Distinct on Two or More Fields/Variables

R equivalent of SELECT DISTINCT on two or more fields/variables

unique works on data.frame so unique(df[c("var1","var2")]) should be what you want.

Another option is distinct from dplyr package:

df %>% distinct(var1, var2) # or distinct(df, var1, var2)

Note:

For older versions of dplyr (< 0.5.0, 2016-06-24) distinct required additional step

df %>% select(var1, var2) %>% distinct

(or oldish way distinct(select(df, var1, var2))).

Subset with unique cases, based on multiple columns

You can use the duplicated() function to find the unique combinations:

> df[!duplicated(df[1:3]),]
v1 v2 v3 v4 v5
1 7 1 A 100 98
2 7 2 A 98 97
3 8 1 C NA 80
6 9 3 C 75 75

To get only the duplicates, you can check it in both directions:

> df[duplicated(df[1:3]) | duplicated(df[1:3], fromLast=TRUE),]
v1 v2 v3 v4 v5
3 8 1 C NA 80
4 8 1 C 78 75
5 8 1 C 50 62

unique() for more than one variable

How about using unique() itself?

df <- data.frame(yad = c("BARBIE", "BARBIE", "BAKUGAN", "BAKUGAN"),
per = c("AYLIK", "AYLIK", "2 AYLIK", "2 AYLIK"),
hmm = 1:4)

df
# yad per hmm
# 1 BARBIE AYLIK 1
# 2 BARBIE AYLIK 2
# 3 BAKUGAN 2 AYLIK 3
# 4 BAKUGAN 2 AYLIK 4

unique(df[c("yad", "per")])
# yad per
# 1 BARBIE AYLIK
# 3 BAKUGAN 2 AYLIK

R - Count unique/distinct values in two columns together per group

You can subset the data from cur_data() and unlist the data to get a vector. Use n_distinct to count number of unique values.

library(dplyr)

df %>%
group_by(ID) %>%
mutate(Count = n_distinct(unlist(select(cur_data(),
Party, Party2013)), na.rm = TRUE)) %>%
ungroup

# ID Wave Party Party2013 Count
# <int> <int> <chr> <chr> <int>
#1 1 1 A A 2
#2 1 2 A NA 2
#3 1 3 B NA 2
#4 1 4 B NA 2
#5 2 1 A C 3
#6 2 2 B NA 3
#7 2 3 B NA 3
#8 2 4 B NA 3

data

It is easier to help if you provide data in a reproducible format

df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Wave = c(1L, 
2L, 3L, 4L, 1L, 2L, 3L, 4L), Party = c("A", "A", "B", "B", "A",
"B", "B", "B"), Party2013 = c("A", NA, NA, NA, "C", NA, NA, NA
)), class = "data.frame", row.names = c(NA, -8L))

filter distinct value based on two columns with inverse string values in `r`

We can split the 'City.Pair' by '-', sort the elements in the list output, paste them together to give avector`, check for duplicates ('i1') and use the logical vector to subset the rows of 'data2'.

i1 <- !duplicated(apply(sapply(strsplit(as.character(data2$City.Pair), "-"), 
sort), 2, paste, collapse="-"))
data2[i1,]
# City.Pair Origin.City Destination.City Total.Passengers Total.Revenue
#1 LIS-BRU LISBON BRUSSELS 100 100.66
#2 LIS-LHR LISBON LONDON 5000 5000.25
#3 LAD-LIS LUANDA LISBON 200 200.75
#5 FAO-MAN FARO MANCHESTER 4000 4000.1
#7 LIS-ORY LISBON PARIS 4000 4000.05

Or using separate with pmin/pmax

library(dplyr)
library(tidyr)
separate(data2, City.Pair, into = c("City", "City2"), remove = FALSE) %>%
filter(!duplicated(pmin(City, City2), pmax(City, City2))) %>%
select(-City, -City2)
# City.Pair Origin.City Destination.City Total.Passengers Total.Revenue
#1 LIS-BRU LISBON BRUSSELS 100 100.66
#2 LIS-LHR LISBON LONDON 5000 5000.25
#3 LAD-LIS LUANDA LISBON 200 200.75
#4 FAO-MAN FARO MANCHESTER 4000 4000.1
#5 LIS-ORY LISBON PARIS 4000 4000.05

Select groups with more than one distinct value

Several possibilities, here's my favorite

library(data.table)
setDT(df)[, if(+var(number)) .SD, by = from]
# from number
# 1: 2 1
# 2: 2 2

Basically, per each group we are checking if there is any variance, if TRUE, then return the group values


With base R, I would go with

df[as.logical(with(df, ave(number, from, FUN = var))), ]
# from number
# 3 2 1
# 4 2 2

Edit: for a non numerical data you could try the new uniqueN function for the devel version of data.table (or use length(unique(number)) > 1 instead

setDT(df)[, if(uniqueN(number) > 1) .SD, by = from]


Related Topics



Leave a reply



Submit