Filter Based on Number of Distinct Values Per Group

Filter based on number of distinct values per group

We can group by 'names' and filter the 'sex' having unique number of elements greater than 1

library(dplyr)
df %>%
group_by(names) %>%
filter(n_distinct(sex) > 1)

Or another option is to group by 'names' and filter the groups having both the 'M' and 'F'

df %>%
group_by(names) %>%
filter(all(c("M", "F") %in% sex))

Filter in SQL on distinct values after grouping

If all you need is the column col1 you can group by col1 and set the condition in the HAVING clause:

SELECT col1
FROM tablename
GROUP BY col1
HAVING COUNT(DISTINCT col2) = 1;

If you want all the rows from the table use the above query with the operator IN:

SELECT *
FROM tablename
WHERE col1 IN (
SELECT col1
FROM tablename
GROUP BY col1
HAVING COUNT(DISTINCT col2) = 1
)

Select groups based on number of unique / distinct values

You can make a selector for sample using ave many different ways.

sample[ ave( sample$Value, sample$Group, FUN = function(x) length(unique(x)) ) == 1,]

or

sample[ ave( sample$Value, sample$Group, FUN = function(x) sum(x - x[1]) ) == 0,]

or

sample[ ave( sample$Value, sample$Group, FUN = function(x) diff(range(x)) ) == 0,]

Filter list of distinct values from one column of grouped data in the same order as it shows

Couldn't recreate your data exactly for some reason so my output is different but here are a few quick methods that should achieve your desired outcome:

A few edits to your original code gives us a data frame with the proper cities in the proper order:

library(dplyr)

set.seed(42)

id <- seq_len(10)
city <- sample(c('Miami', 'Seattle', 'Houston', 'Toronto', 'Tokyo', 'Mumbai', 'Austin'), 10, replace = TRUE)
state <- sample(c('ON', 'WA', 'TX', 'MA'), 10, replace = TRUE)
rent <- sample(800:1900, 10)

data <- data.frame(id, city, state, rent)

data %>%
group_by(id, city, state) %>%
summarise(total_rent = sum(rent)) %>%
group_by(city) %>%
slice_max(1) %>%
arrange(desc(total_rent)) %>%
ungroup()
#> # A tibble: 5 x 4
#> id city state total_rent
#> <int> <chr> <chr> <int>
#> 1 1 Miami MA 1698
#> 2 6 Toronto WA 1659
#> 3 5 Seattle ON 1420
#> 4 2 Tokyo TX 1400
#> 5 10 Austin TX 1098

For just the values, the pull() / unique() combo is quite nice:

data %>% 
group_by(id, city, state) %>%
summarise(total_rent = sum(rent)) %>%
arrange(desc(total_rent)) %>%
pull(city) %>%
unique()
#> [1] "Miami" "Toronto" "Seattle" "Tokyo" "Austin"

Another possible solution could involve factoring the cities in order after you've arranged them. This is achieved with library(forecats):

library(forcats)
library(magrittr)

data %>%
group_by(id, city, state) %>%
summarise(total_rent = sum(rent)) %>%
arrange(desc(total_rent)) %>%
ungroup() %>%
mutate(city = fct_inorder(city)) %$%
levels(city)
#> [1] "Miami" "Toronto" "Seattle" "Tokyo" "Austin"

Created on 2021-03-04 by the reprex package (v0.3.0)

How to chain group_by, filter, distinct, count in data.table?

The distinct in dplyr can be unique in data.table with by option

unique(setDT(test_df)[!is.na(date)], by = c("id", "date"))[, .N, by = id][N > 1]
id N
1: 5678 2

Steps are as follows

  1. Convert to data.table (setDT)
  2. Remove the rows with NA from 'date' (!is.na(date))
  3. Get the unique rows by the 'id' and 'date' column
  4. Do a group by 'id' to get the count (.N)
  5. Finally, filter the rows where count is greater than 1

R dplyr - Filter unique row in each group with dplyr

dat %>%
mutate(rn = row_number()) %>%
arrange(flag) %>%
group_by(id, col2, col3) %>%
slice(1) %>%
ungroup() %>%
arrange(rn) %>%
select(-rn)
# # A tibble: 4 x 5
# id col2 col3 flag val
# <int> <chr> <chr> <int> <int>
# 1 1 a q NA NA
# 2 1 a w 1 NA
# 3 1 b r NA NA
# 4 2 c q 1 5

If your data is instead strings with empty strings (it's not clear in the question), then

dat %>%
# this is just to transform my number-based 'flag'/'val' to strings, you don't need this
mutate(across(c(flag, val), ~ if_else(is.na(.), "", as.character(.)))) %>%
# pick up here
mutate(rn = row_number()) %>%
arrange(!nzchar(flag)) %>% # this is the only difference from above
group_by(id, col2, col3) %>%
slice(1) %>%
ungroup() %>%
arrange(rn) %>%
select(-rn)
# # A tibble: 4 x 5
# id col2 col3 flag val
# <int> <chr> <chr> <chr> <chr>
# 1 1 a q "" ""
# 2 1 a w "1" ""
# 3 1 b r "" ""
# 4 2 c q "1" "5"

The use of rn is merely to ensure that the order is preserved across the filtering. If order is not an issue (perhaps it's inferred some other way), then you can remove the mutate, and the trailing arrange(rn) %>% select(-rn).


Data

dat <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), col2 = c("a", "a", "b", "c", "c", "c"), col3 = c("q", "w", "r", "q", "q", "q"), flag = c(NA, 1L, NA, 1L, NA, 1L), val = c(NA, NA, NA, 5L, NA, 6L)), class = "data.frame", row.names = c(NA, -6L))

filter column by count of distinct values

You can add another column in the summarize to count the number of records per group and then filter based on it:

my_tibble %>% 
group_by(A) %>%
summarise(percentage = mean(B), n = n()) %>%
filter(percentage > 0, n > 1)

# A tibble: 2 x 3
# A percentage n
# <chr> <dbl> <int>
#1 a 0.75 4
#2 b 0.50 2

SQL Filter rows based on multiple distinct values of a column

You shouldn't GROUP BY the description if you are doing a DISTINCT COUNT on it (then it will always be just 1). Try something like this:

SELECT P2.PLU, P2.Description
FROM @YourTable P2
WHERE P2.PLU in (
SELECT P.PLU
FROM @YourTable P
GROUP BY P.PLU
HAVING COUNT(DISTINCT(P.DESCRIPTION)) > 1
)


Related Topics



Leave a reply



Submit