R count distinct elements based on two columns by group
Here is a data.table
option
data[, N := uniqueN(paste0(customer_id, account_id, "_")), by = start_date]
# customer_id account_id start_date N
# 1: 1 11 2017-01-01 4
# 2: 1 11 2017-01-01 4
# 3: 1 11 2017-01-01 4
# 4: 2 11 2017-01-01 4
# 5: 3 55 2017-01-01 4
# 6: 3 88 2017-01-01 4
# 7: 4 22 2018-02-02 3
# 8: 5 38 2018-02-02 3
# 9: 5 38 2018-02-02 3
#10: 6 13 2018-02-02 3
Or
data[, N := uniqueN(.SD, by = c("customer_id", "account_id")), by = start_date]
R - Count unique/distinct values in two columns together per group
You can subset the data from cur_data()
and unlist
the data to get a vector. Use n_distinct
to count number of unique values.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Count = n_distinct(unlist(select(cur_data(),
Party, Party2013)), na.rm = TRUE)) %>%
ungroup
# ID Wave Party Party2013 Count
# <int> <int> <chr> <chr> <int>
#1 1 1 A A 2
#2 1 2 A NA 2
#3 1 3 B NA 2
#4 1 4 B NA 2
#5 2 1 A C 3
#6 2 2 B NA 3
#7 2 3 B NA 3
#8 2 4 B NA 3
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Wave = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), Party = c("A", "A", "B", "B", "A",
"B", "B", "B"), Party2013 = c("A", NA, NA, NA, "C", NA, NA, NA
)), class = "data.frame", row.names = c(NA, -8L))
Count unique values over two columns per group
In summarise()
, you could use across()
to select multiple columns, unlist them to vectors and count the numbers of unique values by groups.
library(dplyr)
df %>%
group_by(gvkey, Year) %>%
summarise(n_unique = n_distinct(unlist(across(SICS1:SICS2)))) %>%
ungroup()
# # A tibble: 4 × 3
# gvkey Year n_unique
# <int> <int> <int>
# 1 1209 2017 3
# 2 1209 2018 6
# 3 1503 2017 3
# 4 1503 2018 3
Another way is that you need to stack SICS1
and SICS2
together first, and then you could count the number of unique values.
df %>%
tidyr::pivot_longer(SICS1:SICS2) %>%
group_by(gvkey, Year) %>%
summarise(n_unique = n_distinct(value)) %>%
ungroup()
Group by and count unique values in several columns in R
Here's an approach using dplyr::across
, which is a handy way to calculate across multiple columns:
my_data <- data.frame(
city = c(rep("A", 3), rep("B", 3)),
col1 = 1:6,
col2 = 0,
col3 = c(1:3, 4, 4, 4),
col4 = 1:2
)
library(dplyr)
my_data %>%
group_by(city) %>%
summarize(across(col1:col4, n_distinct))
# A tibble: 2 x 5
city col1 col2 col3 col4
* <chr> <int> <int> <int> <int>
1 A 3 1 3 2
2 B 3 1 1 2
Grouping on two columns and counting distinct values using R
you are probably running plyr
-package and dplyr
-package at the same time. They both contain a function named summarise
. If not plyr, then probably another package that contains a function named summarise
.
Run ?summarise
to inspect the available summarise
-functions on your system.
Make sure you use summarise()
from the dplyr package!!
library( dplyr )
d %>%
dplyr::group_by(A,B)%>%
dplyr::summarise(UNIQUE_COUNT = n_distinct(C)) # <-- dplyr
# # A tibble: 4 x 3
# # Groups: A [?]
# A B UNIQUE_COUNT
# <fct> <fct> <int>
# 1 A R1 2
# 2 A R2 1
# 3 B R1 2
# 4 B R2 1
d %>%
dplyr::group_by(A,B)%>%
plyr::summarise(UNIQUE_COUNT = n_distinct(C)) # <-- plyr
# UNIQUE_COUNT
# 1 4
R group by | count distinct values grouping by another column
One way
test_df |>
distinct() |>
count(post_pagename)
# post_pagename n
# <fct> <int>
# 1 A 3
# 2 B 2
# 3 C 1
# 4 D 1
Or another
test_df |>
group_by(post_pagename) |>
summarise(distinct_visit_ids = n_distinct(visit_id))
# A tibble: 4 x 2
# post_pagename distinct_visit_ids
# <fct> <int>
#1 A 3
#2 B 2
#3 C 1
#4 D 1
*D has one visit, so it must be counted*
Count number of unique values in two columns by group
df <- data.frame(Webpage = c(111, 111, 111, 111, 222, 222),
Dept = c(101, 101, 101, 102, 102, 103),
Emp_Id = c(1, 1, 2, 3, 4, 4))
library(dplyr)
df %>%
group_by(Webpage) %>%
summarise(n = n_distinct(Dept, Emp_Id))
#> # A tibble: 2 x 2
#> Webpage n
#> <dbl> <int>
#> 1 111 3
#> 2 222 2
library(data.table)
setDT(df)[, list(n = uniqueN(paste0(Dept, Emp_Id))), by = Webpage]
#> Webpage n
#> 1: 111 3
#> 2: 222 2
Created on 2021-03-30 by the reprex package (v1.0.0)
Counting unique values based on two columns with repeated rows, R data frame
You can use rle
for runs and table
for tabulation:
table(rle(df$column2)$values)
# A B
# 2 1
See ?rle
and ?table
for details.
Or, if you want to take advantage of column1
(which is derived from column2
):
table(unique(df)$column2)
dplyr count unique values in two columns without reshaping long
An alternative is using c_across()
after dplyr 1.0.0
:
library(dplyr)
d %>%
group_by(Group) %>%
mutate(n = n_distinct(c_across(everything())))
# # A tibble: 6 x 4
# # Groups: Group [3]
# Group node1 node2 n
# <chr> <chr> <chr> <int>
# 1 A a w 2
# 2 B b r 3
# 3 B b t 3
# 4 C c z 4
# 5 C c u 4
# 6 C c i 4
Note: everything()
in c_across()
excludes grouping variables, i.e. Group
, so actually n_distinct()
takes c(node1, node2)
as input. To specify variables, you can also use
c_across(node1:node2)
c_across(starts_with('node'))
Related Topics
Remove Quotes from a Character Vector in R
Subtract Value from Previous Row by Group
Create Counter With Multiple Variables
Pair-Wise Duplicate Removal from Dataframe
"For" Loop Only Adds the Final Ggplot Layer
How to Drop Columns by Name in a Data Frame
Opposite of %In%: Exclude Rows With Values Specified in a Vector
Global and Local Variables in R
Axis Labels on Two Lines With Nested X Variables (Year Below Months)
Get the Difference Between Dates in Terms of Weeks, Months, Quarters, and Years
Subset/Filter Rows in a Data Frame Based on a Condition in a Column
Calculate Max Value Across Multiple Columns by Multiple Groups
R: Pulling Data from One Column to Create New Columns
How to Test When Condition Returns Numeric(0) in R
How to Add Row and Column to a Dataframe of Different Length