Keeping Zero Count Combinations When Aggregating with Data.Table

Keeping zero count combinations when aggregating with data.table

Seems like the most straightforward approach is to explicitly supply all category combos in a data.table passed to i=, setting by=.EACHI to iterate over them:

setkey(dt, sex, fruit)
dt[CJ(sex, fruit, unique = TRUE), .N, by = .EACHI]
# sex fruit N
# 1: F apple 2
# 2: F orange 0
# 3: F tomato 2
# 4: H apple 3
# 5: H orange 1
# 6: H tomato 1

Complete with all combinations after counting on data.table

Here is one possible way to solve your problem. Note that the argument with=FALSE in the data.table context allows to select the columns using the standard data.frame rules. In the example below, I assumed that the columns used to compute all combinations are passed to myfun as a character vector.
Keep in mind that no columns in your dataset should be named gcases. .EACHI in by allows to perform some operation for each row in i.

myfun = function(d, g) {
# get levels (for factors) and unique values for other types.
fn <- function(x) if(is.factor(x)) levels(x) else unique(x)
gcases <- lapply(setDT(d, key=g)[, g, with=FALSE], fn)

# count based on all combinations
d[do.call(CJ, gcases), .N, keyby=.EACHI]
}

`data.table` how to get `keyby` to include all combinations of factors?

That's a tidyr/dplyr approach:

dt1 %>% 
group_by(a,b) %>%
summarise(c = length(.)) %>%
ungroup %>%
complete(a,b, fill = list(c = 0))

R data table unique record count based on all combination of a given list of values from 2 columns

In base R, you can do:

data.frame(table(dt))

Var1 Var2 Freq
1 Col1Value1 Col2Value1 1
2 Col1Value2 Col2Value1 1
3 Col1Value3 Col2Value1 1
4 Col1Value1 Col2Value2 1
5 Col1Value2 Col2Value2 0
6 Col1Value3 Col2Value2 1
7 Col1Value1 Col2Value3 1
8 Col1Value2 Col2Value3 1
9 Col1Value3 Col2Value3 1

Populating a count matrix with permutations of R data.table rows

Here's a data.table solution that seems to be efficient. We basically doing a self join in order to create combinations and then count. Then, similar to what @coldspeed done with Numpy, we will just update a zero matrix by locations with counts.

# a self join
tmp <- dt[dt,
.(V1, id = x.V3, id2 = V3),
on = .(V1, V3 < V3),
nomatch = 0L,
allow.cartesian = TRUE
][, .N, by = .(id, id2)]

## Create a zero matrix and update by locations
m <- array(0L, rep(max(dt$V3), 2L))
m[cbind(tmp$id, tmp$id2)] <- tmp$N
m + t(m)

# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] 0 2 0 0 1 0 0 1
# [2,] 2 0 0 0 1 0 0 1
# [3,] 0 0 0 0 0 0 0 0
# [4,] 0 0 0 0 1 0 1 0
# [5,] 1 1 0 1 0 0 1 1
# [6,] 0 0 0 0 0 0 0 0
# [7,] 0 0 0 1 1 0 0 0
# [8,] 1 1 0 0 1 0 0 0

Alternatively, we could create tmp using data.table::CJ but that could be (potentially - thanks to @Frank for the tip) less memory efficient as it will create all possible combinations first, e.g.

tmp <- dt[, CJ(V3, V3)[V1 < V2], by = .(g = V1)][, .N, by = .(V1, V2)]

## Then, as previously
m <- array(0L, rep(max(dt$V3), 2L))
m[cbind(tmp$V1, tmp$V2)] <- tmp$N
m + t(m)

# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] 0 2 0 0 1 0 0 1
# [2,] 2 0 0 0 1 0 0 1
# [3,] 0 0 0 0 0 0 0 0
# [4,] 0 0 0 0 1 0 1 0
# [5,] 1 1 0 1 0 0 1 1
# [6,] 0 0 0 0 0 0 0 0
# [7,] 0 0 0 1 1 0 0 0
# [8,] 1 1 0 0 1 0 0 0

For R data.table, how to use uniqueN() in order count unique/distinct values in multiple columns?

To answer your question, yes, you can just add both columns to the by argument:

dt[, .(distinct_groups = uniqueN(order_no)), by = c("Name", "Overlimit")]


Related Topics



Leave a reply



Submit