How to group data.table by multiple columns?
Use by=list(adShown,url)
instead of by=c("adShown","url")
Example:
set.seed(007)
DF <- data.frame(X=1:20, Y=sample(c(0,1), 20, TRUE), Z=sample(0:5, 20, TRUE))
library(data.table)
DT <- data.table(DF)
DT[, Mean:=mean(X), by=list(Y, Z)]
X Y Z Mean
1: 1 1 3 1.000000
2: 2 0 1 9.333333
3: 3 0 5 7.400000
4: 4 0 5 7.400000
5: 5 0 5 7.400000
6: 6 1 0 6.000000
7: 7 0 3 7.000000
8: 8 1 2 12.500000
9: 9 0 5 7.400000
10: 10 0 2 15.000000
11: 11 0 4 14.500000
12: 12 0 1 9.333333
13: 13 1 1 13.000000
14: 14 0 1 9.333333
15: 15 0 2 15.000000
16: 16 0 5 7.400000
17: 17 1 2 12.500000
18: 18 0 4 14.500000
19: 19 1 5 19.000000
20: 20 0 2 15.000000
data.table summary by group for multiple columns
dt <- dt[, .(y = mean(y), z = mean(z)), by=.(a)]
Apply multiple functions to multiple columns in data.table by group
First you need to change your function. data.table expects consistent types and median
can return integer or double values depending on input.
my.summary <- function(x) list(mean = mean(x), median = as.numeric(median(x)))
Then you need to ensure that only the first level of the nested list is unlisted. The result of the unlist
call still needs to be a list (remember, a data.table is a list of column vectors).
DT[, unlist(lapply(.SD, my.summary), recursive = FALSE), by = c, .SDcols = c("a", "b")]
# c a.mean a.median b.mean b.median
#1: 1 1.5 1.5 2.5 2.5
#2: 2 4.0 4.0 5.0 5.0
Multiple functions on multiple columns by group, and create informative column names
If I understand correctly, this question consists of two parts:
- How to group and aggregate with multiple functions over a list of columns and generate new column names automatically.
- How to pass the names of the functions as a character vector.
For part 1, this is nearly a duplicate of Apply multiple functions to multiple columns in data.table but with the additional requirement that the results should be grouped using by =
.
Therefore, eddi's answer has to be modified by adding the parameter recursive = FALSE
in the call to unlist()
:
my.summary = function(x) list(N = length(x), mean = mean(x), median = median(x))
dt[, unlist(lapply(.SD, my.summary), recursive = FALSE),
.SDcols = ColChoice, by = category]
category c1.N c1.mean c1.median c4.N c4.mean c4.median
1: f 3974 9999.987 9999.989 3974 9.994220 9.974125
2: w 4033 10000.008 9999.991 4033 10.004261 9.986771
3: n 4025 9999.981 10000.000 4025 10.003686 9.998259
4: x 3975 10000.035 10000.019 3975 10.010448 9.995268
5: k 3957 10000.019 10000.017 3957 9.991886 10.007873
6: j 4027 10000.026 10000.023 4027 10.015663 9.998103
...
For part 2, we need to create my.summary()
from a character vector of function names. This can be achieved by "programming on the language", i.e, by assembling an expression as character string and finally parsing and evaluating it:
my.summary <-
sapply(FunChoice, function(f) paste0(f, "(x)")) %>%
paste(collapse = ", ") %>%
sprintf("function(x) setNames(list(%s), FunChoice)", .) %>%
parse(text = .) %>%
eval()
my.summary
function(x) setNames(list(length(x), mean(x), sum(x)), FunChoice)
<environment: 0xe376640>
Alternatively, we can loop over the categories and rbind()
the results afterwards:
library(magrittr) # used only to improve readability
lapply(dt[, unique(category)],
function(x) dt[category == x,
c(.(category = x), unlist(lapply(.SD, my.summary))),
.SDcols = ColChoice]) %>%
rbindlist()
Benchmark
So far, 4 data.table
and one dplyr
solutions have been posted. At least one of the answers claims to be "superfast". So, I wanted to verify by a benchmark with varying number of rows:
library(data.table)
library(magrittr)
bm <- bench::press(
n = 10L^(2:6),
{
set.seed(12212018)
dt <- data.table(
index = 1:n,
category = sample(letters[1:25], n, replace = T),
c1 = rnorm(n, 10000),
c2 = rnorm(n, 1000),
c3 = rnorm(n, 100),
c4 = rnorm(n, 10)
)
# use set() instead of <<- for appending additional columns
for (i in 5:100) set(dt, , paste0("c", i), rnorm(n, 1000))
tables()
ColChoice <- c("c1", "c4")
FunChoice <- c("length", "mean", "sum")
my.summary <- function(x) list(length = length(x), mean = mean(x), sum = sum(x))
bench::mark(
unlist = {
dt[, unlist(lapply(.SD, my.summary), recursive = FALSE),
.SDcols = ColChoice, by = category]
},
loop_category = {
lapply(dt[, unique(category)],
function(x) dt[category == x,
c(.(category = x), unlist(lapply(.SD, my.summary))),
.SDcols = ColChoice]) %>%
rbindlist()
},
dcast = {
dcast(dt, category ~ 1, fun = list(length, mean, sum), value.var = ColChoice)
},
loop_col = {
lapply(ColChoice, function(col)
dt[, setNames(lapply(FunChoice, function(f) get(f)(get(col))),
paste0(col, "_", FunChoice)),
by=category]
) %>%
Reduce(function(x, y) merge(x, y, by="category"), .)
},
dplyr = {
dt %>%
dplyr::group_by(category) %>%
dplyr::summarise_at(dplyr::vars(ColChoice), .funs = setNames(FunChoice, FunChoice))
},
check = function(x, y)
all.equal(setDT(x)[order(category)],
setDT(y)[order(category)] %>%
setnames(stringr::str_replace(names(.), "_", ".")),
ignore.col.order = TRUE,
check.attributes = FALSE
)
)
}
)
The results are easier to compare when plotted:
library(ggplot2)
autoplot(bm)
Please, note the logarithmic time scale.
For this test case, the unlist approach is always the fastest method, followed by dcast. dplyr is catching up for larger problem sizes n
. Both lapply/loop approaches are less performant. In particular, Parfait's approach to loop over the columns and merge subresults afterwards seems to be rather sensitive to problem sizes n
.
Edit: 2nd benchmark
As suggested by jangorecki, I have repeated the benchmark with much more rows and also with a varying number of groups.
Due to memory limitations, the largest problem size is 10 M rows times 102 columns which takes 7.7 GBytes of memory.
So, the first part of the benchmark code is modified to
bm <- bench::press(
n_grp = 10^(1:3),
n_row = 10L^seq(3, 7, by = 2),
{
set.seed(12212018)
dt <- data.table(
index = 1:n_row,
category = sample(n_grp, n_row, replace = TRUE),
c1 = rnorm(n_row),
c2 = rnorm(n_row),
c3 = rnorm(n_row),
c4 = rnorm(n_row, 10)
)
for (i in 5:100) set(dt, , paste0("c", i), rnorm(n_row, 1000))
tables()
...
As expected by jangorecki, some solutions are more sensitive to the number of groups than others. In particular, performance of loop_category is degrading much stronger with the number of groups while dcast seems to be less affected. For fewer groups, the unlist approach is always faster than dcast while for many groups dcast
is faster. However, for larger problem sizes unlist seems to be ahead of dcast.
Edit 2019-03-12: Computing on the language, 3rd benchmark
Inspired by this follow-up question, I have have added a computing on the language approach where the whole expression is created as character string, parsed and evaluated.
The expression is created by
library(magrittr)
ColChoice <- c("c1", "c4")
FunChoice <- c("length", "mean", "sum")
my.expression <- CJ(ColChoice, FunChoice, sorted = FALSE)[
, sprintf("%s.%s = %s(%s)", V1, V2, V2, V1)] %>%
paste(collapse = ", ") %>%
sprintf("dt[, .(%s), by = category]", .) %>%
parse(text = .)
my.expression
expression(dt[, .(c1.length = length(c1), c1.mean = mean(c1), c1.sum = sum(c1),
c4.length = length(c4), c4.mean = mean(c4), c4.sum = sum(c4)), by = category])
This is then evaluated by
eval(my.expression)
which yields
category c1.length c1.mean c1.sum c4.length c4.mean c4.sum
1: f 3974 9999.987 39739947 3974 9.994220 39717.03
2: w 4033 10000.008 40330032 4033 10.004261 40347.19
3: n 4025 9999.981 40249924 4025 10.003686 40264.84
4: x 3975 10000.035 39750141 3975 10.010448 39791.53
5: k 3957 10000.019 39570074 3957 9.991886 39537.89
6: j 4027 10000.026 40270106 4027 10.015663 40333.07
...
I have modified the code of the 2nd benchmark to include this approach but had to reduce the additional columns from 100 to 25 in order to cope with the memory limitations of a much smaller PC. The chart shows that the "eval" approach is almost always the fastest or second:
Subset groups in a data.table using conditions on two columns
With data.table
:
DT[,.SD[any(x == 3 & y == 6)], by=group]
group id y x
<char> <int> <num> <num>
1: B 5 6 3
2: B 6 8 3
3: B 7 14 3
4: B 8 19 2
5: C 9 10 2
6: C 10 9 3
7: C 11 6 3
8: C 12 8 3
Another possibly faster option:
DT[, if (any(x == 3 & y == 6)) .SD, by=group]
apply .GRP to multiple columns in data.table R to group each column separately
One way would be using match
and unique
:
library(data.table)
cols <- paste0('col', 1:3)
DT[, paste0('new_', cols) := lapply(.SD, function(x)
match(x, unique(x))), .SDcols = cols]
DT
# col1 col2 col3 new_col1 new_col2 new_col3
# 1: A DD FFF1 1 1 1
# 2: B GG HHH1 2 2 2
# 3: B RR CCC1 2 3 3
# 4: D HH AAA1 3 4 4
# 5: B SS FFF1 2 5 1
# 6: A AA RRR1 1 6 5
# 7: A CC GGG1 1 7 6
# 8: B RR DDD1 2 3 7
# 9: R EE FFF1 4 8 1
#10: T DD JJJ1 5 1 8
#11: E HH VVV1 6 4 9
#12: E BB CCC1 6 9 3
#13: H CC AAA1 7 7 4
#14: T AA XXX1 5 6 10
#15: Y QQ GGG1 8 10 6
#16: F EE HHH1 9 8 2
#17: F YY AAA1 9 11 4
#18: F MM RRR1 9 12 5
How to group data.table by several columns consecutively
With development version 1.10.5, data.table
has gained Grouping Set aggregation functions which calculate aggregates at various levels of groupings producing multiple (sub-)totals.
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2018-01-31 02:23:45 UTC
grp_vars <- setdiff(names(DF), "X")
groupingsets(setDT(DF), mean(X), by = grp_vars, sets = as.list(grp_vars))
Y Z V1
1: 1 NA 24960.98
2: 0 NA 25039.96
3: NA 5 24652.44
4: NA 0 25006.61
5: NA 2 25223.83
6: NA 3 24959.26
7: NA 1 25095.58
8: NA 4 25068.84
Benchmark
# create data
n_rows = 1e6L
n_vars = 5
n_grps = 1e2L
set.seed(007)
DT <- data.table(rn = seq_len(n_rows))
for (i in seq_len(n_vars)) set(DT, , paste0("X", i), i*rnorm(n_rows))
for (i in seq_len(n_grps)) set(DT, , paste0("Z", i), sample(0:i, n_rows, TRUE))
grps <- grep("^Z", names(DT), value = TRUE)
vars <- grep("^X", names(DT), value = TRUE)
# run benchmark
bm <- microbenchmark::microbenchmark(
gs = {
groupingsets(DT, lapply(.SD, mean), by = grps, sets = as.list(grps), .SDcols = vars)
},
lapply1 = {
rbindlist(lapply(grps, function(K) DT[, lapply(.SD, mean), by = K, .SDcols = vars]),
fill = TRUE)
},
lapply2 = {
out <- lapply(grps, function(K) DT[, lapply(.SD, mean), by = get(K), .SDcols = vars])
do.call(rbind, out)
},
times = 3L
)
print(bm)
Even with 1 M rows and 100 grouping vars, there is no remarkable difference in the run times (groupingsets()
seems to be a little slower than the two other approaches):
Unit: seconds
expr min lq mean median uq max neval
gs 3.602689 3.606646 3.608343 3.610603 3.611169 3.611735 3
lapply1 3.524957 3.546060 3.561130 3.567163 3.579217 3.591270 3
lapply2 3.562424 3.569284 3.577199 3.576144 3.584586 3.593027 3
Related Topics
Imported a CSV-Dataset to R But the Values Becomes Factors
Only Download Sources of a Package and All Dependencies
Create a Vector of All Days Between Two Dates
When Should I Use Setdt() Instead of Data.Table() to Create a Data.Table
Standard Error Bars Using Stat_Summary
Purrr Map Equivalent of Nested for Loop
Possible to Show Console Messages (Written with 'Message') in a Shiny Ui
Find All Functions (Including Private) in a Package
How Does One Stop Using Rowwise in Dplyr
Add a New Column to a Dataframe Using Matching Values of Another Dataframe
How Can a Data Ellipse Be Superimposed on a Ggplot2 Scatterplot
Get the Row and Column Name of the Minimum Element of a Matrix
R Function for Returning All Factors
Moving Columns Within a Data.Frame() Without Retyping
Reading Multiple Files into Multiple Data Frames