Apply Multiple Functions to Multiple Columns in Data.Table

Apply multiple functions to multiple columns in data.table

I'd normally do this:

my.summary = function(x) list(mean = mean(x), median = median(x))

DT[, unlist(lapply(.SD, my.summary)), .SDcols = c('a', 'b')]
#a.mean a.median   b.mean b.median 
#     3        3        4        4

Apply multiple functions to multiple columns in data.table by group

First you need to change your function. data.table expects consistent types and median can return integer or double values depending on input.

my.summary <- function(x) list(mean = mean(x), median = as.numeric(median(x)))

Then you need to ensure that only the first level of the nested list is unlisted. The result of the unlist call still needs to be a list (remember, a data.table is a list of column vectors).

DT[, unlist(lapply(.SD, my.summary), recursive = FALSE), by = c, .SDcols = c("a", "b")]
#   c a.mean a.median b.mean b.median
#1: 1    1.5      1.5    2.5      2.5
#2: 2    4.0      4.0    5.0      5.0

Multiple functions on multiple columns by group, and create informative column names

If I understand correctly, this question consists of two parts:

How to group and aggregate with multiple functions over a list of columns and generate new column names automatically.
How to pass the names of the functions as a character vector.

For part 1, this is nearly a duplicate of Apply multiple functions to multiple columns in data.table but with the additional requirement that the results should be grouped using by =.

Therefore, eddi's answer has to be modified by adding the parameter recursive = FALSE in the call to unlist():

my.summary = function(x) list(N = length(x), mean = mean(x), median = median(x))
dt[, unlist(lapply(.SD, my.summary), recursive = FALSE), 
   .SDcols = ColChoice, by = category]

    category c1.N   c1.mean c1.median c4.N   c4.mean c4.median
 1:        f 3974  9999.987  9999.989 3974  9.994220  9.974125
 2:        w 4033 10000.008  9999.991 4033 10.004261  9.986771
 3:        n 4025  9999.981 10000.000 4025 10.003686  9.998259
 4:        x 3975 10000.035 10000.019 3975 10.010448  9.995268
 5:        k 3957 10000.019 10000.017 3957  9.991886 10.007873
 6:        j 4027 10000.026 10000.023 4027 10.015663  9.998103
...

For part 2, we need to create my.summary() from a character vector of function names. This can be achieved by "programming on the language", i.e, by assembling an expression as character string and finally parsing and evaluating it:

my.summary <- 
  sapply(FunChoice, function(f) paste0(f, "(x)")) %>% 
  paste(collapse = ", ") %>% 
  sprintf("function(x) setNames(list(%s), FunChoice)", .) %>% 
  parse(text = .) %>% 
  eval()

my.summary

function(x) setNames(list(length(x), mean(x), sum(x)), FunChoice)
<environment: 0xe376640>

Alternatively, we can loop over the categories and rbind() the results afterwards:

library(magrittr)   # used only to improve readability
lapply(dt[, unique(category)],
       function(x) dt[category == x, 
                      c(.(category = x), unlist(lapply(.SD, my.summary))), 
                      .SDcols = ColChoice]) %>% 
  rbindlist()

Benchmark

So far, 4 data.table and one dplyr solutions have been posted. At least one of the answers claims to be "superfast". So, I wanted to verify by a benchmark with varying number of rows:

library(data.table)
library(magrittr)
bm <- bench::press(
  n = 10L^(2:6),
  {
    set.seed(12212018)
    dt <- data.table(
      index = 1:n,
      category = sample(letters[1:25], n, replace = T),
      c1 = rnorm(n, 10000),
      c2 = rnorm(n, 1000),
      c3 = rnorm(n, 100),
      c4 = rnorm(n, 10)
    )
    # use set() instead of <<- for appending additional columns
    for (i in 5:100) set(dt, , paste0("c", i), rnorm(n, 1000))
    tables()

    ColChoice <- c("c1", "c4")
    FunChoice <- c("length", "mean", "sum")
    my.summary <- function(x) list(length = length(x), mean = mean(x), sum = sum(x))
    
    bench::mark(
      unlist = {
        dt[, unlist(lapply(.SD, my.summary), recursive = FALSE),
           .SDcols = ColChoice, by = category]
      },
      loop_category = {
        lapply(dt[, unique(category)],
               function(x) dt[category == x, 
                              c(.(category = x), unlist(lapply(.SD, my.summary))), 
                              .SDcols = ColChoice]) %>% 
          rbindlist()
        },
      dcast = {
        dcast(dt, category ~ 1, fun = list(length, mean, sum), value.var = ColChoice)
        },
      loop_col = {
        lapply(ColChoice, function(col)
          dt[, setNames(lapply(FunChoice, function(f) get(f)(get(col))), 
                        paste0(col, "_", FunChoice)), 
             by=category]
        ) %>% 
          Reduce(function(x, y) merge(x, y, by="category"), .)
      },
      dplyr = {
        dt %>% 
          dplyr::group_by(category) %>% 
          dplyr::summarise_at(dplyr::vars(ColChoice), .funs = setNames(FunChoice, FunChoice))
      },
      check = function(x, y) 
        all.equal(setDT(x)[order(category)], 
                  setDT(y)[order(category)] %>%  
                    setnames(stringr::str_replace(names(.), "_", ".")),
                  ignore.col.order = TRUE,
                  check.attributes = FALSE
                  )
    )  
  }
)

The results are easier to compare when plotted:

library(ggplot2)
autoplot(bm)

Sample Image

Please, note the logarithmic time scale.

For this test case, the unlist approach is always the fastest method, followed by dcast. dplyr is catching up for larger problem sizes n. Both lapply/loop approaches are less performant. In particular, Parfait's approach to loop over the columns and merge subresults afterwards seems to be rather sensitive to problem sizes n.

Edit: 2nd benchmark

As suggested by jangorecki, I have repeated the benchmark with much more rows and also with a varying number of groups.
Due to memory limitations, the largest problem size is 10 M rows times 102 columns which takes 7.7 GBytes of memory.

So, the first part of the benchmark code is modified to

bm <- bench::press(
  n_grp = 10^(1:3),
  n_row = 10L^seq(3, 7, by = 2),
  {
    set.seed(12212018)
    dt <- data.table(
      index = 1:n_row,
      category = sample(n_grp, n_row, replace = TRUE),
      c1 = rnorm(n_row),
      c2 = rnorm(n_row),
      c3 = rnorm(n_row),
      c4 = rnorm(n_row, 10)
    )
    for (i in 5:100) set(dt, , paste0("c", i), rnorm(n_row, 1000))
    tables()
    ...

Sample Image

As expected by jangorecki, some solutions are more sensitive to the number of groups than others. In particular, performance of loop_category is degrading much stronger with the number of groups while dcast seems to be less affected. For fewer groups, the unlist approach is always faster than dcast while for many groups dcast is faster. However, for larger problem sizes unlist seems to be ahead of dcast.

Edit 2019-03-12: Computing on the language, 3rd benchmark

Inspired by this follow-up question, I have have added a computing on the language approach where the whole expression is created as character string, parsed and evaluated.

The expression is created by

library(magrittr)
ColChoice <- c("c1", "c4")
FunChoice <- c("length", "mean", "sum")
my.expression <- CJ(ColChoice, FunChoice, sorted = FALSE)[
  , sprintf("%s.%s = %s(%s)", V1, V2, V2, V1)] %>% 
  paste(collapse = ", ") %>% 
  sprintf("dt[, .(%s), by = category]", .) %>% 
  parse(text = .)
my.expression

expression(dt[, .(c1.length = length(c1), c1.mean = mean(c1), c1.sum = sum(c1), 
                  c4.length = length(c4), c4.mean = mean(c4), c4.sum = sum(c4)), by = category])

This is then evaluated by

eval(my.expression)

which yields

    category c1.length   c1.mean   c1.sum c4.length   c4.mean   c4.sum
 1:        f      3974  9999.987 39739947      3974  9.994220 39717.03
 2:        w      4033 10000.008 40330032      4033 10.004261 40347.19
 3:        n      4025  9999.981 40249924      4025 10.003686 40264.84
 4:        x      3975 10000.035 39750141      3975 10.010448 39791.53
 5:        k      3957 10000.019 39570074      3957  9.991886 39537.89
 6:        j      4027 10000.026 40270106      4027 10.015663 40333.07
 ...

I have modified the code of the 2nd benchmark to include this approach but had to reduce the additional columns from 100 to 25 in order to cope with the memory limitations of a much smaller PC. The chart shows that the "eval" approach is almost always the fastest or second:

Sample Image

Apply multiple functions to multiple columns of data table in R

With another lapply:

f_lapply = function(DT, col_by, col_func, col_new, func){
  temp = DT[, mget(unique(c(col_by,col_func)))]
  
  ##loop to remove
  # 1:length(func)){
  #   temp[ ,eval(col_new[i]) := do.call(func[[i]], lapply(col_func, function(x) get(x))), by = mget(col_by)]
  # }
  ##

  temp[ ,(col_new) := lapply(func, function(y) do.call(y, lapply(col_func, function(x) get(x)))), by = mget(col_by)]
  
  temp = unique(temp[,mget(c(col_new, col_by))])
  return(temp)
}

DT = data.table(iris)

col_by = c("Species")
col_func = c("Petal.Length","Petal.Width")
col_new = c("PL.mean","PL.max")
func = list(function(x,y) return(mean(x[y == max(y)])), function(x,y) return(max(x[y == max(y)])))

f_lapply(DT, col_by, col_func,col_new,func)
#>     PL.mean PL.max    Species
#>       <num>  <num>     <fctr>
#> 1: 1.600000    1.6     setosa
#> 2: 4.800000    4.8 versicolor
#> 3: 5.933333    6.1  virginica

About 25% more efficient on this example:

microbenchmark::microbenchmark(f_lapply(DT, col_by, col_func,col_new,func),
                               f(DT, col_by, col_func,col_new,func))
Unit: milliseconds
                                          expr    min     lq     mean  median      uq     max neval
 f_lapply(DT, col_by, col_func, col_new, func) 2.1065 2.2694 2.791411 2.41730 2.58115 13.4727   100
        f(DT, col_by, col_func, col_new, func) 2.8480 3.0575 3.672728 3.19515 3.46075 17.1140   100

Apply a function to every specified column in a data.table and update by reference

This seems to work:

dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]

The result is

    a  b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3

There are a few tricks here:

Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, instead of to some new variable named "cols".
.SDcols tells the call that we're only looking at those columns, and allows us to use .SD, the Subset of the Data associated with those columns.
lapply(.SD, ...) operates on .SD, which is a list of columns (like all data.frames and data.tables). lapply returns a list, so in the end j looks like cols := list(...).

EDIT: Here's another way that is probably faster, as @Arun mentioned:

for (j in cols) set(dt, j = j, value = -dt[[j]])

How to apply multiple functions to multiple columns within by?

There is an option in unlist to avoid unlisting recursively - the recursive parameter (By default, the recursive = TRUE)

DT[,unlist(lapply(.SD,my.sum.fun), 
      recursive = FALSE),.SDcols=c("mpg","hp"),by=list(cyl)]
#   cyl mpg.mean mpg.median   mpg.sd   hp.mean hp.median    hp.sd
#1:   6 19.74286       19.7 1.453567 122.28571     110.0 24.26049
#2:   4 26.66364       26.0 4.509828  82.63636      91.0 20.93453
#3:   8 15.10000       15.2 2.560048 209.21429     192.5 50.97689

Generating multiple new columns in data.table using multiple functions applied to multiple columns

The reason your solution with my.summary is not working is that unlist is recursive by default,
so it ends up packing all values from all nested lists in a single vector,
and data.table ends up recycling values silently.
Taking into account Jaap's comment,
you can write:

my.summary = function(x) list(sum(x<=7), sum(x>7 & x<=31))

dt[, c("scliq.s", "symgr.s", "scliq.d", "symgr.d") := unlist(lapply(.SD, my.summary), recursive = FALSE),
   .SDcols = c("time1", "time2"), by = p]

For the means, I can think of 2 options,
the first one uses .SD and by,
which can be slow at times:

dt[, c("mean1", "mean2") := .(.SD[time1 <= 7, mean(closeness1)], 
                              .SD[time2 > 7 & time2 <= 31, mean(closeness2)]),
   by = p,
   .SDcols = time1:closeness2]

The other option is to calculate the means in a sub-table and then join back:

dt[dt[time1 <= 7, .(ans = mean(closeness1)), by = p], mean1 := ans, on = "p"]
dt[dt[time2 > 7 & time2 <= 31, .(ans = mean(closeness2)), by = p], mean2 := ans, on = "p"]

Depending on your actual data,
one might be faster than the other,
so you should time them.

Apply different functions to different columns programmatically in data.table R

An option is to use mapply:

DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), id, 
    .SDcols=col_selection]

Need to careful on the ordering of col_selection and aggregate_functions so that the right function is applied to the right column.

output:

   id v1 v2
1:  a  2 15
2:  b  3 30
3:  c  4 45

Edit from the OP:

Just to complete this brilliant solution.
This solution works very well and if we replace col_selection with names(aggregate_functions) there is no issue with the ordering. Plus it automatically discards all the columns that are not in the list:

library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
                v1 = rep(c(2, 3, 4), each=3),
                v2 = rep(c(5, 10, 15), each=3),
                id2 = c(rep(c('cc', 'dd'), 4), 'dd')
                )
aggregate_functions = list(v1=mean, v2=sum)
DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), id, 
   .SDcols=names(aggregate_functions)]
#>    id v1 v2
#> 1:  a  2 15
#> 2:  b  3 30
#> 3:  c  4 45

It is also possible to use multiple variables to aggregate by, by passing a list:

DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), list(id, id2), 
   .SDcols=names(aggregate_functions)]
#>    id id2 v1 v2
#> 1:  a  cc  2 10
#> 2:  a  dd  2  5
#> 3:  b  dd  3 20
#> 4:  b  cc  3 10
#> 5:  c  cc  4 15
#> 6:  c  dd  4 30

^{Created on 2019-11-27 by the reprex package (v0.3.0)}

Apply a function to multiple columns of a SparkDataFrame, at once

You can extract the column names as a list using SparkR::colnames function, and then use base::lapply on that list. Note that the function argument inside lapply has to use the columns as a Spark column object (SparkR::column). Example below:

df <- data.frame(v1 = c(1:3), v2 = c(3:5), v3 = c(8:10))
sdf <- SparkR::createDataFrame(df)
cols <- SparkR::colnames(sdf)
modify_cols <- c("v2", "v3")
spark_cols_new <- lapply(cols, function(x) { 
    if (!x %in% modify_cols){
      SparkR::column(x)
    } else {
      SparkR::alias(SparkR::column(x) * SparkR::lit(100), x)
    }
})
sdf_new <- SparkR::select(sdf, spark_cols_new)

Note that, if you intend to use a constant then it can be provided directly instead of using SparkR::lit function as well, but it is a safer choice.

What is the most elegant way to apply a function to multiple pairs of columns in a data.table or data.frame?

1) gv Using gv in the collapse package we could do this:

library(collapse)

DT[, (result.cols) := gv(.SD, one.cols) - gv(.SD, two.cols)]

2) gvr We can alternately use the regex variant of gv to eliminate one.cols and two.cols:

library(collapse)

result.cols <- sub(1, 3, gvr(DT, "1$", "names"))
DT[, (result.cols) := gvr(.SD, "1$") - gvr(.SD, "2$")]

3) across Using dplyr we can use across eliminating result.cols as well.

library(dplyr)

DT %>%
  mutate(across(ends_with("1"), .names="{sub(1,3,.col)}") - across(ends_with("2")))

4) data.table If we write it like this it is straight forward in data.table:

DT[, result.cols] <- DT[, ..one.cols] - DT[, ..two.cols]

DT[, (result.cols) := .SD[, one.cols, with=FALSE] - .SD[, two.cols, with=FALSE]]

Apply Multiple Functions to Multiple Columns in Data.Table

Apply multiple functions to multiple columns in data.table

Apply multiple functions to multiple columns in data.table by group

Multiple functions on multiple columns by group, and create informative column names

Benchmark

Edit: 2nd benchmark

Edit 2019-03-12: Computing on the language, 3rd benchmark

Apply multiple functions to multiple columns of data table in R

Apply a function to every specified column in a data.table and update by reference

How to apply multiple functions to multiple columns within by?

Generating multiple new columns in data.table using multiple functions applied to multiple columns

Apply different functions to different columns programmatically in data.table R

Apply a function to multiple columns of a SparkDataFrame, at once

What is the most elegant way to apply a function to multiple pairs of columns in a data.table or data.frame?

Related Topics

Leave a reply