Elegantly Assigning Multiple Columns in Data.Table with Lapply()

Elegantly assigning multiple columns in data.table with lapply()

Yes, you're right in question here :

I understand it is more efficient to loop over a vector of column names using := to assign:

for (col in paste0("V", 20:100)) dt[, col := sqrt(dt[[col]]), with = FALSE]

Aside: note that the new way of doing that is :

for (col in paste0("V", 20:100))
  dt[ , (col) := sqrt(dt[[col]])]

because the with = FALSE wasn't easy to read whether it referred to the LHS or the RHS of :=. End aside.

As you know, that's efficient because that does each column one by one, so working memory is only needed for one column at a time. That can make a difference between it working and it failing with the dreaded out-of-memory error.

The problem with lapply on the RHS of := is that the RHS (the lapply) is evaluated first; i.e., the result for the 80 columns is created. That's 80 column's worth of new memory which has to be allocated and populated. So you need 80 column's worth of free RAM for that operation to succeed. That RAM usage dominates vs the subsequently instant operation of assigning (plonking) those 80 new columns into the data.table's column pointer slots.

As @Frank pointed to, if you have a lot of columns (say 10,000 or more) then the small overhead of dispatching to the [.data.table method starts to add up). To eliminate that overhead that there is data.table::set which under ?set is described as a "loopable" :=. I use a for loop for this type of operation. It's the fastest way and is fairly easy to write and read.

for (col in paste0("V", 20:100))
  set(dt, j = col, value = sqrt(dt[[col]]))

Although with just 80 columns, it's unlikely to matter. (Note it may be more common to loop set over a large number of rows than a large number of columns.) However, looped set doesn't solve the problem of the repeated reference to the dt symbol name that you mentioned in the question :

I don't like this because I don't like reference the data.table in a j expression.

Agreed. So the best I can do is revert to your looping of := but use get instead.

for (col in paste0("V", 20:100))
  dt[, (col) := sqrt(get(col))]

However, I fear that using get in j carry an overhead. Benchmarking made in #1380. Also, perhaps it is confusing to use get() on the RHS but not on the LHS. To address that we could sugar the LHS and allow get() as well, #1381 :

for (col in paste0("V", 20:100))
  dt[, get(col) := sqrt(get(col))]

Also, maybe value of set could be run within scope of DT, #1382.

for (col in paste0("V", 20:100))
  set(dt, j = col, value = sqrt(get(col))

data.table: create multiple columns with lapply and .SD

Issue is the with output of scale which is a matrix

dim(scale(dt$A))
#[1] 100   1

so, we need to change it to a vector by removing the dim attributes. Either as.vector or c would do it

dt[ , ( cols_to_define ) := lapply( .SD, function(x) 
          c(scale(x)) ), by = id, .SDcols = cols_to_use ]

When there is no by the matrix dim attributes gets dropped while keeping the other attributes.

What is the most elegant way to apply a function to multiple pairs of columns in a data.table or data.frame?

1) gv Using gv in the collapse package we could do this:

library(collapse)

DT[, (result.cols) := gv(.SD, one.cols) - gv(.SD, two.cols)]

2) gvr We can alternately use the regex variant of gv to eliminate one.cols and two.cols:

library(collapse)

result.cols <- sub(1, 3, gvr(DT, "1$", "names"))
DT[, (result.cols) := gvr(.SD, "1$") - gvr(.SD, "2$")]

3) across Using dplyr we can use across eliminating result.cols as well.

library(dplyr)

DT %>%
  mutate(across(ends_with("1"), .names="{sub(1,3,.col)}") - across(ends_with("2")))

4) data.table If we write it like this it is straight forward in data.table:

DT[, result.cols] <- DT[, ..one.cols] - DT[, ..two.cols]

DT[, (result.cols) := .SD[, one.cols, with=FALSE] - .SD[, two.cols, with=FALSE]]

Update multiple data.table columns elegantly

How about

dt[, (names(dt)) := lapply(.SD, function(x) x/mean(x))]

If you need to specify certain columns, you could use

dt[, 1:40 := lapply(.SD, function(x) x/mean(x)), .SDcols = 1:40]

cols <- names(dt)[c(1,5,10)]
dt[, (cols) := lapply(.SD, function(x) x/mean(x)), .SDcols = cols]

Assignment with multiple lapplys in data.table?

out <- dt[, Map(function(x, nm) if (nm %in% just_first) x[1] else list(x),
                .SD, names(.SD)),
           by = ID, .SDcols = c(use_all, just_first)]
out
#        ID               A               B      C      D
#     <int>          <list>          <list> <char> <char>
#  1:     1       f,b,w,x,g       u,s,y,x,r      f      q
#  2:     5     f,e,l,t,n,j     v,p,i,w,x,b      f      t
#  3:     9         t,h,m,j         p,z,m,n      o      q
#  4:    10 c,b,q,e,n,b,... v,i,w,j,a,s,...      b      a
#  5:     4 v,j,a,i,i,x,... q,y,h,e,p,n,...      j      b
#  6:     2 u,g,k,e,w,u,... l,f,z,f,k,p,...      w      h
#  7:     8     f,c,e,r,h,y     u,k,y,q,e,v      i      e
#  8:     7             z,d             k,q      a      m
#  9:     3           d,p,d           a,j,q      n      f
# 10:     6             v,r             y,o      z      t

# results <- data.table(...) # first of your two `results`
all.equal(out, results[,c(1,4,5,2,3)]) # column-order is different
# [1] TRUE

Reproducible data:

set.seed(42)
dt <- data.table( 
    ID=sample(1:10, 50, replace=TRUE),
    A=letters[sample(1:26, 50, replace=TRUE)],
    B=letters[sample(1:26, 50, replace=TRUE)],
    C=letters[sample(1:26, 50, replace=TRUE)],
    D=letters[sample(1:26, 50, replace=TRUE)]
  )
head(dt, 3)
#       ID      A      B      C      D
#    <int> <char> <char> <char> <char>
# 1:     1      f      u      f      q
# 2:     5      f      v      f      t
# 3:     1      b      s      t      a

Apply a function to every specified column in a data.table and update by reference

This seems to work:

dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]

The result is

    a  b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3

There are a few tricks here:

Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, instead of to some new variable named "cols".
.SDcols tells the call that we're only looking at those columns, and allows us to use .SD, the Subset of the Data associated with those columns.
lapply(.SD, ...) operates on .SD, which is a list of columns (like all data.frames and data.tables). lapply returns a list, so in the end j looks like cols := list(...).

EDIT: Here's another way that is probably faster, as @Arun mentioned:

for (j in cols) set(dt, j = j, value = -dt[[j]])

set multiple columns in R `data.table` with a named list and `:=`

too long in comment. Not pretty:

dt[, {
    a <- my_fun(widths, heights)   
    for (x in names(a))
        set(dt, j=x, value=a[[x]])
}]

Or you can pass dt into the function if it was created by you?

data.table assignment by reference using lapply and also returning the rest of the columns

Try

x[,  c("a", "b") := lapply(.SD, overwriteNA), .SDcols = c("a", "b")]

Edit:

Per OPs additional request.

myCols <- c("a", "b")  
x[, (myCols) := lapply(.SD, overwriteNA), .SDcols = myCols]

Elegantly Assigning Multiple Columns in Data.Table with Lapply()