Use of Lapply .Sd in Data.Table R

Use of lapply .SD in data.table R

Just to illustrate the comments above with an example, let's take

set.seed(10238)
# A and B are the "id" variables within which the
# "data" variables C and D vary meaningfully
DT = data.table(
A = rep(1:3, each = 5L),
B = rep(1:5, 3L),
C = sample(15L),
D = sample(15L)
)
DT
# A B C D
# 1: 1 1 14 11
# 2: 1 2 3 8
# 3: 1 3 15 1
# 4: 1 4 1 14
# 5: 1 5 5 9
# 6: 2 1 7 13
# 7: 2 2 2 12
# 8: 2 3 8 6
# 9: 2 4 9 15
# 10: 2 5 4 3
# 11: 3 1 6 5
# 12: 3 2 12 10
# 13: 3 3 10 4
# 14: 3 4 13 7
# 15: 3 5 11 2

Compare the following:

#Sum all columns
DT[ , lapply(.SD, sum)]
# A B C D
# 1: 30 45 120 120

#Sum all columns EXCEPT A, grouping BY A
DT[ , lapply(.SD, sum), by = A]
# A B C D
# 1: 1 15 38 43
# 2: 2 15 30 49
# 3: 3 15 52 28

#Sum all columns EXCEPT A
DT[ , lapply(.SD, sum), .SDcols = !"A"]
# B C D
# 1: 45 120 120

#Sum all columns EXCEPT A, grouping BY B
DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"]
# B C D
# 1: 1 27 29
# 2: 2 17 30
# 3: 3 33 11
# 4: 4 23 36
# 5: 5 20 14

A few notes:

  • You said "does the below snippet... change all the columns in DT..."

The answer is no, and this is very important for data.table. The object returned is a new data.table, and all of the columns in DT are exactly as they were before running the code.

  • You mentioned wanting to change the column types

Referring to the point above again, note that your code (DT[ , lapply(.SD, as.factor)]) returns a new data.table and does not change DT at all. One (incorrect) way to do this, which is done with data.frames in base, is to overwrite the old data.table with the new data.table you've returned, i.e., DT = DT[ , lapply(.SD, as.factor)].

This is wasteful because it involves creating copies of DT which can be an efficiency killer when DT is large. The correct data.table approach to this problem is to update the columns by reference using`:=`, e.g., DT[ , names(DT) := lapply(.SD, as.factor)], which creates no copies of your data. See data.table's reference semantics vignette for more on this.

  • You mentioned comparing efficiency of lapply(.SD, sum) to that of colSums. sum is internally optimized in data.table (you can note this is true from the output of adding the verbose = TRUE argument within []); to see this in action, let's beef up your DT a bit and run a benchmark:

Results:

library(data.table)
set.seed(12039)
nn = 1e7; kk = seq(100L)
DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE))
DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))]

library(microbenchmark)
microbenchmark(
times = 100L,
colsums = colSums(DT[ , !c("A", "B")]),
lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962 100
# lapplys 246.5824 250.3753 252.9603 252.1586 254.8297 266.1771 100

What does .SD stand for in data.table in R

.SD stands for something like "Subset of Data.table". There's no significance to the initial ".", except that it makes it even more unlikely that there will be a clash with a user-defined column name.

If this is your data.table:

DT = data.table(x=rep(c("a","b","c"),each=2), y=c(1,3), v=1:6)
setkey(DT, y)
DT
# x y v
# 1: a 1 1
# 2: b 1 3
# 3: c 1 5
# 4: a 3 2
# 5: b 3 4
# 6: c 3 6

Doing this may help you see what .SD is:

DT[ , .SD[ , paste(x, v, sep="", collapse="_")], by=y]
# y V1
# 1: 1 a1_b3_c5
# 2: 3 a2_b4_c6

Basically, the by=y statement breaks the original data.table into these two sub-data.tables

DT[ , print(.SD), by=y]
# <1st sub-data.table, called '.SD' while it's being operated on>
# x v
# 1: a 1
# 2: b 3
# 3: c 5
# <2nd sub-data.table, ALSO called '.SD' while it's being operated on>
# x v
# 1: a 2
# 2: b 4
# 3: c 6
# <final output, since print() doesn't return anything>
# Empty data.table (0 rows) of 1 col: y

and operates on them in turn.

While it is operating on either one, it lets you refer to the current sub-data.table by using the nick-name/handle/symbol .SD. That's very handy, as you can access and operate on the columns just as if you were sitting at the command line working with a single data.table called .SD ... except that here, data.table will carry out those operations on every single sub-data.table defined by combinations of the key, "pasting" them back together and returning the results in a single data.table!

data.table: create multiple columns with lapply and .SD

Issue is the with output of scale which is a matrix

dim(scale(dt$A))
#[1] 100 1

so, we need to change it to a vector by removing the dim attributes. Either as.vector or c would do it

dt[ , ( cols_to_define ) := lapply( .SD, function(x) 
c(scale(x)) ), by = id, .SDcols = cols_to_use ]

When there is no by the matrix dim attributes gets dropped while keeping the other attributes.

Using .SD column names in lapply() with data tables

The question/answer Access lapply index names inside FUN provided me with inspiration for a solution:

DT[, lapply(seq_along(names(.SD)),
function(y, n, i) if(n[[i]] == 'carrot') {y[[i]] * 2} else {y[[i]] / 2},
y = .SD,
n = names(.SD)),
.SDcols = names(DT)]

r data.table lapply with multiple SDcols

If we have multiple columns, then use Map to loop over each of the corresponding sets of columns and apply the function.

library(data.table)
x <- 50
DT[, (new_columns) := Map(function(u, y)
newcols_1.5_months(u, DT[['agedays']], x, y),
.SD[, new_columns, with = FALSE],
.SD[, new_temp_cols, with = FALSE]) ]

Perhaps the function can be

newcols_1.5_months<- function(u, agedays, x,  y){
ifelse(agedays==x, u, y)
}

data

DT <- data.table(agedays = c(25,50,53,22,37,50,12,45,50,15,33,50),
temp_haz_1.5 = c(1.2,1.5,1.7,2.0,4.5,6.7,6.8,6.7,4.5,6.6,8.9,6.7),
temp_waz_1.5 = c(3.2,1.8,6.7,2.8,3.5,7.7,9.8,1.7,6.9,3.8,0.9,4.7),
haz_1.5_1 = c(1.2,2.5,4.7,7.0,4.7,6.8,6.3,2.7,5.5,8.6,3.9,6.7),
waz_1.5_1 =c(6.2,2.5,5.7,7.0,2.5,7.7,8.8,9.7,2.5,4.6,5.9,6.7))

Apply a function to every specified column in a data.table and update by reference

This seems to work:

dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]

The result is

    a  b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3

There are a few tricks here:

  • Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, instead of to some new variable named "cols".
  • .SDcols tells the call that we're only looking at those columns, and allows us to use .SD, the Subset of the Data associated with those columns.
  • lapply(.SD, ...) operates on .SD, which is a list of columns (like all data.frames and data.tables). lapply returns a list, so in the end j looks like cols := list(...).

EDIT: Here's another way that is probably faster, as @Arun mentioned:

for (j in cols) set(dt, j = j, value = -dt[[j]])

R data.table creating a custom function using lapply to create and reassign multiple variables

OP's code can be turned into an anonymous function which is applied to the selected columns:

library(data.table)
columns <- c("haz_1.5", "waz_1.5")
new_cols <- paste0(columns, "_1")
x <- 61

DT[, (new_cols) := lapply(.SD, function(v) {
temp <- fifelse(flag, v, NA_real_)
temp <- nafill(temp, "locf")
fifelse(agedays == x, temp, NA_real_)
}), .SDcols = columns, by = pid][]
    pid  flag agedays haz_1.5 waz_1.5 haz_1.5_1 waz_1.5_1
1: 1 TRUE 1 1 1 NA NA
2: 1 TRUE 61 1 NA 1 1
3: 2 FALSE 61 1 NA NA NA
4: 3 TRUE 51 2 NA NA NA
5: 3 TRUE 61 NA NA 2 NA
6: 4 FALSE 23 1 2 NA NA
7: 4 TRUE 61 3 2 3 2
8: 5 TRUE 1 2 3 NA NA
9: 5 TRUE 32 3 4 NA NA
10: 5 TRUE 61 4 4 4 4

This is the same result we would get when we manually repeat OP's code for the two columns (note that it is required to clear the temp column before assigning by reference parts of it.)

DT[(flag), temp := haz_1.5]
DT[, temp := zoo::na.locf(temp, na.rm = FALSE), by = pid]
DT[agedays == 61, haz_1.5_1 := temp]
DT[, temp := NULL]
DT[(flag), temp := waz_1.5]
DT[, temp := zoo::na.locf(temp, na.rm = FALSE), by = pid]
DT[agedays == 61, waz_1.5_1 := temp]
DT[, temp := NULL][]
    pid  flag agedays haz_1.5 waz_1.5 haz_1.5_1 waz_1.5_1
1: 1 TRUE 1 1 1 NA NA
2: 1 TRUE 61 1 NA 1 1
3: 2 FALSE 61 1 NA NA NA
4: 3 TRUE 51 2 NA NA NA
5: 3 TRUE 61 NA NA 2 NA
6: 4 FALSE 23 1 2 NA NA
7: 4 TRUE 61 3 2 3 2
8: 5 TRUE 1 2 3 NA NA
9: 5 TRUE 32 3 4 NA NA
10: 5 TRUE 61 4 4 4 4

Some explanations

  • There is one important difference between OP's "single column" code and this approach: The anonymous function is called for each item in the grouping variable pid. In OP's code, the first and last assignments are working on the ungrouped (full) vectors (which might be somewhat more efficient, perhaps). However, the result of those assignments is independent of pid and the result is the same.
  • Instead of zoo::na.locf(), data.table's nafill() function is used (new with data.table v1.12.4, on CRAN 03 Oct 2019)
  • DT[(flag), ...] is equivalent to DT[flag == TRUE, ...]
  • When fifelse() is used instead of subsetted assign by reference, the no parameter must be NA to be compliant. Thus, DT[, temp := fifelse(flag, haz_1.5, NA_real_)][] is equivalent to DT[(flag), temp := haz_1.5][]

Assignment with multiple lapplys in data.table?


out <- dt[, Map(function(x, nm) if (nm %in% just_first) x[1] else list(x),
.SD, names(.SD)),
by = ID, .SDcols = c(use_all, just_first)]
out
# ID A B C D
# <int> <list> <list> <char> <char>
# 1: 1 f,b,w,x,g u,s,y,x,r f q
# 2: 5 f,e,l,t,n,j v,p,i,w,x,b f t
# 3: 9 t,h,m,j p,z,m,n o q
# 4: 10 c,b,q,e,n,b,... v,i,w,j,a,s,... b a
# 5: 4 v,j,a,i,i,x,... q,y,h,e,p,n,... j b
# 6: 2 u,g,k,e,w,u,... l,f,z,f,k,p,... w h
# 7: 8 f,c,e,r,h,y u,k,y,q,e,v i e
# 8: 7 z,d k,q a m
# 9: 3 d,p,d a,j,q n f
# 10: 6 v,r y,o z t

# results <- data.table(...) # first of your two `results`
all.equal(out, results[,c(1,4,5,2,3)]) # column-order is different
# [1] TRUE

Reproducible data:

set.seed(42)
dt <- data.table(
ID=sample(1:10, 50, replace=TRUE),
A=letters[sample(1:26, 50, replace=TRUE)],
B=letters[sample(1:26, 50, replace=TRUE)],
C=letters[sample(1:26, 50, replace=TRUE)],
D=letters[sample(1:26, 50, replace=TRUE)]
)
head(dt, 3)
# ID A B C D
# <int> <char> <char> <char> <char>
# 1: 1 f u f q
# 2: 5 f v f t
# 3: 1 b s t a


Related Topics



Leave a reply



Submit