Use of lapply .SD in data.table R
Just to illustrate the comments above with an example, let's take
set.seed(10238)
# A and B are the "id" variables within which the
# "data" variables C and D vary meaningfully
DT = data.table(
A = rep(1:3, each = 5L),
B = rep(1:5, 3L),
C = sample(15L),
D = sample(15L)
)
DT
# A B C D
# 1: 1 1 14 11
# 2: 1 2 3 8
# 3: 1 3 15 1
# 4: 1 4 1 14
# 5: 1 5 5 9
# 6: 2 1 7 13
# 7: 2 2 2 12
# 8: 2 3 8 6
# 9: 2 4 9 15
# 10: 2 5 4 3
# 11: 3 1 6 5
# 12: 3 2 12 10
# 13: 3 3 10 4
# 14: 3 4 13 7
# 15: 3 5 11 2
Compare the following:
#Sum all columns
DT[ , lapply(.SD, sum)]
# A B C D
# 1: 30 45 120 120
#Sum all columns EXCEPT A, grouping BY A
DT[ , lapply(.SD, sum), by = A]
# A B C D
# 1: 1 15 38 43
# 2: 2 15 30 49
# 3: 3 15 52 28
#Sum all columns EXCEPT A
DT[ , lapply(.SD, sum), .SDcols = !"A"]
# B C D
# 1: 45 120 120
#Sum all columns EXCEPT A, grouping BY B
DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"]
# B C D
# 1: 1 27 29
# 2: 2 17 30
# 3: 3 33 11
# 4: 4 23 36
# 5: 5 20 14
A few notes:
- You said "does the below snippet... change all the columns in
DT
..."
The answer is no, and this is very important for data.table
. The object returned is a new data.table
, and all of the columns in DT
are exactly as they were before running the code.
- You mentioned wanting to change the column types
Referring to the point above again, note that your code (DT[ , lapply(.SD, as.factor)]
) returns a new data.table
and does not change DT
at all. One (incorrect) way to do this, which is done with data.frame
s in base
, is to overwrite the old data.table
with the new data.table
you've returned, i.e., DT = DT[ , lapply(.SD, as.factor)]
.
This is wasteful because it involves creating copies of DT
which can be an efficiency killer when DT
is large. The correct data.table
approach to this problem is to update the columns by reference using`:=`
, e.g., DT[ , names(DT) := lapply(.SD, as.factor)]
, which creates no copies of your data. See data.table
's reference semantics vignette for more on this.
- You mentioned comparing efficiency of
lapply(.SD, sum)
to that ofcolSums
.sum
is internally optimized indata.table
(you can note this is true from the output of adding theverbose = TRUE
argument within[]
); to see this in action, let's beef up yourDT
a bit and run a benchmark:
Results:
library(data.table)
set.seed(12039)
nn = 1e7; kk = seq(100L)
DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE))
DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))]
library(microbenchmark)
microbenchmark(
times = 100L,
colsums = colSums(DT[ , !c("A", "B")]),
lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962 100
# lapplys 246.5824 250.3753 252.9603 252.1586 254.8297 266.1771 100
What does .SD stand for in data.table in R
.SD
stands for something like "S
ubset of D
ata.table". There's no significance to the initial "."
, except that it makes it even more unlikely that there will be a clash with a user-defined column name.
If this is your data.table:
DT = data.table(x=rep(c("a","b","c"),each=2), y=c(1,3), v=1:6)
setkey(DT, y)
DT
# x y v
# 1: a 1 1
# 2: b 1 3
# 3: c 1 5
# 4: a 3 2
# 5: b 3 4
# 6: c 3 6
Doing this may help you see what .SD
is:
DT[ , .SD[ , paste(x, v, sep="", collapse="_")], by=y]
# y V1
# 1: 1 a1_b3_c5
# 2: 3 a2_b4_c6
Basically, the by=y
statement breaks the original data.table into these two sub-data.tables
DT[ , print(.SD), by=y]
# <1st sub-data.table, called '.SD' while it's being operated on>
# x v
# 1: a 1
# 2: b 3
# 3: c 5
# <2nd sub-data.table, ALSO called '.SD' while it's being operated on>
# x v
# 1: a 2
# 2: b 4
# 3: c 6
# <final output, since print() doesn't return anything>
# Empty data.table (0 rows) of 1 col: y
and operates on them in turn.
While it is operating on either one, it lets you refer to the current sub-data.table
by using the nick-name/handle/symbol .SD
. That's very handy, as you can access and operate on the columns just as if you were sitting at the command line working with a single data.table called .SD
... except that here, data.table
will carry out those operations on every single sub-data.table
defined by combinations of the key, "pasting" them back together and returning the results in a single data.table
!
data.table: create multiple columns with lapply and .SD
Issue is the with output of scale
which is a matrix
dim(scale(dt$A))
#[1] 100 1
so, we need to change it to a vector
by removing the dim
attributes. Either as.vector
or c
would do it
dt[ , ( cols_to_define ) := lapply( .SD, function(x)
c(scale(x)) ), by = id, .SDcols = cols_to_use ]
When there is no by
the matrix
dim
attributes gets dropped while keeping the other attributes.
Using .SD column names in lapply() with data tables
The question/answer Access lapply index names inside FUN provided me with inspiration for a solution:
DT[, lapply(seq_along(names(.SD)),
function(y, n, i) if(n[[i]] == 'carrot') {y[[i]] * 2} else {y[[i]] / 2},
y = .SD,
n = names(.SD)),
.SDcols = names(DT)]
r data.table lapply with multiple SDcols
If we have multiple columns, then use Map
to loop over each of the corresponding sets of columns and apply the function.
library(data.table)
x <- 50
DT[, (new_columns) := Map(function(u, y)
newcols_1.5_months(u, DT[['agedays']], x, y),
.SD[, new_columns, with = FALSE],
.SD[, new_temp_cols, with = FALSE]) ]
Perhaps the function can be
newcols_1.5_months<- function(u, agedays, x, y){
ifelse(agedays==x, u, y)
}
data
DT <- data.table(agedays = c(25,50,53,22,37,50,12,45,50,15,33,50),
temp_haz_1.5 = c(1.2,1.5,1.7,2.0,4.5,6.7,6.8,6.7,4.5,6.6,8.9,6.7),
temp_waz_1.5 = c(3.2,1.8,6.7,2.8,3.5,7.7,9.8,1.7,6.9,3.8,0.9,4.7),
haz_1.5_1 = c(1.2,2.5,4.7,7.0,4.7,6.8,6.3,2.7,5.5,8.6,3.9,6.7),
waz_1.5_1 =c(6.2,2.5,5.7,7.0,2.5,7.7,8.8,9.7,2.5,4.6,5.9,6.7))
Apply a function to every specified column in a data.table and update by reference
This seems to work:
dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]
The result is
a b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3
There are a few tricks here:
- Because there are parentheses in
(cols) :=
, the result is assigned to the columns specified incols
, instead of to some new variable named "cols". .SDcols
tells the call that we're only looking at those columns, and allows us to use.SD
, theS
ubset of theD
ata associated with those columns.lapply(.SD, ...)
operates on.SD
, which is a list of columns (like all data.frames and data.tables).lapply
returns a list, so in the endj
looks likecols := list(...)
.
EDIT: Here's another way that is probably faster, as @Arun mentioned:
for (j in cols) set(dt, j = j, value = -dt[[j]])
R data.table creating a custom function using lapply to create and reassign multiple variables
OP's code can be turned into an anonymous function which is applied to the selected columns
:
library(data.table)
columns <- c("haz_1.5", "waz_1.5")
new_cols <- paste0(columns, "_1")
x <- 61
DT[, (new_cols) := lapply(.SD, function(v) {
temp <- fifelse(flag, v, NA_real_)
temp <- nafill(temp, "locf")
fifelse(agedays == x, temp, NA_real_)
}), .SDcols = columns, by = pid][]
pid flag agedays haz_1.5 waz_1.5 haz_1.5_1 waz_1.5_1
1: 1 TRUE 1 1 1 NA NA
2: 1 TRUE 61 1 NA 1 1
3: 2 FALSE 61 1 NA NA NA
4: 3 TRUE 51 2 NA NA NA
5: 3 TRUE 61 NA NA 2 NA
6: 4 FALSE 23 1 2 NA NA
7: 4 TRUE 61 3 2 3 2
8: 5 TRUE 1 2 3 NA NA
9: 5 TRUE 32 3 4 NA NA
10: 5 TRUE 61 4 4 4 4
This is the same result we would get when we manually repeat OP's code for the two columns (note that it is required to clear the temp
column before assigning by reference parts of it.)
DT[(flag), temp := haz_1.5]
DT[, temp := zoo::na.locf(temp, na.rm = FALSE), by = pid]
DT[agedays == 61, haz_1.5_1 := temp]
DT[, temp := NULL]
DT[(flag), temp := waz_1.5]
DT[, temp := zoo::na.locf(temp, na.rm = FALSE), by = pid]
DT[agedays == 61, waz_1.5_1 := temp]
DT[, temp := NULL][]
pid flag agedays haz_1.5 waz_1.5 haz_1.5_1 waz_1.5_1
1: 1 TRUE 1 1 1 NA NA
2: 1 TRUE 61 1 NA 1 1
3: 2 FALSE 61 1 NA NA NA
4: 3 TRUE 51 2 NA NA NA
5: 3 TRUE 61 NA NA 2 NA
6: 4 FALSE 23 1 2 NA NA
7: 4 TRUE 61 3 2 3 2
8: 5 TRUE 1 2 3 NA NA
9: 5 TRUE 32 3 4 NA NA
10: 5 TRUE 61 4 4 4 4
Some explanations
- There is one important difference between OP's "single column" code and this approach: The anonymous function is called for each item in the grouping variable
pid
. In OP's code, the first and last assignments are working on the ungrouped (full) vectors (which might be somewhat more efficient, perhaps). However, the result of those assignments is independent ofpid
and the result is the same. - Instead of
zoo::na.locf()
, data.table'snafill()
function is used (new with data.table v1.12.4, on CRAN 03 Oct 2019) DT[(flag), ...]
is equivalent toDT[flag == TRUE, ...]
- When
fifelse()
is used instead of subsetted assign by reference, theno
parameter must beNA
to be compliant. Thus,DT[, temp := fifelse(flag, haz_1.5, NA_real_)][]
is equivalent toDT[(flag), temp := haz_1.5][]
Assignment with multiple lapplys in data.table?
out <- dt[, Map(function(x, nm) if (nm %in% just_first) x[1] else list(x),
.SD, names(.SD)),
by = ID, .SDcols = c(use_all, just_first)]
out
# ID A B C D
# <int> <list> <list> <char> <char>
# 1: 1 f,b,w,x,g u,s,y,x,r f q
# 2: 5 f,e,l,t,n,j v,p,i,w,x,b f t
# 3: 9 t,h,m,j p,z,m,n o q
# 4: 10 c,b,q,e,n,b,... v,i,w,j,a,s,... b a
# 5: 4 v,j,a,i,i,x,... q,y,h,e,p,n,... j b
# 6: 2 u,g,k,e,w,u,... l,f,z,f,k,p,... w h
# 7: 8 f,c,e,r,h,y u,k,y,q,e,v i e
# 8: 7 z,d k,q a m
# 9: 3 d,p,d a,j,q n f
# 10: 6 v,r y,o z t
# results <- data.table(...) # first of your two `results`
all.equal(out, results[,c(1,4,5,2,3)]) # column-order is different
# [1] TRUE
Reproducible data:
set.seed(42)
dt <- data.table(
ID=sample(1:10, 50, replace=TRUE),
A=letters[sample(1:26, 50, replace=TRUE)],
B=letters[sample(1:26, 50, replace=TRUE)],
C=letters[sample(1:26, 50, replace=TRUE)],
D=letters[sample(1:26, 50, replace=TRUE)]
)
head(dt, 3)
# ID A B C D
# <int> <char> <char> <char> <char>
# 1: 1 f u f q
# 2: 5 f v f t
# 3: 1 b s t a
Related Topics
Generate Paired Stacked Bar Charts in Ggplot (Using Position_Dodge Only on Some Variables)
Solution. How to Install_Github When There Is a Proxy
Why Is the Terminology of Labels and Levels in Factors So Weird
Subset Based on Variable Column Name
Creating a Unique Sequence of Dates
R: += (Plus Equals) and ++ (Plus Plus) Equivalent from C++/C#/Java, etc.
Add Line Break to Axis Labels and Ticks in Ggplot
Writing Multiple Data Frames into .CSV Files Using R
Automatically Create Formulas for All Possible Linear Models
Display Weighted Mean by Group in the Data.Frame
Issue When Importing Dataset: 'Error in Scan(...): Line 1 Did Not Have 145 Elements'
Delete Columns/Rows with More Than X% Missing
Displaying a PDF from a Local Drive in Shiny
How to Access the Help/Documentation .Rd Source Files in R
Cowplot Made Ggplot2 Theme Disappear/How to See Current Ggplot2 Theme, and Restore the Default