What Does .Sd Stand For in Data.Table in R

What does .SD stand for in data.table in R

.SD stands for something like "Subset of Data.table". There's no significance to the initial ".", except that it makes it even more unlikely that there will be a clash with a user-defined column name.

If this is your data.table:

DT = data.table(x=rep(c("a","b","c"),each=2), y=c(1,3), v=1:6)
setkey(DT, y)
DT
# x y v
# 1: a 1 1
# 2: b 1 3
# 3: c 1 5
# 4: a 3 2
# 5: b 3 4
# 6: c 3 6

Doing this may help you see what .SD is:

DT[ , .SD[ , paste(x, v, sep="", collapse="_")], by=y]
# y V1
# 1: 1 a1_b3_c5
# 2: 3 a2_b4_c6

Basically, the by=y statement breaks the original data.table into these two sub-data.tables

DT[ , print(.SD), by=y]
# <1st sub-data.table, called '.SD' while it's being operated on>
# x v
# 1: a 1
# 2: b 3
# 3: c 5
# <2nd sub-data.table, ALSO called '.SD' while it's being operated on>
# x v
# 1: a 2
# 2: b 4
# 3: c 6
# <final output, since print() doesn't return anything>
# Empty data.table (0 rows) of 1 col: y

and operates on them in turn.

While it is operating on either one, it lets you refer to the current sub-data.table by using the nick-name/handle/symbol .SD. That's very handy, as you can access and operate on the columns just as if you were sitting at the command line working with a single data.table called .SD ... except that here, data.table will carry out those operations on every single sub-data.table defined by combinations of the key, "pasting" them back together and returning the results in a single data.table!

rowmean and standard deviation using data.table

The issues lies in sd() which doesn't work row-wise.

x[,
c("meanY",'sdY',"nY") :=
.(rowMeans(.SD, na.rm = TRUE),
apply(.SD, 1, sd, na.rm = TRUE),
rowSums(!is.na(.SD))),
.SDcols = 2:10]

How to create means and s.d. columns with data.table (based on multiple conditions)

As far as I can tell you want the sum of con per id, for data where ds is between -365 and 0, then the mean, median and standard deviation of those sums. This can be done by chaining two expressions.

DT[ds >= -365 & ds < 0, sum(con), by = "id"][, .(mean(V1), median(V1), sd(V1))]

This first subsets your data, then sums con per id, then takes the mean etc. of the sums with a second expression

Use of lapply .SD in data.table R

Just to illustrate the comments above with an example, let's take

set.seed(10238)
# A and B are the "id" variables within which the
# "data" variables C and D vary meaningfully
DT = data.table(
A = rep(1:3, each = 5L),
B = rep(1:5, 3L),
C = sample(15L),
D = sample(15L)
)
DT
# A B C D
# 1: 1 1 14 11
# 2: 1 2 3 8
# 3: 1 3 15 1
# 4: 1 4 1 14
# 5: 1 5 5 9
# 6: 2 1 7 13
# 7: 2 2 2 12
# 8: 2 3 8 6
# 9: 2 4 9 15
# 10: 2 5 4 3
# 11: 3 1 6 5
# 12: 3 2 12 10
# 13: 3 3 10 4
# 14: 3 4 13 7
# 15: 3 5 11 2

Compare the following:

#Sum all columns
DT[ , lapply(.SD, sum)]
# A B C D
# 1: 30 45 120 120

#Sum all columns EXCEPT A, grouping BY A
DT[ , lapply(.SD, sum), by = A]
# A B C D
# 1: 1 15 38 43
# 2: 2 15 30 49
# 3: 3 15 52 28

#Sum all columns EXCEPT A
DT[ , lapply(.SD, sum), .SDcols = !"A"]
# B C D
# 1: 45 120 120

#Sum all columns EXCEPT A, grouping BY B
DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"]
# B C D
# 1: 1 27 29
# 2: 2 17 30
# 3: 3 33 11
# 4: 4 23 36
# 5: 5 20 14

A few notes:

  • You said "does the below snippet... change all the columns in DT..."

The answer is no, and this is very important for data.table. The object returned is a new data.table, and all of the columns in DT are exactly as they were before running the code.

  • You mentioned wanting to change the column types

Referring to the point above again, note that your code (DT[ , lapply(.SD, as.factor)]) returns a new data.table and does not change DT at all. One (incorrect) way to do this, which is done with data.frames in base, is to overwrite the old data.table with the new data.table you've returned, i.e., DT = DT[ , lapply(.SD, as.factor)].

This is wasteful because it involves creating copies of DT which can be an efficiency killer when DT is large. The correct data.table approach to this problem is to update the columns by reference using`:=`, e.g., DT[ , names(DT) := lapply(.SD, as.factor)], which creates no copies of your data. See data.table's reference semantics vignette for more on this.

  • You mentioned comparing efficiency of lapply(.SD, sum) to that of colSums. sum is internally optimized in data.table (you can note this is true from the output of adding the verbose = TRUE argument within []); to see this in action, let's beef up your DT a bit and run a benchmark:

Results:

library(data.table)
set.seed(12039)
nn = 1e7; kk = seq(100L)
DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE))
DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))]

library(microbenchmark)
microbenchmark(
times = 100L,
colsums = colSums(DT[ , !c("A", "B")]),
lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962 100
# lapplys 246.5824 250.3753 252.9603 252.1586 254.8297 266.1771 100

What is the namespace environment for .SD and other similar special variables of data.table?

Unfortunately it seems that .SD, while accessible, is not filled in at the point that myFunA is run. .N is usable though as in myFunB below so the particular example in the question could be modified as in myFunB to work.

The problem seems to be that data.table optimizes the calculation to avoid filling in .SD if it thinks it is not used. If we do use it then it can be made to work as in myFunC below.

library(data.table)
irisDT <- as.data.table(iris)

myFunA <- function(envir = parent.frame()) with(envir, .SD)
irisDT[, myFunA(), by = Species]
## Empty data.table (0 rows and 1 cols): Species

myFunB <- function(envir = parent.frame()) with(envir, .N)
irisDT[, myFunB(), by = Species]
## Species V1
## 1: setosa 50
## 2: versicolor 50
## 3: virginica 50

myFunC <- function(envir = parent.frame()) with(envir, nrow(.SD))
irisDT[, {.SD; myFunC()}, by = Species]
## Species V1
## 1: setosa 50
## 2: versicolor 50
## 3: virginica 50

Apply mean and sd to every column of data table

You are pretty close. Just use unlist over lapply and you are there.

b<-a[ ,unlist(lapply(.SD, test))]
b
#Accuracy.mean Accuracy.sd Specificity.mean Specificity.sd Sensitivity.mean Sensitivity.sd
# 5.50000 3.02765 16.50000 3.02765 105.50000 3.02765

Its pretty straight forward with dplyr as:

library(dplyr)
summarise_all(a, .funs = funs(mean, sd))
#Accuracy_mean Specificity_mean Sensitivity_mean Accuracy_sd Specificity_sd Sensitivity_sd
#1 5.5 16.5 105.5 3.02765 3.02765 3.02765

Function created by OP

test<-function(x){
c(mean = mean(x),sd = sd(x))
}

Sample Data

library(data.table)
a<- data.table(Accuracy = 1:10, Specificity = 12:21, Sensitivity = 101 : 110)

data.table performance using .SD with by

Here's an approach that is faster than the .I for this particular example. Note that this also changes the order which may not be desirable for you.

DT[order(col_1), .SD[1L], by = ids, .SDcols = cols[1:3]]

As @Ian Campbell mentions, this is a Github issue. The good news is that there are some optimizations, one of which being .SD[1L]. The optimization is that the subsetting is done all in C which makes it very fast.

Here are the benchmarks which includes @sindri_baldur's solution but removes your original .SD attempt - I didn't want to wait 3 minutes :).

# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 .I 4.54s 4.54s 0.220 30MB 0.880 1 4 4.54s
2 self_join 11.32s 11.32s 0.0883 76.3MB 0 1 0 11.32s
3 use_order 3.55s 3.55s 0.282 58.3MB 0 1 0 3.55s

## show that it's equal but re-ordered:
all.equal(DT[DT[, .(row_id = .I[which.min(col_1)]),
by = ids]$row_id, c("ids", cols[1:3]), with = FALSE][order(col_1)],
DT[order(col_1), .SD[1L], by = ids, .SDcols = cols[1:3]])

## [1] TRUE


Related Topics



Leave a reply



Submit