Can .Sd Be Viewed from a Browser Within [.Data.Table()

Can .SD be viewed from a browser within [.data.table()?

Updated in light of Matthew Dowle's comments:

It turns out that .SD is, internally, the environment within which all j expressions are evaluated, including those which don't explicitly reference .SD at all. Filling it with all of DT's columns for each subset of DT is not cheap, timewise, so [.data.table() won't do so unless it really needs to.

Instead, making great use of R's lazy-evaluation of arguments, it previews the unevaluated j expression, and only adds to .SD columns that are referenced therein. If .SD itself is mentioned, it adds all of DT's columns.

So, to view .SD, just include some reference to it in the j-expression. Here is one of many expressions that will work:

library(data.table)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)

## This works
DT[, if(nrow(.SD)) browser(), by=x]
# Called from: `[.data.table`(DT, , if (nrow(.SD)) browser(), by = x)
Browse[1]> .SD
# y v
# 1: 1 1
# 2: 3 2
# 3: 6 3

And here are a couple more:

DT[,{.SD; browser()}, by=x]
DT[,{browser(); .SD}, by=x] ## Notice that order doesn't matter

To see for yourself that .SD just loads columns needed by the j-expression, run these each in turn (typing .SD when entering the browser environment, and Q to leave it and return to the normal command-line):

DT[, {.N * y ; browser()}, by=x]
DT[, {v^2 ; browser()}, by=x]
DT[, {y*v ; browser()}, by=x]

.SD columns in data.table in R

To make all of .SD's columns available, you just need to reference it somewhere in your j expression. For example, try this:

x[,{.SD; browser(); a+1},by=id]
# Called from: `[.data.table`(x, , {
# .SD
# browser()
# a + 1
# }, by = id)
Browse[1]> .SD
# a b
# 1: 1 10
# 2: 6 5

This works because, as explained here

[.data.table() [...] previews the unevaluated j expression, and only adds to .SD columns that are referenced therein. If .SD itself is mentioned, it adds all of DT's columns.


Alternatively, if you don't want to incur the expense of loading .SD's columns for each by-group calculation, you can always inspect the currently loaded subset of x by calling x[.I,]. (.I is a variable which stores the row locations in x of the current group):

x[,{browser(); a+1},by=id]
# Called from: `[.data.table`(x, , {
# browser()
# a + 1
# }, by = id)
Browse[1]> x[.I,]
# a b id
# 1: 1 10 1
# 2: 6 5 1

data.table grouping column is length 1 in J

Thanks to all for the candidates.

mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE]
mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]

The performance (with this small model) seems to have some small differences:

library(microbenchmark)
microbenchmark(
c1 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)],
c2 = mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
c3 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)],
c4 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
c5 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# c1 3.7328 4.21745 4.584591 4.43485 4.57465 9.8924 100
# c2 2.6740 3.11295 3.244856 3.21655 3.28975 5.6725 100
# c3 2.8219 3.30150 3.618646 3.46560 3.81250 6.8010 100
# c4 2.9084 3.27070 3.620761 3.44120 3.86935 6.3447 100
# c5 5.6156 6.37405 6.832622 6.54625 7.03130 13.8931 100

With larger data

mtbigger <- rbindlist(replicate(1000, mtcars, simplify=FALSE))
microbenchmark(
c1 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = mtbigger[.I]))), by = .(cyl)],
c2 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
c3 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mtbigger)],
c4 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
c5 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# c1 27.1635 30.54040 33.98210 32.2859 34.71505 76.5064 100
# c2 23.9612 25.83105 28.97927 27.5059 30.02720 67.9793 100
# c3 25.7880 28.27205 31.38212 30.2445 32.79030 105.4742 100
# c4 25.6469 27.84185 30.52403 29.8286 32.60805 37.8675 100
# c5 29.2477 32.32465 35.67090 35.0291 37.90410 68.5017 100

(I'm guessing the relative performance scales similarly. A better adjudication might include much wider data.)

By median runtime alone, it looks like the top (by a very small margin) is:

mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]

Different aggregation rules with data.table in r

You can use that general syntax, just a few changes (1) you're creating a new data frame (with columns whose length doesn't equal nrow(df)), so you don't need the := and the part before it (2) you can use mget to get a list of columns to lapply over from a character vector (3) use c to concatenate lists together, rather than list which creates sublists.

ids <- c("id1","id2")
summing = c("var_sum1","var_sum2")
averaging = c("var_mean1","var_mean2")
wght_average = c("var_weighted_mean")

df[ , c(lapply(mget(summing), sum),
lapply(mget(averaging), mean),
lapply(mget(wght_average), weighted.mean, weight)),
by = c(ids)]

# id1 id2 var_sum1 var_sum2 var_mean1 var_mean2 var_weighted_mean
# 1: a c -0.4091754 19.469144 10.181026 15.29206 0.06766247
# 2: a d -0.9797636 4.884255 8.856079 15.36002 1.43762082
# 3: b c -3.0569705 15.284160 10.021045 14.94577 -0.72186913
# 4: b d -0.4616429 10.076022 8.442672 15.09100 0.13813689

A possible tidyverse solution is to store the rules in a tibble

library(tidyverse)

ids = c("id1","id2")
do_over <-
list(
summing = c("var_sum1","var_sum2"),
averaging = c("var_mean1","var_mean2"),
wght_average = c("var_weighted_mean"))
do_what <-
list(
summing = sum,
averaging = mean,
wght_average = ~weighted.mean(., weight))

todo <- tibble(do_over, do_what)

todo
# # A tibble: 3 x 2
# do_over do_what
# <named list> <named list>
# 1 <chr [2]> <fn>
# 2 <chr [2]> <fn>
# 3 <chr [1]> <formula>

Then pmap over the tibble to get your output

pmap_dfc(todo, ~
df %>%
group_by_at(ids) %>%
summarise_at(.x, .y))

# # A tibble: 3 x 11
# # Groups: id1 [2]
# id1 id2 var_sum1 var_sum2 id11 id21 var_mean1 var_mean2 id12 id22 var_weighted_mean
# <fct> <fct> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <fct> <fct> <dbl>
# 1 a c 0.152 4.90 a c 9.04 15.1 a c 0.294
# 2 a d 2.74 16.0 a d 10.0 14.8 a d -0.486
# 3 b c -0.112 23.6 b c 10.2 14.5 b c 0.421

data.table grouping column is length 1 in J

Thanks to all for the candidates.

mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE]
mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]

The performance (with this small model) seems to have some small differences:

library(microbenchmark)
microbenchmark(
c1 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)],
c2 = mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
c3 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)],
c4 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
c5 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# c1 3.7328 4.21745 4.584591 4.43485 4.57465 9.8924 100
# c2 2.6740 3.11295 3.244856 3.21655 3.28975 5.6725 100
# c3 2.8219 3.30150 3.618646 3.46560 3.81250 6.8010 100
# c4 2.9084 3.27070 3.620761 3.44120 3.86935 6.3447 100
# c5 5.6156 6.37405 6.832622 6.54625 7.03130 13.8931 100

With larger data

mtbigger <- rbindlist(replicate(1000, mtcars, simplify=FALSE))
microbenchmark(
c1 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = mtbigger[.I]))), by = .(cyl)],
c2 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
c3 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mtbigger)],
c4 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
c5 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# c1 27.1635 30.54040 33.98210 32.2859 34.71505 76.5064 100
# c2 23.9612 25.83105 28.97927 27.5059 30.02720 67.9793 100
# c3 25.7880 28.27205 31.38212 30.2445 32.79030 105.4742 100
# c4 25.6469 27.84185 30.52403 29.8286 32.60805 37.8675 100
# c5 29.2477 32.32465 35.67090 35.0291 37.90410 68.5017 100

(I'm guessing the relative performance scales similarly. A better adjudication might include much wider data.)

By median runtime alone, it looks like the top (by a very small margin) is:

mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]

R data.table looping over columns to conditionally replace row values

Option 1 using :=:

dt[, (paste0("y", 50:70)) := lapply(.SD, function(x) {x[x<0] <- 0; x}), .SDcols=paste0("y", 50:70)]

Option 2 using set:

for (j in paste0("y", 50:70)) {
set(dt, dt[,which(get(j) < 0)], j, 0)
}

data:

library(data.table)
dt <- data.table(id=c(1:1000), x=rnorm(1:1000,60,20))
for(i in 50:70) {
dt[, paste0("y", i) := i-x]
}


Related Topics



Leave a reply



Submit