Can .Sd Be Viewed from a Browser Within [.Data.Table()

Can .SD be viewed from a browser within [.data.table()?

Updated in light of Matthew Dowle's comments:

It turns out that .SD is, internally, the environment within which all j expressions are evaluated, including those which don't explicitly reference .SD at all. Filling it with all of DT's columns for each subset of DT is not cheap, timewise, so [.data.table() won't do so unless it really needs to.

Instead, making great use of R's lazy-evaluation of arguments, it previews the unevaluated j expression, and only adds to .SD columns that are referenced therein. If .SD itself is mentioned, it adds all of DT's columns.

So, to view .SD, just include some reference to it in the j-expression. Here is one of many expressions that will work:

library(data.table)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)

## This works
DT[, if(nrow(.SD)) browser(), by=x]
# Called from: `[.data.table`(DT, , if (nrow(.SD)) browser(), by = x)
Browse[1]> .SD
#    y v
# 1: 1 1
# 2: 3 2
# 3: 6 3

And here are a couple more:

DT[,{.SD; browser()}, by=x]
DT[,{browser(); .SD}, by=x]  ## Notice that order doesn't matter

To see for yourself that .SD just loads columns needed by the j-expression, run these each in turn (typing .SD when entering the browser environment, and Q to leave it and return to the normal command-line):

DT[, {.N * y ; browser()}, by=x]
DT[, {v^2 ; browser()}, by=x]
DT[, {y*v ; browser()}, by=x]

.SD columns in data.table in R

To make all of .SD's columns available, you just need to reference it somewhere in your j expression. For example, try this:

x[,{.SD; browser(); a+1},by=id]
# Called from: `[.data.table`(x, , {
#     .SD
#     browser()
#     a + 1
# }, by = id)
Browse[1]> .SD
#    a  b
# 1: 1 10
# 2: 6  5

This works because, as explained here

[.data.table() [...] previews the unevaluated j expression, and only adds to .SD columns that are referenced therein. If .SD itself is mentioned, it adds all of DT's columns.

Alternatively, if you don't want to incur the expense of loading .SD's columns for each by-group calculation, you can always inspect the currently loaded subset of x by calling x[.I,]. (.I is a variable which stores the row locations in x of the current group):

x[,{browser(); a+1},by=id]
# Called from: `[.data.table`(x, , {
#     browser()
#     a + 1
# }, by = id)
Browse[1]> x[.I,]
#    a  b id
# 1: 1 10  1
# 2: 6  5  1

data.table grouping column is length 1 in J

Thanks to all for the candidates.

mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE]
mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]

The performance (with this small model) seems to have some small differences:

library(microbenchmark)
microbenchmark(
  c1 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)],
  c2 = mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
  c3 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)],
  c4 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
  c5 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
)
# Unit: milliseconds
#  expr    min      lq     mean  median      uq     max neval
#    c1 3.7328 4.21745 4.584591 4.43485 4.57465  9.8924   100
#    c2 2.6740 3.11295 3.244856 3.21655 3.28975  5.6725   100
#    c3 2.8219 3.30150 3.618646 3.46560 3.81250  6.8010   100
#    c4 2.9084 3.27070 3.620761 3.44120 3.86935  6.3447   100
#    c5 5.6156 6.37405 6.832622 6.54625 7.03130 13.8931   100

With larger data

mtbigger <- rbindlist(replicate(1000, mtcars, simplify=FALSE))
microbenchmark(
  c1 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = mtbigger[.I]))), by = .(cyl)],
  c2 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
  c3 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mtbigger)],
  c4 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
  c5 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
)
# Unit: milliseconds
#  expr     min       lq     mean  median       uq      max neval
#    c1 27.1635 30.54040 33.98210 32.2859 34.71505  76.5064   100
#    c2 23.9612 25.83105 28.97927 27.5059 30.02720  67.9793   100
#    c3 25.7880 28.27205 31.38212 30.2445 32.79030 105.4742   100
#    c4 25.6469 27.84185 30.52403 29.8286 32.60805  37.8675   100
#    c5 29.2477 32.32465 35.67090 35.0291 37.90410  68.5017   100

(I'm guessing the relative performance scales similarly. A better adjudication might include much wider data.)

By median runtime alone, it looks like the top (by a very small margin) is:

mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]

Different aggregation rules with data.table in r

You can use that general syntax, just a few changes (1) you're creating a new data frame (with columns whose length doesn't equal nrow(df)), so you don't need the := and the part before it (2) you can use mget to get a list of columns to lapply over from a character vector (3) use c to concatenate lists together, rather than list which creates sublists.

ids <- c("id1","id2")
summing = c("var_sum1","var_sum2")
averaging = c("var_mean1","var_mean2")
wght_average = c("var_weighted_mean")

df[ ,  c(lapply(mget(summing), sum), 
         lapply(mget(averaging), mean), 
         lapply(mget(wght_average), weighted.mean, weight)), 
    by = c(ids)]

#    id1 id2   var_sum1  var_sum2 var_mean1 var_mean2 var_weighted_mean
# 1:   a   c -0.4091754 19.469144 10.181026  15.29206        0.06766247
# 2:   a   d -0.9797636  4.884255  8.856079  15.36002        1.43762082
# 3:   b   c -3.0569705 15.284160 10.021045  14.94577       -0.72186913
# 4:   b   d -0.4616429 10.076022  8.442672  15.09100        0.13813689

A possible tidyverse solution is to store the rules in a tibble

library(tidyverse)

ids = c("id1","id2")
do_over <- 
  list(
    summing = c("var_sum1","var_sum2"),
    averaging = c("var_mean1","var_mean2"),
    wght_average = c("var_weighted_mean"))
do_what <- 
  list(
    summing = sum,
    averaging = mean,
    wght_average = ~weighted.mean(., weight))

todo <- tibble(do_over, do_what)

todo
# # A tibble: 3 x 2
#   do_over      do_what     
#   <named list> <named list>
# 1 <chr [2]>    <fn>        
# 2 <chr [2]>    <fn>        
# 3 <chr [1]>    <formula>

Then pmap over the tibble to get your output

pmap_dfc(todo, ~
           df %>% 
            group_by_at(ids) %>% 
            summarise_at(.x, .y))

# # A tibble: 3 x 11
# # Groups:   id1 [2]
#   id1   id2   var_sum1 var_sum2 id11  id21  var_mean1 var_mean2 id12  id22  var_weighted_mean
#   <fct> <fct>    <dbl>    <dbl> <fct> <fct>     <dbl>     <dbl> <fct> <fct>             <dbl>
# 1 a     c        0.152     4.90 a     c          9.04      15.1 a     c                 0.294
# 2 a     d        2.74     16.0  a     d         10.0       14.8 a     d                -0.486
# 3 b     c       -0.112    23.6  b     c         10.2       14.5 b     c                 0.421

data.table grouping column is length 1 in J

Thanks to all for the candidates.

mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE]
mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]

The performance (with this small model) seems to have some small differences:

library(microbenchmark)
microbenchmark(
  c1 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)],
  c2 = mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
  c3 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)],
  c4 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
  c5 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
)
# Unit: milliseconds
#  expr    min      lq     mean  median      uq     max neval
#    c1 3.7328 4.21745 4.584591 4.43485 4.57465  9.8924   100
#    c2 2.6740 3.11295 3.244856 3.21655 3.28975  5.6725   100
#    c3 2.8219 3.30150 3.618646 3.46560 3.81250  6.8010   100
#    c4 2.9084 3.27070 3.620761 3.44120 3.86935  6.3447   100
#    c5 5.6156 6.37405 6.832622 6.54625 7.03130 13.8931   100

With larger data

mtbigger <- rbindlist(replicate(1000, mtcars, simplify=FALSE))
microbenchmark(
  c1 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = mtbigger[.I]))), by = .(cyl)],
  c2 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
  c3 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mtbigger)],
  c4 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
  c5 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
)
# Unit: milliseconds
#  expr     min       lq     mean  median       uq      max neval
#    c1 27.1635 30.54040 33.98210 32.2859 34.71505  76.5064   100
#    c2 23.9612 25.83105 28.97927 27.5059 30.02720  67.9793   100
#    c3 25.7880 28.27205 31.38212 30.2445 32.79030 105.4742   100
#    c4 25.6469 27.84185 30.52403 29.8286 32.60805  37.8675   100
#    c5 29.2477 32.32465 35.67090 35.0291 37.90410  68.5017   100

(I'm guessing the relative performance scales similarly. A better adjudication might include much wider data.)

By median runtime alone, it looks like the top (by a very small margin) is:

mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]

R data.table looping over columns to conditionally replace row values

Option 1 using :=:

dt[, (paste0("y", 50:70)) := lapply(.SD, function(x) {x[x<0] <- 0; x}), .SDcols=paste0("y", 50:70)]

Option 2 using set:

for (j in paste0("y", 50:70)) {
    set(dt, dt[,which(get(j) < 0)], j, 0)
}

data:

library(data.table)
dt <- data.table(id=c(1:1000), x=rnorm(1:1000,60,20))
for(i in 50:70) {
    dt[, paste0("y", i) := i-x]
}

Can .Sd Be Viewed from a Browser Within [.Data.Table()