About Gforce in Data.Table 1.9.2

About GForce in data.table 1.9.2

It's nothing to do with na.rm. What you show worked fine before as well. However, I can see why you might have thought that. Here is the rest of the same NEWS item :

Examples where GForce applies now :
DT[,sum(x,na.rm=),by=...] # yes
DT[,list(sum(x,na.rm=),mean(y,na.rm=)),by=...] # yes
DT[,lapply(.SD,sum,na.rm=),by=...] # yes
DT[,list(sum(x),min(y)),by=...] # no. gmin not yet available
GForce is a level 2 optimization. To turn it off: options(datatable.optimize=1)
Reminder: to see the optimizations and other info, set verbose=TRUE

You don't need to do anything to benefit, it's an automatic optimization.

Here's an example on 500 million rows and 4 columns (13GB). First create and illustrate the data :

$ R
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

> require(data.table)
Loading required package: data.table
data.table 1.9.2 For help type: help("data.table")

> DT = data.table( grp = sample(1e6,5e8,replace=TRUE),
a = rnorm(1e6),
b = rnorm(1e6),
c = rnorm(1e6))
> tables()
NAME NROW MB COLS KEY
[1,] DT 500,000,000 13352 grp,a,b,c
Total: 13,352MB
> print(DT)
grp a b c
1e+00: 695059 -1.4055192 1.587540028 1.7104991
2e+00: 915263 -0.8239298 -0.513575696 -0.3429516
3e+00: 139937 -0.2202024 0.971816721 1.0597421
4e+00: 651525 1.0026858 -1.157824780 0.3100616
5e+00: 438180 1.1074729 -2.513939427 0.8357155
---
5e+08: 705823 -1.4773420 0.004369457 -0.2867529
5e+08: 716694 -0.6826147 -0.357086020 -0.4044164
5e+08: 217509 0.4939808 -0.012797093 -1.1084564
5e+08: 501760 1.7081212 -1.772721799 -0.7119432
5e+08: 765653 -1.1141456 -1.569578263 0.4947304

Now time with GForce optimization on (the default). Notice here there is no setkey first. This is what's known as cold by or ad hoc by which is common practice when you want to group in lots of different ways.

> system.time(ans1 <- DT[, lapply(.SD,sum), by=grp])
user system elapsed
47.520 5.651 53.173
> system.time(ans1 <- DT[, lapply(.SD,sum), by=grp])
user system elapsed
47.372 5.676 53.049 # immediate repeat to confirm timing

Now turn off GForce optimization (as per NEWS item) to see the difference it makes :

> options(datatable.optimize=1)

> system.time(ans2 <- DT[, lapply(.SD,sum), by=grp])
user system elapsed
97.274 3.383 100.659
> system.time(ans2 <- DT[, lapply(.SD,sum), by=grp])
user system elapsed
97.199 3.423 100.624 # immediate repeat to confirm timing

Finally, confirm the results are the same :

> identical(ans1,ans2)
[1] TRUE
> print(ans1)
grp a b c
1: 695059 16.791281 13.269647 -10.663118
2: 915263 43.312584 -33.587933 4.490842
3: 139937 3.967393 -10.386636 -3.766019
4: 651525 -4.152362 9.339594 7.740136
5: 438180 4.725874 26.328877 9.063309
---
999996: 372601 -2.087248 -19.936420 21.172860
999997: 13912 18.414226 -1.744378 -7.951381
999998: 150074 -4.031619 8.433173 -22.041731
999999: 385718 11.527876 6.807802 7.405016
1000000: 906246 -13.857315 -23.702011 6.605254

Notice that data.table retains the order of the groups according to when they first appeared. To order the grouped result, use keyby= instead of by=.

To turn GForce optimization back on (default is Inf to benefit from all optimizations) :

> options(datatable.optimize=Inf)

Aside : if you're not familiar with the lapply(.SD,...) syntax, it's just a way to apply a function through columns by group. For example, these two lines are equivalent :

 DT[, lapply(.SD,sum), by=grp]               # (1)
DT[, list(sum(a),sum(b),sum(c)), by=grp] # (2) exactly the same

The first (1) is more useful as you have more columns, especially in combination with .SDcols to control which subset of columns to apply the function through.

The NEWS item was just trying to convey that it doesn't matter which of these syntax is used, or whether you pass na.rm or not, GForce optimization will still be applied. It's saying that you can mix sum() and mean() in one call (which syntax (2) allows), but as soon as you do something else (like min()), then GForce won't kick in since min isn't done yet; only mean and sum have GForce optimizations currently. You can use verbose=TRUE to see if GForce is being applied.

Details of the machine used for this timing :

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 8
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Stepping: 4
CPU MHz: 2494.022
BogoMIPS: 4988.04
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-7

Performance of Grouped Operations with data.table

I think your suspicion is correct.
As to why, we could make an educated guess.

EDIT: the information below ignores what data.table does with GForce optimizations,
the functions GForce supports probably avoid copies of data similar to your C++ code but,
as Frank mentioned,
:= didn't try to detect GForce-optimizable expressions before 1.14.3,
and addrs below is certainly not optimized by GForce.

The C++ code you wrote doesn't modify anything from its input data,
it only has to allocate out.
As you correctly said,
data.table aims to be more flexible,
and must support any valid R code the users provide.
That makes me think that,
for grouped operations,
it would have to allocate new R vectors for each subset of z according to the group indices,
effectively copying some of the input multiple times.

I wanted to try to validate my assumption,
so I wrote some C++ code to look at memory addresses:

set.seed(31L)
DT <- data.table(w = sample(1:3, size = 20, replace = TRUE),
x = sample(1:3, size = 20, replace = TRUE),
y = sample(1:3, size = 20, replace = TRUE),
z = runif(n = 20, min = 0, max = 1))

setkey(DT, w, x, y)

cppFunction(plugins = "cpp11", includes = "#include <sstream>", '
StringVector addrs(NumericVector z, IntegerVector indices, IntegerVector group_ids) {
StringVector ans(indices.size());
for (auto i = 0; i < indices.size(); ++i) {
std::ostringstream oss;
oss << "Group" << group_ids[i] << " " << &z[indices[i]];
ans[i] = oss.str();
}
return ans;
}
')

idx <- DT[, .(indices = .I[1L] - 1L, group_ids = .GRP), keyby = list(w, x, y)]

DT[, list(addrs(z, idx$indices, idx$group_ids))]
# V1
# 1: Group1 0x55b7733361f8
# 2: Group2 0x55b773336200
# 3: Group3 0x55b773336210
# 4: Group4 0x55b773336220
# 5: Group5 0x55b773336228
# 6: Group6 0x55b773336230
# 7: Group7 0x55b773336238
# 8: Group8 0x55b773336248
# 9: Group9 0x55b773336250
# 10: Group10 0x55b773336258
# 11: Group11 0x55b773336260
# 12: Group12 0x55b773336270
# 13: Group13 0x55b773336280
# 14: Group14 0x55b773336288
# 15: Group15 0x55b773336290

As expected here,
if we look at z as a whole,
no copy takes place and the addresses of the different elements are close to each other,
for example 0x55b773336200 = 0x55b7733361f8 + 0x8.
You can execute the last line from the previous code multiple times and it will always show the same addresses.

What I partially didn't expect is this:

DT[, list(addrs(z, 0L, .GRP)), keyby = list(w, x, y)]
# w x y V1
# 1: 1 1 2 Group1 0x55b771b03440
# 2: 1 1 3 Group2 0x55b771b03440
# 3: 1 2 1 Group3 0x55b771b03440
# 4: 1 2 2 Group4 0x55b771b03440
# 5: 1 3 2 Group5 0x55b771b03440
# 6: 1 3 3 Group6 0x55b771b03440
# 7: 2 1 1 Group7 0x55b771b03440
# 8: 2 2 1 Group8 0x55b771b03440
# 9: 2 2 2 Group9 0x55b771b03440
# 10: 2 2 3 Group10 0x55b771b03440
# 11: 2 3 1 Group11 0x55b771b03440
# 12: 3 2 1 Group12 0x55b771b03440
# 13: 3 2 2 Group13 0x55b771b03440
# 14: 3 2 3 Group14 0x55b771b03440
# 15: 3 3 3 Group15 0x55b771b03440

On the one hand,
the address in memory did change,
so something was copied,
and if you run this multiple times you will see different addresses each time.
However, it seems like data.table somehow reuses a buffer with the same address,
maybe allocating one array with the length of the biggest group-subset and copying the different group values to it?
I wonder how they manage that.
Or maybe my code is wrong ¯\_(ツ)_/¯

EDIT: I added the following as the first line of addrs
(double escape because of R parsing)

Rcout << "input length = " << z.size() << "\\n";

and if I run the last DT[...] code from above,
it does print different lengths even though the address is the same.

Using data.table package in R to sum over columns - getting GForce sum(gsum) error

The error says that you cannot sum a character, so I'd say that colA is a character. You can use str(DT) to see the types of the variables in your data.

I created a similar dataset and used the code you provided and it worked for me:

library(data.table)
DT = data.table("Date" = c('01/23/15', '01/24/15', '02/23/15', '02/24/15'),
"colA" = c(2323, 1212, 1234, 2345),
"colB" = c(2323, 1112, 1134, 2245),
"colC" = c(2323, 1012, 1434, 2445),
"month" = c('january', 'january', 'february', 'february'),
"year" = c(2015, 2015, 2015, 2015)
)

setkey(DT, month, year)

DT[ ,lapply(.SD, sum, na.rm=TRUE), by=.(month , year), .SDcols= 2:(length(colnames(DT))-2) ]
month year colA colB colC
1: february 2015 3579 3379 3879
2: january 2015 3535 3435 3335

rollmean with grouped data.table returns a logical

We can use the NA_real_ instead of NA as by default it would be NA_logical_

dt3[x == 'c', class(rollmean(y, k = 7, fill = NA, align = 'right'))] 
#[1] "logical"

With NA_real_ in fill, it would work fine

dt3[,.(ma=rollmean(y, k = 7, fill=NA_real_,align="right")), by = .(x)]
# x ma
# 1: a NA
# 2: a NA
# 3: a NA
# 4: a NA
# 5: a NA
# 6: a NA
# 7: a 0.19653855
# 8: a -0.05506344
# 9: a -0.17022022
#10: a -0.28731762
#11: b NA
#12: b NA
#13: b NA
#14: b NA
#15: b NA
#16: b NA
#17: b 0.02117906
#18: b -0.07079598
#19: b -0.05393943
#20: b 0.04511924
#21: c NA
x ma

In other groups, it is also creating NA, but the difference is that it gets coerced to numeric NA when there are non-NA elements

What does .SD stand for in data.table in R

.SD stands for something like "Subset of Data.table". There's no significance to the initial ".", except that it makes it even more unlikely that there will be a clash with a user-defined column name.

If this is your data.table:

DT = data.table(x=rep(c("a","b","c"),each=2), y=c(1,3), v=1:6)
setkey(DT, y)
DT
# x y v
# 1: a 1 1
# 2: b 1 3
# 3: c 1 5
# 4: a 3 2
# 5: b 3 4
# 6: c 3 6

Doing this may help you see what .SD is:

DT[ , .SD[ , paste(x, v, sep="", collapse="_")], by=y]
# y V1
# 1: 1 a1_b3_c5
# 2: 3 a2_b4_c6

Basically, the by=y statement breaks the original data.table into these two sub-data.tables

DT[ , print(.SD), by=y]
# <1st sub-data.table, called '.SD' while it's being operated on>
# x v
# 1: a 1
# 2: b 3
# 3: c 5
# <2nd sub-data.table, ALSO called '.SD' while it's being operated on>
# x v
# 1: a 2
# 2: b 4
# 3: c 6
# <final output, since print() doesn't return anything>
# Empty data.table (0 rows) of 1 col: y

and operates on them in turn.

While it is operating on either one, it lets you refer to the current sub-data.table by using the nick-name/handle/symbol .SD. That's very handy, as you can access and operate on the columns just as if you were sitting at the command line working with a single data.table called .SD ... except that here, data.table will carry out those operations on every single sub-data.table defined by combinations of the key, "pasting" them back together and returning the results in a single data.table!

data.table: subset by observations in last 24 hours, per group

Simple solution creating a cutoff time for each group (assuming time is already transformed):

dt[, cutoff_time := max(time) - 24*60*60, by = group]
dt[time > cutoff_time]

EDIT:

The comment on "GForce optimized max" made me curious, so I created some bigger fake data in order to compare speed. Note that integer plays nicely with both max and >=:

require(data.table)

require(microbenchmark)

N = 100000
N_g = 100

all_times = seq(from = as.POSIXct('2016-01-01 10:00:00'),
to = as.POSIXct('2016-06-30 10:00:00'),
by = 60)

all_times_int = as.integer(all_times)

idx = sample(seq.int(length(all_times)), N, replace = TRUE)

dt = data.table(group = sample(seq.int(N_g), N, replace = TRUE),
time = all_times[idx],
time_int = all_times_int[idx])

f1a = function (x) {
x[, cutoff_time := max(time) - 24*60*60, by = group]
x[time >= cutoff_time, list(group, time)]
}

f1b = function (x) {
x[, cutoff_time := max(time_int) - 24*60*60, by = group]
x[time_int >= cutoff_time, list(group, time)]
}

f2 = function (x) {
thresh_dt = x[, .(time = max(time)), by=group][, time := time - 24*60*60]
thresh_dt[x, on=c("group", "time"), roll=TRUE, nomatch=0][, list(group, time)]
}

microbenchmark(f1a(dt),
f1b(dt),
f2(dt))

Unit: milliseconds
expr min lq mean median uq max neval
f1a(dt) 9.842106 10.593243 11.593148 11.62311 12.478853 14.335338 100
f1b(dt) 3.391178 3.763598 4.403264 4.00142 5.018182 8.335717 100
f2(dt) 14.422669 15.701397 17.090674 16.56990 17.695653 52.926897 100

identical(f1a(dt), f1b(dt)) # TRUE
identical(f1a(dt), f2(dt)) # TRUE

EDIT 2:
And one more with N = 1,000,000 and N_g = 10,000 groups:

> microbenchmark(f1a(dt),
+ f1b(dt),
+ f2(dt),
+ times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
f1a(dt) 634.91473 647.5662 670.74597 663.28238 694.29595 728.2481 10
f1b(dt) 64.61488 67.3692 76.68925 68.42335 72.36862 113.1407 10
f2(dt) 205.67688 208.6491 229.65610 213.59476 249.16703 278.7713 10

> microbenchmark(f1a(dt),
+ f1b(dt),
+ f2(dt),
+ times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
f1a(dt) 620.11090 624.33587 645.0220 642.13648 657.74347 697.27674 10
f1b(dt) 64.80214 67.43851 67.9140 67.99647 68.63552 69.74466 10
f2(dt) 198.39200 199.56088 209.6908 204.60183 216.23255 241.76792 10

> microbenchmark(f1a(dt),
+ f1b(dt),
+ f2(dt),
+ times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
f1a(dt) 619.2903 645.22617 656.58883 660.99508 664.82678 682.7618 10
f1b(dt) 63.2454 67.31781 72.10255 68.19679 71.91441 106.7493 10
f2(dt) 195.9335 210.06171 222.19868 215.75979 241.74100 245.9022 10


Related Topics



Leave a reply



Submit