Row-By-Row Operations and Updates in Data.Table

Row operations in data.table using `by = .I`

UPDATE:

Since data.table version 1.4.3 or later, by=.I has been implemented to work as expected by OP for row-wise grouping. Note using by=.I will create a new column in the data.table called I that has the row numbers. The row number column can then be kept or deleted according to preference.

The following parts of this answer records an earlier version that pertains to older versions of data.table. I keep it here for reference in case someone still uses legacy versions.

Note: section (3) of this answer updated in April 2019, due to many changes in data.table over time redering the original version obsolete. Also, use of the argument with= removed from all instances of data.table, as it has since been deprecated.

1) Well, one reason not to use it, at least for the rowsums example is performance, and creation of an unnecessary column. Compare to option f2 below, which is almost 4x faster and does not need the rowpos column (Note that the original question used rowSums as the example function, to which this part of the answer responds. OP edited the question afterwards to use a different function, for which part 3 of this answer is more relevant`):

dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)], V1=1:5, V2=3:7, V3=5:1)
f1 <- function(dt){
  dt[, rowpos := .I] 
  dt[ ,  sdd := rowSums(.SD[, 2:4]), by = rowpos ] }
f2 <- function(dt) dt[, sdd := rowSums(.SD), .SDcols= 2:4]

library(microbenchmark)
microbenchmark(f1(dt),f2(dt))
# Unit: milliseconds
#   expr      min       lq     mean   median       uq      max neval cld
# f1(dt) 3.669049 3.732434 4.013946 3.793352 3.972714 5.834608   100   b
# f2(dt) 1.052702 1.085857 1.154132 1.105301 1.138658 2.825464   100  a

2) On your second question, although dt[, sdd := sum(.SD[, 2:4]), by = .I] does not work, dt[, sdd := sum(.SD[, 2:4]), by = 1:NROW(dt)] works perfectly. Given that according to ?data.table ".I is an integer vector equal to seq_len(nrow(x))", one might expect these to be equivalent. The difference, however, is that .I is for use in j, not in by. NB the value of .I is calculated internally in data.table, so is not available beforehand to be passed in as a parameter value as in by=.I.

It might also be expected that by = .I should just throw an error. But this does not occur, because loading the data.table package creates an object .I in the data.table namespace that is accessible from the global environment, and whose value is NULL. You can test this by typing .I at the command prompt. (Note, the same applies to .SD, .EACHI, .N, .GRP, and .BY)

.I
# Error: object '.I' not found
library(data.table)
.I
# NULL
data.table::.I
# NULL

The upshot of this is that the behaviour of by = .I is equivalent to by = NULL.

3) Although we have already seen in part 1 that in the case of rowSums, which already loops row-wise efficiently, there are much faster ways than creating the rowpos column. But what about looping when we don't have a fast row-wise function?

Benchmarking the by = rowpos and by = 1:NROW(dt) versions against a for loop with set() is informative here. We find that looping over set in a for loop is slower than either of the methods that use data.table's by argument for looping. However there is neglibible difference in timing between the by loop that creates an additional column and the one that uses seq_len(NROW(dt)). Absent any performance difference, it seems that f.nrow is probably preferable, but only on the basis of being more concise and not creating an unnecessary column

dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1)

f.rowpos <- function() {
  dt[, rowpos := .I] 
  dt[,  sdd := sum(.SD[, 2:4]), by = rowpos ] 
}

f.nrow <- function() {
  dt[, sdd := sum(.SD[, 2:4]), by = seq_len(NROW(dt)) ]
}

f.forset<- function() {
  for (i in seq_len(NROW(dt))) set(dt, i, 'sdd', sum(dt[i, 2:4]))
}

microbenchmark(f.rowpos(),f.nrow(), f.forset(), times = 5)
# Unit: milliseconds
#       expr       min        lq      mean    median        uq       max neval
# f.rowpos()  559.1115  575.3162  580.2853  578.6865  588.5532  599.7591     5
#   f.nrow()  558.4327  582.4434  584.6893  587.1732  588.6689  606.7282     5
# f.forset() 1172.6560 1178.8399 1298.4842 1255.4375 1292.7393 1592.7486     5

So, in conclusion, even in situations where there is not an optimised function such as rowSums that already operates by row, there are alternatives to using a rowpos column that, although not faster, don't require creation of a redundant column.

Row operations in data.table

A few things:

dt[, genesum:=lapply(.SD,sum), by=gene] and dt[, genesum:=apply(dt[ ,-1],1, sum)] are quite different.
- dt[, genesum:=lapply(.SD,sum), by=gene] loops over the columns of the .SD data.table and sums them
- dt[, genesum:=apply(dt[, -1], 1, sum)] is looping over the rows (ie. apply(x, 1, function) applies function to every row in x
I think you can get what you want by calling rowSums, like so:
```
dt[, genesum := rowSums(dt[, -1])]
```

Is that what you're after?

sequentially update rows in data.table

I'd write a simple Rcpp function instead of spending time trying to find a vectorized R solution:

library(Rcpp)
sourceCpp(code = "
          #include <Rcpp.h>
          // [[Rcpp::export]]
          Rcpp::IntegerVector myfun(const Rcpp::IntegerVector x, const Rcpp::IntegerVector y) {
          Rcpp::IntegerVector res = x;
          res(0) = x(0) + y(0);
          for (int i=1; i<x.length(); i++) {
            if (x(i) >= res(i-1)) res(i) += y(i);
            else res(i) = res(i-1) + y(i);
          }
          return res;
          }
          ")
tempData[, enddt1 := myfun(startdt, daysupp)]
#   drugName    startdt daysupp      enddt     enddt1
#1: Aspirine 2012-01-01      30 2012-01-31 2012-01-31
#2: Aspirine 2012-01-20      30 2012-03-01 2012-03-01
#3: Aspirine 2012-02-15      10 2012-03-11 2012-03-11
#4: Aspirine 2012-03-10      20 2012-03-31 2012-03-31

data.table: Perform efficient row-wise operation on large data.table with columns as input

This is difficult. strsplit will not be very memory efficient for this 100 million dataset - each row requires two lists to be made from strsplit. My suggestion is to use a function and skip the by = 1:.N step.

exposed = function(before, after) {
  out = vector(length = length(before))
  for (i in seq_along(before)) {
    bef = before[i]
    aft = after[i]
    if (bef == "NONE" || aft == "NONE") 
      out[i] = FALSE
    else
      out[i] = any(!unlist(strsplit(aft, "[+]", fixed = TRUE), use.names = FALSE)%chin%unlist(strsplit(bef, "[+]", fixed = TRUE), use.names = FALSE))
  }
  return(out)
}

DT[, TI3 := exposed(exposure.before.index, exposure)]

> DT[, .(exposure.before.index, exposure, TI, TI3)]
   exposure.before.index          exposure    TI   TI3
1:                 drugA             drugA FALSE FALSE
2:                 drugA       drugA+drugB  TRUE  TRUE
3:                 drugA drugA+drugB+drugC  TRUE  TRUE
4:                 drugB             drugB FALSE FALSE
5:                 drugB             drugC  TRUE  TRUE
6:                  NONE              NONE FALSE FALSE
7:                  NONE              NONE FALSE FALSE

Note there are a few optimizations here:

Using %chin% instead of %in% which is a data.table utility function that is faster on character vectors than %in%
Using strsplit(..., fixed = TRUE) to optimize - this isn't a regular expression we are using. Likely the biggest performance boost.
unlist(..., use.names = FALSE)

The next step would be to turn the function into an Rcpp which is not done here. Strings are more complicated than numbers in Rcpp (at least for me).

Here's the performance of this function. For the 7 row example, this is 4 times faster. But as we increase the rows, the speed difference becomes less significant:

## 7 rows
Unit: microseconds
   expr      min       lq     mean   median       uq       max
 use_fx  375.801  395.251  662.582  409.751  431.351 21345.701
     OP 1889.901 2021.601 2211.858 2096.101 2285.201  4042.801

## 700,000 rows
Unit: seconds
   expr       min        lq      mean    median        uq       max
 use_fx  4.409595  4.409595  4.409595  4.409595  4.409595  4.409595
     OP 12.592520 12.592520 12.592520 12.592520 12.592520 12.592520

## 7,000,000 rows
Unit: seconds
   expr       min        lq      mean    median        uq       max
 use_fx  43.90979  43.90979  43.90979  43.90979  43.90979  43.90979
     OP 130.16418 130.16418 130.16418 130.16418 130.16418 130.16418

## code used:
DT_big = DT[rep(seq_len(.N), 1e5)]
microbenchmark(
  use_fx = DT_big[, TI3 := exposed(exposure.before.index, exposure)],
  OP = {
    DT_big[,CNT:=1:.N]
    DT_big[!(exposure.before.index!="NONE" & exposure=="NONE"),TI:=(any(!unlist(strsplit(exposure, "[+]")) %in% unlist(strsplit(exposure.before.index, "[+]")))),by="CNT"]
    DT_big[is.na(TI),TI:=FALSE]
  }
  , times = 1L
)

If you are interested in Rcpp, this may be helpful:

https://wckdouglas.github.io/2015/05/string-manipulation

Use by = each row for data table

The following data.table code worked for me.

 dt[, average := rowMeans(.SD)]

As pointed out by @jangorecki, it is possible to construct your own function to run by row as long as you remember that each row is a list object:

# my function, must unlist the argument
myMean <- function(i, ...) mean(unlist(i), ...)

using by=seq_len

dt[, averageNew := myMean(.SD), by = seq_len(nrow(dt))]

using row.names

dt[, averageOther := myMean(.SD), by = row.names(dt)]

R data.table: optimize speed of row operations by (different) groups

In the OP's code, we don't need the == once we set the key i.e. the first setkey is enough, and join on by 'DATE' while doing the subtraction of PRICE and i.PRICE

setkeyv(DT, cols=c("FRUIT", "DATE"))
DT[DT[.(chosen_fruit)],  RESULTS := PRICE - i.PRICE, on = .(DATE)]

Or another option is do a group by 'DATE', subtract the 'PRICE' from the corresponding PRICE where 'FRUIT' is 'GRAPE'

library(data.table)
DT[, RESULTS := PRICE - PRICE[FRUIT == 'GRAPE'], DATE]

-output

DT
          DATE  FRUIT  PRICE RESULTS
 1: 2020-03-01 BANANA 30.000  29.500
 2: 2020-03-02 BANANA 30.060  29.430
 3: 2020-03-03 BANANA 30.120  29.360
 4: 2020-03-04 BANANA 30.180  29.290
 5: 2020-03-05 BANANA 30.240  29.220
 6: 2020-03-06 BANANA 30.300  29.150
 7: 2020-03-07 BANANA 30.360  29.080
 8: 2020-03-08 BANANA 30.420  29.010
 9: 2020-03-01 ORANGE  5.000   4.500
10: 2020-03-02 ORANGE  5.035   4.405
11: 2020-03-03 ORANGE  5.070   4.310
12: 2020-03-04 ORANGE  5.105   4.215
13: 2020-03-05 ORANGE  5.140   4.120
14: 2020-03-06 ORANGE  5.175   4.025
15: 2020-03-07 ORANGE  5.210   3.930
16: 2020-03-08 ORANGE  5.245   3.835
17: 2020-03-01  APPLE 12.000  11.500
18: 2020-03-02  APPLE 12.600  11.970
19: 2020-03-03  APPLE 13.200  12.440
20: 2020-03-04  APPLE 13.800  12.910
21: 2020-03-05  APPLE 14.400  13.380
22: 2020-03-06  APPLE 15.000  13.850
23: 2020-03-07  APPLE 15.600  14.320
24: 2020-03-08  APPLE 16.200  14.790
25: 2020-03-01  LEMON 10.000   9.500
26: 2020-03-02  LEMON 10.010   9.380
27: 2020-03-03  LEMON 10.020   9.260
28: 2020-03-04  LEMON 10.030   9.140
29: 2020-03-05  LEMON 10.040   9.020
30: 2020-03-06  LEMON 10.050   8.900
31: 2020-03-07  LEMON 10.060   8.780
32: 2020-03-08  LEMON 10.070   8.660
33: 2020-03-01  GRAPE  0.500   0.000
34: 2020-03-02  GRAPE  0.630   0.000
35: 2020-03-03  GRAPE  0.760   0.000
36: 2020-03-04  GRAPE  0.890   0.000
37: 2020-03-05  GRAPE  1.020   0.000
38: 2020-03-06  GRAPE  1.150   0.000
39: 2020-03-07  GRAPE  1.280   0.000
40: 2020-03-08  GRAPE  1.410   0.000

Or another option is to dcast to 'wide' format and then do the subtraction

dt_wide <- dcast(DT, DATE ~ FRUIT, value.var = 'PRICE')
nm1 <- names(dt_wide)[-1]
dt_wide[, (nm1) := lapply(.SD, function(x) x - GRAPE), .SDcols = nm1]

Benchmarks

Tested on a slightly bigger dataset by changing the sample_size in constructing the input data

sample_size <- 1000000
dim(DT)
#[1] 5000000       3

system.time(DT[DT[.(chosen_fruit)],  RESULTS := PRICE - i.PRICE, on = .(DATE)])
# user  system elapsed 
#  0.287   0.039   0.326 

system.time({     DT[DT[FRUIT == chosen_fruit], RESULTS := PRICE - i.PRICE, on = .(DATE)] })
#  user  system elapsed 
#  0.294   0.006   0.300 

system.time({
  setkey(DT, DATE)
  DT[DT[FRUIT == chosen_fruit], RESULTS := PRICE - i.PRICE]
  setkey(DT, FRUIT)
})
# user  system elapsed 
#  0.431   0.045   0.476 

system.time(DT[, RESULTS := PRICE - PRICE[FRUIT == 'GRAPE'], DATE])
#  user  system elapsed 
#  6.660   0.039   6.665 

system.time({
dt_wide <- dcast(DT, DATE ~ FRUIT, value.var = 'PRICE')
nm1 <- names(dt_wide)[-1]
dt_wide[, (nm1) := lapply(.SD, function(x) x - GRAPE), .SDcols = nm1]

})
# user  system elapsed 
#  0.868   0.060   0.926

Preferred performant procedure for R data.table row-wise operations?

I think you can use matrix multiplication and other vectorization techniques to simplify your code, which helps you avoid running function logpost in a row-wise manner.

Below is a vectorized version of logpost, i.e., logpost2

logpost2 <- function(d, dd, mub = 1, taub = 10, a = 0.5, z = 0.7) {
  bmat <- as.matrix(dd[, .(b1, b2, b3)])
  xmat <- cbind(1, as.matrix(d[, .(x1, x2)]))
  phi <- dd$phi
  phi_log <- log(phi)
  lp <- -(a + nrow(d) + 1) * phi_log -
    (1 / (2 * phi^2)) * colSums((d$y - tcrossprod(xmat, bmat))^2) -
    (1 / (2 * taub^2)) * rowSums((bmat - mub)^2) - (z / phi)
  lp
}

and you will see

> start <- Sys.time()

> grid[, lp := logpost2(d, .SD)]

> difftime(Sys.time(), start)
Time difference of 0.1966231 secs

and

> head(grid)
     b1 b2   b3 phi id        lp
1: 0.00  1 -1.5 0.4  1 -398.7618
2: 0.05  1 -1.5 0.4  2 -380.3674
3: 0.10  1 -1.5 0.4  3 -363.5356
4: 0.15  1 -1.5 0.4  4 -348.2663
5: 0.20  1 -1.5 0.4  5 -334.5595
6: 0.25  1 -1.5 0.4  6 -322.4152

data.table | faster row-wise recursive update within group

Great question!

Starting from a fresh R session, showing the demo data with 5 million rows, here's your function from the question and the timing on my laptop. With some comments inline.

require(data.table)   # v1.10.0
n_smpl = 1e6
ni = 5
id = rep(1:n_smpl, each = ni)
smpl = data.table(id)
smpl[, time := 1:.N, by = id]
a_init = 1; b_init = 1
smpl[, ':=' (a = a_init, b = b_init)]
smpl[, xb := (1:.N)*id, by = id]

myfun = function (xb, a, b) {

  z = NULL
  # initializes a new length-0 variable

  for (t in 1:length(xb)) {

      if (t >= 2) { a[t] = b[t-1] + xb[t] }
      # if() on every iteration. t==1 could be done before loop

      z[t] = rnorm(1, mean = a[t])
      # z vector is grown by 1 item, each time

      b[t] = a[t] + z[t]
      # assigns to all of b vector when only really b[t-1] is
      # needed on the next iteration 
  }
  return(z)
}

set.seed(1); system.time(smpl[, z := myfun(xb, a, b), by = id][])
   user  system elapsed 
 19.216   0.004  19.212

smpl
              id time a b      xb            z
      1:       1    1 1 1       1 3.735462e-01
      2:       1    2 1 1       2 3.557190e+00
      3:       1    3 1 1       3 9.095107e+00
      4:       1    4 1 1       4 2.462112e+01
      5:       1    5 1 1       5 5.297647e+01
     ---                                      
4999996: 1000000    1 1 1 1000000 1.618913e+00
4999997: 1000000    2 1 1 2000000 2.000000e+06
4999998: 1000000    3 1 1 3000000 7.000003e+06
4999999: 1000000    4 1 1 4000000 1.800001e+07
5000000: 1000000    5 1 1 5000000 4.100001e+07

So 19.2s is the time to beat. In all these timings, I've run the command 3 times locally to make sure it's a stable timing. The timing variance is insignificant in this task so I'll just report one timing to keep the answer quicker to read.

Tackling the comments inline above in myfun() :

myfun2 = function (xb, a, b) {

  z = numeric(length(xb))
  # allocate size up front rather than growing

  z[1] = rnorm(1, mean=a[1])
  prevb = a[1]+z[1]
  t = 2L
  while(t<=length(xb)) {
    at = prevb + xb[t]
    z[t] = rnorm(1, mean=at)
    prevb = at + z[t]
    t = t+1L
  }
  return(z)
}
set.seed(1); system.time(smpl[, z2 := myfun2(xb, a, b), by = id][])
   user  system elapsed 
 13.212   0.036  13.245 
smpl[,identical(z,z2)]
[1] TRUE

That was quite good (19.2s down to 13.2s) but it's still a for loop at R level. On first glance it can't be vectorized because the rnorm() call depends on the previous value. In fact, it probably can be vectorized by using the property that m+sd*rnorm(mean=0,sd=1) == rnorm(mean=m, sd=sd) and calling vectorized rnorm(n=5e6) once rather than 5e6 times. But there'd probably be a cumsum() involved to deal with the groups. So let's not go there as that would probably make the code harder to read and would be specific to this precise problem.

So let's try Rcpp which looks very similar to the style you wrote and is more widely applicable :

require(Rcpp)   # v0.12.8
cppFunction(
'NumericVector myfun3(IntegerVector xb, NumericVector a, NumericVector b) {
  NumericVector z = NumericVector(xb.length());
  z[0] = R::rnorm(/*mean=*/ a[0], /*sd=*/ 1);
  double prevb = a[0]+z[0];
  int t = 1;
  while (t<xb.length()) {
    double at = prevb + xb[t];
    z[t] = R::rnorm(at, 1);
    prevb = at + z[t];
    t++;
  }
  return z;
}')

set.seed(1); system.time(smpl[, z3 := myfun3(xb, a, b), by = id][])
   user  system elapsed 
  1.800   0.020   1.819 
smpl[,identical(z,z3)]
[1] TRUE

Much better: 19.2s down to 1.8s. But every call to the function calls the first line (NumericVector()) which allocates a new vector as long as the number of rows in the group. That's then filled in and returned which is copied to the final column in the correct place for that group (by :=), only to be released. That allocation and management of all those 1 million small temporary vectors (one for each group) is all a bit convoluted.

Why don't we do the whole column in one go? You've already written it in a for loop style and there's nothing wrong with that. Let's tweak the C function to accept the id column too and add the if for when it reaches a new group.

cppFunction(
'NumericVector myfun4(IntegerVector id, IntegerVector xb, NumericVector a, NumericVector b) {

  // ** id must be pre-grouped, such as via setkey(DT,id) **

  NumericVector z = NumericVector(id.length());
  int previd = id[0]-1;  // initialize to anything different than id[0]
  for (int i=0; i<id.length(); i++) {
    double prevb;
    if (id[i]!=previd) {
      // first row of new group
      z[i] = R::rnorm(a[i], 1);
      prevb = a[i]+z[i];
      previd = id[i];
    } else {
      // 2nd row of group onwards
      double at = prevb + xb[i];
      z[i] = R::rnorm(at, 1);
      prevb = at + z[i];
    }
  }
  return z;
}')

system.time(setkey(smpl,id))  # ensure grouped by id
   user  system elapsed
  0.028   0.004   0.033
set.seed(1); system.time(smpl[, z4 := myfun4(id, xb, a, b)][])
   user  system elapsed 
  0.232   0.004   0.237 
smpl[,identical(z,z4)]
[1] TRUE

That's better: 19.2s down to 0.27s.