Row operations in data.table
A few things:
dt[, genesum:=lapply(.SD,sum), by=gene]
anddt[, genesum:=apply(dt[ ,-1],1, sum)]
are quite different.dt[, genesum:=lapply(.SD,sum), by=gene]
loops over the columns of the.SD
data.table and sums themdt[, genesum:=apply(dt[, -1], 1, sum)]
is looping over the rows (ie.apply(x, 1, function)
appliesfunction
to every row inx
I think you can get what you want by calling
rowSums
, like so:dt[, genesum := rowSums(dt[, -1])]
Is that what you're after?
Row operations in data.table using `by = .I`
UPDATE:
Since data.table version 1.4.3 or later, by=.I
has been implemented to work as expected by OP for row-wise grouping. Note using by=.I
will create a new column in the data.table called I
that has the row numbers. The row number column can then be kept or deleted according to preference.
The following parts of this answer records an earlier version that pertains to older versions of data.table. I keep it here for reference in case someone still uses legacy versions.
Note: section (3) of this answer updated in April 2019, due to many changes in data.table over time redering the original version obsolete. Also, use of the argument with=
removed from all instances of data.table, as it has since been deprecated.
1) Well, one reason not to use it, at least for the rowsums
example is performance, and creation of an unnecessary column. Compare to option f2 below, which is almost 4x faster and does not need the rowpos column (Note that the original question used rowSums
as the example function, to which this part of the answer responds. OP edited the question afterwards to use a different function, for which part 3 of this answer is more relevant`):
dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)], V1=1:5, V2=3:7, V3=5:1)
f1 <- function(dt){
dt[, rowpos := .I]
dt[ , sdd := rowSums(.SD[, 2:4]), by = rowpos ] }
f2 <- function(dt) dt[, sdd := rowSums(.SD), .SDcols= 2:4]
library(microbenchmark)
microbenchmark(f1(dt),f2(dt))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(dt) 3.669049 3.732434 4.013946 3.793352 3.972714 5.834608 100 b
# f2(dt) 1.052702 1.085857 1.154132 1.105301 1.138658 2.825464 100 a
2) On your second question, although dt[, sdd := sum(.SD[, 2:4]), by = .I]
does not work, dt[, sdd := sum(.SD[, 2:4]), by = 1:NROW(dt)]
works perfectly. Given that according to ?data.table
".I is an integer vector equal to seq_len(nrow(x))", one might expect these to be equivalent. The difference, however, is that .I
is for use in j
, not in by
. NB the value of .I
is calculated internally in data.table, so is not available beforehand to be passed in as a parameter value as in by=.I
.
It might also be expected that by = .I
should just throw an error. But this does not occur, because loading the data.table
package creates an object .I
in the data.table namespace that is accessible from the global environment, and whose value is NULL
. You can test this by typing .I
at the command prompt. (Note, the same applies to .SD
, .EACHI
, .N
, .GRP
, and .BY
)
.I
# Error: object '.I' not found
library(data.table)
.I
# NULL
data.table::.I
# NULL
The upshot of this is that the behaviour of by = .I
is equivalent to by = NULL
.
3) Although we have already seen in part 1 that in the case of rowSums
, which already loops row-wise efficiently, there are much faster ways than creating the rowpos column. But what about looping when we don't have a fast row-wise function?
Benchmarking the by = rowpos
and by = 1:NROW(dt)
versions against a for
loop with set()
is informative here. We find that looping over set
in a for
loop is slower than either of the methods that use data.table's by
argument for looping. However there is neglibible difference in timing between the by
loop that creates an additional column and the one that uses seq_len(NROW(dt))
. Absent any performance difference, it seems that f.nrow
is probably preferable, but only on the basis of being more concise and not creating an unnecessary column
dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1)
f.rowpos <- function() {
dt[, rowpos := .I]
dt[, sdd := sum(.SD[, 2:4]), by = rowpos ]
}
f.nrow <- function() {
dt[, sdd := sum(.SD[, 2:4]), by = seq_len(NROW(dt)) ]
}
f.forset<- function() {
for (i in seq_len(NROW(dt))) set(dt, i, 'sdd', sum(dt[i, 2:4]))
}
microbenchmark(f.rowpos(),f.nrow(), f.forset(), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval
# f.rowpos() 559.1115 575.3162 580.2853 578.6865 588.5532 599.7591 5
# f.nrow() 558.4327 582.4434 584.6893 587.1732 588.6689 606.7282 5
# f.forset() 1172.6560 1178.8399 1298.4842 1255.4375 1292.7393 1592.7486 5
So, in conclusion, even in situations where there is not an optimised function such as rowSums
that already operates by row, there are alternatives to using a rowpos column that, although not faster, don't require creation of a redundant column.
rowwise operations with data.table using the .I notation
this can be accomplished using pmax
and pmin
library(data.table)
dt[, c("max","min"):=list(pmax(x,y), pmin(x,y))]
Preferred performant procedure for R data.table row-wise operations?
I think you can use matrix multiplication and other vectorization techniques to simplify your code, which helps you avoid running function logpost
in a row-wise manner.
Below is a vectorized version of logpost
, i.e., logpost2
logpost2 <- function(d, dd, mub = 1, taub = 10, a = 0.5, z = 0.7) {
bmat <- as.matrix(dd[, .(b1, b2, b3)])
xmat <- cbind(1, as.matrix(d[, .(x1, x2)]))
phi <- dd$phi
phi_log <- log(phi)
lp <- -(a + nrow(d) + 1) * phi_log -
(1 / (2 * phi^2)) * colSums((d$y - tcrossprod(xmat, bmat))^2) -
(1 / (2 * taub^2)) * rowSums((bmat - mub)^2) - (z / phi)
lp
}
and you will see
> start <- Sys.time()
> grid[, lp := logpost2(d, .SD)]
> difftime(Sys.time(), start)
Time difference of 0.1966231 secs
and
> head(grid)
b1 b2 b3 phi id lp
1: 0.00 1 -1.5 0.4 1 -398.7618
2: 0.05 1 -1.5 0.4 2 -380.3674
3: 0.10 1 -1.5 0.4 3 -363.5356
4: 0.15 1 -1.5 0.4 4 -348.2663
5: 0.20 1 -1.5 0.4 5 -334.5595
6: 0.25 1 -1.5 0.4 6 -322.4152
data.table: Perform efficient row-wise operation on large data.table with columns as input
This is difficult. strsplit
will not be very memory efficient for this 100 million dataset - each row requires two lists to be made from strsplit
. My suggestion is to use a function and skip the by = 1:.N
step.
exposed = function(before, after) {
out = vector(length = length(before))
for (i in seq_along(before)) {
bef = before[i]
aft = after[i]
if (bef == "NONE" || aft == "NONE")
out[i] = FALSE
else
out[i] = any(!unlist(strsplit(aft, "[+]", fixed = TRUE), use.names = FALSE)%chin%unlist(strsplit(bef, "[+]", fixed = TRUE), use.names = FALSE))
}
return(out)
}
DT[, TI3 := exposed(exposure.before.index, exposure)]
> DT[, .(exposure.before.index, exposure, TI, TI3)]
exposure.before.index exposure TI TI3
1: drugA drugA FALSE FALSE
2: drugA drugA+drugB TRUE TRUE
3: drugA drugA+drugB+drugC TRUE TRUE
4: drugB drugB FALSE FALSE
5: drugB drugC TRUE TRUE
6: NONE NONE FALSE FALSE
7: NONE NONE FALSE FALSE
Note there are a few optimizations here:
- Using
%chin%
instead of%in%
which is a data.table utility function that is faster on character vectors than%in%
- Using
strsplit(..., fixed = TRUE)
to optimize - this isn't a regular expression we are using. Likely the biggest performance boost. unlist(..., use.names = FALSE)
The next step would be to turn the function into an Rcpp
which is not done here. Strings are more complicated than numbers in Rcpp
(at least for me).
Here's the performance of this function. For the 7 row example, this is 4 times faster. But as we increase the rows, the speed difference becomes less significant:
## 7 rows
Unit: microseconds
expr min lq mean median uq max
use_fx 375.801 395.251 662.582 409.751 431.351 21345.701
OP 1889.901 2021.601 2211.858 2096.101 2285.201 4042.801
## 700,000 rows
Unit: seconds
expr min lq mean median uq max
use_fx 4.409595 4.409595 4.409595 4.409595 4.409595 4.409595
OP 12.592520 12.592520 12.592520 12.592520 12.592520 12.592520
## 7,000,000 rows
Unit: seconds
expr min lq mean median uq max
use_fx 43.90979 43.90979 43.90979 43.90979 43.90979 43.90979
OP 130.16418 130.16418 130.16418 130.16418 130.16418 130.16418
## code used:
DT_big = DT[rep(seq_len(.N), 1e5)]
microbenchmark(
use_fx = DT_big[, TI3 := exposed(exposure.before.index, exposure)],
OP = {
DT_big[,CNT:=1:.N]
DT_big[!(exposure.before.index!="NONE" & exposure=="NONE"),TI:=(any(!unlist(strsplit(exposure, "[+]")) %in% unlist(strsplit(exposure.before.index, "[+]")))),by="CNT"]
DT_big[is.na(TI),TI:=FALSE]
}
, times = 1L
)
If you are interested in Rcpp
, this may be helpful:
https://wckdouglas.github.io/2015/05/string-manipulation
Perform row-wise operation in datatable with multiple elements
You can do this:
DT[, .(.BY$x+DT[group==.BY$group,y]), by=.(x,group)]
This returns N rows per x, where N is the size of x's group. We leverage the special (.BY
), which is available in j
when utilizing by
. Basically, .BY
is a named list, containing the values of the grouping variables. Here, I'm adding the value of x (.BY$x
) to the vector of y
values from the subset of DT
where the group
is equal to the current group value (.BY$group
)
Output:
x group V1
<int> <char> <int>
1: 68 A 144
2: 68 A 107
3: 68 A 92
4: 68 A 121
5: 68 A 160
---
4996: 4 B 25
4997: 4 B 66
4998: 4 B 83
4999: 4 B 27
5000: 4 B 68
You can also accomplish this via a join:
DT[,!c("y")][DT[, .(y,group)], on=.(group), allow.cartesian=T][, total:=x+y][order(ind)]
Output:
ind x group y total
<int> <int> <char> <int> <int>
1: 1 68 A 76 144
2: 1 68 A 39 107
3: 1 68 A 24 92
4: 1 68 A 53 121
5: 1 68 A 92 160
---
4996: 100 4 B 21 25
4997: 100 4 B 62 66
4998: 100 4 B 79 83
4999: 100 4 B 23 27
5000: 100 4 B 64 68
R data.table: optimize speed of row operations by (different) groups
In the OP's code, we don't need the ==
once we set
the key i.e. the first setkey
is enough, and join on
by 'DATE' while doing the subtraction of PRICE
and i.PRICE
setkeyv(DT, cols=c("FRUIT", "DATE"))
DT[DT[.(chosen_fruit)], RESULTS := PRICE - i.PRICE, on = .(DATE)]
Or another option is do a group by 'DATE', subtract the 'PRICE' from the corresponding PRICE where 'FRUIT' is 'GRAPE'
library(data.table)
DT[, RESULTS := PRICE - PRICE[FRUIT == 'GRAPE'], DATE]
-output
DT
DATE FRUIT PRICE RESULTS
1: 2020-03-01 BANANA 30.000 29.500
2: 2020-03-02 BANANA 30.060 29.430
3: 2020-03-03 BANANA 30.120 29.360
4: 2020-03-04 BANANA 30.180 29.290
5: 2020-03-05 BANANA 30.240 29.220
6: 2020-03-06 BANANA 30.300 29.150
7: 2020-03-07 BANANA 30.360 29.080
8: 2020-03-08 BANANA 30.420 29.010
9: 2020-03-01 ORANGE 5.000 4.500
10: 2020-03-02 ORANGE 5.035 4.405
11: 2020-03-03 ORANGE 5.070 4.310
12: 2020-03-04 ORANGE 5.105 4.215
13: 2020-03-05 ORANGE 5.140 4.120
14: 2020-03-06 ORANGE 5.175 4.025
15: 2020-03-07 ORANGE 5.210 3.930
16: 2020-03-08 ORANGE 5.245 3.835
17: 2020-03-01 APPLE 12.000 11.500
18: 2020-03-02 APPLE 12.600 11.970
19: 2020-03-03 APPLE 13.200 12.440
20: 2020-03-04 APPLE 13.800 12.910
21: 2020-03-05 APPLE 14.400 13.380
22: 2020-03-06 APPLE 15.000 13.850
23: 2020-03-07 APPLE 15.600 14.320
24: 2020-03-08 APPLE 16.200 14.790
25: 2020-03-01 LEMON 10.000 9.500
26: 2020-03-02 LEMON 10.010 9.380
27: 2020-03-03 LEMON 10.020 9.260
28: 2020-03-04 LEMON 10.030 9.140
29: 2020-03-05 LEMON 10.040 9.020
30: 2020-03-06 LEMON 10.050 8.900
31: 2020-03-07 LEMON 10.060 8.780
32: 2020-03-08 LEMON 10.070 8.660
33: 2020-03-01 GRAPE 0.500 0.000
34: 2020-03-02 GRAPE 0.630 0.000
35: 2020-03-03 GRAPE 0.760 0.000
36: 2020-03-04 GRAPE 0.890 0.000
37: 2020-03-05 GRAPE 1.020 0.000
38: 2020-03-06 GRAPE 1.150 0.000
39: 2020-03-07 GRAPE 1.280 0.000
40: 2020-03-08 GRAPE 1.410 0.000
Or another option is to dcast
to 'wide' format and then do the subtraction
dt_wide <- dcast(DT, DATE ~ FRUIT, value.var = 'PRICE')
nm1 <- names(dt_wide)[-1]
dt_wide[, (nm1) := lapply(.SD, function(x) x - GRAPE), .SDcols = nm1]
Benchmarks
Tested on a slightly bigger dataset by changing the sample_size
in constructing the input data
sample_size <- 1000000
dim(DT)
#[1] 5000000 3
system.time(DT[DT[.(chosen_fruit)], RESULTS := PRICE - i.PRICE, on = .(DATE)])
# user system elapsed
# 0.287 0.039 0.326
system.time({ DT[DT[FRUIT == chosen_fruit], RESULTS := PRICE - i.PRICE, on = .(DATE)] })
# user system elapsed
# 0.294 0.006 0.300
system.time({
setkey(DT, DATE)
DT[DT[FRUIT == chosen_fruit], RESULTS := PRICE - i.PRICE]
setkey(DT, FRUIT)
})
# user system elapsed
# 0.431 0.045 0.476
system.time(DT[, RESULTS := PRICE - PRICE[FRUIT == 'GRAPE'], DATE])
# user system elapsed
# 6.660 0.039 6.665
system.time({
dt_wide <- dcast(DT, DATE ~ FRUIT, value.var = 'PRICE')
nm1 <- names(dt_wide)[-1]
dt_wide[, (nm1) := lapply(.SD, function(x) x - GRAPE), .SDcols = nm1]
})
# user system elapsed
# 0.868 0.060 0.926
Row operations on selected columns based on substring in data.table
We melt
the dataset into 'long' format, by making use of the measure
argument, get the difference between the sum
of 'v' and mean
of 'f', grouped by 'id', join on
the 'id' column with the original dataset and assign (:=
) the 'V1' as the 'Y' variable
dt[melt(dt, measure = patterns("^v", "^f"), value.name = c("v", "f"))[
, sum(v) - mean(f), id], Y :=V1, on = .(id)]
dt
# id v1 v2 f1 f2 Y
#1: a 1 1 11 11 -9
#2: b 2 2 12 12 -8
#3: c 3 3 13 13 -7
#4: d 4 4 14 14 -6
#5: e 5 5 15 15 -5
Or another option is with Reduce
after creating index or 'v' and 'f' columns
nmv <- which(startsWith(names(dt), "v"))
nmf <- which(startsWith(names(dt), "f"))
l1 <- length(nmv)
dt[, Y := Reduce(`+`, .SD[, nmv, with = FALSE])- (Reduce(`+`, .SD[, nmf, with = FALSE])/l1)]
Related Topics
Knitr Wont Compile PDF: "Error in Tools::File_Path_As_Absolute(Output_File)"
Ggplot: Colour Points by Groups Based on User Defined Colours
How to Cumulatively Add Values in One Vector in R
R::Ggplot2::Geom_Points: How to Swap Points with Pie Charts
Package Rgl in R Not Loading in MAC Os
How to Load Data Quickly into R
Merging More Than 2 Dataframes in R by Rownames
Split Time Series Data into Time Intervals (Say an Hour) and Then Plot the Count
Crop for Spatialpolygonsdataframe
Buffer (Geo)Spatial Points in R with Gbuffer
How to Strip Dollar Signs ($) from Data/ Escape Special Characters in R
R: Replacing Na Values by Mean of Hour with Dplyr
How to Use R Plotly Library in R Script Visual of Power Bi
Convert Integer as "20160119" to Different Columns of "Day" "Year" "Month"
Create Binary Column (0/1) Based on Condition in Another Column