Sub-Assign by Reference on Vector in R

Sub-assign by reference on vector in R

In most recent R versions (3.1-3.1.2+ or so), assignment to a vector does not copy. You will not see that by running OP's code though, and the reason for that is the following. Because you reuse x and assign it to some other object, R is not notified that x is copied at that point, and has to assume that it won't be (in the particular case above, I think it'll be good to change it in data.table::data.table and notify R that a copy has been made, but that's a separate issue - data.frame suffers from same issue), and because of that it copies x on first use. If you change the order of the commands a bit, you'd see no difference:

N <- 5e7
x <- sample(letters, N, TRUE)
upd_i <- sample(N, 1L, FALSE)
# no copy here:
system.time(x[upd_i] <- NA_character_)
# user system elapsed
# 0 0 0
X <- data.table(x = x)
system.time(X[upd_i, x := NA_character_])
# user system elapsed
# 0 0 0

# but now R will copy:
system.time(x[upd_i] <- NA_character_)
# user system elapsed
# 0.28 0.08 0.36

(old answer, mostly left as a curiosity)

You actually can use the data.table := operator to modify your vector in place (I think you need R version 3.1+ to avoid the copy in list):

modify.vector = function (v, idx, value) setDT(list(v))[idx, V1 := value]

v = 1:5
address(v)
#[1] "000000002CC7AC48"

modify.vector(v, 4, 10)
v
#[1] 1 2 3 10 5

address(v)
#[1] "000000002CC7AC48"

Sub-assign rows by reference using data.table

data.table has the format DT[i,j,by] with i meaning location / where, j meaning select / update / compute / assign and by meaning group by.

So the mistake that you are making here is the following:

In your assignment: DT1[col1==vec, ...] part is equivalent to the following index:

DT1$col1 == vec  

This is like comparing the elements col1 column of DT1 with vec. Since vec has only 3 elements, the elements are rolled over, and due to specific values in your vec and col1, the 5th and 6th elements turns out to be TRUE after rolling.

The correct way to do what you want to do is:

Method 1: (preferred)

DT1[vec, col3 := FALSE]

Method 2: (equivalent to data.frame, but not preferred for data.table)

DT1$col3[vec] <- FALSE

or, the following will also work:

DT1[vec]$col3 <- FALSE

Method 3: Here is another possibility (although slower than the first method):

DT1[col1 %in% vec, col3 := FALSE]

Hope this helps!!

Rcpp pass by reference vs. by value

They key is 'proxy model' -- your xa really is the same memory location as your original object so you end up changing your original.

If you don't want that, you should do one thing: (deep) copy using the clone() method, or maybe explicit creation of a new object into which the altered object gets written. Method two does not do that, you simply use two differently named variables which are both "pointers" (in the proxy model sense) to the original variable.

An additional complication, though, is in implicit cast and copy when you pass an int vector (from R) to a NumericVector type: that creates a copy, and then the original no longer gets altered.

Here is a more explicit example, similar to one I use in the tutorials or workshops:

library(inline)
f1 <- cxxfunction(signature(a="numeric"), plugin="Rcpp", body='
Rcpp::NumericVector xa(a);
int n = xa.size();
for(int i=0; i < n; i++) {
if(xa[i]<0) xa[i] = 0;
}
return xa;
')

f2 <- cxxfunction(signature(a="numeric"), plugin="Rcpp", body='
Rcpp::NumericVector xa(a);
int n = xa.size();
Rcpp::NumericVector xr(a); // still points to a
for(int i=0; i < n; i++) {
if(xr[i]<0) xr[i] = 0;
}
return xr;
')

p <- seq(-2,2)
print(class(p))
print(cbind(f1(p), p))
print(cbind(f2(p), p))
p <- as.numeric(seq(-2,2))
print(class(p))
print(cbind(f1(p), p))
print(cbind(f2(p), p))

and this is what I see:

edd@max:~/svn/rcpp/pkg$ r /tmp/ari.r
Loading required package: methods
[1] "integer"
p
[1,] 0 -2
[2,] 0 -1
[3,] 0 0
[4,] 1 1
[5,] 2 2
p
[1,] 0 -2
[2,] 0 -1
[3,] 0 0
[4,] 1 1
[5,] 2 2
[1] "numeric"
p
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 1 1
[5,] 2 2
p
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 1 1
[5,] 2 2
edd@max:~/svn/rcpp/pkg$

So it really matters whether you pass int-to-float or float-to-float.

matching sub-vectors in R

A recent post uncovered this solution by Jonathan Carroll. I doubt a faster solution exists in R.

v_match <- function(needle, haystack, nomatch = 0L) { 
sieved <- which(haystack == needle[1L])
for (i in seq.int(1L, length(needle) - 1L)) {
sieved <- sieved[haystack[sieved + i] == needle[i + 1L]]
}
sieved
}

v_contains <- function(needle, haystack) {
sieved <- which(haystack == needle[1L])
for (i in seq.int(1L, length(needle) - 1L)) {
sieved <- sieved[haystack[sieved + i] == needle[i + 1L]]
}
length(sieved) && !anyNA(sieved)
}

Tests and benchmarks:

library(testthat)
x=c(1,3,4)
y=c(4,1,3,4,5)
z=c(3,1)

expect_true(v_contains(x,y)) # return TRUE x is contained in y
expect_false(v_contains(z,y)) # FALSE the values of z are in y, but not in order
expect_equal(v_match(x,y), 2) # returns 2 because x appears in y starting at position 2

x <- c(5, 1, 3)
yes <- c(sample(5:1e6), c(5, 1, 3))
no <- c(sample(5:1e6), c(4, 1, 3))
expect_true(v_contains(x, yes))
expect_false(v_contains(x, no))
expect_equal(v_match(x, yes), 1e6 - 3)

v_contains_roll <- function(x, y) {
any(zoo::rollapply(y, length(x), identical, x))
}
v_contains_stri <- function(x, y) {
stringr::str_detect(paste(y, collapse = "_"),
paste(x, collapse = "_"))
}

options(digits = 2)
options(scipen = 99)
library(microbenchmark)
gc(0, 1, 1)
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 527502 28 1180915 63 527502 28
#> Vcells 3010073 23 8388608 64 3010073 23
microbenchmark(v_contains(x, yes),
v_contains(x, no),
v_contains_stri(x, yes),
v_contains_stri(x, no),
v_contains_roll(x, yes),
v_contains_roll(x, no),
times = 2L,
control = list(order = "block"))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> v_contains(x, yes) 3.8 3.8 3.8 3.8 3.9 3.9 2
#> v_contains(x, no) 3.7 3.7 3.7 3.7 3.8 3.8 2
#> v_contains_stri(x, yes) 1658.4 1658.4 1676.7 1676.7 1695.0 1695.0 2
#> v_contains_stri(x, no) 1632.3 1632.3 1770.0 1770.0 1907.8 1907.8 2
#> v_contains_roll(x, yes) 5447.4 5447.4 5666.1 5666.1 5884.7 5884.7 2
#> v_contains_roll(x, no) 5458.8 5458.8 5521.7 5521.7 5584.6 5584.6 2
#> cld
#> a
#> a
#> b
#> b
#> c
#> c

Created on 2018-08-18 by the reprex package (v0.2.0).

understanding optimisation messages on assignment by reference in a data.table

Update: The expression,

DT[, c(..., lapply(.SD, .), ..., by=.]

has been optimised internally in commit #1242 of v1.9.3 (FR #2722). Here's the entry from NEWS:

o Complex j-expressions of the form DT[, c(..., lapply(.SD, fun)), by=grp]are now optimised, as long as .SD is only present in the form lapply(.SD, fun).

For ex: DT[, c(.I, lapply(.SD, sum), mean(x), lapply(.SD, log)), by=grp]
is optimised to: DT[, list(.I, x=sum(x), y=sum(y), ..., mean(x), log(x), log(y), ...), by=grp]

But DT[, c(.SD, lapply(.SD, sum)), by=grp] for example isn't optimised yet.
This partially resolves FR #2722. Thanks to Sam Steingold for filing the FR.


Where it says NAMED vector it means that in the internal R sense at C level; i.e., whether an object has been assigned a symbol and is called something, not whether an atomic vector has a "names" attribute or not. The NAMED value in the SEXP structure takes value 0, 1 or 2. R uses that to know whether it needs to copy-on-subassign or not. See section 1.1.2 of R-ints.

What would be better is if optimization of j in data.table could handle :

DT[, c(lapply(.SD,sum),.N), by=a]

That works but may be slow. Currently only the simpler form is optimized :

DT[, lapply(.SD,sum), by=a]

To answer main question, yes the following :

Direct plonk of unnamed RHS, no copy.

is desirable compared to :

RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.

Another way to achieve this is :

dt.out[, count := dt[, .N, by=a]$N]

I'm not quite sure why [["N"]] returns a NAM(2) compared to $N which doesn't.

Logical manipulation of lists in R

For instance this would add 1 to those values.

y[x < 0] <- y[x < 0] + 1

Assuming you want to keep all of the Y elements:

y <- ifelse(x < 0, 0, y)

Understanding exactly when a data.table is a reference to (vs a copy of) another data.table

Yes, it's subassignment in R using <- (or = or ->) that makes a copy of the whole object. You can trace that using tracemem(DT) and .Internal(inspect(DT)), as below. The data.table features := and set() assign by reference to whatever object they are passed. So if that object was previously copied (by a subassigning <- or an explicit copy(DT)) then it's the copy that gets modified by reference.

DT <- data.table(a = c(1, 2), b = c(11, 12)) 
newDT <- DT

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..

.Internal(inspect(newDT)) # precisely the same object at this point
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..

tracemem(newDT)
# [1] "<0x0000000003b7e2a0"

newDT$b[2] <- 200
# tracemem[0000000003B7E2A0 -> 00000000040ED948]:
# tracemem[00000000040ED948 -> 00000000040ED830]: .Call copy $<-.data.table $<-

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),TR,ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,200
# ATTRIB: # ..snip..

Notice how even the a vector was copied (different hex value indicates new copy of vector), even though a wasn't changed. Even the whole of b was copied, rather than just changing the elements that need to be changed. That's important to avoid for large data, and why := and set() were introduced to data.table.

Now, with our copied newDT we can modify it by reference :

newDT
# a b
# [1,] 1 11
# [2,] 2 200

newDT[2, b := 400]
# a b # See FAQ 2.21 for why this prints newDT
# [1,] 1 11
# [2,] 2 400

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,400
# ATTRIB: # ..snip ..

Notice that all 3 hex values (the vector of column points, and each of the 2 columns) remain unchanged. So it was truly modified by reference with no copies at all.

Or, we can modify the original DT by reference :

DT[2, b := 600]
# a b
# [1,] 1 11
# [2,] 2 600

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,600
# ATTRIB: # ..snip..

Those hex values are the same as the original values we saw for DT above. Type example(copy) for more examples using tracemem and comparison to data.frame.

Btw, if you tracemem(DT) then DT[2,b:=600] you'll see one copy reported. That is a copy of the first 10 rows that the print method does. When wrapped with invisible() or when called within a function or script, the print method isn't called.

All this applies inside functions too; i.e., := and set() do not copy on write, even within functions. If you need to modify a local copy, then call x=copy(x) at the start of the function. But, remember data.table is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects (ever). As a result we don't need to allow for the usual 3* working memory factor rule of thumb. We try to only need working memory as large as one column (i.e. a working memory factor of 1/ncol rather than 3).

Select / assign to data.table when variable names are stored in a character vector

Two ways to programmatically select variable(s):

  1. with = FALSE:

     DT = data.table(col1 = 1:3)
    colname = "col1"
    DT[, colname, with = FALSE]
    # col1
    # 1: 1
    # 2: 2
    # 3: 3
  2. 'dot dot' (..) prefix:

     DT[, ..colname]    
    # col1
    # 1: 1
    # 2: 2
    # 3: 3

For further description of the 'dot dot' (..) notation, see New Features in 1.10.2 (it is currently not described in help text).

To assign to variable(s), wrap the LHS of := in parentheses:

DT[, (colname) := 4:6]    
# col1
# 1: 4
# 2: 5
# 3: 6

The latter is known as a column plonk, because you replace the whole column vector by reference. If a subset i was present, it would subassign by reference. The parens around (colname) is a shorthand introduced in version v1.9.4 on CRAN Oct 2014. Here is the news item:

Using with = FALSE with := is now deprecated in all cases, given that wrapping
the LHS of := with parentheses has been preferred for some time.

colVar = "col1"
DT[, (colVar) := 1]                             # please change to this
DT[, c("col1", "col2") := 1] # no change
DT[, 2:4 := 1] # no change
DT[, c("col1","col2") := list(sum(a), mean(b))] # no change
DT[, `:=`(...), by = ...] # no change

See also Details section in ?`:=`:

DT[i, (colnamevector) := value]
# [...] The parens are enough to stop the LHS being a symbol

And to answer further question in comment, here's one way (as usual there are many ways) :

DT[, colname := cumsum(get(colname)), with = FALSE]
# col1
# 1: 4
# 2: 9
# 3: 15

or, you might find it easier to read, write and debug just to eval a paste, similar to constructing a dynamic SQL statement to send to a server :

expr = paste0("DT[,",colname,":=cumsum(",colname,")]")
expr
# [1] "DT[,col1:=cumsum(col1)]"

eval(parse(text=expr))
# col1
# 1: 4
# 2: 13
# 3: 28

If you do that a lot, you can define a helper function EVAL :

EVAL = function(...)eval(parse(text=paste0(...)),envir=parent.frame(2))

EVAL("DT[,",colname,":=cumsum(",colname,")]")
# col1
# 1: 4
# 2: 17
# 3: 45

Now that data.table 1.8.2 automatically optimizes j for efficiency, it may be preferable to use the eval method. The get() in j prevents some optimizations, for example.

Or, there is set(). A low overhead, functional form of :=, which would be fine here. See ?set.

set(DT, j = colname, value = cumsum(DT[[colname]]))
DT
# col1
# 1: 4
# 2: 21
# 3: 66


Related Topics



Leave a reply



Submit