When Should I Use the := Operator in Data.Table

When should I use the := operator in data.table?

Here is an example showing 10 minutes reduced to 1 second (from NEWS on homepage). It's like subassigning to a data.frame but doesn't copy the entire table each time.

m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)

system.time(for (i in 1:1000) DF[i,1] <- i)
     user  system elapsed 
  287.062 302.627 591.984 

system.time(for (i in 1:1000) DT[i,V1:=i])
     user  system elapsed 
    1.148   0.000   1.158     ( 511 times faster )

Putting the := in j like that allows more idioms :

DT["a",done:=TRUE]   # binary search for group 'a' and set a flag
DT[,newcol:=42]      # add a new column by reference (no copy of existing data)
DT[,col:=NULL]       # remove a column by reference

and :

DT[,newcol:=sum(v),by=group]  # like a fast transform() by group

I can't think of any reasons to avoid := ! Other than, inside a for loop. Since := appears inside DT[...], it comes with the small overhead of the [.data.table method; e.g., S3 dispatch and checking for the presence and type of arguments such as i, by, nomatch etc. So for inside for loops, there is a low overhead, direct version of := called set. See ?set for more details and examples. The disadvantages of set include that i must be row numbers (no binary search) and you can't combine it with by. By making those restrictions set can reduce the overhead dramatically.

system.time(for (i in 1:1000) set(DT,i,"V1",i))
     user  system elapsed 
    0.016   0.000   0.018

Adding a column in data.table with = vs :=

= for aggregating/summarising, result has same number of rows as number of unique values in by

:= for adding a column, result has the same number of rows as the original

For example:

library(data.table)
dt <- data.table(I = 1:3, x = 11:13, y = c("A", "A", "B"))
dt[, .(mx = mean(x)), by = "y"]
#>    y   mx
#> 1: A 11.5
#> 2: B 13.0
dt[, mx := mean(x), by = "y"][]
#>    I  x y   mx
#> 1: 1 11 A 11.5
#> 2: 2 12 A 11.5
#> 3: 3 13 B 13.0

Created on 2018-06-16 by the reprex package (v0.2.0).

Understanding exactly when a data.table is a reference to (vs a copy of) another data.table

Yes, it's subassignment in R using <- (or = or ->) that makes a copy of the whole object. You can trace that using tracemem(DT) and .Internal(inspect(DT)), as below. The data.table features := and set() assign by reference to whatever object they are passed. So if that object was previously copied (by a subassigning <- or an explicit copy(DT)) then it's the copy that gets modified by reference.

DT <- data.table(a = c(1, 2), b = c(11, 12)) 
newDT <- DT 

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

.Internal(inspect(newDT))   # precisely the same object at this point
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

tracemem(newDT)
# [1] "<0x0000000003b7e2a0"

newDT$b[2] <- 200
# tracemem[0000000003B7E2A0 -> 00000000040ED948]: 
# tracemem[00000000040ED948 -> 00000000040ED830]: .Call copy $<-.data.table $<- 

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),TR,ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,200
# ATTRIB:  # ..snip..

Notice how even the a vector was copied (different hex value indicates new copy of vector), even though a wasn't changed. Even the whole of b was copied, rather than just changing the elements that need to be changed. That's important to avoid for large data, and why := and set() were introduced to data.table.

Now, with our copied newDT we can modify it by reference :

newDT
#      a   b
# [1,] 1  11
# [2,] 2 200

newDT[2, b := 400]
#      a   b        # See FAQ 2.21 for why this prints newDT
# [1,] 1  11
# [2,] 2 400

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,400
# ATTRIB:  # ..snip ..

Notice that all 3 hex values (the vector of column points, and each of the 2 columns) remain unchanged. So it was truly modified by reference with no copies at all.

Or, we can modify the original DT by reference :

DT[2, b := 600]
#      a   b
# [1,] 1  11
# [2,] 2 600

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,600
#   ATTRIB:  # ..snip..

Those hex values are the same as the original values we saw for DT above. Type example(copy) for more examples using tracemem and comparison to data.frame.

Btw, if you tracemem(DT) then DT[2,b:=600] you'll see one copy reported. That is a copy of the first 10 rows that the print method does. When wrapped with invisible() or when called within a function or script, the print method isn't called.

All this applies inside functions too; i.e., := and set() do not copy on write, even within functions. If you need to modify a local copy, then call x=copy(x) at the start of the function. But, remember data.table is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects (ever). As a result we don't need to allow for the usual 3* working memory factor rule of thumb. We try to only need working memory as large as one column (i.e. a working memory factor of 1/ncol rather than 3).

Using %in% operator on data.table column

With data.table/tibble/data_frame etc, the [,columnindex] for a single column will still return a data.table/tibble/data_frame. We need to either use $ or [[ to return a vector and %in% works on vector

as.data.table(mtcars)[[2]] %in% 4

How to embed an and operator inside a data.table function?

Perhaps something like this:

f <- function(v1,v2,s) {
  s[cumsum(abs(v1)+abs(v2))==0] <- "END"
  s
}

setDT(data1)[order(-Period), State1:=f(Values_1, Values_2, State), by=ID]

Output:

    ID Period Values_1 Values_2 State State1
 1:  1      1        5        5    X0     X0
 2:  1      2        0        2    X1     X1
 3:  1      3        0        0    X2     X2
 4:  1      4        0       12    X1     X1
 5:  2      1        1        2    X0     X0
 6:  2      2       -1        0    X2     X2
 7:  2      3        0        1    X0     X0
 8:  2      4        0        0    X0    END
 9:  3      1        0        0    X2     X2
10:  3      2        0        0    X1     X1
11:  3      3        0        0    X9     X9
12:  3      4        0        2    X3     X3
13:  4      1        1        4    X2     X2
14:  4      2        2        5    X1     X1
15:  4      3        3        6    X9     X9
16:  4      4        0        0    X3    END

Does data.table's `:=` operator really operate by reference?

You (or perhaps the column) just got plonked ;) The plonk behaviour is rather thoroughly described in the help text (?`:=`):

Unlike <- for data.frame, the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given (whether or not fractional data is truncated). The motivation for this is efficiency. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax, or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening, and it's clearer to readers of your code that you really do intend to change the column type.

However, the relationship between plonking and memory is currently not explicitly addressed in the docs (but see below). Hence questions like yours and by others (on github: := does not update by reference existing column if i is missing, := doesn't always assign in-place).

There are a lot of interesting points in the github posts, but rather than me reiterating them, please just go there and enjoy! One quote from Matt Dowle though, which I believe nicely justifies the plonk behaviour:

Instead of 5 column allocatons, there's just one now for the a+a expression (the RHS, which gets created anyway) which is then plonked into the column slot by reference i.e. address(DT) doesn't change but address(DT$a) will change. That's correct behaviour, and most efficient, to save copying the whole RHS into the existing column (which is only possible if they're the same type anyway). Since the RHS is as long as the number of rows, it is just plonked in.

(Disclaimer: things may have changed in both data.table and R since that post, but I think the main message is still valid.)

Regarding documentation, there is an open PR (update and clarify := docs), where a more explicit description of plonk and memory is suggested:

When a column is plonked, the original column is not updated by reference, because that needs to update every single element of that column.

Have I been plonked? Yes! For me it wasn't memory, but column classes which caused some head scratching, and I ended up here: Why is data.table casting column classes when I assign all columns by reference. After reading your question, I returned to that post and realized that the very nice answer by Matt not "only" addresses class but also memory. I think it's worth repeating here (my bold and comment in []):

if length(RHS) == nrow(DT) then the RHS (and whatever its type) is
plonked into that column slot. Even if those lengths are 1. If length(RHS) < nrow(DT), the memory for the column (and its type) is
kept in place [implicitly memory not kept in place when length(RHS) == nrow(DT), I assume] but the RHS is coerced and recycled to replace the (subset of) items in that column.
If I need to change a column's type in a large table I write:
DT[, col := as.numeric(col)]
here as.numeric allocates a new vector, coerces "col" into that
new memory, which is then plonked into the column slot. It's as
efficient as it can be. The reason that's a plonk is because
length(RHS) == nrow(DT).

Why has data.table defined := rather than overloading -?

I don't think there is any technical reason this should be necessary, for the following reason: := is only used inside [...] so it is always quoted. [...] goes through the expression tree to see if := is in it.

That means it's not really acting as an operator and it's not really overloaded; so they could have picked pretty much any operator they wanted. I guess maybe it looked better? Or less confusing because it's clearly not <-?

(Note that if := were used outside of [...] it could not be <-, because you can't actually overload <-. <- Doesn't evaluate its lefthand argument so it doesn't know what the type is).

:= (pass by reference) operator in the data.table package modifies another data table object simultaneously

This piece of documentation in data.table would help. ? data.table::copy

No value is returned. The data.table is modified by reference. If you require a copy, take a copy first (using DT2=copy(DT)). copy() may also sometimes be useful before := is used to subassign to a column by reference.

When Should I Use the := Operator in Data.Table