When should I use the := operator in data.table?
Here is an example showing 10 minutes reduced to 1 second (from NEWS on homepage). It's like subassigning to a data.frame
but doesn't copy the entire table each time.
m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)
system.time(for (i in 1:1000) DF[i,1] <- i)
user system elapsed
287.062 302.627 591.984
system.time(for (i in 1:1000) DT[i,V1:=i])
user system elapsed
1.148 0.000 1.158 ( 511 times faster )
Putting the :=
in j
like that allows more idioms :
DT["a",done:=TRUE] # binary search for group 'a' and set a flag
DT[,newcol:=42] # add a new column by reference (no copy of existing data)
DT[,col:=NULL] # remove a column by reference
and :
DT[,newcol:=sum(v),by=group] # like a fast transform() by group
I can't think of any reasons to avoid :=
! Other than, inside a for
loop. Since :=
appears inside DT[...]
, it comes with the small overhead of the [.data.table
method; e.g., S3 dispatch and checking for the presence and type of arguments such as i
, by
, nomatch
etc. So for inside for
loops, there is a low overhead, direct version of :=
called set
. See ?set
for more details and examples. The disadvantages of set
include that i
must be row numbers (no binary search) and you can't combine it with by
. By making those restrictions set
can reduce the overhead dramatically.
system.time(for (i in 1:1000) set(DT,i,"V1",i))
user system elapsed
0.016 0.000 0.018
Adding a column in data.table with = vs :=
=
for aggregating/summarising, result has same number of rows as number of unique values in by
:=
for adding a column, result has the same number of rows as the original
For example:
library(data.table)
dt <- data.table(I = 1:3, x = 11:13, y = c("A", "A", "B"))
dt[, .(mx = mean(x)), by = "y"]
#> y mx
#> 1: A 11.5
#> 2: B 13.0
dt[, mx := mean(x), by = "y"][]
#> I x y mx
#> 1: 1 11 A 11.5
#> 2: 2 12 A 11.5
#> 3: 3 13 B 13.0
Created on 2018-06-16 by the reprex package (v0.2.0).
Understanding exactly when a data.table is a reference to (vs a copy of) another data.table
Yes, it's subassignment in R using <-
(or =
or ->
) that makes a copy of the whole object. You can trace that using tracemem(DT)
and .Internal(inspect(DT))
, as below. The data.table
features :=
and set()
assign by reference to whatever object they are passed. So if that object was previously copied (by a subassigning <-
or an explicit copy(DT)
) then it's the copy that gets modified by reference.
DT <- data.table(a = c(1, 2), b = c(11, 12))
newDT <- DT
.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..
.Internal(inspect(newDT)) # precisely the same object at this point
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..
tracemem(newDT)
# [1] "<0x0000000003b7e2a0"
newDT$b[2] <- 200
# tracemem[0000000003B7E2A0 -> 00000000040ED948]:
# tracemem[00000000040ED948 -> 00000000040ED830]: .Call copy $<-.data.table $<-
.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),TR,ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..
.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,200
# ATTRIB: # ..snip..
Notice how even the a
vector was copied (different hex value indicates new copy of vector), even though a
wasn't changed. Even the whole of b
was copied, rather than just changing the elements that need to be changed. That's important to avoid for large data, and why :=
and set()
were introduced to data.table
.
Now, with our copied newDT
we can modify it by reference :
newDT
# a b
# [1,] 1 11
# [2,] 2 200
newDT[2, b := 400]
# a b # See FAQ 2.21 for why this prints newDT
# [1,] 1 11
# [2,] 2 400
.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,400
# ATTRIB: # ..snip ..
Notice that all 3 hex values (the vector of column points, and each of the 2 columns) remain unchanged. So it was truly modified by reference with no copies at all.
Or, we can modify the original DT
by reference :
DT[2, b := 600]
# a b
# [1,] 1 11
# [2,] 2 600
.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,600
# ATTRIB: # ..snip..
Those hex values are the same as the original values we saw for DT
above. Type example(copy)
for more examples using tracemem
and comparison to data.frame
.
Btw, if you tracemem(DT)
then DT[2,b:=600]
you'll see one copy reported. That is a copy of the first 10 rows that the print
method does. When wrapped with invisible()
or when called within a function or script, the print
method isn't called.
All this applies inside functions too; i.e., :=
and set()
do not copy on write, even within functions. If you need to modify a local copy, then call x=copy(x)
at the start of the function. But, remember data.table
is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects (ever). As a result we don't need to allow for the usual 3* working memory factor rule of thumb. We try to only need working memory as large as one column (i.e. a working memory factor of 1/ncol rather than 3).
Using %in% operator on data.table column
With data.table/tibble/data_frame
etc, the [,columnindex]
for a single column will still return a data.table/tibble/data_frame
. We need to either use $
or [[
to return a vector
and %in%
works on vector
as.data.table(mtcars)[[2]] %in% 4
How to embed an and operator inside a data.table function?
Perhaps something like this:
f <- function(v1,v2,s) {
s[cumsum(abs(v1)+abs(v2))==0] <- "END"
s
}
setDT(data1)[order(-Period), State1:=f(Values_1, Values_2, State), by=ID]
Output:
ID Period Values_1 Values_2 State State1
1: 1 1 5 5 X0 X0
2: 1 2 0 2 X1 X1
3: 1 3 0 0 X2 X2
4: 1 4 0 12 X1 X1
5: 2 1 1 2 X0 X0
6: 2 2 -1 0 X2 X2
7: 2 3 0 1 X0 X0
8: 2 4 0 0 X0 END
9: 3 1 0 0 X2 X2
10: 3 2 0 0 X1 X1
11: 3 3 0 0 X9 X9
12: 3 4 0 2 X3 X3
13: 4 1 1 4 X2 X2
14: 4 2 2 5 X1 X1
15: 4 3 3 6 X9 X9
16: 4 4 0 0 X3 END
Does data.table's `:=` operator really operate by reference?
You (or perhaps the column) just got plonked ;) The plonk behaviour is rather thoroughly described in the help text (?`:=`
):
Unlike
<-
fordata.frame
, the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given (whether or not fractional data is truncated). The motivation for this is efficiency. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax, or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening, and it's clearer to readers of your code that you really do intend to change the column type.
However, the relationship between plonking and memory is currently not explicitly addressed in the docs (but see below). Hence questions like yours and by others (on github: := does not update by reference existing column if i is missing, :=
doesn't always assign in-place).
There are a lot of interesting points in the github posts, but rather than me reiterating them, please just go there and enjoy! One quote from Matt Dowle though, which I believe nicely justifies the plonk behaviour:
Instead of 5 column allocatons, there's just one now for the
a+a
expression (the RHS, which gets created anyway) which is then plonked into the column slot by reference i.e.address(DT)
doesn't change butaddress(DT$a)
will change. That's correct behaviour, and most efficient, to save copying the whole RHS into the existing column (which is only possible if they're the same type anyway). Since the RHS is as long as the number of rows, it is just plonked in.
(Disclaimer: things may have changed in both data.table
and R
since that post, but I think the main message is still valid.)
Regarding documentation, there is an open PR (update and clarify := docs), where a more explicit description of plonk and memory is suggested:
When a column is plonked, the original column is not updated by reference, because that needs to update every single element of that column.
Have I been plonked? Yes! For me it wasn't memory, but column classes which caused some head scratching, and I ended up here: Why is data.table casting column classes when I assign all columns by reference. After reading your question, I returned to that post and realized that the very nice answer by Matt not "only" addresses class but also memory. I think it's worth repeating here (my bold and comment in []
):
if
length(RHS) == nrow(DT)
then the RHS (and whatever its type) is
plonked into that column slot. Even if those lengths are 1. Iflength(RHS) < nrow(DT)
, the memory for the column (and its type) is
kept in place [implicitly memory not kept in place whenlength(RHS) == nrow(DT)
, I assume] but the RHS is coerced and recycled to replace the (subset of) items in that column.If I need to change a column's type in a large table I write:
DT[, col := as.numeric(col)]
here
as.numeric
allocates a new vector, coerces "col" into that
new memory, which is then plonked into the column slot. It's as
efficient as it can be. The reason that's a plonk is becauselength(RHS) == nrow(DT)
.
Why has data.table defined := rather than overloading -?
I don't think there is any technical reason this should be necessary, for the following reason: :=
is only used inside [...]
so it is always quoted. [...]
goes through the expression tree to see if :=
is in it.
That means it's not really acting as an operator and it's not really overloaded; so they could have picked pretty much any operator they wanted. I guess maybe it looked better? Or less confusing because it's clearly not <-
?
(Note that if :=
were used outside of [...]
it could not be <-
, because you can't actually overload <-
. <-
Doesn't evaluate its lefthand argument so it doesn't know what the type is).
:= (pass by reference) operator in the data.table package modifies another data table object simultaneously
This piece of documentation in data.table
would help. ? data.table::copy
No value is returned. The data.table is modified by reference. If you require a copy, take a copy first (using DT2=copy(DT)). copy() may also sometimes be useful before := is used to subassign to a column by reference.
Related Topics
How to Make Geom_Text Plot Within the Canvas's Bounds
Performing Dplyr Mutate on Subset of Columns
Pass a Vector of Variable Names to Arrange() in Dplyr
Longest Common Substring in R Finding Non-Contiguous Matches Between the Two Strings
How to Access the Last Value in a Vector
Convert Named Character Vector to Data.Frame
How to Spread Columns with Duplicate Identifiers
Why Apply() Returns a Transposed Xts Matrix
Adding Space Between Bars in Ggplot2
Calculate Row-Wise Proportions
Proper Idiom for Adding Zero Count Rows in Tidyr/Dplyr
Subsetting a Data.Table Using !=<Some Non-Na> Excludes Na Too