Why does data.table update names(DT) by reference, even if I assign to another variable?
Update: This is now added in the documentation for ?copy
in version 1.9.3. From NEWS:
- Moved
?copy
to it's own help page, and documented thatdt_names <- copy(names(DT))
is necessary fordt_names
to be not modified by reference as a result of updatingDT
by reference (ex: adding a new column by reference). Closes #512. Thanks to Zach for this SO question and user1971988 for this SO question.
Part of your first question makes it a bit unclear to me as to what you really mean about <-
operator (at least in the context of data.table
), especially the part: In other instances, we are explicitly warned that <- creates copies, both of data.tables and data.frames.
So, before answering your actual question, I'll briefly touch it here. In case of a data.table
a <-
(assignment) merely is not sufficient for copying a data.table
. For example:
DT <- data.table(x = 1:5, y= 6:10)
# assign DT2 to DT
DT2 <- DT # assign by reference, no copy taken.
DT2[, z := 11:15]
# DT will also have the z column
If you want to create a copy
, then you've to explicitly mention it using copy
command.
DT2 <- copy(DT) # copied content to DT2
DT2[, z := 11:15] # only DT2 is affected
From CauchyDistributedRV, I understand what you mean is the assignment names(dt) <- .
that'll result in the warning. I'll leave it as such.
Now, to answer your first question: It seems that names1 <- names(DT)
also behaves similarly. I hadn't thought/known about this until now. The .Internal(inspect(.))
command is very useful here:
.Internal(inspect(names1))
# @7fc86a851480 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)
# @7fc86a069f68 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "x"
# @7fc86a0f96d8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "y"
.Internal(inspect(names(DT)))
# @7fc86a851480 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)
# @7fc86a069f68 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "x"
# @7fc86a0f96d8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "y"
Here, you see that they are pointing to the same memory location @7fc86a851480
. Even the truelength
of names1
is 100 (which is by default allocated in data.table
, check ?alloc.col
for this).
truelength(names1)
# [1] 100
So basically, the assignment names1 <- names(dt)
seems to happen by reference. That is, names1
is pointing to the same location as dt's column names pointer.
To answer your second question: The command c(.)
seems to create a copy as there is no checking as to whether the contents result due to concatenation operation are different. That is, because c(.)
operation can change the contents of the vector, it immediately results in a "copy" being made without checking if the contents are modified are not.
Variable containing data.table names changed in place?
We can create a copy
of the names
as the names(DT)
and the 'idVars' have the same memory location
tracemem(names(DT))
#[1] "<0x7f9d74c99600>"
tracemem(idVars)
#[1] "<0x7f9d74c99600>"
So, instead create a copy
of the names
idVars <- copy(names(DT))
tracemem(idVars)
#[1] "<0x7f9d7d2b97c8>"
and it wouldn't change after the assignment
DT[, "c" := 1:10]
idVars
#[1] "a" "b"
According to ?copy
:
A
copy()
may be required when doingdt_names = names(DT)
. Due to R's copy-on-modify,dt_names
still points to the same location in memory asnames(DT)
. Therefore modifyingDT
by reference now, say by adding a new column,dt_names
will also get updated. To avoid this, one has to explicitly copy:dt_names <- copy(names(DT))
.
R data.table 'variable - names(DT)': variable gets overwritten with := operator
Answered by the comments below the question:
Why does data.table update names(DT) by reference, even if I assign to another variable?
Update a data.table by reference that is stored within another data.table
I think you need to move the closing square brackets to the back, i.e.
x[, item_dt[[1]][, 'skill_freq' := item_dt[[1]][order(-N),N] / N_jobs], by = c(geo_level, 'noc')]
However, you will encounter a lot of problems with updating the list of data.tables when you keep a list of data.tables within a data.table. A sample warning below:
In
[.data.table
(item_dt[[1L]], ,:=
("skill_freq", ... :
Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
If I understand your requirements correctly, you can keep your data in a long format as follows:
aggregator_by_geo <- function(dt_in, geo_level){
x <- dt_in[!is.na(noc4_code),
c(.(N_jobs=uniqueN(id),
noc1_code=unique(noc1_code),
noc4_code=unique(noc4_code)),
.SD[ , .N, .(item_type, items)]),
c(geo_level, 'noc')]
x[, skill_freq := N[order(-N)] / N_jobs, c(geo_level, 'noc')]
}
Since the dataset provided is not complete nor is a desired result provided nor is the actual calculation detailed, I am unable to check if my understanding of the requirements is correct.
Understanding exactly when a data.table is a reference to (vs a copy of) another data.table
Yes, it's subassignment in R using <-
(or =
or ->
) that makes a copy of the whole object. You can trace that using tracemem(DT)
and .Internal(inspect(DT))
, as below. The data.table
features :=
and set()
assign by reference to whatever object they are passed. So if that object was previously copied (by a subassigning <-
or an explicit copy(DT)
) then it's the copy that gets modified by reference.
DT <- data.table(a = c(1, 2), b = c(11, 12))
newDT <- DT
.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..
.Internal(inspect(newDT)) # precisely the same object at this point
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..
tracemem(newDT)
# [1] "<0x0000000003b7e2a0"
newDT$b[2] <- 200
# tracemem[0000000003B7E2A0 -> 00000000040ED948]:
# tracemem[00000000040ED948 -> 00000000040ED830]: .Call copy $<-.data.table $<-
.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),TR,ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..
.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,200
# ATTRIB: # ..snip..
Notice how even the a
vector was copied (different hex value indicates new copy of vector), even though a
wasn't changed. Even the whole of b
was copied, rather than just changing the elements that need to be changed. That's important to avoid for large data, and why :=
and set()
were introduced to data.table
.
Now, with our copied newDT
we can modify it by reference :
newDT
# a b
# [1,] 1 11
# [2,] 2 200
newDT[2, b := 400]
# a b # See FAQ 2.21 for why this prints newDT
# [1,] 1 11
# [2,] 2 400
.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,400
# ATTRIB: # ..snip ..
Notice that all 3 hex values (the vector of column points, and each of the 2 columns) remain unchanged. So it was truly modified by reference with no copies at all.
Or, we can modify the original DT
by reference :
DT[2, b := 600]
# a b
# [1,] 1 11
# [2,] 2 600
.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,600
# ATTRIB: # ..snip..
Those hex values are the same as the original values we saw for DT
above. Type example(copy)
for more examples using tracemem
and comparison to data.frame
.
Btw, if you tracemem(DT)
then DT[2,b:=600]
you'll see one copy reported. That is a copy of the first 10 rows that the print
method does. When wrapped with invisible()
or when called within a function or script, the print
method isn't called.
All this applies inside functions too; i.e., :=
and set()
do not copy on write, even within functions. If you need to modify a local copy, then call x=copy(x)
at the start of the function. But, remember data.table
is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects (ever). As a result we don't need to allow for the usual 3* working memory factor rule of thumb. We try to only need working memory as large as one column (i.e. a working memory factor of 1/ncol rather than 3).
Why does setnames() affect copied tables?
Because R implements simple reference counting, and generally only copies on modification and not on assignment. So y = x
for any x
and y
would not copy anything, and no new objects will be created.
Combined with the fact that some data.table
functions can modify the object without copying, like setnames
, you get the effect you see.
Use copy
as Frank mentioned to force an explicit copy.
:= (pass by reference) operator in the data.table package modifies another data table object simultaneously
This piece of documentation in data.table
would help. ? data.table::copy
No value is returned. The data.table is modified by reference. If you require a copy, take a copy first (using DT2=copy(DT)). copy() may also sometimes be useful before := is used to subassign to a column by reference.
Why does data.table update names(DT) by reference, even if I assign to another variable?
Update: This is now added in the documentation for ?copy
in version 1.9.3. From NEWS:
- Moved
?copy
to it's own help page, and documented thatdt_names <- copy(names(DT))
is necessary fordt_names
to be not modified by reference as a result of updatingDT
by reference (ex: adding a new column by reference). Closes #512. Thanks to Zach for this SO question and user1971988 for this SO question.
Part of your first question makes it a bit unclear to me as to what you really mean about <-
operator (at least in the context of data.table
), especially the part: In other instances, we are explicitly warned that <- creates copies, both of data.tables and data.frames.
So, before answering your actual question, I'll briefly touch it here. In case of a data.table
a <-
(assignment) merely is not sufficient for copying a data.table
. For example:
DT <- data.table(x = 1:5, y= 6:10)
# assign DT2 to DT
DT2 <- DT # assign by reference, no copy taken.
DT2[, z := 11:15]
# DT will also have the z column
If you want to create a copy
, then you've to explicitly mention it using copy
command.
DT2 <- copy(DT) # copied content to DT2
DT2[, z := 11:15] # only DT2 is affected
From CauchyDistributedRV, I understand what you mean is the assignment names(dt) <- .
that'll result in the warning. I'll leave it as such.
Now, to answer your first question: It seems that names1 <- names(DT)
also behaves similarly. I hadn't thought/known about this until now. The .Internal(inspect(.))
command is very useful here:
.Internal(inspect(names1))
# @7fc86a851480 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)
# @7fc86a069f68 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "x"
# @7fc86a0f96d8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "y"
.Internal(inspect(names(DT)))
# @7fc86a851480 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)
# @7fc86a069f68 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "x"
# @7fc86a0f96d8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "y"
Here, you see that they are pointing to the same memory location @7fc86a851480
. Even the truelength
of names1
is 100 (which is by default allocated in data.table
, check ?alloc.col
for this).
truelength(names1)
# [1] 100
So basically, the assignment names1 <- names(dt)
seems to happen by reference. That is, names1
is pointing to the same location as dt's column names pointer.
To answer your second question: The command c(.)
seems to create a copy as there is no checking as to whether the contents result due to concatenation operation are different. That is, because c(.)
operation can change the contents of the vector, it immediately results in a "copy" being made without checking if the contents are modified are not.
Related Topics
Cleaning Up Factor Levels (Collapsing Multiple Levels/Labels)
How to Combine Multiple Conditions to Subset a Data-Frame Using "Or"
Fixing the Order of Facets in Ggplot
Pair-Wise Duplicate Removal from Dataframe
Remove Part of String After "."
Compare Two Data.Frames to Find the Rows in Data.Frame 1 That Are Not Present in Data.Frame 2
Multirow Axis Labels With Nested Grouping Variables
Convert Row Names into First Column
Removing Duplicate Combinations (Irrespective of Order)
R Reshape Data Frame from Long to Wide Format
How to Change Legend Title in Ggplot
How to Succinctly Write a Formula With Many Variables from a Data Frame
Fitting a Linear Model With Multiple Lhs
Adding Value from One Data.Frame to Another Data.Frame by Matching a Variable