update by reference vs shallow copy
In data.table
, :=
and all set*
functions update objects by reference. This was introduced sometime around 2012 IIRC. And at this time, base R did not shallow copy, but deep copied. Shallow copy was introduced since 3.1.0.
It's a wordy/lengthy answer, but I think this answers your first two questions:
How is this base R method different from the data.table equivalent? Is the difference related only to speed or also memory use?
In base R v3.1.0+ when we do:
DF1 = data.frame(x=1:5, y=6:10, z=11:15)
DF2 = DF1[, c("x", "y")]
DF3 = transform(DF2, y = ifelse(y>=8L, 1L, y))
DF4 = transform(DF2, y = 2L)
- From
DF1
toDF2
, both columns are only shallow copied. - From
DF2
toDF3
the columny
alone had to be copied/re-allocated, butx
gets shallow copied again. - From
DF2
toDF4
, same as (2).
That is, columns are shallow copied as long as the column remains unchanged - in a way, the copy is being delayed unless absolutely necessary.
In data.table
, we modify in-place. Meaning even during DF3
and DF4
column y
doesn't get copied.
DT2[y >= 8L, y := 1L] ## (a)
DT2[, y := 2L]
Here, since y
is already an integer column, and we are modifying it by integer, by reference, there's no new memory allocation made here at all.
This is also particularly useful when you'd like to sub-assign by reference (marked as (a) above). This is a handy feature we really like in data.table
.
Another advantage that comes for free (that I came to know from our interactions) is, when we've to, say, convert all columns of a data.table to a numeric
type, from say, character
type:
DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
Here, since we're updating by reference, each character column gets replaced by reference with it's numeric counterpart. And after that replacement, the earlier character column isn't required anymore and is up for grabs for garbage collection. But if you were to do this using base R:
DF[] = lapply(DF, as.numeric)
All the columns will have to be converted to numeric, and that'll have to be held in a temporary variable, and then finally will be assigned back to DF
. That means, if you've 10 columns with a 100 million rows, each of character type, then your DF
takes a space of:
10 * 100e6 * 4 / 1024^3 = ~ 3.7GB
And since numeric
type is twice as much in size, we'll need a total of 7.4GB + 3.7GB
of space for us to make the conversion using base R.
But note that data.table
copies during DF1
to DF2
. That is:
DT2 = DT1[, c("x", "y")]
results in a copy, because we can't sub-assign by reference on a shallow copy. It'll update all the clones.
What would be great is if we could integrate seamlessly the shallow copy feature, but keep track of whether a particular object's columns has multiple references, and update by reference wherever possible. R's upgraded reference counting feature might be very useful in this regard. In any case, we're working towards it.
For your last question:
"When is the difference most sizeable?"
There are still people who have to use older versions of R, where deep copies can't be avoided.
It depends on how many columns are being copied because the operations you perform on it. Worst case scenario would be that you've copied all the columns, of course.
There are cases like this where shallow copying won't benefit.
When you'd like to update columns of a data.frame for each group, and there are too many groups.
When you'd like to update a column of say, data.table
DT1
based on a join with another data.tableDT2
- this can be done as:DT1[DT2, col := i.val]
where
i.
refers to the value fromval
column ofDT2
(thei
argument) for matching rows. This syntax allows for performing this operation very efficiently, instead of having to first join the entire result, and then update the required column.
All in all, there are strong arguments where update by reference would save a lot of time, and be fast. But people sometimes like to not update objects in-place, and are willing to sacrifice speed/memory for it. We're trying to figure out how best to provide this functionality as well, in addition to the already existing update by reference.
Hope this helps. This is already quite a lengthy answer. I'll leave any questions you might have left to others or for you to figure out (other than any obvious misconceptions in this answer).
How can I make update by reference faster than shallow copy?
Here is a language way to approach it. Basically, data.table is awesome and does its best to optimize. However, other programmatic trickery like mget
makes data.table
less efficient as it has to assume that the j
expression could include any column and therefore it needs to allocate memory to evaluate.
This translates mget
into what is actually happening using language: a[b, (cols) = list(t)]
.
nms = lapply(cols, as.name)
eval(substitute(a[b, (cols):= .x, on = .(id)], list(.x = as.call(c(quote(list), nms)))))
## performance against mget()
bench::mark(lang = {
nms = lapply(cols, as.name)
eval(substitute(a[b, (cols):= .x, on = .(id),], list(.x = as.call(c(quote(list), nms)))))
}
, mget_approach = a[b,(cols):=mget(cols),on=.(id)]
, normal = b2[a2,on=.(id)]
)
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int>
1 lang 157ms 160ms 5.97 23MB 1.99 3
2 mget_approach 244ms 255ms 3.92 57.3MB 5.87 2
3 normal 162ms 174ms 5.82 34.4MB 3.88 3
## additional data to make above work: a2 = copy(a); b2 = copy(b)
Here is the source code of where mget
is detected and data.table needs to account for it.
https://github.com/Rdatatable/data.table/blob/94a12475f737892c542d3cb7daf42e534ea13a22/R/data.table.R#L1044-L1049
# added 'mget' - fix for #994
if (any(c("get", "mget") %chin% av)){
if (verbose)
catf("'(m)get' found in j. ansvars being set to all columns. Use .SDcols or a single j=eval(macro) instead. Both will detect the columns used which is important for efficiency.\nOld ansvars: %s \n", brackify(ansvars))
# get('varname') is too difficult to detect which columns are used in general
# eval(macro) column names are detected via the if jsub[[1]]==eval switch earlier above.
What is the difference between a deep copy and a shallow copy?
Shallow copies duplicate as little as possible. A shallow copy of a collection is a copy of the collection structure, not the elements. With a shallow copy, two collections now share the individual elements.
Deep copies duplicate everything. A deep copy of a collection is two collections with all of the elements in the original collection duplicated.
R, deep vs. shallow copies, pass by reference
When it passes variables, it is always by copy rather than by reference. Sometimes, however, you will not get a copy made until an assignment actually occurs. The real description of the process is pass-by-promise. Take a look at the documentation
?force
?delayedAssign
One practical implication is that it is very difficult if not impossible to avoid needing at least twice as much RAM as your objects nominally occupy. Modifying a large object will generally require making a temporary copy.
update: 2015: I do (and did) agree with Matt Dowle that his data.table package provides an alternate route to assignment that avoids the copy-duplication problem. If that was the update requested, then I didn't understand it at the time the suggestion was made.
There was a recent change in R 3.2.1 in the evaluation rules for apply
and Reduce
. It was SO-announced with reference to the News here: Returning anonymous functions from lapply - what is going wrong?
And the interesting paper cited by jhetzel in the comments is now here:
Understanding dict.copy() - shallow or deep?
By "shallow copying" it means the content of the dictionary is not copied by value, but just creating a new reference.
>>> a = {1: [1,2,3]}
>>> b = a.copy()
>>> a, b
({1: [1, 2, 3]}, {1: [1, 2, 3]})
>>> a[1].append(4)
>>> a, b
({1: [1, 2, 3, 4]}, {1: [1, 2, 3, 4]})
In contrast, a deep copy will copy all contents by value.
>>> import copy
>>> c = copy.deepcopy(a)
>>> a, c
({1: [1, 2, 3, 4]}, {1: [1, 2, 3, 4]})
>>> a[1].append(5)
>>> a, c
({1: [1, 2, 3, 4, 5]}, {1: [1, 2, 3, 4]})
So:
b = a
: Reference assignment, Makea
andb
points to the same object.b = a.copy()
: Shallow copying,a
andb
will become two isolated objects, but their contents still share the same referenceb = copy.deepcopy(a)
: Deep copying,a
andb
's structure and content become completely isolated.
Related Topics
How to Use Stat_Bin2D() to Compute Counts Labels in Ggplot2
How to Uninstall R Completely from Os X
Creating a Table with Individual Trials from a Frequency Table in R (Inverse of Table Function)
R: Check If Value from Dataframe Is Within Range Other Dataframe
Meaning of Error Using . Shorthand Inside Dplyr Function
Check If a Program Is Installed
How to Read All Files in One Directory into R at Once
Fill in Gaps (E.G. Not Single Cells) of Na Values in Raster Using a Neighborhood Analysis
Put Y Axis Title in Top Left Corner of Graph
How to Do Histograms of This Row-Column Table in R Ggplot
R: How to Count How Many Points Are in Each Cell of My Grid
Convert Latitude/Longitude to State Plane Coordinates
Is Ifelse Ever Appropriate in a Non-Vectorized Situation and Vice-Versa
Horizontal Rule in R Markdown/Bookdown Causing Errors