"Update by Reference" Vs Shallow Copy

update by reference vs shallow copy

In data.table, := and all set* functions update objects by reference. This was introduced sometime around 2012 IIRC. And at this time, base R did not shallow copy, but deep copied. Shallow copy was introduced since 3.1.0.

It's a wordy/lengthy answer, but I think this answers your first two questions:

How is this base R method different from the data.table equivalent? Is the difference related only to speed or also memory use?

In base R v3.1.0+ when we do:

DF1 = data.frame(x=1:5, y=6:10, z=11:15)
DF2 = DF1[, c("x", "y")]
DF3 = transform(DF2, y = ifelse(y>=8L, 1L, y))
DF4 = transform(DF2, y = 2L)

From DF1 to DF2, both columns are only shallow copied.
From DF2 to DF3 the column y alone had to be copied/re-allocated, but x gets shallow copied again.
From DF2 to DF4, same as (2).

That is, columns are shallow copied as long as the column remains unchanged - in a way, the copy is being delayed unless absolutely necessary.

In data.table, we modify in-place. Meaning even during DF3 and DF4 column y doesn't get copied.

DT2[y >= 8L, y := 1L] ## (a)
DT2[, y := 2L]

Here, since y is already an integer column, and we are modifying it by integer, by reference, there's no new memory allocation made here at all.

This is also particularly useful when you'd like to sub-assign by reference (marked as (a) above). This is a handy feature we really like in data.table.

Another advantage that comes for free (that I came to know from our interactions) is, when we've to, say, convert all columns of a data.table to a numeric type, from say, character type:

DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]

Here, since we're updating by reference, each character column gets replaced by reference with it's numeric counterpart. And after that replacement, the earlier character column isn't required anymore and is up for grabs for garbage collection. But if you were to do this using base R:

DF[] = lapply(DF, as.numeric)

All the columns will have to be converted to numeric, and that'll have to be held in a temporary variable, and then finally will be assigned back to DF. That means, if you've 10 columns with a 100 million rows, each of character type, then your DF takes a space of:

10 * 100e6 * 4 / 1024^3 = ~ 3.7GB

And since numeric type is twice as much in size, we'll need a total of 7.4GB + 3.7GB of space for us to make the conversion using base R.

But note that data.table copies during DF1 to DF2. That is:

DT2 = DT1[, c("x", "y")]

results in a copy, because we can't sub-assign by reference on a shallow copy. It'll update all the clones.

What would be great is if we could integrate seamlessly the shallow copy feature, but keep track of whether a particular object's columns has multiple references, and update by reference wherever possible. R's upgraded reference counting feature might be very useful in this regard. In any case, we're working towards it.

For your last question:

"When is the difference most sizeable?"

There are still people who have to use older versions of R, where deep copies can't be avoided.
It depends on how many columns are being copied because the operations you perform on it. Worst case scenario would be that you've copied all the columns, of course.
There are cases like this where shallow copying won't benefit.
When you'd like to update columns of a data.frame for each group, and there are too many groups.
When you'd like to update a column of say, data.table DT1 based on a join with another data.table DT2 - this can be done as:
```
DT1[DT2, col := i.val]
```
where i. refers to the value from val column of DT2 (the i argument) for matching rows. This syntax allows for performing this operation very efficiently, instead of having to first join the entire result, and then update the required column.

All in all, there are strong arguments where update by reference would save a lot of time, and be fast. But people sometimes like to not update objects in-place, and are willing to sacrifice speed/memory for it. We're trying to figure out how best to provide this functionality as well, in addition to the already existing update by reference.

Hope this helps. This is already quite a lengthy answer. I'll leave any questions you might have left to others or for you to figure out (other than any obvious misconceptions in this answer).

How can I make update by reference faster than shallow copy?

Here is a language way to approach it. Basically, data.table is awesome and does its best to optimize. However, other programmatic trickery like mget makes data.table less efficient as it has to assume that the j expression could include any column and therefore it needs to allocate memory to evaluate.

This translates mget into what is actually happening using language: a[b, (cols) = list(t)].

nms = lapply(cols, as.name)
eval(substitute(a[b, (cols):= .x, on = .(id)], list(.x = as.call(c(quote(list), nms)))))

## performance against mget()

bench::mark(lang = {
  nms = lapply(cols, as.name)
  eval(substitute(a[b, (cols):= .x, on = .(id),], list(.x = as.call(c(quote(list), nms)))))
}
, mget_approach = a[b,(cols):=mget(cols),on=.(id)]
, normal =  b2[a2,on=.(id)]
  )

# A tibble: 3 x 13
  expression        min  median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr>    <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int>
1 lang            157ms   160ms      5.97      23MB     1.99     3
2 mget_approach   244ms   255ms      3.92    57.3MB     5.87     2
3 normal          162ms   174ms      5.82    34.4MB     3.88     3

## additional data to make above work: a2 = copy(a); b2 = copy(b)

Here is the source code of where mget is detected and data.table needs to account for it.

https://github.com/Rdatatable/data.table/blob/94a12475f737892c542d3cb7daf42e534ea13a22/R/data.table.R#L1044-L1049

    # added 'mget' - fix for #994
    if (any(c("get", "mget") %chin% av)){
      if (verbose)
        catf("'(m)get' found in j. ansvars being set to all columns. Use .SDcols or a single j=eval(macro) instead. Both will detect the columns used which is important for efficiency.\nOld ansvars: %s \n", brackify(ansvars))
        # get('varname') is too difficult to detect which columns are used in general
        # eval(macro) column names are detected via the  if jsub[[1]]==eval switch earlier above.

What is the difference between a deep copy and a shallow copy?

Shallow copies duplicate as little as possible. A shallow copy of a collection is a copy of the collection structure, not the elements. With a shallow copy, two collections now share the individual elements.

Deep copies duplicate everything. A deep copy of a collection is two collections with all of the elements in the original collection duplicated.

R, deep vs. shallow copies, pass by reference

When it passes variables, it is always by copy rather than by reference. Sometimes, however, you will not get a copy made until an assignment actually occurs. The real description of the process is pass-by-promise. Take a look at the documentation

?force
?delayedAssign

One practical implication is that it is very difficult if not impossible to avoid needing at least twice as much RAM as your objects nominally occupy. Modifying a large object will generally require making a temporary copy.

update: 2015: I do (and did) agree with Matt Dowle that his data.table package provides an alternate route to assignment that avoids the copy-duplication problem. If that was the update requested, then I didn't understand it at the time the suggestion was made.

There was a recent change in R 3.2.1 in the evaluation rules for apply and Reduce. It was SO-announced with reference to the News here: Returning anonymous functions from lapply - what is going wrong?

And the interesting paper cited by jhetzel in the comments is now here:

Understanding dict.copy() - shallow or deep?

By "shallow copying" it means the content of the dictionary is not copied by value, but just creating a new reference.

>>> a = {1: [1,2,3]}
>>> b = a.copy()
>>> a, b
({1: [1, 2, 3]}, {1: [1, 2, 3]})
>>> a[1].append(4)
>>> a, b
({1: [1, 2, 3, 4]}, {1: [1, 2, 3, 4]})

In contrast, a deep copy will copy all contents by value.

>>> import copy
>>> c = copy.deepcopy(a)
>>> a, c
({1: [1, 2, 3, 4]}, {1: [1, 2, 3, 4]})
>>> a[1].append(5)
>>> a, c
({1: [1, 2, 3, 4, 5]}, {1: [1, 2, 3, 4]})

So:

b = a: Reference assignment, Make a and b points to the same object.
b = a.copy(): Shallow copying, a and b will become two isolated objects, but their contents still share the same reference
b = copy.deepcopy(a): Deep copying, a and b's structure and content become completely isolated.

"Update by Reference" Vs Shallow Copy