R, Deep VS. Shallow Copies, Pass by Reference

R, deep vs. shallow copies, pass by reference

When it passes variables, it is always by copy rather than by reference. Sometimes, however, you will not get a copy made until an assignment actually occurs. The real description of the process is pass-by-promise. Take a look at the documentation

?force
?delayedAssign

One practical implication is that it is very difficult if not impossible to avoid needing at least twice as much RAM as your objects nominally occupy. Modifying a large object will generally require making a temporary copy.

update: 2015: I do (and did) agree with Matt Dowle that his data.table package provides an alternate route to assignment that avoids the copy-duplication problem. If that was the update requested, then I didn't understand it at the time the suggestion was made.

There was a recent change in R 3.2.1 in the evaluation rules for apply and Reduce. It was SO-announced with reference to the News here: Returning anonymous functions from lapply - what is going wrong?

And the interesting paper cited by jhetzel in the comments is now here:

update by reference vs shallow copy

In data.table, := and all set* functions update objects by reference. This was introduced sometime around 2012 IIRC. And at this time, base R did not shallow copy, but deep copied. Shallow copy was introduced since 3.1.0.


It's a wordy/lengthy answer, but I think this answers your first two questions:

How is this base R method different from the data.table equivalent? Is the difference related only to speed or also memory use?

In base R v3.1.0+ when we do:

DF1 = data.frame(x=1:5, y=6:10, z=11:15)
DF2 = DF1[, c("x", "y")]
DF3 = transform(DF2, y = ifelse(y>=8L, 1L, y))
DF4 = transform(DF2, y = 2L)
  1. From DF1 to DF2, both columns are only shallow copied.
  2. From DF2 to DF3 the column y alone had to be copied/re-allocated, but x gets shallow copied again.
  3. From DF2 to DF4, same as (2).

That is, columns are shallow copied as long as the column remains unchanged - in a way, the copy is being delayed unless absolutely necessary.

In data.table, we modify in-place. Meaning even during DF3 and DF4 column y doesn't get copied.

DT2[y >= 8L, y := 1L] ## (a)
DT2[, y := 2L]

Here, since y is already an integer column, and we are modifying it by integer, by reference, there's no new memory allocation made here at all.

This is also particularly useful when you'd like to sub-assign by reference (marked as (a) above). This is a handy feature we really like in data.table.

Another advantage that comes for free (that I came to know from our interactions) is, when we've to, say, convert all columns of a data.table to a numeric type, from say, character type:

DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]

Here, since we're updating by reference, each character column gets replaced by reference with it's numeric counterpart. And after that replacement, the earlier character column isn't required anymore and is up for grabs for garbage collection. But if you were to do this using base R:

DF[] = lapply(DF, as.numeric)

All the columns will have to be converted to numeric, and that'll have to be held in a temporary variable, and then finally will be assigned back to DF. That means, if you've 10 columns with a 100 million rows, each of character type, then your DF takes a space of:

10 * 100e6 * 4 / 1024^3 = ~ 3.7GB

And since numeric type is twice as much in size, we'll need a total of 7.4GB + 3.7GB of space for us to make the conversion using base R.

But note that data.table copies during DF1 to DF2. That is:

DT2 = DT1[, c("x", "y")]

results in a copy, because we can't sub-assign by reference on a shallow copy. It'll update all the clones.

What would be great is if we could integrate seamlessly the shallow copy feature, but keep track of whether a particular object's columns has multiple references, and update by reference wherever possible. R's upgraded reference counting feature might be very useful in this regard. In any case, we're working towards it.


For your last question:

"When is the difference most sizeable?"

  1. There are still people who have to use older versions of R, where deep copies can't be avoided.

  2. It depends on how many columns are being copied because the operations you perform on it. Worst case scenario would be that you've copied all the columns, of course.

  3. There are cases like this where shallow copying won't benefit.

  4. When you'd like to update columns of a data.frame for each group, and there are too many groups.

  5. When you'd like to update a column of say, data.table DT1 based on a join with another data.table DT2 - this can be done as:

    DT1[DT2, col := i.val]

    where i. refers to the value from val column of DT2 (the i argument) for matching rows. This syntax allows for performing this operation very efficiently, instead of having to first join the entire result, and then update the required column.

All in all, there are strong arguments where update by reference would save a lot of time, and be fast. But people sometimes like to not update objects in-place, and are willing to sacrifice speed/memory for it. We're trying to figure out how best to provide this functionality as well, in addition to the already existing update by reference.

Hope this helps. This is already quite a lengthy answer. I'll leave any questions you might have left to others or for you to figure out (other than any obvious misconceptions in this answer).

Understanding exactly when a data.table is a reference to (vs a copy of) another data.table

Yes, it's subassignment in R using <- (or = or ->) that makes a copy of the whole object. You can trace that using tracemem(DT) and .Internal(inspect(DT)), as below. The data.table features := and set() assign by reference to whatever object they are passed. So if that object was previously copied (by a subassigning <- or an explicit copy(DT)) then it's the copy that gets modified by reference.

DT <- data.table(a = c(1, 2), b = c(11, 12)) 
newDT <- DT

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..

.Internal(inspect(newDT)) # precisely the same object at this point
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..

tracemem(newDT)
# [1] "<0x0000000003b7e2a0"

newDT$b[2] <- 200
# tracemem[0000000003B7E2A0 -> 00000000040ED948]:
# tracemem[00000000040ED948 -> 00000000040ED830]: .Call copy $<-.data.table $<-

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),TR,ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB: # ..snip..

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,200
# ATTRIB: # ..snip..

Notice how even the a vector was copied (different hex value indicates new copy of vector), even though a wasn't changed. Even the whole of b was copied, rather than just changing the elements that need to be changed. That's important to avoid for large data, and why := and set() were introduced to data.table.

Now, with our copied newDT we can modify it by reference :

newDT
# a b
# [1,] 1 11
# [2,] 2 200

newDT[2, b := 400]
# a b # See FAQ 2.21 for why this prints newDT
# [1,] 1 11
# [2,] 2 400

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,400
# ATTRIB: # ..snip ..

Notice that all 3 hex values (the vector of column points, and each of the 2 columns) remain unchanged. So it was truly modified by reference with no copies at all.

Or, we can modify the original DT by reference :

DT[2, b := 600]
# a b
# [1,] 1 11
# [2,] 2 600

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
# @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
# @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,600
# ATTRIB: # ..snip..

Those hex values are the same as the original values we saw for DT above. Type example(copy) for more examples using tracemem and comparison to data.frame.

Btw, if you tracemem(DT) then DT[2,b:=600] you'll see one copy reported. That is a copy of the first 10 rows that the print method does. When wrapped with invisible() or when called within a function or script, the print method isn't called.

All this applies inside functions too; i.e., := and set() do not copy on write, even within functions. If you need to modify a local copy, then call x=copy(x) at the start of the function. But, remember data.table is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects (ever). As a result we don't need to allow for the usual 3* working memory factor rule of thumb. We try to only need working memory as large as one column (i.e. a working memory factor of 1/ncol rather than 3).

Shallow and deep copy of R6 elements in a list

You can just copy each element of the list one by one over a for loop. I suppose that might comes at a performance cost when dealing with large lists though and it is probably not the cleanest method.

A = list(a1 = C$new(1), a2 = C$new(2))    

B = list()
for (name in names(A))
{
B[[name]] = A[[name]]$clone(deep=TRUE)
}

print(B[["a1"]]$x) # [1]
A[["a1"]]$x = 4
print(B[["a1"]]$x) # [1]

What is the difference between shallow copy, deepcopy and normal assignment operation?

Normal assignment operations will simply point the new variable towards the existing object. The docs explain the difference between shallow and deep copies:

The difference between shallow and deep copying is only relevant for
compound objects (objects that contain other objects, like lists or
class instances):

  • A shallow copy constructs a new compound object and then (to the extent possible) inserts references into it to the objects found in the original.

  • A deep copy constructs a new compound object and then, recursively, inserts copies into it of the objects found in the
    original.

Here's a little demonstration:

import copy

a = [1, 2, 3]
b = [4, 5, 6]
c = [a, b]

Using normal assignment operatings to copy:

d = c

print id(c) == id(d) # True - d is the same object as c
print id(c[0]) == id(d[0]) # True - d[0] is the same object as c[0]

Using a shallow copy:

d = copy.copy(c)

print id(c) == id(d) # False - d is now a new object
print id(c[0]) == id(d[0]) # True - d[0] is the same object as c[0]

Using a deep copy:

d = copy.deepcopy(c)

print id(c) == id(d) # False - d is now a new object
print id(c[0]) == id(d[0]) # False - d[0] is now a new object

Is data really copied four times in R's replacement functions?

NOTE: Unless otherwise specified, all explanations below are valid for R versions < 3.1.0. There are great improvements made in R v3.1.0, which is also briefly touched upon here.

To answer your first question, "why four copies and shouldn't one be enough?", we'll begin by quoting the relevant part from R-internals first:

A 'named' value of 2, NAM(2), means that the object must be duplicated before being changed. (Note that this does not say that it is necessary to duplicate, only that it should be duplicated whether necessary or not.) A value of 0 means that it is known that no other SEXP shares data with this object, and so it may safely be altered.

A value of 1 is used for situations like dim(a) <- c(7, 2) where in principle two copies of a exist for the duration of the computation as (in principle)
a <- dim<-(a, c(7, 2)) but for no longer, and so some primitive functions can be optimized to avoid a copy in this case.

NAM(1):

Let's start with NAM(1) objects. Here's an example:

x <- 1:5 # (1)
.Internal(inspect(x))
# @10374ecc8 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
tracemem(x)
# [1] "<0x10374ecc8>"

x[2L] <- 10L # (2)
.Internal(inspect(x))
# @10374ecc8 13 INTSXP g0c3 [MARK,NAM(1),TR] (len=5, tl=0) 1,10,3,4,5

What's happening here? We created an integer vector using :, it being a primitive, resulted in a NAM(1) object. And when we used [<- on that object, the value got changed in-place (note that the pointers are identical, (1) and (2)). This is because [<- being a primitive knows quite well how to handle its inputs and is optimised for a no-copy in this scenario.

y = x # (3)
.Internal(inspect(x))
# @10374ecc8 13 INTSXP g0c3 [MARK,NAM(2),TR] (len=5, tl=0) 1,10,3,4,5

x[2L] <- 20L # (4)
.Internal(inspect(x))
# tracemem[0x10374ecc8 -> 0x10372f328]:
# @10372f328 13 INTSXP g0c3 [NAM(1),TR] (len=5, tl=0) 1,20,3,4,5

Now the same assignment results in a copy, why? By doing (3), the 'named' field gets incremented to NAM(2) as more than one object is pointing to the same data. Even if [<- is optimised, the fact that it's a NAM(2) means that the object must be duplicated. That's why it's now again a NAM(1) object after the assignment. That's because, call to duplicate sets named to 0 and the new assignment bumps it back to 1.

Note: Peter Dalgaard explains this case nicely in this link as to why x = 2L results in NAM(2) object.



NAM(2):

Now let's return to your question on calling *<- on a data.frame which is a NAM(2) object.

The first question then is, why is data.frame() a NAM(2) object? Why not a NAM(1) like the earlier case x <- 1:5? Duncan Murdoch answers this very nicely on the same post:

data.frame() is a plain R function, so it is treated no differently than any user-written function. On the other hand, the internal function that implements the : operator is a primitive, so it has complete control over its return value, and it can set NAMED in the most efficient way.

This means any attempt to change the value would result in triggering a duplicate (a deep copy). From ?tracemem:

... any copying of the object by the C function duplicate produces a message to standard output.

So a message from tracemem helps understand the number of copies. To understand the first line of your tracemem output, let's construct the function f<-, which does no actual replacement. Also, let's construct a data.frame big enough so that we can measure the time taken for a single copy of that data.frame.

## R v 3.0.3
`f<-` = function(x, value) {
return(x) ## no actual replacement
}

df <- data.frame(x=1:1e8, y=1:1e8) # 762.9 Mb
tracemem(df) # [1] "<0x7fbccd2f4ae8>"

require(data.table)
system.time(copy(df))
# tracemem[0x7fbccd2f4ae8 -> 0x7fbccd2f4ff0]: copy system.time
# user system elapsed
# 0.609 0.484 1.106

system.time(f(df) <- 3)
# tracemem[0x7fbccd2f4ae8 -> 0x7fbccd2f4f10]: system.time
# user system elapsed
# 0.608 0.480 1.101

I've used the function copy() from data.table (which basically calls the C duplicate function). The times for copying are more or less identical. So, the first step is clearly a deep copy, even if it did nothing.

This explains the first two verbose messages from tracemem in your post:

(1) From the global environment we called f(df) <- 3). Here's one copy.

(2) From within the function f<-, another assignment x[1,1] <- 3 which'll call the [<- (and hence the [<-.data.frame function). That makes the second copy immediately.

Finding the rest of the copies is easy with a debugonce() on [<-.data.frame. That is, doing:

debugonce(`[<-`)
df <- data.frame(x=1:1e8, y=1:1e8)
`f<-` = function(x, value) {
x[1,1] = value
return(x)
}
tracemem(df)
f(df) = 3

# first three lines:

# tracemem[0x7f8ba33d8a08 -> 0x7f8ba33d8d50]: (1)
# tracemem[0x7f8ba33d8d50 -> 0x7f8ba33d8a78]: f<- (2)
# debugging in: `[<-.data.frame`(`*tmp*`, 1L, 1L, value = 3L)

By hitting enter, you'll find the other two copies to be inside this function:

# debug: class(x) <- NULL
# tracemem[0x7f8ba33d8a78 -> 0x7f8ba3cd6078]: [<-.data.frame [<- f<- (3)

# debug: x[[jj]][iseq] <- vjj
# tracemem[0x7f8ba3cd6078 -> 0x7f882c35ed40]: [<-.data.frame [<- f<- (4)

Note that class is primitive but it's being called on a NAM(2) object. I suspect that's the reason for the copy there. And the last copy is inevitable as it modifies the column.

So, there you go.


Now a small note on R v3.1.0:

I also tested the same in R V3.1.0. tracemem provides all four lines. However, the only time-consuming step is (4). IIUC, the remaining cases, all due to [<- / class<- should be triggering a shallow copy instead of deep copy. What's awesome is that, even in (4), only that column that's being modified seems to be deep copied. R 3.1.0 has great improvements!

This means tracemem provides output due to shallow copy too - which is a bit confusing since the documentation doesn't explicitly state that and makes it hard to tell between a shallow and deep copy, except by measuring time. Perhaps it's my (incorrect) understanding. Feel free to correct me.


On your part 2, I'll quote Luke Tierney from here:

Calling a foo<- function directly is not a good idea unless you really understand what is going on in the assignment mechanism in general and in the particular foo<- function. It is definitely not something to be done in routine programming unless you like unpleasant surprises.

But I am unable to tell if these unpleasant surprises extend to an object that's already NAM(2). Because, Matt was calling it on a list, which is a primitive and therefore NAM(1), and calling foo<- directly wouldn't increment it's 'named' value.

But, the fact that R v3.1.0 has great improvements should already convince you that such a function call is not necessary anymore.

HTH.

PS: Feel free to correct me (and help me shorten this answer if possible) :).


Edit: I seem to have missed the point about a copy being reduced when calling f<- directly as spotted under comment. It's pretty easy to see by using the function Simon Urbanek used in the post (that's linked multiple times now):

# rm(list=ls()) # to make sure there' no other object in your workspace
`f<-` <- function(x, value) {
print(ls(env = parent.frame()))
}

df <- data.frame(x=1, y=2)
tracemem(df) # [1] "<0x7fce01a65358>"

f(df) = 3
# tracemem[0x7fce0359b2a0 -> 0x7fce0359ae08]:
# [1] "*tmp*" "df" "f<-"

df <- data.frame(x=1, y=2)
tracemem(df) # [1] "<0x7fce03c505c0>"
df <- `f<-`(df, 3)
# [1] "df" "f<-"

As you can see, in the first method there's an object *tmp* that's being created, which is not, in the second case. And it seems like this creation of *tmp* object for a NAM(2) input object triggers a copy of the input before *tmp* gets assigned to the function argument. But that's as far as my understanding goes.



Related Topics



Leave a reply



Submit