What Exactly Is Copy-On-Modify Semantics in R, and Where Is the Canonical Source

What exactly is copy-on-modify semantics in R, and where is the canonical source?

Call-by-value

The R Language Definition says this (in section 4.3.3 Argument Evaluation)

The semantics of invoking a function in R argument are call-by-value. In general, supplied arguments behave as if they are local variables initialized with the value supplied and the name of the corresponding formal argument. Changing the value of a supplied argument within a function will not affect the value of the variable in the calling frame. [Emphasis added]

Whilst this does not describe the mechanism by which copy-on-modify works, it does mention that changing an object passed to a function doesn't affect the original in the calling frame.

Additional information, particularly on the copy-on-modify aspect are given in the description of SEXPs in the R Internals manual, section 1.1.2 Rest of Header. Specifically it states [Emphasis added]

The named field is set and accessed by the SET_NAMED and NAMED
macros, and take values 0, 1 and 2. R has a 'call by value'
illusion, so an assignment like
b <- a
appears to make a copy of a and refer to it as b. However, if
neither a nor b are subsequently altered there is no need to copy.
What really happens is that a new symbol b is bound to the same
value as a and the named field on the value object is set (in this
case to 2). When an object is about to be altered, the named field
is consulted. A value of 2 means that the object must be duplicated
before being changed. (Note that this does not say that it is
necessary to duplicate, only that it should be duplicated whether
necessary or not.) A value of 0 means that it is known that no other
SEXP shares data with this object, and so it may safely be altered.
A value of 1 is used for situations like
dim(a) <- c(7, 2)
where in principle two copies of a exist for the duration of the
computation as (in principle)
a <- `dim<-`(a, c(7, 2))
but for no longer, and so some primitive functions can be optimized to
avoid a copy in this case.

Whilst this doesn't describe the situation whereby objects are passed to functions as arguments, we might deduce that the same process operates, especially given the information from the R Language definition quoted earlier.

Promises in function evaluation

I don't think it is quite correct to say that a promise is passed to the function. The arguments are passed to the function and the actual expressions used are stored as promises (plus a pointer to the calling environment). Only when an argument gets evaluated is the expression stored in the promise retrieved and evaluated within the environment indicated by the pointer, a process known as forcing.

As such, I don't believe it is correct to talk about pass-by-reference in this regard. R has call-by-value semantics but tries to avoid copying unless a value passed to an argument is evaluated and modified.

The NAMED mechanism is an optimisation (as noted by @hadley in the comments) which allows R to track whether a copy needs to be made upon modification. There are some subtleties involved with exactly how the NAMED mechanism operates, as discussed by Peter Dalgaard (in the R Devel thread @mnel cites in their comment to the question)

Is slicing copy-on-modify in R?

This is a complex topic. You should start with reading about the NAMED mechanism.

If you run the following, you see that there is no copy of the list elements (because lists are basically pointers to their elements):

> a <- list(1, 2, 3, 4, 5)
> 
> b <- a[1:2]
> .Internal(inspect(b)) 
@0x000000001327e5b8 19 VECSXP g0c2 [NAM(3)] (len=2, tl=0)
  @0x00000000136f6b60 14 REALSXP g0c1 [NAM(3)] (len=1, tl=0) 1
  @0x00000000136f6b28 14 REALSXP g0c1 [NAM(3)] (len=1, tl=0) 2
> 
> 
> c <- a[1:2]
> .Internal(inspect(c)) 
@0x000000001327e678 19 VECSXP g0c2 [NAM(3)] (len=2, tl=0)
  @0x00000000136f6b60 14 REALSXP g0c1 [NAM(3)] (len=1, tl=0) 1
  @0x00000000136f6b28 14 REALSXP g0c1 [NAM(3)] (len=1, tl=0) 2
> 
> b[1] <- 6
> .Internal(inspect(b)) 
@0x000000001327e6f8 19 VECSXP g0c2 [NAM(1)] (len=2, tl=0)
  @0x0000000013745b58 14 REALSXP g0c1 [] (len=1, tl=0) 6
  @0x00000000136f6b28 14 REALSXP g0c1 [NAM(3)] (len=1, tl=0) 2
> 
> .Internal(inspect(c))
@0x000000001327e678 19 VECSXP g0c2 [NAM(3)] (len=2, tl=0)
  @0x00000000136f6b60 14 REALSXP g0c1 [NAM(3)] (len=1, tl=0) 1
  @0x00000000136f6b28 14 REALSXP g0c1 [NAM(3)] (len=1, tl=0) 2

This is different if you subset vectors.

You might also be interested in the new reference counting mechanism.

Copy-on-modify semantic on a vector does not append in a loop. Why?

I complete the @MikeH. awnser with this code

library(pryr)

x = runif(10)
y = numeric(length(x))
print(c(address(y), refs(y)))

for(i in 2:length(x))
{
  y[i] = x[i] - x[i-1]
  print(c(address(y), refs(y)))
}

print(c(address(y), refs(y)))

The output shows clearly what happened

[1] "0x7872180" "2"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1" 
[1] "0x765b860" "2"

There is a copy at the first iteration. Indeed because of Rstudio there are 2 refs. But after this first copy y belongs in the loops and is not available into the global environment. Then, Rstudio does not create any additional refs and thus no copy is made during the next updates. y is updated by reference. On loop exit y become available in the global environment. Rstudio creates an extra refs but this action does not change the address obviously.

First bracketed assignment is as time-consuming as full assignment?

Compare the "NAM()" part of

> a <- rep(1L, 10)
> .Internal(inspect(a))
@457b840 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 1,1,1,1,1,...

versus

> system.time(a <- rep(1L, 10))
[...]
> .Internal(inspect(a))
@4626f88 13 INTSXP g0c4 [NAM(2)] (len=10, tl=0) 1,1,1,1,1,...

The "1" in the first example means that R thinks there is one reference to a, hence can be updated in place. The "2" means that R thinks there have been at least two references to a, hence duplication required if modified. Roughly, I rationalize this as the representation of the return value of rep() inside system.time, and its value outside system.time; the moral equivalent of f = function() { x <- rep(1L, 10); x }; a = f() rather than g = function() rep(1L, 10); a = g().

The real-world code a <- rep(1L, 10^8); a[123L] <- 231L would not involve a copy. We can time the assignment without artificially incrementing the NAMED count with

> a <- rep(1L, 10^8)
> .Internal(inspect(a))
@7f972b571010 13 INTSXP g0c7 [NAM(1)] (len=100000000, tl=0) 1,1,1,1,1,...
> system.time(a[123L] <- a[321L])
   user  system elapsed 
      0       0       0

Copy On Modify; What Happens When You Run This Code? x - list(1:10); x[[2]] - x

Maybe this will help. Here we use pryr::address to see the memory location where objects are stored (note, your actual addresses may vary, but when I have matching addresses, your addresses should match as well).

library(pryr)
x <- list(1:10)
pryr::address(x)
# [1] "0x3452810"
y <- x[[1]]
pryr::address(y)
# [1] "0x16b53bf0"

So we have an list x at a given location. We can think of lists in R as collections of pointers to other objects. We can't directly take the address of where it's storing it's first item (at least, I don't know how with address), but we can store that value to y and since R will only change address when objects are modified, we can assume this is where that first value is stored. Now let's update x

x[[2]] <- x
pryr::address(x)
# [1] "0x16001018"

we can see that x has changed and been given a new memory location

y <- x[[1]]
pryr::address(y)
# [1] "0x16b53bf0"

Note that the first element though is still at the same memory address. So a new copy of this vector hasn't been made. The new list just points to that same vector. Now let's look at the address of the value we just added

y <- x[[2]]
pryr::address(y)
# [1] "0x3452810"

Note that this value now points to the old memory address where the original x lived.

And further more

y <- x[[2]][[1]]
pryr::address(y)
# [1] "0x16b53bf0"

both lists point to the same 1:10 vector. It's only stored once.

So when you do x[[2]]<-x what you are doing is creating a new list. This new list contains two "pointers" essentially. One to the same vector that was in the original list, and one that points to the original address of the list.

Understanding exactly when a data.table is a reference to (vs a copy of) another data.table

Yes, it's subassignment in R using <- (or = or ->) that makes a copy of the whole object. You can trace that using tracemem(DT) and .Internal(inspect(DT)), as below. The data.table features := and set() assign by reference to whatever object they are passed. So if that object was previously copied (by a subassigning <- or an explicit copy(DT)) then it's the copy that gets modified by reference.

DT <- data.table(a = c(1, 2), b = c(11, 12)) 
newDT <- DT 

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

.Internal(inspect(newDT))   # precisely the same object at this point
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

tracemem(newDT)
# [1] "<0x0000000003b7e2a0"

newDT$b[2] <- 200
# tracemem[0000000003B7E2A0 -> 00000000040ED948]: 
# tracemem[00000000040ED948 -> 00000000040ED830]: .Call copy $<-.data.table $<- 

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),TR,ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,200
# ATTRIB:  # ..snip..

Notice how even the a vector was copied (different hex value indicates new copy of vector), even though a wasn't changed. Even the whole of b was copied, rather than just changing the elements that need to be changed. That's important to avoid for large data, and why := and set() were introduced to data.table.

Now, with our copied newDT we can modify it by reference :

newDT
#      a   b
# [1,] 1  11
# [2,] 2 200

newDT[2, b := 400]
#      a   b        # See FAQ 2.21 for why this prints newDT
# [1,] 1  11
# [2,] 2 400

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,400
# ATTRIB:  # ..snip ..

Notice that all 3 hex values (the vector of column points, and each of the 2 columns) remain unchanged. So it was truly modified by reference with no copies at all.

Or, we can modify the original DT by reference :

DT[2, b := 600]
#      a   b
# [1,] 1  11
# [2,] 2 600

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,600
#   ATTRIB:  # ..snip..

Those hex values are the same as the original values we saw for DT above. Type example(copy) for more examples using tracemem and comparison to data.frame.

Btw, if you tracemem(DT) then DT[2,b:=600] you'll see one copy reported. That is a copy of the first 10 rows that the print method does. When wrapped with invisible() or when called within a function or script, the print method isn't called.

All this applies inside functions too; i.e., := and set() do not copy on write, even within functions. If you need to modify a local copy, then call x=copy(x) at the start of the function. But, remember data.table is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects (ever). As a result we don't need to allow for the usual 3* working memory factor rule of thumb. We try to only need working memory as large as one column (i.e. a working memory factor of 1/ncol rather than 3).