Warning: 'Invalid .Internal.Selfref Detected' When Adding a Column to a Data.Table Returned from a Function

Warning: 'Invalid .internal.selfref detected' when adding a column to a data.table returned from a function

This has nothing to do with fread per se, but that you're calling list() and passing it a named object. We can recreate this by doing:

require(data.table)
DT <- data.table(x=1:2) # name the object 'DT'
DT.l <- list(DT=DT) # create a list containing one data.table
y <- DT.l$DT # get back the data.table
y[, bla := 1L] # now add by reference
# works fine but warning message will occur

DT.l = list(DT=data.table(x=1:2)) # DT = a call, not a named object
y = DT.l$DT
y[, bla:=1L]
# works fine and no warning message

Good news:

The good news is that from R version >= 3.1.0 (now in devel), passing a named object to list() will no longer create a copy, rather, its reference count (number of objects pointing to this value) just gets bumped. So, the problem goes away with the next version of R.

To understand how data.table detects copies using .internal.selfref, we'll dive into some history of data.table.

First, some history:

You should know that data.table over-allocates column pointer slots (truelength is set to a default of 100) on creation so that := can be used to add columns by reference later on. There was one issue with this as such - handling copies. For example, when we call list() and pass it a named object, a copy is being made, as illustrated below.

tracemem(DT)
# [1] "<0x7fe23ac3e6d0>"
DT.list <- list(DT=DT) # `DT` is the named object on the RHS of = here
# tracemem[0x7fe23ac3e6d0 -> 0x7fe23cd72f48]:

The problem with any copy of data.table that R makes (not data.table's copy()) is that R internally sets the truelength parameter to 0 even though truelength(.) function will still return the correct result. This inadvertently led to a segfault when updated by reference with :=, because, the over-allocation didn't exist anymore (or at least is not recognised anymore). This happened in versions < 1.7.8. In order to overcome this, an attribute called .internal.selfref was introduced. You can check this attribute by doing attributes(DT).

From NEWS (of v1.7.8):

o The 'Chris crash' is fixed. The root cause was that key<- always copies the whole table. The problem with that copy (other than being slower) is that R doesn't maintain the over allocated truelength, but it looks as though it has. key<- was used internally, in particular in merge(). So, adding a column using := after merge() was a memory overwrite, since the over allocated memory wasn't really there after key<-'s copy.

data.tables now have a new attribute .internal.selfref to catch and warn about such copies in future. All internal use of key<- has been replaced with setkey(), or new function setkeyv() which accepts a vector, and do not copy.

What does this .internal.selfref do?

It just points to itself, basically. It's simply an attribute attached to DT that contains the address in RAM of DT. If R inadvertently copies DT, the address of DT will move in RAM but the attribute attached will still contain the old memory address, they won't match any more. data.table checks they do match (i.e. is valid) before adding a new column by reference into a spare column pointer slot.

How is .internal.selfref implemented ?

In order to understand this attribute .internal.selfref, we've to understand what an external pointer (EXTPTRSXP) is. This page explains nicely. Copy/pasting the essential lines:

External pointer SEXPs are intended to handle references to C structures such as handles, and are used for this purpose in package RODBC for example. They are unusual in their copying semantics in that when an R object is copied, the external pointer object is not duplicated.

They are created as:

SEXP R_MakeExternalPtr(void *p, SEXP tag, SEXP prot);

where p is the pointer (and hence this cannot portably be a function pointer), and tag and prot are references to ordinary R objects which will remain in existence (be protected from garbage collection) for the lifetime of the external pointer object. A useful convention is to use the tag field for some form of type identification and the prot field for protecting the memory that the external pointer represents, if that memory is allocated from the R heap.

In our case, we create the attribute .internal.selfref of/for DT, whose value is an external pointer to NULL (the address of which you see in the attribute value) and this external pointer's prot field is another external pointer back to DT (hence referred to as selfref) with its prot set to NULL this time.

Note: We've to employ this extptr to NULL whose 'prot' is an extptr strategy so that identical(DT1, DT2) which are two different copies, but with same content returns TRUE. (If you don't understand what this means, you can just skip to the next part. It's not relevant to understanding the answer to this question).

Okay, how does this all work then?

We know that the external pointer does not get duplicated during a copy. Basically, when we create a data.table, the attribute .internal.selfref creates an external pointer to NULL with it's prot field creating an external pointer back to DT. Now, when an unintentional "copy" is being made, the object's address gets modified but not the address protected by the attribute. It still points to DT whether it exists or not.. because it won't/can't be modified. This is therefore detected internally by checking the address of the current object and the address protected by the external pointer. If they don't match, then a "copy" has been made by R (that would have lost the over-allocation that data.table carefully created). That is:

DT <- data.table(x=1:2) # internal selfref set
DT.list <- list(DT=DT) # copy made, address(DT.list$DT) != address(DT)
# and truelength would be affected.

DT.new <- DT.list$DT # address of DT.new != address of DT
# and it's not equal to the address pointed to by
# the attribute's 'prot' external pointer

# so a re-over-allocation has to be made by data.table at the next update by
# reference, and it warns so you can fix the root cause by not using list(),
# key<-, names<- etc.

That's a lot to take in. I think I've managed to get it through as clear as possible. If there're any mistakes (it took me a while to wrap this around my head) or possibilities for further clarity, feel free to edit or comment with your suggestions.

Hope this clears up things.

Invalid .internal.selfref in data.table

Yes, the problem is the list. Here is a simple example:

DT <- data.table(1:5)
mylist1 <- list(DT,"a")
mylist1[[1]][,id:=.I]
#warning

mylist2 <- list(data.table(1:5),"a")
mylist2[[1]][,id:=.I]
#no warning

You should avoid copying a data.table into a list (and to be on the safe side I would avoid having a DT in a list at all). Try this:

f1 <- function(){
mylist <- list(res=data.table(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
period = c("start", "end", "start", "end", "start", "end"),
date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date"))))
other_results <- ""
mylist$other_results <- other_results
mylist
}

Invalid .internal.selfref warning NO CALLING list() INVOLVED

The warning message tells you everything you need:

Warning message: In [.data.table (dt, , `:=`(new.col, 5)) :

Invalid .internal.selfref detected and fixed by taking a (shallow)
copy of the data.table so that := can add this new column by
reference. At an earlier point, this data.table has been copied by R
(or been created manually using structure() or similar). Avoid key<-,
names<- and attr<- which in R currently (and oddly) may copy the whole
data.table. Use set* syntax instead to avoid copying: ?set, ?setnames
and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1
and DT2 (R's list() used to copy named objects); please upgrade to
R>v3.0.2 if that is biting. If this message doesn't help, please
report to datatable-help so the root cause can be fixed.

The .internal.selfref pointer refers to the location in memory of the data.table. Using key<-, names<- or attr<- apperently causes R to make a copy of the data.table which needs another place in memory.

So, instead of using colnames you should use setnames:

initialize = function() {
table = data.table(1:10)
setnames(table,"V1","old.col")
table
}
dt <- initialize()
dt[, new.col := 5]

Now you won't get a warning because the data.table is updated by reference without making a copy and thus keeping the same .internal.selfref pointer to the location in memory.



Related Topics



Leave a reply



Submit