Add a Row by Reference at the End of a Data.Table Object

Add a row by reference at the end of a data.table object

To answer your edit, just run a benchmark:

a = data.table(id=letters[1:2], var=1:2)
b = copy(a)
c = copy(b) # let's also just try modifying same value in place
# to see how well changing existing values does
microbenchmark(a <- rbind(a, data.table(id="c", var=3)),
b <- rbindlist(list(b, data.table(id="c", var=3))),
c[1, var := 3L],
set(c, 1L, 2L, 3L))
#Unit: microseconds
# expr min lq median uq max neval
# a <- rbind(a, data.table(id = "c", var = 3)) 865.460 1141.2585 1357.1230 1539.4300 6814.492 100
#b <- rbindlist(list(b, data.table(id = "c", var = 3))) 260.440 325.3835 445.4190 522.8825 1143.930 100
# c[1, `:=`(var, 3L)] 482.147 626.5570 778.3135 904.3595 1109.539 100
#                                  set(c, 1L, 2L, 3L)   2.339    5.677    7.5140    9.5170   19.033   100

rbindlist is clearly better than rbind. Thanks to Matthew Dowle pointing out the problems with using [ in a loop, I added another benchmark with set.

From the above your best options are using rbindlist, or sizing the data.table to begin with and then just populating the values (you can also use a similar strategy to std::vector in C++, and double the size every time you run out of space, if you don't know the size of the data to begin with, and then once you're done filling it in, delete the extra rows).

Insert a row in a data.table

To expand on @Franks answer, if in your particular case you are appending a row, it's :

set.seed(12345) 
dt1 <- data.table(a=rnorm(5), b=rnorm(5))

The following are equivalent; I find the first easier to read but the second faster:

microbenchmark(
rbind(dt1, list(5, 6)),
rbindlist(list(dt1, list(5, 6)))
)

As we can see:

                             expr     min      lq  median       uq     max
rbind(dt1, list(5, 6)) 160.516 166.058 175.089 185.1470 457.735
rbindlist(list(dt1, list(5, 6))) 130.137 134.037 140.605 149.6365 184.326

If you want to insert the row elsewhere, the following will work, but it's not pretty:

rbindlist(list(dt1[1:3, ], list(5, 6), dt1[4:5, ]))

or even

rbindlist(list(dt1[1:3, ], as.list(c(5, 6)), dt1[4:5, ]))

giving:

            a          b
1: 0.5855288 -1.8179560
2: 0.7094660 0.6300986
3: -0.1093033 -0.2761841
4: 5.0000000 6.0000000
5: -0.4534972 -0.2841597
6: 0.6058875 -0.9193220

If you are modifying a row in place (which is the preferred approach), you will need to define the size of the data.table in advance i.e.

dt1 <- data.table(a=rnorm(6), b=rnorm(6))
set(dt1, i=6L, j="a", value=5) # refer to column by name
set(dt1, i=6L, j=2L, value=6) # refer to column by number

Thanks @Boxuan, I have modified this answer to take account of your suggestion, which is a little faster and easier to read.

Append a data.table to another data.table by reference similar to rbind

We need to assign it to the object in the global environment. In the OP's function, it is assigned locally to an object named 'x1' and one of the nice things about functions it that the global objects are not mutated (local scope)

dt.append <- function(x1, x2) {
obj <- deparse(substitute(x1)) # get the actual object name as a string
assign(obj, value = data.table::rbindlist(list(x1, x2)), envir = .GlobalEnv)

}

dt1
# a b
#1: 1 hello
#2: 2 world

How to store a reference to an object in System.DataTable, rather than a string?

That's because you have added Column1 without denoting its type, hence by default it will be assigned string.

Try to explicitly set it as Object instead:

dt.Columns.Add("Column1", typeof(Object));

See Source

:= (pass by reference) operator in the data.table package modifies another data table object simultaneously

This piece of documentation in data.table would help. ? data.table::copy

No value is returned. The data.table is modified by reference. If you require a copy, take a copy first (using DT2=copy(DT)). copy() may also sometimes be useful before := is used to subassign to a column by reference.

data.table doesn't modify by reference if the object is freshly loaded from file?

A data.table needs to be over-allocated in memory for adding columns by reference to work. After loading it that's not the case:

load("ttt")
length(a)
#[1] 1
truelength(a)
#[1] 0

b <- data.table(x=1:2)
length(b)
#[1] 1
truelength(b)
#[1] 100

From help(truelength):

For tables loaded from disk however, truelength is 0 in R 2.14.0 and random in R <= 2.13.2; i.e., in both cases perhaps unexpected. data.table detects this state and over-allocates the loaded data.table when the next column addition or deletion occurs.

But it seems like if you pass a (freshly loaded) data.table to a function and then add by reference inside the function over-allocation happens but doesn't reach the symbol up in the global environment (only the local symbol inside the function). If you do it in the global environment directly or don't pass the data.table as function parameter, it works.

If the data.table is over-allocated already (as is normally the case, other than when freshly loaded from disk), then there are spare slots for the column to be added into by reference and no shallow copy (to achieve over-allocation) needs to be done by := inside the function.

This might be worth a bug report (but I haven't checked if there is already one).

R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] microbenchmark_1.3-0 data.table_1.8.8

loaded via a namespace (and not attached):
[1] tools_3.0.1

data.table out of range , how to add value to new row

Data Tables were designed to work much faster with some common operations like subset, join, group, sort etc. and as a result have some differences with data.frames.
Some operations like the one you pointed out will not work on data.tables. You need to use data.table - specific operations.

dt1 <- data.table(c(1:3))
rbindlist(list(dt1, list(1)), use.names=FALSE)
dt1

# V1
# 1: 1
# 2: 2
# 3: 3
# 4: 1

Why I need to assign a data.table to a new object to filter rows?

Fundamentally, columns are much easier to modify by-reference in R since columns are list elements, and list elements are not stored contiguously in memory.

Removing a column by reference just means unallocating its allotted memory and removing the associated pointers

By contrast, removing some rows is a lot harder and can't really be done by-reference -- some copying is inevitable. Consider this simplified representation of a table with two columns, A and B:

    1  2  3  4  5
A: [ ][ ][ ][ ][ ]
B: [ ][ ][ ][ ][ ]

A is stored in contiguous memory as an array with size 5*sizeof(A). E.g. if A is an integer, it's given 4 bytes per cell. numeric is 8 bytes per cell.

Deleting B is easy from a memory point of view: just tell R/your system you don't need that memory anymore:

    1  2  3  4  5
A: [ ][ ][ ][ ][ ]
B: [x][x][x][x][x]

A's memory allocation is unaffected.

By contrast, consider removing some rows from the table (i.e., both A and B):

    1  2  3  4  5
A: [ ][x][x][ ][ ]
B: [ ][x][x][ ][ ]

If we simply release the memory for these 4 cells, our table will be broken -- its constituent memory has been split with the 2*sizeof(A)-size gaps between its 1st and 4th rows.

The best we can do is to try and minimize copying by shifting rows 4 & 5, and leaving row 1 alone:

    1  2  3<-4<-5
A: [ ][x][x][ ][ ]
B: [ ][x][x][ ][ ]

1 4 5
A: [ ][ ][ ]
B: [ ][ ][ ]

In the linked answer, Matt alludes to a very specific case in which the by-reference approach can work -- when the rows to add/drop come at the end. Hopefully the illustration makes it clear why this is easier to do.

This technical difficulty is the reason why the linked feature request is so hard to fill. Copying many columns' data as illustrated is easier said than done & requires a lot of finesse to get it working & communicated back to R from C properly.



Related Topics



Leave a reply



Submit