Add a row by reference at the end of a data.table object
To answer your edit, just run a benchmark:
a = data.table(id=letters[1:2], var=1:2)
b = copy(a)
c = copy(b) # let's also just try modifying same value in place
# to see how well changing existing values does
microbenchmark(a <- rbind(a, data.table(id="c", var=3)),
b <- rbindlist(list(b, data.table(id="c", var=3))),
c[1, var := 3L],
set(c, 1L, 2L, 3L))
#Unit: microseconds
# expr min lq median uq max neval
# a <- rbind(a, data.table(id = "c", var = 3)) 865.460 1141.2585 1357.1230 1539.4300 6814.492 100
#b <- rbindlist(list(b, data.table(id = "c", var = 3))) 260.440 325.3835 445.4190 522.8825 1143.930 100
# c[1, `:=`(var, 3L)] 482.147 626.5570 778.3135 904.3595 1109.539 100
# set(c, 1L, 2L, 3L) 2.339 5.677 7.5140 9.5170 19.033 100
rbindlist
is clearly better than rbind
. Thanks to Matthew Dowle pointing out the problems with using [
in a loop, I added another benchmark with set
.
From the above your best options are using rbindlist
, or sizing the data.table
to begin with and then just populating the values (you can also use a similar strategy to std::vector
in C++
, and double the size every time you run out of space, if you don't know the size of the data to begin with, and then once you're done filling it in, delete the extra rows).
Insert a row in a data.table
To expand on @Franks answer, if in your particular case you are appending a row, it's :
set.seed(12345)
dt1 <- data.table(a=rnorm(5), b=rnorm(5))
The following are equivalent; I find the first easier to read but the second faster:
microbenchmark(
rbind(dt1, list(5, 6)),
rbindlist(list(dt1, list(5, 6)))
)
As we can see:
expr min lq median uq max
rbind(dt1, list(5, 6)) 160.516 166.058 175.089 185.1470 457.735
rbindlist(list(dt1, list(5, 6))) 130.137 134.037 140.605 149.6365 184.326
If you want to insert the row elsewhere, the following will work, but it's not pretty:
rbindlist(list(dt1[1:3, ], list(5, 6), dt1[4:5, ]))
or even
rbindlist(list(dt1[1:3, ], as.list(c(5, 6)), dt1[4:5, ]))
giving:
a b
1: 0.5855288 -1.8179560
2: 0.7094660 0.6300986
3: -0.1093033 -0.2761841
4: 5.0000000 6.0000000
5: -0.4534972 -0.2841597
6: 0.6058875 -0.9193220
If you are modifying a row in place (which is the preferred approach), you will need to define the size of the data.table in advance i.e.
dt1 <- data.table(a=rnorm(6), b=rnorm(6))
set(dt1, i=6L, j="a", value=5) # refer to column by name
set(dt1, i=6L, j=2L, value=6) # refer to column by number
Thanks @Boxuan, I have modified this answer to take account of your suggestion, which is a little faster and easier to read.
Append a data.table to another data.table by reference similar to rbind
We need to assign it to the object in the global environment. In the OP's function, it is assigned locally to an object named 'x1' and one of the nice things about functions it that the global objects are not mutated (local scope)
dt.append <- function(x1, x2) {
obj <- deparse(substitute(x1)) # get the actual object name as a string
assign(obj, value = data.table::rbindlist(list(x1, x2)), envir = .GlobalEnv)
}
dt1
# a b
#1: 1 hello
#2: 2 world
How to store a reference to an object in System.DataTable, rather than a string?
That's because you have added Column1
without denoting its type, hence by default it will be assigned string
.
Try to explicitly set it as Object
instead:
dt.Columns.Add("Column1", typeof(Object));
See Source
:= (pass by reference) operator in the data.table package modifies another data table object simultaneously
This piece of documentation in data.table
would help. ? data.table::copy
No value is returned. The data.table is modified by reference. If you require a copy, take a copy first (using DT2=copy(DT)). copy() may also sometimes be useful before := is used to subassign to a column by reference.
data.table doesn't modify by reference if the object is freshly loaded from file?
A data.table needs to be over-allocated in memory for adding columns by reference to work. After loading it that's not the case:
load("ttt")
length(a)
#[1] 1
truelength(a)
#[1] 0
b <- data.table(x=1:2)
length(b)
#[1] 1
truelength(b)
#[1] 100
From help(truelength)
:
For tables loaded from disk however, truelength is 0 in R 2.14.0 and random in R <= 2.13.2; i.e., in both cases perhaps unexpected. data.table detects this state and over-allocates the loaded data.table when the next column addition or deletion occurs.
But it seems like if you pass a (freshly loaded) data.table to a function and then add by reference inside the function over-allocation happens but doesn't reach the symbol up in the global environment (only the local symbol inside the function). If you do it in the global environment directly or don't pass the data.table as function parameter, it works.
If the data.table is over-allocated already (as is normally the case, other than when freshly loaded from disk), then there are spare slots for the column to be added into by reference and no shallow copy (to achieve over-allocation) needs to be done by :=
inside the function.
This might be worth a bug report (but I haven't checked if there is already one).
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.3-0 data.table_1.8.8
loaded via a namespace (and not attached):
[1] tools_3.0.1
data.table out of range , how to add value to new row
Data Tables were designed to work much faster with some common operations like subset, join, group, sort etc. and as a result have some differences with data.frames.
Some operations like the one you pointed out will not work on data.tables. You need to use data.table - specific operations.
dt1 <- data.table(c(1:3))
rbindlist(list(dt1, list(1)), use.names=FALSE)
dt1
# V1
# 1: 1
# 2: 2
# 3: 3
# 4: 1
Why I need to assign a data.table to a new object to filter rows?
Fundamentally, columns are much easier to modify by-reference in R since columns are list elements, and list elements are not stored contiguously in memory.
Removing a column by reference just means unallocating its allotted memory and removing the associated pointers
By contrast, removing some rows is a lot harder and can't really be done by-reference -- some copying is inevitable. Consider this simplified representation of a table with two columns, A
and B
:
1 2 3 4 5
A: [ ][ ][ ][ ][ ]
B: [ ][ ][ ][ ][ ]
A
is stored in contiguous memory as an array with size 5*sizeof(A)
. E.g. if A
is an integer
, it's given 4 bytes per cell. numeric
is 8 bytes per cell.
Deleting B
is easy from a memory point of view: just tell R/your system you don't need that memory anymore:
1 2 3 4 5
A: [ ][ ][ ][ ][ ]
B: [x][x][x][x][x]
A
's memory allocation is unaffected.
By contrast, consider removing some rows from the table (i.e., both A
and B
):
1 2 3 4 5
A: [ ][x][x][ ][ ]
B: [ ][x][x][ ][ ]
If we simply release the memory for these 4 cells, our table will be broken -- its constituent memory has been split with the 2*sizeof(A)
-size gaps between its 1st and 4th rows.
The best we can do is to try and minimize copying by shifting rows 4 & 5, and leaving row 1 alone:
1 2 3<-4<-5
A: [ ][x][x][ ][ ]
B: [ ][x][x][ ][ ]
1 4 5
A: [ ][ ][ ]
B: [ ][ ][ ]
In the linked answer, Matt alludes to a very specific case in which the by-reference approach can work -- when the rows to add/drop come at the end. Hopefully the illustration makes it clear why this is easier to do.
This technical difficulty is the reason why the linked feature request is so hard to fill. Copying many columns' data as illustrated is easier said than done & requires a lot of finesse to get it working & communicated back to R from C properly.
Related Topics
Add Objects to Package Namespace
R Shiny Rest API Communication
Conditional Coloring of Cells in Table
How to Select Last N Observation from Each Group in Dplyr Dataframe
How to Fix the Aspect Ratio in Ggplot
Using Grep to Help Subset a Data Frame
Add (Subtract) Months Without Exceeding the Last Day of the New Month
What Does This Mean: Unable to Find an Inherited Method for Function 'A' for Signature '"B"'
Aggregate() Puts Multiple Output Columns in a Matrix Instead
Stacked Bar Chart in R (Ggplot2) with Y Axis and Bars as Percentage of Counts
Can't Download Data from Yahoo Finance Using Quantmod in R
Extract Elements Common in All Column Groups
Compile R Script into Standalone .Exe File