Growing a Data.Frame in a Memory-Efficient Manner

Growing a data.frame in a memory-efficient manner

First approach

I tried accessing each element of a pre-allocated data.frame:

res <- data.frame(x=rep(NA,1000), y=rep(NA,1000))
tracemem(res)
for(i in 1:1000) {
res[i,"x"] <- runif(1)
res[i,"y"] <- rnorm(1)
}

But tracemem goes crazy (e.g. the data.frame is being copied to a new address each time).

Alternative approach (doesn't work either)

One approach (not sure it's faster as I haven't benchmarked yet) is to create a list of data.frames, then stack them all together:

makeRow <- function() data.frame(x=runif(1),y=rnorm(1))
res <- replicate(1000, makeRow(), simplify=FALSE ) # returns a list of data.frames
library(taRifx)
res.df <- stack(res)

Unfortunately in creating the list I think you will be hard-pressed to pre-allocate. For instance:

> tracemem(res)
[1] "<0x79b98b0>"
> res[[2]] <- data.frame()
tracemem[0x79b98b0 -> 0x71da500]:

In other words, replacing an element of the list causes the list to be copied. I assume the whole list, but it's possible it's only that element of the list. I'm not intimately familiar with the details of R's memory management.

Probably the best approach

As with many speed or memory-limited processes these days, the best approach may well be to use data.table instead of a data.frame. Since data.table has the := assign by reference operator, it can update without re-copying:

library(data.table)
dt <- data.table(x=rep(0,1000), y=rep(0,1000))
tracemem(dt)
for(i in 1:1000) {
dt[i,x := runif(1)]
dt[i,y := rnorm(1)]
}
# note no message from tracemem

But as @MatthewDowle points out, set() is the appropriate way to do this inside a loop. Doing so makes it faster still:

library(data.table)
n <- 10^6
dt <- data.table(x=rep(0,n), y=rep(0,n))

dt.colon <- function(dt) {
for(i in 1:n) {
dt[i,x := runif(1)]
dt[i,y := rnorm(1)]
}
}

dt.set <- function(dt) {
for(i in 1:n) {
set(dt,i,1L, runif(1) )
set(dt,i,2L, rnorm(1) )
}
}

library(microbenchmark)
m <- microbenchmark(dt.colon(dt), dt.set(dt),times=2)

(Results shown below)

Benchmarking

With the loop run 10,000 times, data table is almost a full order of magnitude faster:

Unit: seconds
expr min lq median uq max
1 test.df() 523.49057 523.49057 524.52408 525.55759 525.55759
2 test.dt() 62.06398 62.06398 62.98622 63.90845 63.90845
3 test.stack() 1196.30135 1196.30135 1258.79879 1321.29622 1321.29622

benchmarks

And comparison of := with set():

> m
Unit: milliseconds
expr min lq median uq max
1 dt.colon(dt) 654.54996 654.54996 656.43429 658.3186 658.3186
2 dt.set(dt) 13.29612 13.29612 15.02891 16.7617 16.7617

Note that n here is 10^6 not 10^5 as in the benchmarks plotted above. So there's an order of magnitude more work, and the result is measured in milliseconds not seconds. Impressive indeed.

R memory efficient way to store many data frames?

Your example and mentioning the apply family of functions suggest that the structure of the data frames is identical, ie, they all have the same columns.

If this is the case and if the total volume of data (all data frames together) still does fit in available RAM then a solution could be to pack all data into one large data.table with an extra id column. This can be achieved with function rbindlist:

library(data.table)
x <- data.table(A = rnorm(100), B = rnorm(200))
y <- data.table(A = rnorm(30), B = rnorm(300))
z <- data.table(A = rnorm(20), B = rnorm(600))
dt <- rbindlist(list(x, y, z), idcol = TRUE)
dt
.id A B
1: 1 -0.10981198 -0.55483251
2: 1 -0.09501871 -0.39602767
3: 1 2.07894635 0.09838722
4: 1 -2.16227936 0.04620932
5: 1 -0.85767886 -0.02500463
---
1096: 3 1.65858606 -1.10010088
1097: 3 -0.52939876 -0.09720765
1098: 3 0.59847826 0.78347801
1099: 3 0.02024844 -0.37545346
1100: 3 -1.44481850 -0.02598364

The rows originating from the individual source data frames
can be distinghuished by the .id variable. All the memory efficient data.tableoperations can be applied on all rows, selected rows (dt[.id == 1, some_function(A)]) or group-wise (dt[, another_function(B), by = .id]).

Although the data.table operations are memory efficient, RAM might still be a limiting factor. Use the tables() function to monitor memory consumption of all created data.table objects:

tables()
NAME NROW NCOL MB COLS KEY
[1,] dt 1,100 3 1 .id,A,B
[2,] x 200 2 1 A,B
[3,] y 300 2 1 A,B
[4,] z 600 2 1 A,B
Total: 4MB

and remove objects from memory which are no longer needed

rm(x, y, z)
tables()
NAME NROW NCOL MB COLS KEY
[1,] dt 1,100 3 1 .id,A,B
Total: 1MB

How to append a growing array to itself efficiently

Try to avoid for...for... loop. It is extremely inefficient. It is equal to:

for (i in 1:iter) 
{
df[[i]] <- v[sample(nrow(v),replace = T),]
}

a more verbose version of Gregor's solution...

Appending to large data.frame and hitting memory limit

If you are using data.table objects you should use rbindlist to avoid making unnecessary copies of your data.table. This should work...

d = rbindlist(d, n)

Why would rbind work faster than set for growing a data table?

set is more often an alternative to := for fast assignment to elements of a data.table. This is one example of how it's normally used.

As chinsoon12 points out, rbindlist(lapply(filepaths, fread)) should be a faster solution here. In terms of the example given, one option would be to define a list of the correct dimensions and use rbindlist:

list.way <- function() {
wildfire_data_list <- vector("list", length = 3)
for(tile in 1:3) {
# Normally this data would be read in from an external file, but we'll make some dummy data for this example
new_wildfire_data <- data.table(x = sample(1:1e6,1000), y = sample(1:1e6,1000), total_PM10 = sample(1:1e6,1000),
total_PM2.5 = sample(1:1e6,1000), total_CH4 = sample(1:1e6,1000), total_CO = sample(1:1e6,1000), total_CO2 = sample(1:1e6,1000), total_NOx = sample(1:1e6,1000), total_SO2 = sample(1:1e6,1000), total_VOC = sample(1:1e6,1000), total_char = sample(1:1e6,1000))

wildfire_data_list[[tile]] <- new_wildfire_data
}
wildfire_data <- rbindlist(wildfire_data_list)
return(wildfire_data)
}

What is a memory-efficient method to spread then gather columns? (see example)

An option using data.table

dcast(DT, rleid(Label) ~ rowid(Label), value.var = "X")
# Label 1 2 3 4
#1: 1 367.138 384.295 159.496 269.927
#2: 2 364.118 154.475 265.861 NA

data

library(data.table)
DT <- fread(text = " X.1 Label X
81 21 367.138
82 21 384.295
83 21 159.496
84 21 269.927
85 22 364.118
86 22 154.475
87 22 265.861")


Related Topics



Leave a reply



Submit