R: How to Rbind Two Huge Data-Frames Without Running Out of Memory

R: how to rbind two huge data-frames without running out of memory

Rather than reading them into R at the beginning and then combining them you could have SQLite read them and combine them before sending them to R. That way the files are never individually loaded into R.

# create two sample files
DF1 <- data.frame(A = 1:2, B = 2:3)
write.table(DF1, "data1.dat", sep = ",", quote = FALSE)
rm(DF1)

DF2 <- data.frame(A = 10:11, B = 12:13)
write.table(DF2, "data2.dat", sep = ",", quote = FALSE)
rm(DF2)

# now we do the real work
library(sqldf)

data1 <- file("data1.dat")
data2 <- file("data2.dat")

sqldf(c("select * from data1",
"insert into data1 select * from data2",
"select * from data1"),
dbname = tempfile())

This gives:

>  sqldf(c("select * from data1", "insert into data1 select * from data2", "select * from data1"), dbname = tempfile())
A B
1 1 2
2 2 3
3 10 12
4 11 13

This shorter version also works if row order is unimportant:

sqldf("select * from data1 union select * from data2", dbname = tempfile())

See the sqldf home page http://sqldf.googlecode.com and ?sqldf for more info. Pay particular attention to the file format arguments since they are close but not identical to read.table. Here we have used the defaults so it was less of an issue.

Memory efficient alternative to rbind - in-place rbind?

Right now I worked out the following solution:

nextrow = nrow(df)+1
df[nextrow:(nextrow+nrow(df.extension)-1),] = df.extension
# we need to assure unique row names
row.names(df) = 1:nrow(df)

Now I don't run out of memory. I think its because I store

object.size(df) + 2 * object.size(df.extension)

while with rbind R would need

object.size(rbind(df,df.extension)) + object.size(df) + object.size(df.extension). 

After that I use

rm(df.extension)
gc(reset=TRUE)

to free the memory I don't need anymore.

This solved my problem for now, but I feel that there is a more advanced way to do a memory efficient rbind. I appreciate any comments on this solution.

how do you process many files without running out of memory in R

You can try importing all your files in a list and then processing the list:

all_files <- lapply(input_files, xmlTreeParse, isHTML=TRUE)
process_files <- lapply(all_files, function(myfile){
r = xmlRoot(xmlobj)
server = xmlValue(r[[2]][1][1]$h1)
time = xmlValue(r[[2]][4][1]$dl[1]$dt)
web_data = data.frame(server, time, stringsAsFactors=FALSE)
web_data
}
total_data <- do.call(rbind, process_files)

if you need to split your data in chunks, you can use the function seq to get the beginning indices of the chunks:

seq_ind <- seq(1, length(input_files), by=1000)

Then you can get the list of the files corresponding to each chunk with

files_in_chunks <- mapply(function(x, y) input_files[x:y], x=seq_ind, y=c(seq_ind[-1], length(input_files)), SIMPLIFY=FALSE)

Not enough memory to row bind two large datasets

If the combined dataset can fit into memory, you could try combining the tables in a CSV via fwrite with append = TRUE.

library(data.table)
fwrite(
rbindlist(
list(
fread("A.csv", nrows = 1L),
fread("B.csv")
),
fill = TRUE
)[-1],
"AB.csv"
)
fwrite(
fread("A.csv"),
"AB.csv",
append = TRUE
)
# maybe restart R here
AB <- fread("AB.csv", fill = TRUE)


Related Topics



Leave a reply



Submit