R: how to rbind two huge data-frames without running out of memory
Rather than reading them into R at the beginning and then combining them you could have SQLite read them and combine them before sending them to R. That way the files are never individually loaded into R.
# create two sample files
DF1 <- data.frame(A = 1:2, B = 2:3)
write.table(DF1, "data1.dat", sep = ",", quote = FALSE)
rm(DF1)
DF2 <- data.frame(A = 10:11, B = 12:13)
write.table(DF2, "data2.dat", sep = ",", quote = FALSE)
rm(DF2)
# now we do the real work
library(sqldf)
data1 <- file("data1.dat")
data2 <- file("data2.dat")
sqldf(c("select * from data1",
"insert into data1 select * from data2",
"select * from data1"),
dbname = tempfile())
This gives:
> sqldf(c("select * from data1", "insert into data1 select * from data2", "select * from data1"), dbname = tempfile())
A B
1 1 2
2 2 3
3 10 12
4 11 13
This shorter version also works if row order is unimportant:
sqldf("select * from data1 union select * from data2", dbname = tempfile())
See the sqldf home page http://sqldf.googlecode.com and ?sqldf
for more info. Pay particular attention to the file format arguments since they are close but not identical to read.table
. Here we have used the defaults so it was less of an issue.
Memory efficient alternative to rbind - in-place rbind?
Right now I worked out the following solution:
nextrow = nrow(df)+1
df[nextrow:(nextrow+nrow(df.extension)-1),] = df.extension
# we need to assure unique row names
row.names(df) = 1:nrow(df)
Now I don't run out of memory. I think its because I store
object.size(df) + 2 * object.size(df.extension)
while with rbind R would need
object.size(rbind(df,df.extension)) + object.size(df) + object.size(df.extension).
After that I use
rm(df.extension)
gc(reset=TRUE)
to free the memory I don't need anymore.
This solved my problem for now, but I feel that there is a more advanced way to do a memory efficient rbind. I appreciate any comments on this solution.
how do you process many files without running out of memory in R
You can try importing all your files in a list and then processing the list:
all_files <- lapply(input_files, xmlTreeParse, isHTML=TRUE)
process_files <- lapply(all_files, function(myfile){
r = xmlRoot(xmlobj)
server = xmlValue(r[[2]][1][1]$h1)
time = xmlValue(r[[2]][4][1]$dl[1]$dt)
web_data = data.frame(server, time, stringsAsFactors=FALSE)
web_data
}
total_data <- do.call(rbind, process_files)
if you need to split your data in chunks, you can use the function seq
to get the beginning indices of the chunks:
seq_ind <- seq(1, length(input_files), by=1000)
Then you can get the list of the files corresponding to each chunk with
files_in_chunks <- mapply(function(x, y) input_files[x:y], x=seq_ind, y=c(seq_ind[-1], length(input_files)), SIMPLIFY=FALSE)
Not enough memory to row bind two large datasets
If the combined dataset can fit into memory, you could try combining the tables in a CSV via fwrite
with append = TRUE
.
library(data.table)
fwrite(
rbindlist(
list(
fread("A.csv", nrows = 1L),
fread("B.csv")
),
fill = TRUE
)[-1],
"AB.csv"
)
fwrite(
fread("A.csv"),
"AB.csv",
append = TRUE
)
# maybe restart R here
AB <- fread("AB.csv", fill = TRUE)
Related Topics
Reasons That Ggplot2 Legend Does Not Appear
How to Add New Column to an Dataframe (To the Front Not End)
Moving Color Key in R Heatmap.2 (Function of Gplots Package)
Create an Id (Row Number) Column
How to 'Source()' and Continue After an Error
Test If an Argument of a Function Is Set or Not in R
Best Way to Transpose Data.Table
How to Perform Multiple Left Joins Using Dplyr in R
Write.CSV for Large Data.Table
How to Change the Color in Geom_Point or Lines in Ggplot
Count How Many Values in Some Cells of a Row Are Not Na (In R)
Data.Table Row-Wise Sum, Mean, Min, Max Like Dplyr
Simple Way to Subset Spatialpolygonsdataframe (I.E. Delete Polygons) by Attribute in R
R Knitr Markdown: Output Plots Within for Loop
Find Neighbouring Elements of a Matrix in R
Converting a List of Data Frames into Individual Data Frames in R