Best practices for storing and using data frames too large for memory?
You probably want to look at these packages:
- ff for 'flat-file' storage and very efficient retrieval (can do data.frames; different data types)
- bigmemory for out-of-R-memory but still in RAM (or file-backed) use (can only do matrices; same data type)
- biglm for out-of-memory model fitting with
lm()
andglm()
-style models.
and also see the High-Performance Computing task view.
R memory efficient way to store many data frames?
Your example and mentioning the apply
family of functions suggest that the structure of the data frames is identical, ie, they all have the same columns.
If this is the case and if the total volume of data (all data frames together) still does fit in available RAM then a solution could be to pack all data into one large data.table
with an extra id column. This can be achieved with function rbindlist
:
library(data.table)
x <- data.table(A = rnorm(100), B = rnorm(200))
y <- data.table(A = rnorm(30), B = rnorm(300))
z <- data.table(A = rnorm(20), B = rnorm(600))
dt <- rbindlist(list(x, y, z), idcol = TRUE)
dt
.id A B
1: 1 -0.10981198 -0.55483251
2: 1 -0.09501871 -0.39602767
3: 1 2.07894635 0.09838722
4: 1 -2.16227936 0.04620932
5: 1 -0.85767886 -0.02500463
---
1096: 3 1.65858606 -1.10010088
1097: 3 -0.52939876 -0.09720765
1098: 3 0.59847826 0.78347801
1099: 3 0.02024844 -0.37545346
1100: 3 -1.44481850 -0.02598364
The rows originating from the individual source data frames
can be distinghuished by the .id
variable. All the memory efficient data.table
operations can be applied on all rows, selected rows (dt[.id == 1, some_function(A)]
) or group-wise (dt[, another_function(B), by = .id]
).
Although the data.table
operations are memory efficient, RAM might still be a limiting factor. Use the tables()
function to monitor memory consumption of all created data.table
objects:
tables()
NAME NROW NCOL MB COLS KEY
[1,] dt 1,100 3 1 .id,A,B
[2,] x 200 2 1 A,B
[3,] y 300 2 1 A,B
[4,] z 600 2 1 A,B
Total: 4MB
and remove objects from memory which are no longer needed
rm(x, y, z)
tables()
NAME NROW NCOL MB COLS KEY
[1,] dt 1,100 3 1 .id,A,B
Total: 1MB
Out of memory when modifying a big R data.frame
Look up 'copy-on-write' in the context of R discussions related to memory. As soon as one part of a (potentially really large) data structure changes, a copy is made.
A useful rule of thumb is that if your largest object is N mb/gb/... large, you need around 3*N of RAM. Such is life with an interpreted system.
Years ago when I had to handle large amounts of data on machines with (relative to the data volume) relatively low-ram 32-bit machines, I got good use out of early versions of the bigmemory package. It uses the 'external pointer' interface to keep large gobs of memory outside of R. That save you not only the '3x' factor, but possibly more as you may get away with non-contiguous memory (which is the other thing R likes).
Why is there such a large difference in memory usage for dataframes between pandas and R?
I found that link super useful and figured it's worth breaking out from the comments and summarizing:
Reducing Pandas memory usage #1: lossless compression
Load only columns of interest with
usecols
df = pd.read_csv('voters.csv', usecols=['First Name', 'Last Name'])
Shrink numerical columns with smaller dtypes
int64
: (default) -9223372036854775808 to 9223372036854775807int16
: -32768 to 32767int8
: -128 to 127
df = pd.read_csv('voters.csv', dtype={'Ward Number': 'int8'})
Shrink categorical data with dtype
category
df = pd.read_csv('voters.csv', dtype={'Party Affiliation': 'category'})
Convert mostly
nan
data to dtypeSparse
sparse_str_series = series.astype('Sparse[str]')
sparse_int16_series = series.astype('Sparse[int16]')
How can I better manage my use of memory in Pandas?
Here is a typical work-flow for this type of data:
1) read in csv data, convert to DataFrame, coerce data type, write out using
HDFStore
(depending on your needs could be 'fixed' or 'table' format). Do this is a separate process, then exit the process. When the dataset is large, I read it in a logical format (e.g. say a range of dates), then output a 'table' format HDF5 file. Then can append to this.2) query (again could be on dates or some other criteria). perform calculations, then write out NEW HDF5 files. This can be done in parallel (multiple processes). MAKE SURE THAT YOU ARE WRITING SEPARATE FILES IN EACH PROCESS.
3) combine the prior data files into single HDF5 files. This is a SINGLE process event.
4) repeat 2 & 3 as needed.
The key is to do discrete steps, writing out intermediate data in between, and exiting processes in-between. This keeps a manageable in-memory data size and makes in-memory calculations fast. Further this allows multiple processing for cpu-intensive operations on a read-only HDF5 file.
It is important to do this in separate system processes to allow the system to reclaim memory.
HTH
What are helpful optimizations in R for big data sets?
What best practices can I apply and, in particular, what can I do to make these types of functions optimized for large datasets?
use data.table
package
library(data.table)
d1 = as.data.table(dataframe)
d2 = as.data.table(dataframe_two)
1
grouping by many columns is something that data.table is excellent at
see barchart at the very bottom of the second plot for comparison against dplyr spark and others for exactly this kind of grouping
https://h2oai.github.io/db-benchmark
by_cols = paste("key", c("a","b","c","d","e","f","g","h","i"), sep="_")
a1 = d1[, .(min_date = min(date_sequence)), by=by_cols]
note I changed date
to date_sequence
, I think you meant that as a column name
2
it is unclear on what fields you want to merge tables, dataframe_two does not have specified fields so the query is invalid
please clarify
3
data.table has very useful type of join called rolling join, which does exactly what you need
a3 = d2[d1, on=c("key_a","date_sequence"), roll="nearest"]
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin || #!anyDuplicated(f__, :
# Join results in more than 2^31 rows (internal vecseq reached #physical limit). Very likely misspecified join. Check for #duplicate key values in i each of which join to the same group in #x over and over again. If that's ok, try by=.EACHI to run j for #each group to avoid the large allocation. Otherwise, please search #for this error message in the FAQ, Wiki, Stack Overflow and #data.table issue tracker for advice.
It results an error. Error is in fact very useful. On your real data it may work perfectly fine, as the reason behind the error (cardinality of matching rows) may be related to process of generating sample data. It is very tricky to have good dummy data for joining.
If you are getting the same error on your real data you may want to review design of that query as it attempts to make row explosion by doing many-to-many join. Even after already considering only single date_sequence
identity (taking roll
into account). I don't see this kind of question to be valid for that data (cadrinalities of join fields strictly speaking). You may want to introduce data quality checks layer in your workflow to ensure there are no duplicates on key_a
and date_sequence
combined.
Memory Error when parsing two large data frames
The issue is that in order to see all values to filter, you will need to store both DFs in memory at some point. You can improve your efficiency somewhat by not using apply()
, which is still an iterator. The following code is a more efficient, vectorized approach using boolean masking directly.
dfB[~dfB["AccountID"].isin(dfA["AccountID"])]
However, if storage is the problem, then this may still not work. Some approaches to consider are chunking the data, as you say you've already tried, or some of the options in the documentation on enhancing performance
Related Topics
Plot Random Effects from Lmer (Lme4 Package) Using Qqmath or Dotplot: How to Make It Look Fancy
Replacing All Missing Values in R Data.Table with a Value
Disregarding Simple Warnings/Errors in Trycatch()
Random Forest with Classes That Are Very Unbalanced
Extract Standard Errors from Lm Object
Using Lapply with Changing Arguments
Read CSV File Hosted on Google Drive
Calculating Standard Deviation of Each Row
Replace Character at Certain Location Within String
Do I Need to Normalize (Or Scale) Data for Randomforest (R Package)
How to Extract Fitted Splines from a Gam ('Mgcv::Gam')
Change Color of Leaflet Marker
Average Values of a Point Dataset to a Grid Dataset
Using Lapply to Change Column Names of a List of Data Frames
Get the Number of Lines in a Text File Using R
R Library for Discrete Markov Chain Simulation
How to Automate Multiple Requests to a Web Search Form Using R