Best Practices for Storing and Using Data Frames Too Large for Memory

Best practices for storing and using data frames too large for memory?

You probably want to look at these packages:

  • ff for 'flat-file' storage and very efficient retrieval (can do data.frames; different data types)
  • bigmemory for out-of-R-memory but still in RAM (or file-backed) use (can only do matrices; same data type)
  • biglm for out-of-memory model fitting with lm() and glm()-style models.

and also see the High-Performance Computing task view.

R memory efficient way to store many data frames?

Your example and mentioning the apply family of functions suggest that the structure of the data frames is identical, ie, they all have the same columns.

If this is the case and if the total volume of data (all data frames together) still does fit in available RAM then a solution could be to pack all data into one large data.table with an extra id column. This can be achieved with function rbindlist:

library(data.table)
x <- data.table(A = rnorm(100), B = rnorm(200))
y <- data.table(A = rnorm(30), B = rnorm(300))
z <- data.table(A = rnorm(20), B = rnorm(600))
dt <- rbindlist(list(x, y, z), idcol = TRUE)
dt
.id A B
1: 1 -0.10981198 -0.55483251
2: 1 -0.09501871 -0.39602767
3: 1 2.07894635 0.09838722
4: 1 -2.16227936 0.04620932
5: 1 -0.85767886 -0.02500463
---
1096: 3 1.65858606 -1.10010088
1097: 3 -0.52939876 -0.09720765
1098: 3 0.59847826 0.78347801
1099: 3 0.02024844 -0.37545346
1100: 3 -1.44481850 -0.02598364

The rows originating from the individual source data frames
can be distinghuished by the .id variable. All the memory efficient data.tableoperations can be applied on all rows, selected rows (dt[.id == 1, some_function(A)]) or group-wise (dt[, another_function(B), by = .id]).

Although the data.table operations are memory efficient, RAM might still be a limiting factor. Use the tables() function to monitor memory consumption of all created data.table objects:

tables()
NAME NROW NCOL MB COLS KEY
[1,] dt 1,100 3 1 .id,A,B
[2,] x 200 2 1 A,B
[3,] y 300 2 1 A,B
[4,] z 600 2 1 A,B
Total: 4MB

and remove objects from memory which are no longer needed

rm(x, y, z)
tables()
NAME NROW NCOL MB COLS KEY
[1,] dt 1,100 3 1 .id,A,B
Total: 1MB

Out of memory when modifying a big R data.frame

Look up 'copy-on-write' in the context of R discussions related to memory. As soon as one part of a (potentially really large) data structure changes, a copy is made.

A useful rule of thumb is that if your largest object is N mb/gb/... large, you need around 3*N of RAM. Such is life with an interpreted system.

Years ago when I had to handle large amounts of data on machines with (relative to the data volume) relatively low-ram 32-bit machines, I got good use out of early versions of the bigmemory package. It uses the 'external pointer' interface to keep large gobs of memory outside of R. That save you not only the '3x' factor, but possibly more as you may get away with non-contiguous memory (which is the other thing R likes).

Why is there such a large difference in memory usage for dataframes between pandas and R?

I found that link super useful and figured it's worth breaking out from the comments and summarizing:

Reducing Pandas memory usage #1: lossless compression

  1. Load only columns of interest with usecols

    df = pd.read_csv('voters.csv', usecols=['First Name', 'Last Name'])
  2. Shrink numerical columns with smaller dtypes

    • int64: (default) -9223372036854775808 to 9223372036854775807
    • int16: -32768 to 32767
    • int8: -128 to 127
    df = pd.read_csv('voters.csv', dtype={'Ward Number': 'int8'})
  3. Shrink categorical data with dtype category

    df = pd.read_csv('voters.csv', dtype={'Party Affiliation': 'category'})
  4. Convert mostly nan data to dtype Sparse

    sparse_str_series = series.astype('Sparse[str]')
    sparse_int16_series = series.astype('Sparse[int16]')

How can I better manage my use of memory in Pandas?

Here is a typical work-flow for this type of data:

  • 1) read in csv data, convert to DataFrame, coerce data type, write out using HDFStore (depending on your needs could be 'fixed' or 'table' format). Do this is a separate process, then exit the process. When the dataset is large, I read it in a logical format (e.g. say a range of dates), then output a 'table' format HDF5 file. Then can append to this.

  • 2) query (again could be on dates or some other criteria). perform calculations, then write out NEW HDF5 files. This can be done in parallel (multiple processes). MAKE SURE THAT YOU ARE WRITING SEPARATE FILES IN EACH PROCESS.

  • 3) combine the prior data files into single HDF5 files. This is a SINGLE process event.

  • 4) repeat 2 & 3 as needed.

The key is to do discrete steps, writing out intermediate data in between, and exiting processes in-between. This keeps a manageable in-memory data size and makes in-memory calculations fast. Further this allows multiple processing for cpu-intensive operations on a read-only HDF5 file.

It is important to do this in separate system processes to allow the system to reclaim memory.

HTH

What are helpful optimizations in R for big data sets?

What best practices can I apply and, in particular, what can I do to make these types of functions optimized for large datasets?

use data.table package

library(data.table)
d1 = as.data.table(dataframe)
d2 = as.data.table(dataframe_two)


1

grouping by many columns is something that data.table is excellent at

see barchart at the very bottom of the second plot for comparison against dplyr spark and others for exactly this kind of grouping

https://h2oai.github.io/db-benchmark

by_cols = paste("key", c("a","b","c","d","e","f","g","h","i"), sep="_")
a1 = d1[, .(min_date = min(date_sequence)), by=by_cols]

note I changed date to date_sequence, I think you meant that as a column name

2

it is unclear on what fields you want to merge tables, dataframe_two does not have specified fields so the query is invalid

please clarify

3

data.table has very useful type of join called rolling join, which does exactly what you need

a3 = d2[d1, on=c("key_a","date_sequence"), roll="nearest"]
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin || #!anyDuplicated(f__, :
# Join results in more than 2^31 rows (internal vecseq reached #physical limit). Very likely misspecified join. Check for #duplicate key values in i each of which join to the same group in #x over and over again. If that's ok, try by=.EACHI to run j for #each group to avoid the large allocation. Otherwise, please search #for this error message in the FAQ, Wiki, Stack Overflow and #data.table issue tracker for advice.

It results an error. Error is in fact very useful. On your real data it may work perfectly fine, as the reason behind the error (cardinality of matching rows) may be related to process of generating sample data. It is very tricky to have good dummy data for joining.
If you are getting the same error on your real data you may want to review design of that query as it attempts to make row explosion by doing many-to-many join. Even after already considering only single date_sequence identity (taking roll into account). I don't see this kind of question to be valid for that data (cadrinalities of join fields strictly speaking). You may want to introduce data quality checks layer in your workflow to ensure there are no duplicates on key_a and date_sequence combined.

Memory Error when parsing two large data frames

The issue is that in order to see all values to filter, you will need to store both DFs in memory at some point. You can improve your efficiency somewhat by not using apply(), which is still an iterator. The following code is a more efficient, vectorized approach using boolean masking directly.

dfB[~dfB["AccountID"].isin(dfA["AccountID"])]

However, if storage is the problem, then this may still not work. Some approaches to consider are chunking the data, as you say you've already tried, or some of the options in the documentation on enhancing performance



Related Topics



Leave a reply



Submit