How to Tell When My Dataset in R Is Going to Be Too Large

How can I tell when my dataset in R is going to be too large?

R is well suited for big datasets, either using out-of-the-box solutions like bigmemory or the ff package (especially read.csv.ffdf) or processing your stuff in chunks using your own scripts. In almost all cases a little programming makes processing large datasets (>> memory, say 100 Gb) very possible. Doing this kind of programming yourself takes some time to learn (I don't know your level), but makes you really flexible. If this is your cup of tea, or if you need to run depends on the time you want to invest in learning these skills. But once you have them, they will make your life as a data analyst much easier.

In regard to analyzing logfiles, I know that stats pages generated from Call of Duty 4 (computer multiplayer game) work by parsing the log file iteratively into a database, and then retrieving the statsistics per user from the database. See here for an example of the interface. The iterative (in chunks) approach means that logfile size is (almost) unlimited. However, getting good performance is not trivial.

A lot of the stuff you can do in R, you can do in Python or Matlab, even C++ or Fortran. But only if that tool has out-of-the-box support for what you want, I could see a distinct advantage of that tool over R. For processing large data see the HPC Task view. See also an earlier answer of min for reading a very large text file in chunks. Other related links that might be interesting for you:

  • Quickly reading very large tables as dataframes in R
  • https://stackoverflow.com/questions/1257021/suitable-functional-language-for-scientific-statistical-computing (discussion includes that to use for large data processing).
  • Trimming a huge (3.5 GB) csv file to read into R
  • A blog post of mine showing how to estimate the RAM usage of a dataset. Note that this assumes that the data will be stored in a matrix or array, and is just one datatype.
  • Log file processing with R

In regard to choosing R or some other tool, I'd say if it's good enough for Google it is good enough for me ;).

Dealing with big datasets in R

So, to your read data in a Filebacked Big Matrix (FBM), you can do

files <- list.files(path = "SST-CMEMS", pattern = "SST-CMEMS-198201*",
full.names = TRUE)

tmp <- sst_data_full(files[1])

library(bigstatsr)
mat <- FBM(length(tmp$sst), length(files))

for (i in seq_along(files)) {
mat[, i] <- sst_data_full(files[i])$sst
}

Work in R with very large data set

If you are working with package ff and have your data in SQL, you can easily get them in ff using package ETLUtils, see the documentation for an example when using ROracle.

In my experience, ff is perfectly suited for the type of dataset you are working with (21 Mio rows and 15 columns) - in fact your setup is kind of small to ff unless your columns contain a lot of character data which will be converted to factors (meaning all your factor levels should be able to fit in your RAM).
Packages ETLUtils, ff and the package ffbase allow you to get your data in R using ff and do some basic statistics on it. Depending on what you will do with your data, your hardware, you might have to consider sampling when you build models. I prefer having my data in R, building a model based on a sample and score using the tools in ff (like chunking) or from package ffbase.

The drawback is that you have to get used to the fact that your data are ffdf objects and that might take some time - especially if you are new to R.

How to read large dataset in R

Sure:

  1. Get a bigger computer, in particular more ram
  2. Run a 64-bit OS, see 1) about more ram now that you can use it
  3. Read only the columns you need
  4. Read fewer rows
  5. Read the data in binary rather than re-parsing 2gb (which is mighty inefficient).

There is also a manual for this at the R site.

Practical limits of R data frame

R is suited for large data sets, but you may have to change your way of working somewhat from what the introductory textbooks teach you. I did a post on Big Data for R which crunches a 30 GB data set and which you may find useful for inspiration.

The usual sources for information to get started are High-Performance Computing Task View and the R-SIG HPC mailing list at R-SIG HPC.

The main limit you have to work around is a historic limit on the length of a vector to 2^31-1 elements which wouldn't be so bad if R did not store matrices as vectors. (The limit is for compatibility with some BLAS libraries.)

We regularly analyse telco call data records and marketing databases with multi-million customers using R, so would be happy to talk more if you are interested.

Best practices for storing and using data frames too large for memory?

You probably want to look at these packages:

  • ff for 'flat-file' storage and very efficient retrieval (can do data.frames; different data types)
  • bigmemory for out-of-R-memory but still in RAM (or file-backed) use (can only do matrices; same data type)
  • biglm for out-of-memory model fitting with lm() and glm()-style models.

and also see the High-Performance Computing task view.



Related Topics



Leave a reply



Submit