How to Load Data Quickly into R

How to load data quickly into R?

It depends on what you want to do and how you process the data further. In any case, loading from a binary R object is always going to be faster, provided you always need the same dataset. The limiting speed here is the speed of your harddrive, not R. The binary form is the internal representation of the dataframe in the workspace, so there is no transformation needed anymore.

Any kind of text file is a different story, as you include invariably an overhead : each time you read in the text file, the data has to be transformed to the binary R object. I'd forget about them. They are only useful for porting datasets from one application to another.

Setting up a MySQL backend is very useful if you need different parts of the data, or different subsets in different combinations. Especially when working with huge datasets, the fact that you don't have to load in the whole dataset before you can start selecting the rows/columns, can gain you quite some time. But this only works with huge datasets, as reading a binary file is quite a bit faster than searching a database.

If the data is not too big, you can save different dataframes in one RData file, giving you the opportunity to streamline things a bit more. I often have a set of dataframes in a list or in a seperate environment (see also ?environment for some simple examples). This allows for lapply / eapply solutions to process multiple dataframes at once.

Quickly reading very large tables as dataframes

An update, several years later

This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:

  1. Using vroom from the tidyverse package vroom for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.

  2. Using fread in data.table for importing data from csv/tab-delimited files directly into R. See mnel's answer.

  3. Using read_table in readr (on CRAN from April 2015). This works much like fread above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).

  4. read.csv.raw from iotools provides a third option for quickly reading CSV files.

  5. Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.

  6. Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), the h5 or rhdf5 packages for HDF5 format, or write_fst/read_fst from the fst package.


The original answer

There are a couple of simple things to try, whether you use read.table or scan.

  1. Set nrows=the number of records in your data (nmax in scan).

  2. Make sure that comment.char="" to turn off interpretation of comments.

  3. Explicitly define the classes of each column using colClasses in read.table.

  4. Setting multi.line=FALSE may also improve performance in scan.

If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.

The other alternative is filtering your data before you read it into R.

Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save saveRDS, then next time you can retrieve it faster with load readRDS.

Best file type for loading data in to R (speed wise)?

Based on the comments and some of my own research, I put together a benchmark.

library(bench)

nr_of_rows <- 1e7
set.seed(1)
df <- data.frame(
Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
)

baseRDS <- function() {
saveRDS(df, "dataset.Rds")
readRDS("dataset.Rds")
}

baseRDS_nocompress <- function() {
saveRDS(df, "dataset.Rds", compress = FALSE)
readRDS("dataset.Rds")
}

baseRData <- function() {
save(list = "df", file = "dataset.Rdata")
load("dataset.Rdata")
df
}

data.table <- function() {
data.table::fwrite(df, "dataset.csv")
data.table::fread("dataset.csv")
}

feather <- function(variables) {
feather::write_feather(df, "dataset.feather")
as.data.frame(feather::read_feather("dataset.feather"))
}

fst <- function() {
fst::write.fst(df, "dataset.fst")
fst::read.fst("dataset.fst")
}

fst <- function() {
fst::write.fst(df, "dataset.fst")
fst::read.fst("dataset.fst")
}

# only works on Unix systems
# fastSave <- function() {
# fastSave::save.pigz(df, file = "dataset.RData", n.cores = 4)
# fastSave::load.pigz("dataset.RData")
# }

results <- mark(
baseRDS(),
baseRDS_nocompress(),
baseRData(),
data.table(),
feather(),
fst(),
check = FALSE
)

Results

summary(results)
# A tibble: 6 x 13
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 baseRDS() 15.74s 15.74s 0.0635 191MB
2 baseRDS_nocompress() 720.82ms 720.82ms 1.39 191MB
3 baseRData() 18.14s 18.14s 0.0551 191MB
4 data.table() 4.43s 4.43s 0.226 297MB
5 feather() 794.13ms 794.13ms 1.26 191MB
6 fst() 233.96ms 304.28ms 3.29 229MB
# ... with 8 more variables: `gc/sec` <dbl>, n_itr <int>,
# n_gc <dbl>, total_time <bch:tm>, result <list>,
# memory <list>, time <list>, gc <list>

> summary(results, relative = TRUE)
# A tibble: 6 x 13
expression min median `itr/sec` mem_alloc
<bch:expr> <dbl> <dbl> <dbl> <dbl>
1 baseRDS() 67.3 51.7 1.15 1.00
2 baseRDS_nocompress() 3.08 2.37 25.2 1.00
3 baseRData() 77.5 59.6 1 1.00
4 data.table() 18.9 14.5 4.10 1.56
5 feather() 3.39 2.61 22.8 1
6 fst() 1 1 59.6 1.20
# ... with 8 more variables: `gc/sec` <dbl>, n_itr <int>,
# n_gc <dbl>, total_time <bch:tm>, result <list>,
# memory <list>, time <list>, gc <list>

Based on this, the fst package is the fastest. It's followed by base R on the second place with the option compress = FALSE. This produces large files though. I wouldn't recommend saving anything in csv except you want to open it with a different program. In that case data.table would be your choice. Otherwise I would either recommend saveRDS or fst.

How to save a large dataframe and quickly load it in R?

You can serialize it easily with:

readr::write_rds(pageInfo_df, "pageInfo_df.Rds")

and then deserialize it like so:

readr::read_rds("pageInfo_df.Rds")

this should handle every valid R object of an arbitrary complexity.

What is the fastest way and fastest format for loading large data sets into R

It depends on what you plan on doing with the data. If you want the entire data in memory for some operation then I guess your best bet is fread or readRDS (the file size for a data saved in RDS is much much smaller if that matters to you).

If you will be doing summary operations on the data I have found one time conversion to a database (using sqldf) a much better option, as subsequent operations are much more faster by executing sql queries on the data, but that is also because I don't have enough RAM to load 13 GB files in memory.

How do I import a large (6 Gb) .csv file into R efficiently and quickly, without the R REPL crashing?

R will crash if you try to load a file that is larger than your available memory, so you should see that you have at least 6gb ram free (a 6gb .csv is roughly 6gb in memory also). Python will have the same problem
(apparently someone asked the exact same question for python a few years ago)

For reading large csv files, you should either use readr::read_csv() or data.table::fread(), as both are much faster than base::read.table().

readr::read_csv_chunked supports reading csv files in chunks, so if you don't need your whole data at once, that might help. You could also try just reading the columns of interest, to keep the memory size smaller.



Related Topics



Leave a reply



Submit