How to load data quickly into R?
It depends on what you want to do and how you process the data further. In any case, loading from a binary R object is always going to be faster, provided you always need the same dataset. The limiting speed here is the speed of your harddrive, not R. The binary form is the internal representation of the dataframe in the workspace, so there is no transformation needed anymore.
Any kind of text file is a different story, as you include invariably an overhead : each time you read in the text file, the data has to be transformed to the binary R object. I'd forget about them. They are only useful for porting datasets from one application to another.
Setting up a MySQL backend is very useful if you need different parts of the data, or different subsets in different combinations. Especially when working with huge datasets, the fact that you don't have to load in the whole dataset before you can start selecting the rows/columns, can gain you quite some time. But this only works with huge datasets, as reading a binary file is quite a bit faster than searching a database.
If the data is not too big, you can save different dataframes in one RData file, giving you the opportunity to streamline things a bit more. I often have a set of dataframes in a list or in a seperate environment (see also ?environment
for some simple examples). This allows for lapply
/ eapply
solutions to process multiple dataframes at once.
Quickly reading very large tables as dataframes
An update, several years later
This answer is old, and R has moved on. Tweaking read.table
to run a bit faster has precious little benefit. Your options are:
Using
vroom
from the tidyverse packagevroom
for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.Using
fread
indata.table
for importing data from csv/tab-delimited files directly into R. See mnel's answer.Using
read_table
inreadr
(on CRAN from April 2015). This works much likefread
above. The readme in the link explains the difference between the two functions (readr
currently claims to be "1.5-2x slower" thandata.table::fread
).read.csv.raw
fromiotools
provides a third option for quickly reading CSV files.Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.)
read.csv.sql
in thesqldf
package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: theRODBC
package, and the reverse depends section of theDBI
package page.MonetDB.R
gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with itsmonetdb.read.csv
function.dplyr
allows you to work directly with data stored in several types of database.Storing data in binary formats can also be useful for improving performance. Use
saveRDS
/readRDS
(see below), theh5
orrhdf5
packages for HDF5 format, orwrite_fst
/read_fst
from thefst
package.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
Set
nrows
=the number of records in your data (nmax
inscan
).Make sure that
comment.char=""
to turn off interpretation of comments.Explicitly define the classes of each column using
colClasses
inread.table
.Setting
multi.line=FALSE
may also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table
based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save
saveRDS
, then next time you can retrieve it faster with load
readRDS
.
Best file type for loading data in to R (speed wise)?
Based on the comments and some of my own research, I put together a benchmark.
library(bench)
nr_of_rows <- 1e7
set.seed(1)
df <- data.frame(
Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
)
baseRDS <- function() {
saveRDS(df, "dataset.Rds")
readRDS("dataset.Rds")
}
baseRDS_nocompress <- function() {
saveRDS(df, "dataset.Rds", compress = FALSE)
readRDS("dataset.Rds")
}
baseRData <- function() {
save(list = "df", file = "dataset.Rdata")
load("dataset.Rdata")
df
}
data.table <- function() {
data.table::fwrite(df, "dataset.csv")
data.table::fread("dataset.csv")
}
feather <- function(variables) {
feather::write_feather(df, "dataset.feather")
as.data.frame(feather::read_feather("dataset.feather"))
}
fst <- function() {
fst::write.fst(df, "dataset.fst")
fst::read.fst("dataset.fst")
}
fst <- function() {
fst::write.fst(df, "dataset.fst")
fst::read.fst("dataset.fst")
}
# only works on Unix systems
# fastSave <- function() {
# fastSave::save.pigz(df, file = "dataset.RData", n.cores = 4)
# fastSave::load.pigz("dataset.RData")
# }
results <- mark(
baseRDS(),
baseRDS_nocompress(),
baseRData(),
data.table(),
feather(),
fst(),
check = FALSE
)
Results
summary(results)
# A tibble: 6 x 13
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 baseRDS() 15.74s 15.74s 0.0635 191MB
2 baseRDS_nocompress() 720.82ms 720.82ms 1.39 191MB
3 baseRData() 18.14s 18.14s 0.0551 191MB
4 data.table() 4.43s 4.43s 0.226 297MB
5 feather() 794.13ms 794.13ms 1.26 191MB
6 fst() 233.96ms 304.28ms 3.29 229MB
# ... with 8 more variables: `gc/sec` <dbl>, n_itr <int>,
# n_gc <dbl>, total_time <bch:tm>, result <list>,
# memory <list>, time <list>, gc <list>
> summary(results, relative = TRUE)
# A tibble: 6 x 13
expression min median `itr/sec` mem_alloc
<bch:expr> <dbl> <dbl> <dbl> <dbl>
1 baseRDS() 67.3 51.7 1.15 1.00
2 baseRDS_nocompress() 3.08 2.37 25.2 1.00
3 baseRData() 77.5 59.6 1 1.00
4 data.table() 18.9 14.5 4.10 1.56
5 feather() 3.39 2.61 22.8 1
6 fst() 1 1 59.6 1.20
# ... with 8 more variables: `gc/sec` <dbl>, n_itr <int>,
# n_gc <dbl>, total_time <bch:tm>, result <list>,
# memory <list>, time <list>, gc <list>
Based on this, the fst
package is the fastest. It's followed by base R on the second place with the option compress = FALSE
. This produces large files though. I wouldn't recommend saving anything in csv except you want to open it with a different program. In that case data.table
would be your choice. Otherwise I would either recommend saveRDS
or fst
.
How to save a large dataframe and quickly load it in R?
You can serialize it easily with:
readr::write_rds(pageInfo_df, "pageInfo_df.Rds")
and then deserialize it like so:
readr::read_rds("pageInfo_df.Rds")
this should handle every valid R object of an arbitrary complexity.
What is the fastest way and fastest format for loading large data sets into R
It depends on what you plan on doing with the data. If you want the entire data in memory for some operation then I guess your best bet is fread or readRDS (the file size for a data saved in RDS is much much smaller if that matters to you).
If you will be doing summary operations on the data I have found one time conversion to a database (using sqldf) a much better option, as subsequent operations are much more faster by executing sql queries on the data, but that is also because I don't have enough RAM to load 13 GB files in memory.
How do I import a large (6 Gb) .csv file into R efficiently and quickly, without the R REPL crashing?
R will crash if you try to load a file that is larger than your available memory, so you should see that you have at least 6gb ram free (a 6gb .csv is roughly 6gb in memory also). Python will have the same problem
(apparently someone asked the exact same question for python a few years ago)
For reading large csv files, you should either use readr::read_csv()
or data.table::fread()
, as both are much faster than base::read.table()
.
readr::read_csv_chunked
supports reading csv files in chunks, so if you don't need your whole data at once, that might help. You could also try just reading the columns of interest, to keep the memory size smaller.
Related Topics
Paste All Combinations of a Vector in R
R:Pass Argument to Glm Inside an R Function
How to Subset a Matrix with Different Column Positions for Each Row
Calculate Total Miles Traveled from Vectors of Lat/Lon
R: Apt-Get Install R-Cran-Foo VS. Install.Packages("Foo")
Using Data.Table I and J Arguments in Functions
Save Imported CSV Data in Vector - R
Datalabels in R Highcharter Cannot Be Seen After Print as Png or Jpg
Determine the Number of Na Values in a Column
R: Removing Null Elements from a List
Converting Data Frame Column from Character to Numeric
Scale and Size of Plot in Rstudio Shiny
How to Align Multiple Ggplot2 Plots and Add Shadows Over All of Them
Plotting Multiple Time Series on the Same Plot Using Ggplot()
Export Data Frames to Excel via Xlsx with Conditional Formatting