Partially read really large csv.gz in R using vroom
I haven't been able to figure out vroom
solution for very large more-than-RAM (gzipped) csv files. However, the following approach has worked well for me and I'd be grateful to know about approaches with better querying speed while also saving disk space.
- Use
split
sub-command inxsv
from https://github.com/BurntSushi/xsv to split the large csv file into comfortably-within-RAM chunks of say, 10^5, lines and save them in a folder. - Read all chunks using
data.table::fread
one-by-one (to avoid low-memory error) using afor
loop and save all of them into a folder as compressedparquet
files usingarrow
package which saves space and prepares the large table for fast querying. For even faster operations, it is advisable to re-save theparquet
files partitioned by the fields by which you need to frequently filter. - Now you can use
arrow::open_dataset
and query that multi-file parquet folder usingdplyr
commands. It takes minimum disk space and gives the fastest results in my experience.
I use data.table::fread
with explicit definition of column classes of each field for fastest and most reliable parsing of csv files. readr::read_csv
has also been accurate but slower. However, auto-assignment of column classes by read_csv
as well as the ways in which you can custom-define column classes by read_csv
is actually the best - so less human-time but more machine-time - which means that it may be faster overall depending on scenario. Other csv parsers have thrown errors for the kind of csv files that I work with and waste time.
You may now delete the folder containing chunked csv files to save space, unless you want to experiment loop-reading them with other csv parsers.
Other previously successfully approaches: Loop read all csv chunks as mentioned above and save them into:
- a folder using
disk.frame
package. Then that folder may be queried usingdplyr
ordata.table
commands explained in the documentation. It has facility to save in compressedfst
files which saves space, though not as much asparquet
files. - a table in
DuckDB
database which allows querying withSQL
ordplyr
commands. Using database-tables approach won't save you disk space. ButDuckDB
also allows querying partitioned/un-partitioned parquet files (which saves disk space) withSQL
commands.
EDIT: - Improved Method Below
I experimented a little and found a much better way to do the above operations. Using the code below, the large (compressed) csv file will be chunked automatically within R environment (no need to use any external tool like xsv
) and all chunks will be written in parquet
format in a folder ready for querying.
library(readr)
library(arrow)
fyl <- "...path_to_big_data_file.csv.gz"
pqFolder <- "...path_to_folder_where_chunked_parquet_files_are_to_be_saved"
f <- function(x, pos){
write_parquet(x,
file.path(pqFolder, paste0(pos, ".parquet")),
compression = "gzip",
compression_level = 9)
}
read_csv_chunked(
fyl,
col_types = list(Column1="f", Column2="c", Column3="T", ...), # all column specifications
callback = SideEffectChunkCallback$new(f),
chunk_size = 10^6)
If, instead of parquet
, you want to use -
disk.frame
, the callback function may be used to create chunked compressedfst
files fordplyr
ordata.table
style querying.DuckDB
, the callback function may be used toappend
the chunks into a database table forSQL
ordplyr
style querying.
By judiciously choosing the chunk_size
parameter of readr::read_csv_chunked
command, the computer should never run out of RAM while running queries.
PS: I use gzip
compression for parquet
files since they can then be previewed with ParquetViewer
from https://github.com/mukunku/ParquetViewer. Otherwise, zstd
(not currently supported by ParquetViewer
) decompresses faster and hence improves reading speed.
EDIT 2:
I got a csv file which was really big for my machine: 20 GB gzipped and expands to about 83 GB, whereas my home laptop has only 16 GB. Turns out that the read_csv_chunked
method I mentioned in earlier EDIT fails to complete. It always stops working after some time and does not create all parquet
chunks. Using my previous method of splitting the csv file with xsv
and then looping over them creating parquet
chunks worked. To be fair, I must mention it took multiple attempts this way too and I had programmed a check to create only additional parquet
chunks when running the program on successive attempts.
EDIT 3:
VROOM does have difficulty when dealing with huge files since it needs to store the index in memory as well as any data you read from the file. See development thread https://github.com/r-lib/vroom/issues/203
EDIT 4:
Additional tip: The chunked parquet files created by the above mentioned method may be very conveniently queried using SQL with DuckDB method mentioned at
https://duckdb.org/docs/data/parquet
and
https://duckdb.org/2021/06/25/querying-parquet.html
DuckDB method is significant because R Arrow method currently suffers from a very serious limitation which is mentioned in the official documentation page https://arrow.apache.org/docs/r/articles/dataset.html.
Specifically, and I quote: "In the current release, arrow supports the dplyr
verbs mutate(), transmute(), select(), rename(), relocate(), filter()
, and arrange()
. Aggregation is not yet supported, so before you call summarise()
or other verbs with aggregate functions, use collect()
to pull the selected subset of the data into an in-memory R data frame."
The problem is that if you use collect()
on a very big dataset, the RAM usage spikes and the system crashes. Whereas, using SQL statements to do the same aggregation job on the same big-dataset with DuckDB does not cause RAM usage spikes and does not cause system crash. So until Arrow fixes itself for aggregation queries for big-data, SQL from DuckDB provides a nice solution to querying big datasets in chunked parquet format.
Strategies for reading in CSV files in pieces?
You could read it into a database using RSQLite, say, and then use an sql statement to get a portion.
If you need only a single portion then read.csv.sql
in the sqldf package will read the data into an sqlite database. First, it creates the database for you and the data does not go through R so limitations of R won't apply (which is primarily RAM in this scenario). Second, after loading the data into the database , sqldf reads the output of a specified sql statement into R and finally destroys the database. Depending on how fast it works with your data you might be able to just repeat the whole process for each portion if you have several.
Only one line of code accomplishes all three steps, so it's a no-brainer to just try it.
DF <- read.csv.sql("myfile.csv", sql=..., ...other args...)
See ?read.csv.sql
and ?sqldf
and also the sqldf home page.
Read a 20GB file in chunks without exceeding my RAM - R
The LaF package can read in ASCII data in chunks. It can be used directly or if you are using dplyr the chunked package uses it providing an interface for use with dplyr.
The readr package has readr_csv_chunked and related functions.
The section of this web page entitled The Loop as well as subsequent sections of that page describes how to do chunked reads with base R.
It may be that if you remove all but the first three columns that it will be small enough to just read it in and process in one go.
vroom
in the vroom package can read in files very quickly and also has the ability to read in just the columns named in the select=
argument which may make it small enough to read it in in one go.
fread
in the data.table package is a fast reading function that also supports a select=
argument which can select only specified columns.
read.csv.sql
in the sqldf (also see github page) package can read a file larger than R can handle into a temporary external SQLite database which it creates for you and removes afterwards and reads the result of the SQL statement given into R. If the first three columns are named col1, col2 and col3 then try the code below. See ?read.csv.sql and ?sqldf for the remaining arguments which will depend on your file.
library(sqldf)
DF <- read.csv.sql("myfile", "select col1, col2, col3 from file",
dbname = tempfile(), ...)
read.table
and read.csv
in the base of R have a colClasses=
argument which takes a vector of column classes. If the file has nc columns then use colClasses = rep(c(NA, "NULL"), c(3, nc-3))
to only read the first 3 columns.
Another approach is to pre-process the file using cut, sed or awk (available natively in UNIX and in the Rtools bin directory on Windows) or any of a number of free command line utilities such as csvfix outside of R to remove all but the first three columns and then see if that makes it small enough to read in one go.
Also check out the High Performance Computing task view.
How do I import a large (6 Gb) .csv file into R efficiently and quickly, without the R REPL crashing?
R will crash if you try to load a file that is larger than your available memory, so you should see that you have at least 6gb ram free (a 6gb .csv is roughly 6gb in memory also). Python will have the same problem
(apparently someone asked the exact same question for python a few years ago)
For reading large csv files, you should either use readr::read_csv()
or data.table::fread()
, as both are much faster than base::read.table()
.
readr::read_csv_chunked
supports reading csv files in chunks, so if you don't need your whole data at once, that might help. You could also try just reading the columns of interest, to keep the memory size smaller.
Combining 100 RDS/RData files into one large file - large file too big
Update: using the newer purrr::map_df()
function, which combines map
and bind_rows
and returns a dataframe
https://purrr.tidyverse.org/reference/map.html
library(tidyverse)
my_files = list.files(pattern = "*.rds")
my_all <- map_df(my_files, read_rds)
...
The dplyr::bind_rows()
function is explicitly an efficient implementation of the common pattern of do.call(rbind, dfs)
for binding many data frames into one.
https://dplyr.tidyverse.org/reference/bind.html
library(tidyverse)
write_rds(iris, "iris1.rds") #write three sample files
write_rds(iris, "iris2.rds")
write_rds(iris, "iris3.rds")
my_files = list.files(pattern = "*.rds")
dat_list = lapply(my_files, function (x) read_rds(x)) #switched to only read_rds()
my_all <- do.call("bind_rows", dat_list) #switched to bind_rows()
Related Topics
Merge Dataframes on Matching A, B and *Closest* C
Legend of a Raster Map with Categorical Data
Add Axis Tick-Marks on Top and to the Right to a Ggplot
Linear Model Function Lm() Error: Na/Nan/Inf in Foreign Function Call (Arg 1)
Element-Wise Concatenation of String Vectors
In Read.Table(): Incomplete Final Line Found by Readtableheader
Extract First Word from a Column and Insert into New Column
Dplyr Group by Colnames Described as Vector of Strings
Expression and New Line in Plot Labels
How to Use Loess Method in Ggally::Ggpairs Using Wrap Function
Grouping Every N Minutes with Dplyr
Add Regression Plane to 3D Scatter Plot in Plotly
How to Pass "Nothing" as an Argument to '[' for Subsetting
Passing Along Ellipsis Arguments to Two Different Functions
Add a New Column Between Other Dataframe Columns