How to deal with a 50GB large csv file in r language?
You can use R with SQLite behind the curtains with the sqldf package. You'd use the read.csv.sql
function in the sqldf
package and then you can query the data however you want to obtain the smaller data frame.
The example from the docs:
library(sqldf)
iris2 <- read.csv.sql("iris.csv",
sql = "select * from file where Species = 'setosa' ")
I've used this library on VERY large CSV files with good results.
How to read only lines that fulfil a condition from a csv into R?
You could use the read.csv.sql
function in the sqldf
package and filter using SQL select. From the help page of read.csv.sql
:
library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv",
sql = "select * from file where `Sepal.Length` > 5", eol = "\n")
Quickly reading very large tables as dataframes
An update, several years later
This answer is old, and R has moved on. Tweaking read.table
to run a bit faster has precious little benefit. Your options are:
Using
vroom
from the tidyverse packagevroom
for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.Using
fread
indata.table
for importing data from csv/tab-delimited files directly into R. See mnel's answer.Using
read_table
inreadr
(on CRAN from April 2015). This works much likefread
above. The readme in the link explains the difference between the two functions (readr
currently claims to be "1.5-2x slower" thandata.table::fread
).read.csv.raw
fromiotools
provides a third option for quickly reading CSV files.Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.)
read.csv.sql
in thesqldf
package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: theRODBC
package, and the reverse depends section of theDBI
package page.MonetDB.R
gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with itsmonetdb.read.csv
function.dplyr
allows you to work directly with data stored in several types of database.Storing data in binary formats can also be useful for improving performance. Use
saveRDS
/readRDS
(see below), theh5
orrhdf5
packages for HDF5 format, orwrite_fst
/read_fst
from thefst
package.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
Set
nrows
=the number of records in your data (nmax
inscan
).Make sure that
comment.char=""
to turn off interpretation of comments.Explicitly define the classes of each column using
colClasses
inread.table
.Setting
multi.line=FALSE
may also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table
based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save
saveRDS
, then next time you can retrieve it faster with load
readRDS
.
Saving a flat file as an SQL database in R without loading it 100% into RAM
Apparently there is already a function for that
https://raw.githubusercontent.com/inbo/inborutils/master/R/csv_to_sqlite.R
I am testing it. I do not see any progress bar even if the corresponding option is selected, but it appears to get the job done.
Buying ram to avoid chunking for 30-50Gb plus files
I think you have quite a few things which can be optimized:
first of all read only those columns that you really need instead of reading and then dropping them - use
usecols=list_of_needed_columns
parameterincrease your chunksize - try it with different values - i would start with
10**5
don't use
chunk.apply(...)
for converting your datetimes - it's very slow - use pd.to_datetime(column, format='...') insteadyou can filter your data bit more efficiently when combining multiple conditions instead of doing it step-by-step:
Related Topics
SQL Server - Update Column from Data in the Same Table
Help with Sorting Records in Ruby on Rails
Rails Brakeman Warning of SQL Injection
Rails Activerecord Query Using Inner Join
Return SQL Rows Where Field Contains Only Non-Alphanumeric Characters
Select Statement in Sqlite Recognizing Row Number
Postgresql Query to Excel Sheet
Ant SQL Task: How to Run SQL and Pl/Sql and Notice Execution Failure
What Is Db/Development_Structure.SQL in a Rails Project
How to Query Database Name in Oracle SQL Developer
How to Get Referenced Values from Another Table
How to Call Scalar Function in SQL Server 2008
If I Update a View, Will My Original Tables Get Updated
Select The First Row in a Join of Two Tables in One Statement
How to Get a Real Time Within Postgresql Transaction