Quickly reading very large tables as dataframes
An update, several years later
This answer is old, and R has moved on. Tweaking read.table
to run a bit faster has precious little benefit. Your options are:
Using
vroom
from the tidyverse packagevroom
for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.Using
fread
indata.table
for importing data from csv/tab-delimited files directly into R. See mnel's answer.Using
read_table
inreadr
(on CRAN from April 2015). This works much likefread
above. The readme in the link explains the difference between the two functions (readr
currently claims to be "1.5-2x slower" thandata.table::fread
).read.csv.raw
fromiotools
provides a third option for quickly reading CSV files.Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.)
read.csv.sql
in thesqldf
package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: theRODBC
package, and the reverse depends section of theDBI
package page.MonetDB.R
gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with itsmonetdb.read.csv
function.dplyr
allows you to work directly with data stored in several types of database.Storing data in binary formats can also be useful for improving performance. Use
saveRDS
/readRDS
(see below), theh5
orrhdf5
packages for HDF5 format, orwrite_fst
/read_fst
from thefst
package.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
Set
nrows
=the number of records in your data (nmax
inscan
).Make sure that
comment.char=""
to turn off interpretation of comments.Explicitly define the classes of each column using
colClasses
inread.table
.Setting
multi.line=FALSE
may also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table
based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save
saveRDS
, then next time you can retrieve it faster with load
readRDS
.
Is there a faster way than fread() to read big data?
You can use select = columns
to only load the relevant columns without saturating your memory. For example:
dt <- fread("./file.csv", select = c("column1", "column2", "column3"))
I used read.delim()
to read a file that fread()
could not load completely. So you could convert your data into .txt and use read.delim()
.
However, why don't you open a connection to the SQL server you're pulling your data from. You can open connections to SQL servers with library(odbc)
and write your query like you normally would. You can optimize your memory usage that way.
Check out this short introduction to odbc
.
How to save a large dataframe and quickly load it in R?
You can serialize it easily with:
readr::write_rds(pageInfo_df, "pageInfo_df.Rds")
and then deserialize it like so:
readr::read_rds("pageInfo_df.Rds")
this should handle every valid R object of an arbitrary complexity.
Load large datasets into data frame
I recommend data.table although you will end up with a data table after this. If you choose not to use the data table, you can simply convert back to a normal data frame.
require(data.table)
data=fread('yourpathhere/yourfile')
How to properly import CSV files with PySpark
If you can't correct the input file, then you can try to load it as text then split the values to get the desired columns. Here's an example:
input file
1,2,3,4,5,6,7,8,9,10,0,12,121
1,2,3,4,5,6,7,8,9,10,0,12,121
read and parse
from pyspark.sql import functions as F
nb_cols = 5
df = spark.read.text("file.csv")
df = df.withColumn(
"values",
F.split("value", ",")
).select(
*[F.col("values")[i].alias(f"col_{i}") for i in range(nb_cols)],
F.array_join(F.expr(f"slice(values, {nb_cols + 1}, size(values))"), ",").alias(f"col_{nb_cols}")
)
df.show()
#+-----+-----+-----+-----+-----+-------------------+
#|col_0|col_1|col_2|col_3|col_4| col_5|
#+-----+-----+-----+-----+-----+-------------------+
#| 1| 2| 3| 4| 5|6,7,8,9,10,0,12,121|
#| 1| 2| 3| 4| 5|6,7,8,9,10,0,12,121|
#+-----+-----+-----+-----+-----+-------------------+
Related Topics
Remove Specific Characters from Column Names in R
How to Create a Consecutive Group Number
Find Duplicated Elements With Dplyr
Splitting a Large Data Frame into Smaller Segments
How to Change Y Axis Limits in Decimal Points in R
How to Import Multiple .Csv Files At Once
How to Prevent Ifelse() from Turning Date Objects into Numeric Objects
How to Convert a List Consisting of Vector of Different Lengths to a Usable Data Frame in R
Create Counter With Multiple Variables
Numeric Comparison Difficulty in R
Selecting Multiple Odd or Even Columns/Rows for Dataframe
Converting Data Frame into a List of Lists in R
Create and Assign Multiple New Dataframe Columns in Ifelse Statement
R: How to Check If All Columns in a Data.Frame Are the Same
Change Rows into Columns in R With Values Yes/No (1/0)
Select the Row With the Maximum Value in Each Group
Subset Rows Corresponding to Max Value by Group Using Data.Table
Force R to Stop Plotting Abbreviated Axis Labels (Scientific Notation) - E.G. 1E+00