Reason Behind Speed of Fread in Data.Table Package in R

Reason behind speed of fread in data.table package in R

I assume we are comparing to read.csv with all known advice applied such as setting colClasses, nrows etc. read.csv(filename) without any other arguments is slow mainly because it first reads everything into memory as if it were character and then attempts to coerce that to integer or numeric as a second step.

So, comparing fread to read.csv(filename, colClasses=, nrows=, etc) ...

They are both written in C so it's not that.

There isn't one reason in particular, but essentially, fread memory maps the file into memory and then iterates through the file using pointers. Whereas read.csv reads the file into a buffer via a connection.

If you run fread with verbose=TRUE it will tell you how it works and report the time spent in each of the steps. For example, notice that it skips straight to the middle and the end of the file to make a much better guess of the column types (although in this case the top 5 were enough).

> fread("test.csv",verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.486 GB
File is opened and mapped ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=','
Found 6 columns
First row with 6 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 10000001
Subtracted 1 for last eol and any trailing empty lines, leaving 10000000 data rows
Type codes ( first 5 rows): 113431
Type codes (+ middle 5 rows): 113431
Type codes (+ last 5 rows): 113431
Type codes: 113431 (after applying colClasses and integer64)
Type codes: 113431 (after applying drop or select (if supplied)
Allocating 6 column slots (6 - 0 dropped)
Read 10000000 rows and 6 (of 6) columns from 0.486 GB file in 00:00:44
13.420s ( 31%) Memory map (rerun may be quicker)
0.000s ( 0%) sep and header detection
3.210s ( 7%) Count rows (wc -l)
0.000s ( 0%) Column type detection (first, middle and last 5 rows)
1.310s ( 3%) Allocation of 10000000x6 result (xMB) in RAM
25.580s ( 59%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.040s ( 0%) Changing na.strings to NA
43.560s Total

NB: these timings on my very slow netbook with no SSD. Both the absolute and relative times of each step will vary widely from machine to machine. For example if you rerun fread a second time you may notice the time to mmap is much less because your OS has cached it from the previous run.

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 20
Model: 2
Stepping: 0
CPU MHz: 800.000 # i.e. my slow netbook
BogoMIPS: 1995.01
Virtualisation: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
NUMA node0 CPU(s): 0,1

read.csv faster than data.table::fread

data.table::freads significant performance advantage becomes clear if you consider larger files. Here is a fully reproducible example.

  1. Let's generate a CSV file consisting of 10^5 rows and 100 columns

    if (!file.exists("test.csv")) {
    set.seed(2017)
    df <- as.data.frame(matrix(runif(10^5 * 100), nrow = 10^5))
    write.csv(df, "test.csv", quote = F)
    }
  2. We run a microbenchmark analysis (note that this may take a couple of minutes depending on your hardware)

    library(microbenchmark)
    res <- microbenchmark(
    read.csv = read.csv("test.csv", header = TRUE, stringsAsFactors = FALSE, colClasses = "numeric"),
    fread = data.table::fread("test.csv", sep = ",", stringsAsFactors = FALSE, colClasses = "numeric"),
    times = 10)
    res
    # Unit: milliseconds
    # expr min lq mean median uq max
    # read.csv 17034.2886 17669.8653 19369.1286 18537.7057 20433.4933 23459.4308
    # fread 287.1108 311.6304 432.8106 356.6992 460.6167 888.6531


    library(ggplot2)
    autoplot(res)

Sample Image

Is there a faster way than fread() to read big data?

You can use select = columns to only load the relevant columns without saturating your memory. For example:

dt <- fread("./file.csv", select = c("column1", "column2", "column3"))

I used read.delim() to read a file that fread() could not load completely. So you could convert your data into .txt and use read.delim().

However, why don't you open a connection to the SQL server you're pulling your data from. You can open connections to SQL servers with library(odbc) and write your query like you normally would. You can optimize your memory usage that way.

Check out this short introduction to odbc.

R fread data.table inconsistent speed

I've had a similar problem. Namely, the first time I ran fread it was very slow, however, successive runs were much faster. In my case this was due to the fact that I was working on a computer in my University's computer lab. Consequently, the data was not locally on my machine, but was on a network. This meant that most of the time spent running fread was actually represented by transferring the data across the network and into my local working memory. This was corroborated by the fact that when I timed my code on the first run, the user time + sys. time << elapsed time.

When you load the data once, however, it is temporarily in your working memory, i.e. RAM. Successive calls to fread with the same data are therefore much faster.

fread in data.table but too many columns

There have been changes to the quote rules from version 1.10.6. They're now more robust and have better performance, but will not handle unbalanced quotes and other cases. Check the details for quotes on the current documentation of fread.

As alternative, you can use functions that use scan to handle quotes inside quotes, like read.table:

read.table("example.txt", sep = ",", header = TRUE)

Or, as answered by @jared-mamrot, use vroom for better performance, converting later to a data.table with setDT

Why are the results of read_csv larger than those of read.csv?

Selecting the right functions is of course very important for writing efficient code.
The degree of optimization present in different functions and packages will impact how objects are stored, their size, and the speed of operations running on them. Please consider the following.

library(data.table)
a <- c(1:1000000)
b <- rnorm(1000000)
mat <- as.matrix(cbind(a, b))
df <- data.frame(a, b)
dt <- data.table::as.data.table(mat)
cat(paste0("Matrix size: ",object.size(mat), "\ndf size: ", object.size(df), " (",round(object.size(df)/object.size(mat),2) ,")\ndt size: ", object.size(dt), " (",round(object.size(dt)/object.size(mat),2),")" ))
Matrix size: 16000568
df size: 12000848 (0.75)
dt size: 4001152 (0.25)

So here already you see that data.table stores the same data using 4 times less space than your old matrix does, and 3 times less than data.frame. Now about operations speed:

> microbenchmark(df[df$a*df$b>500,], mat[mat[,1]*mat[,2]>500,], dt[a*b>500])
Unit: milliseconds
expr min lq mean median uq max neval
df[df$a * df$b > 500, ] 23.766201 24.136201 26.49715 24.34380 30.243300 32.7245 100
mat[mat[, 1] * mat[, 2] > 500, ] 13.010000 13.146301 17.18246 13.41555 20.105450 117.9497 100
dt[a * b > 500] 8.502102 8.644001 10.90873 8.72690 8.879352 112.7840 100

data.table does the filtering 1.7 times faster than base on data.frame, and 2.5 times faster than using a matrix.

And that's not all, for almost any CSV import, using data.table::fread will change your life. Give it a try instead of read.csv or read_csv.

IMHO data.table doesn't get half the love it deserves, the best all-round package for performance and a very concise syntax. The following vignettes should put you on your way quickly, and that is worth the effort, trust me.

For further performance improvements Rfast contains many Rcpp implementations of popular functions and problems, such as rowSort() for example.


EDIT: fread's speed is due to optimizations done at C-code level involving the use of pointers for memory mapping, and coerce-as-you-go techniques, which frankly are beyond my knowledge to explain. This post contains some explanations by the author Matt Dowle, as well as an interesting, if short, piece of discussion between him and the author of dplyr, Hadley Wickham.

How to speed up shuf -n dt.csv and setting column names using data.table?

Here's an attempt that does two things:

  1. Does a single read to get the column names. This is unavoidable, and the only way to know for certain that you get the actual column names (instead of trying to infer it after they cluttered the sampled data); and

  2. Prevents the column names from being used in the actual sample, since they will stringify any non-string data present in the data.

Working examples. My intent with these solutions, frankly, is to use the speed of shuf and fread while preserving as much data-safety as possible. I take that latter to be of the utmost importance.

Option 1

Read the column names and data each time. Less efficient, but if you change datasets often and/or do not want two nearly-identical versions of the file in the directory, then this is a safe way to go.

library(data.table)
nms <- fread("mt.csv", nrows = 0)
nms
# Empty data.table (0 rows and 11 cols): mpg,cyl,disp,hp,drat,wt...

setnames(fread(cmd = "tail -n +2 mt.csv | shuf -n 3"), names(nms))[]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <num> <int> <num> <int> <num> <num> <num> <int> <int> <int> <int>
# 1: 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
# 2: 13.3 8 350.0 245 3.73 3.84 15.41 0 0 3 4
# 3: 15.5 8 318.0 150 2.76 3.52 16.87 0 0 3 2

The tail -n +2 means to start from line 2, skipping the row of column names.

Option 2

(Similar, but reduces the tail call with each read.)

You said that the data is "large". It might be that running tail -n +2 each time is more than you want to do. In that case, if you can afford the spare disk space, then

$ tail -n +2 mt.csv > mt_nocolnames.csv

and then

nms <- fread("mt.csv", nrows = 0)
fread(cmd = "shuf -n 3 mt_nocolnames.csv")

Option 3

  1. Create a shell script (I'll name it headshuf.sh), with the contents

    #!/bin/sh
    if [ "$1" = "-n" ]; then
    N=$2
    shift 2
    else
    N=10
    fi
    if [ $# -gt 0 ]; then
    head -n 1 $1
    tail -n +2 $1 | shuf -n $N
    fi
  2. Make it executable (likely chmod +x headshuf.sh)

  3. Use it.

    fread(cmd = 'sh -c "./headshuf.sh -n 3 mt.csv"')
    # mpg cyl disp hp drat wt qsec vs am gear carb
    # <num> <int> <num> <int> <num> <num> <num> <int> <int> <int> <int>
    # 1: 17.3 8 275.8 180 3.07 3.73 17.60 0 0 3 3
    # 2: 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3
    # 3: 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1

    fread(cmd = 'sh -c "./headshuf.sh -n 3 mt.csv"')
    # mpg cyl disp hp drat wt qsec vs am gear carb
    # <num> <int> <int> <int> <num> <num> <num> <int> <int> <int> <int>
    # 1: 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
    # 2: 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
    # 3: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4

Side note: while it seems innocuous, I specify cmd= to remove any chance of ambiguity.



P.S.

I think it's important to acknowledge and move the column names out of the way. In this example, all of the columns are numeric, but frankly it only needs to be 1 column to be a need.

The normal flow of reading the data randomly (not safeguarding the column names) should produce:

str(fread(cmd = 'shuf -n 3 mt.csv'))
# Classes 'data.table' and 'data.frame': 3 obs. of 11 variables:
# $ V1 : num 18.7 15.5 14.7
# $ V2 : int 8 8 8
# $ V3 : int 360 318 440
# $ V4 : int 175 150 230
# $ V5 : num 3.15 2.76 3.23
# $ V6 : num 3.44 3.52 5.34
# $ V7 : num 17 16.9 17.4
# $ V8 : int 0 0 0
# $ V9 : int 0 0 0
# $ V10: int 3 3 3
# $ V11: int 2 2 4

However, if you run it enough times, you may easily see:

str(fread(cmd = 'shuf -n 3 mt.csv'))
# Classes 'data.table' and 'data.frame': 3 obs. of 11 variables:
# $ V1 : chr "19.7" "15.2" "mpg"
# $ V2 : chr "6" "8" "cyl"
# $ V3 : chr "145" "304" "disp"
# $ V4 : chr "175" "150" "hp"
# $ V5 : chr "3.62" "3.15" "drat"
# $ V6 : chr "2.77" "3.435" "wt"
# $ V7 : chr "15.5" "17.3" "qsec"
# $ V8 : chr "0" "0" "vs"
# $ V9 : chr "1" "0" "am"
# $ V10: chr "5" "3" "gear"
# $ V11: chr "6" "2" "carb"

It's apparent here that the row of column names has worked its way into the random data, converting all numbers into strings. There would be two ways to mitigate this problem:

  1. Try to detect it. Since you don't know the column names a priori, the only way you can really "know" is if you know that at least one of the columns must be numeric, in which case all(sapply(dat, is.character)) should be false. If your data is naturally all text, then ... there is no way to determine if you accidentally have column names as data.

  2. Okay, only one way. fread(..., colClasses="numeric") works only as long as it is correct; once there is a problem, it'll complain with

    Warning in fread(cmd = "shuf -n 3 mt.csv", colClasses = "numeric") :
    Attempt to override column 1 of inherent type 'string' down to 'float64' ignored. Only overrides to a higher type are currently supported. If this was intended, please coerce to the lower type afterwards.

    and load it all as character anyway.



Related Topics



Leave a reply



Submit