Fast Reading and Combining Several Files Using Data.Table (With Fread)

Fast reading and combining several files using data.table (with fread)

Use rbindlist() which is designed to rbind a list of data.table's together...

mylist <- lapply(all.files, readdata)
mydata <- rbindlist( mylist )

And as @Roland says, do not set the key in each iteration of your function!

So in summary, this is best :

l <- lapply(all.files, fread, sep=",")
dt <- rbindlist( l )
setkey( dt , ID, date )

How to read multiple files once using 'fread' in R

First, you need to list all files that you want to read. Then, you could use a loop to capture the data in a list like so:

filelist <- list.files(pattern='.snplist')
datalist <- list()
for(i in seq_along(filelist)) {
  datalist[[i]] <- fread(filelist[i])
}

Note we use seq_along instead of 1:length(filelist) to avoid errors in case filelist is empty (length 0).

Quick Read and Merge with Data.Table's Fread and Rbindlist

You could do datatablelist = lapply(list.files("my/data/directory/"), fread) and then rbind the resulting list of data frames.

Although lapply is cleaner than an explicit loop, your loop will work if you read the files directly into a list.

datatablelist = list()

for(i in 1:length(datafiles)){
  datatablelist[[datafiles[i]]] = fread(datafiles[i])
}

read.csv faster than data.table::fread

data.table::freads significant performance advantage becomes clear if you consider larger files. Here is a fully reproducible example.

Let's generate a CSV file consisting of 10^5 rows and 100 columns

if (!file.exists("test.csv")) {
    set.seed(2017)
    df <- as.data.frame(matrix(runif(10^5 * 100), nrow = 10^5))
    write.csv(df, "test.csv", quote = F)
}

We run a microbenchmark analysis (note that this may take a couple of minutes depending on your hardware)

library(microbenchmark)
res <- microbenchmark(
    read.csv = read.csv("test.csv", header = TRUE, stringsAsFactors = FALSE, colClasses = "numeric"),
    fread = data.table::fread("test.csv", sep = ",", stringsAsFactors = FALSE, colClasses = "numeric"),
    times = 10)
res
#          Unit: milliseconds
#     expr        min         lq       mean     median         uq        max
# read.csv 17034.2886 17669.8653 19369.1286 18537.7057 20433.4933 23459.4308
#    fread   287.1108   311.6304   432.8106   356.6992   460.6167   888.6531

library(ggplot2)
autoplot(res)

Sample Image

Fast Reading and Combining Several Files Using Data.Table (With Fread)