Fast Reading and Combining Several Files Using Data.Table (With Fread)

Fast reading and combining several files using data.table (with fread)

Use rbindlist() which is designed to rbind a list of data.table's together...

mylist <- lapply(all.files, readdata)
mydata <- rbindlist( mylist )

And as @Roland says, do not set the key in each iteration of your function!

So in summary, this is best :

l <- lapply(all.files, fread, sep=",")
dt <- rbindlist( l )
setkey( dt , ID, date )

How to read multiple files once using 'fread' in R

First, you need to list all files that you want to read. Then, you could use a loop to capture the data in a list like so:

filelist <- list.files(pattern='.snplist')
datalist <- list()
for(i in seq_along(filelist)) {
datalist[[i]] <- fread(filelist[i])
}

Note we use seq_along instead of 1:length(filelist) to avoid errors in case filelist is empty (length 0).

Quick Read and Merge with Data.Table's Fread and Rbindlist

You could do datatablelist = lapply(list.files("my/data/directory/"), fread) and then rbind the resulting list of data frames.

Although lapply is cleaner than an explicit loop, your loop will work if you read the files directly into a list.

datatablelist = list()

for(i in 1:length(datafiles)){
datatablelist[[datafiles[i]]] = fread(datafiles[i])
}

read.csv faster than data.table::fread

data.table::freads significant performance advantage becomes clear if you consider larger files. Here is a fully reproducible example.

  1. Let's generate a CSV file consisting of 10^5 rows and 100 columns

    if (!file.exists("test.csv")) {
    set.seed(2017)
    df <- as.data.frame(matrix(runif(10^5 * 100), nrow = 10^5))
    write.csv(df, "test.csv", quote = F)
    }
  2. We run a microbenchmark analysis (note that this may take a couple of minutes depending on your hardware)

    library(microbenchmark)
    res <- microbenchmark(
    read.csv = read.csv("test.csv", header = TRUE, stringsAsFactors = FALSE, colClasses = "numeric"),
    fread = data.table::fread("test.csv", sep = ",", stringsAsFactors = FALSE, colClasses = "numeric"),
    times = 10)
    res
    # Unit: milliseconds
    # expr min lq mean median uq max
    # read.csv 17034.2886 17669.8653 19369.1286 18537.7057 20433.4933 23459.4308
    # fread 287.1108 311.6304 432.8106 356.6992 460.6167 888.6531

    library(ggplot2)
    autoplot(res)

Sample Image



Related Topics



Leave a reply



Submit