Fastest Way to Read Large Excel Xlsx Files? to Parallelize or Not

Fastest way to read large Excel xlsx files? To parallelize or not?

You could try to run it in parallel using the parallel package, but it is a bit hard to estimate how fast it will be without sample data:

library(parallel)
library(readxl)

excel_path <- ""
sheets <- excel_sheets(excel_path)

Make a cluster with a specified number of cores:

cl <- makeCluster(detectCores() - 1)

Use parLapplyLB to go through the excel sheets and read them in parallel using load balancing:

parLapplyLB(cl, sheets, function(sheet, excel_path) {
  readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)

You can use the package microbenchmark to test how fast certain options are:

library(microbenchmark)

microbenchmark(
  lapply = {lapply(sheets, function(sheet) {
    read_excel(excel_path, sheet = sheet)
  })},
  parralel = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
    readxl::read_excel(excel_path, sheet = sheet)
  }, excel_path)},
  times = 10
)

In my case, the parallel version is faster:

Unit: milliseconds
     expr       min        lq     mean    median        uq      max neval
   lapply 133.44857 167.61801 179.0888 179.84616 194.35048 226.6890    10
 parralel  58.94018  64.96452 118.5969  71.42688  80.48588 316.9914    10

The test file contains of 6 sheets, each containing this table:

    test test1 test3 test4 test5
 1     1     1     1     1     1
 2     2     2     2     2     2
 3     3     3     3     3     3
 4     4     4     4     4     4
 5     5     5     5     5     5
 6     6     6     6     6     6
 7     7     7     7     7     7
 8     8     8     8     8     8
 9     9     9     9     9     9
10    10    10    10    10    10
11    11    11    11    11    11
12    12    12    12    12    12
13    13    13    13    13    13
14    14    14    14    14    14
15    15    15    15    15    15

Note:
you can use stopCluster(cl) to shut down the workers when the process is finished.

Fast way to read xlsx files into R

Here is a small benchmark test. Results: readxl::read_xlsx on average about twice as fast as openxlsx::read.xlsx across different number of rows (n) and columns (p) using standard settings.

Sample Image

options(scipen=999)  # no scientific number format

nn <- c(1, 10, 100, 1000, 5000, 10000, 20000, 30000)
pp <- c(1, 5, 10, 20, 30, 40, 50)

# create some excel files
l <- list()  # save results
tmp_dir <- tempdir()

for (n in nn) {
  for (p in pp) {
    name <-  
    cat("\n\tn:", n, "p:", p)
    flush.console()
    m <- matrix(rnorm(n*p), n, p)
    file <- paste0(tmp_dir, "/n", n, "_p", p, ".xlsx")

    # write
    write.xlsx(m, file)

    # read
    elapsed <- system.time( x <- openxlsx::read.xlsx(file) )["elapsed"]
    df <- data.frame(fun = "openxlsx::read.xlsx", n = n, p = p, 
                     elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
    l <- append(l, list(df))

    elapsed <- system.time( x <- readxl::read_xlsx(file) )["elapsed"]
    df <- data.frame(fun = "readxl::read_xlsx", n = n, p = p, 
                     elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
    l <- append(l, list(df))

  }
}

# results 
d <- do.call(rbind, l)

library(ggplot2)

ggplot(d, aes(n, elapsed, color= fun)) + 
  geom_line() + geom_point() +  
  facet_wrap( ~ paste("columns:", p)) +
  xlab("Number of rows") +
  ylab("Seconds")

How to read excel workbooks with varying sheet names in parallel?

parLapplyLB as parallelized variant of lapply essentially loops over one list. What you want is a multivariate variant which allows you to loop over the 1st, 2nd, 3rd, etc. element in both of your two lists, where the single-threaded version would be Map (with a big M). The parallelized version mcMap lives in the parallel package you are using as well.

We may use parallel::mcMap.

library(parallel)
r <- mcMap(openxlsx::read.xlsx, file.list, sheet_in_file)
r
# r$`./foo/wb1.xlsx`
# X1 X2 X3
# 1  1  1  1
# 2  1  1  1
# 
# $`./foo/wb2.xlsx`
# X1 X2 X3
# 1  2  2  2
# 2  2  2  2
# 
# $`./foo/wb3.xlsx`
# X1 X2 X3
# 1  3  3  3
# 2  3  3  3
# 
# $`./foo/wb4.xlsx`
# X1 X2 X3
# 1  4  4  4
# 2  4  4  4

(Should work similarly with readxl::read_excel, don't have it installed on this machine.)

Also see @HenrikB's comment below for a solution using a different package.

Data:

dir.create('foo')  ## create dir
## create four lists of four dataframes, write to dir as .xlsx
lapply(1:4, \(j) openxlsx::write.xlsx(lapply(1:4, \(i) data.frame(matrix(i, 2, 3))) |> setNames(LETTERS[1:4]), 
                                      file=sprintf('./foo/wb%s.xlsx', j), overwrite=TRUE)) |>invisible()

file.list <- list.files(path = "./foo", pattern="*.xlsx", full.names=TRUE)

sheet_in_file <- c('A', 'B', 'C', 'D')

Reading multiple xlsx files each with multiple sheets - purrr

You can use nested map_df calls to replace the for loop. As far as I know map2 can only operate on two lists of length n and return a list of length n, I don't think it's a way to generate a length n * m list from two lists of length n and m.

files <- list.files(pattern = ".xlsx")

data_xlsx_df <- map_df(set_names(files), function(file) {
  file %>% 
    excel_sheets() %>% 
    set_names() %>% 
    map_df(
      ~ read_xlsx(path = file, sheet = .x, range = "H3"),
      .id = "sheet")
}, .id = "file")

Fastest Way to Read Large Excel Xlsx Files? to Parallelize or Not