Fastest way to read large Excel xlsx files? To parallelize or not?
You could try to run it in parallel using the parallel
package, but it is a bit hard to estimate how fast it will be without sample data:
library(parallel)
library(readxl)
excel_path <- ""
sheets <- excel_sheets(excel_path)
Make a cluster with a specified number of cores:
cl <- makeCluster(detectCores() - 1)
Use parLapplyLB
to go through the excel sheets and read them in parallel using load balancing:
parLapplyLB(cl, sheets, function(sheet, excel_path) {
readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)
You can use the package microbenchmark
to test how fast certain options are:
library(microbenchmark)
microbenchmark(
lapply = {lapply(sheets, function(sheet) {
read_excel(excel_path, sheet = sheet)
})},
parralel = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)},
times = 10
)
In my case, the parallel version is faster:
Unit: milliseconds
expr min lq mean median uq max neval
lapply 133.44857 167.61801 179.0888 179.84616 194.35048 226.6890 10
parralel 58.94018 64.96452 118.5969 71.42688 80.48588 316.9914 10
The test file contains of 6 sheets, each containing this table:
test test1 test3 test4 test5
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
10 10 10 10 10 10
11 11 11 11 11 11
12 12 12 12 12 12
13 13 13 13 13 13
14 14 14 14 14 14
15 15 15 15 15 15
Note:
you can use stopCluster(cl)
to shut down the workers when the process is finished.
Fast way to read xlsx files into R
Here is a small benchmark test. Results: readxl::read_xlsx
on average about twice as fast as openxlsx::read.xlsx
across different number of rows (n
) and columns (p
) using standard settings.
options(scipen=999) # no scientific number format
nn <- c(1, 10, 100, 1000, 5000, 10000, 20000, 30000)
pp <- c(1, 5, 10, 20, 30, 40, 50)
# create some excel files
l <- list() # save results
tmp_dir <- tempdir()
for (n in nn) {
for (p in pp) {
name <-
cat("\n\tn:", n, "p:", p)
flush.console()
m <- matrix(rnorm(n*p), n, p)
file <- paste0(tmp_dir, "/n", n, "_p", p, ".xlsx")
# write
write.xlsx(m, file)
# read
elapsed <- system.time( x <- openxlsx::read.xlsx(file) )["elapsed"]
df <- data.frame(fun = "openxlsx::read.xlsx", n = n, p = p,
elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
l <- append(l, list(df))
elapsed <- system.time( x <- readxl::read_xlsx(file) )["elapsed"]
df <- data.frame(fun = "readxl::read_xlsx", n = n, p = p,
elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
l <- append(l, list(df))
}
}
# results
d <- do.call(rbind, l)
library(ggplot2)
ggplot(d, aes(n, elapsed, color= fun)) +
geom_line() + geom_point() +
facet_wrap( ~ paste("columns:", p)) +
xlab("Number of rows") +
ylab("Seconds")
How to read excel workbooks with varying sheet names in parallel?
parLapplyLB
as parallelized variant of lapply
essentially loops over one list. What you want is a multivariate variant which allows you to loop over the 1st, 2nd, 3rd, etc. element in both of your two lists, where the single-threaded version would be Map
(with a big M). The parallelized version mcMap
lives in the parallel
package you are using as well.
We may use parallel::mcMap
.
library(parallel)
r <- mcMap(openxlsx::read.xlsx, file.list, sheet_in_file)
r
# r$`./foo/wb1.xlsx`
# X1 X2 X3
# 1 1 1 1
# 2 1 1 1
#
# $`./foo/wb2.xlsx`
# X1 X2 X3
# 1 2 2 2
# 2 2 2 2
#
# $`./foo/wb3.xlsx`
# X1 X2 X3
# 1 3 3 3
# 2 3 3 3
#
# $`./foo/wb4.xlsx`
# X1 X2 X3
# 1 4 4 4
# 2 4 4 4
(Should work similarly with readxl::read_excel
, don't have it installed on this machine.)
Also see @HenrikB's comment below for a solution using a different package.
Data:
dir.create('foo') ## create dir
## create four lists of four dataframes, write to dir as .xlsx
lapply(1:4, \(j) openxlsx::write.xlsx(lapply(1:4, \(i) data.frame(matrix(i, 2, 3))) |> setNames(LETTERS[1:4]),
file=sprintf('./foo/wb%s.xlsx', j), overwrite=TRUE)) |>invisible()
file.list <- list.files(path = "./foo", pattern="*.xlsx", full.names=TRUE)
sheet_in_file <- c('A', 'B', 'C', 'D')
Reading multiple xlsx files each with multiple sheets - purrr
You can use nested map_df
calls to replace the for loop. As far as I know map2
can only operate on two lists of length n
and return a list of length n
, I don't think it's a way to generate a length n * m
list from two lists of length n
and m
.
files <- list.files(pattern = ".xlsx")
data_xlsx_df <- map_df(set_names(files), function(file) {
file %>%
excel_sheets() %>%
set_names() %>%
map_df(
~ read_xlsx(path = file, sheet = .x, range = "H3"),
.id = "sheet")
}, .id = "file")
Related Topics
Plot Curved Lines Between Two Locations in Ggplot2
How to Put a Complicated Equation into a R Formula
Independently Move 2 Legends Ggplot2 on a Map
Pass String as Name of Attached Data Column Name
Scale Back Linear Regression Coefficients in R from Scaled and Centered Data
R Doesn't Reset the Seed When "L'Ecuyer-Cmrg" Rng Is Used
Caret: There Were Missing Values in Resampled Performance Measures
How to Search for a String in One Column in Other Columns of a Data Frame
How to Set the Latex Path for Sweave in R
How to Turn the Filename into a Variable When Reading Multiple CSVS into R
Edit Individual Ggplots in Ggally::Ggpairs: How to Have the Density Plot Not Filled in Ggpairs
Running an R Script Using a Windows Shortcut
Split Column in Data.Table to Multiple Rows
How to Read the Files in a Directory in Sorted Order Using R
How to Control Label Color Depending on Fill Darkness of Bars