Importing a Big Xlsx File into R

Read / Import specific rows from large Excel files in R

Let's say a is our excel file, as data frame.

library(readxl)
a <- as.data.frame(read_excel("Pattern/File.xlsx",sheet = "Results"))

For instance, we want to select columns 1 to 3, so use

subset(a[,1:3],is.na(a[1])==FALSE)

By this function, you are subsetting the input data frame with values different than NA in first column.

Output:

  ...1 name   ID
1    1  Dan us1d
4   13  Nev sa2e
6   34  Sam il5a

Note first column name (" ...1 "). This is autogenerated by read_excel() function, but should not be a problem.

Reading a big xslx file in r

You can increase the memory available to java in the options (in R). Typically like this:

options(java.parameters = "-Xmx1000m")
data <- read.xlsx(filepath)

Fastest way to read large Excel xlsx files? To parallelize or not?

You could try to run it in parallel using the parallel package, but it is a bit hard to estimate how fast it will be without sample data:

library(parallel)
library(readxl)

excel_path <- ""
sheets <- excel_sheets(excel_path)

Make a cluster with a specified number of cores:

cl <- makeCluster(detectCores() - 1)

Use parLapplyLB to go through the excel sheets and read them in parallel using load balancing:

parLapplyLB(cl, sheets, function(sheet, excel_path) {
  readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)

You can use the package microbenchmark to test how fast certain options are:

library(microbenchmark)

microbenchmark(
  lapply = {lapply(sheets, function(sheet) {
    read_excel(excel_path, sheet = sheet)
  })},
  parralel = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
    readxl::read_excel(excel_path, sheet = sheet)
  }, excel_path)},
  times = 10
)

In my case, the parallel version is faster:

Unit: milliseconds
     expr       min        lq     mean    median        uq      max neval
   lapply 133.44857 167.61801 179.0888 179.84616 194.35048 226.6890    10
 parralel  58.94018  64.96452 118.5969  71.42688  80.48588 316.9914    10

The test file contains of 6 sheets, each containing this table:

    test test1 test3 test4 test5
 1     1     1     1     1     1
 2     2     2     2     2     2
 3     3     3     3     3     3
 4     4     4     4     4     4
 5     5     5     5     5     5
 6     6     6     6     6     6
 7     7     7     7     7     7
 8     8     8     8     8     8
 9     9     9     9     9     9
10    10    10    10    10    10
11    11    11    11    11    11
12    12    12    12    12    12
13    13    13    13    13    13
14    14    14    14    14    14
15    15    15    15    15    15

Note:
you can use stopCluster(cl) to shut down the workers when the process is finished.

Importing a Big Xlsx File into R

Read / Import specific rows from large Excel files in R

Reading a big xslx file in r

Fastest way to read large Excel xlsx files? To parallelize or not?

Related Topics

Leave a reply