Read / Import specific rows from large Excel files in R
Let's say a
is our excel file, as data frame.
library(readxl)
a <- as.data.frame(read_excel("Pattern/File.xlsx",sheet = "Results"))
For instance, we want to select columns 1 to 3, so use
subset(a[,1:3],is.na(a[1])==FALSE)
By this function, you are subsetting the input data frame with values different than NA in first column.
Output:
...1 name ID
1 1 Dan us1d
4 13 Nev sa2e
6 34 Sam il5a
Note first column name (" ...1 "). This is autogenerated by read_excel()
function, but should not be a problem.
Reading a big xslx file in r
You can increase the memory available to java in the options (in R). Typically like this:
options(java.parameters = "-Xmx1000m")
data <- read.xlsx(filepath)
Fastest way to read large Excel xlsx files? To parallelize or not?
You could try to run it in parallel using the parallel
package, but it is a bit hard to estimate how fast it will be without sample data:
library(parallel)
library(readxl)
excel_path <- ""
sheets <- excel_sheets(excel_path)
Make a cluster with a specified number of cores:
cl <- makeCluster(detectCores() - 1)
Use parLapplyLB
to go through the excel sheets and read them in parallel using load balancing:
parLapplyLB(cl, sheets, function(sheet, excel_path) {
readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)
You can use the package microbenchmark
to test how fast certain options are:
library(microbenchmark)
microbenchmark(
lapply = {lapply(sheets, function(sheet) {
read_excel(excel_path, sheet = sheet)
})},
parralel = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)},
times = 10
)
In my case, the parallel version is faster:
Unit: milliseconds
expr min lq mean median uq max neval
lapply 133.44857 167.61801 179.0888 179.84616 194.35048 226.6890 10
parralel 58.94018 64.96452 118.5969 71.42688 80.48588 316.9914 10
The test file contains of 6 sheets, each containing this table:
test test1 test3 test4 test5
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
10 10 10 10 10 10
11 11 11 11 11 11
12 12 12 12 12 12
13 13 13 13 13 13
14 14 14 14 14 14
15 15 15 15 15 15
Note:
you can use stopCluster(cl)
to shut down the workers when the process is finished.
Related Topics
Two-Way Density Plot Combined with One Way Density Plot with Selected Regions in R
Roc Curve from Training Data in Caret
How to Change the Number of Decimal Places on Axis Labels in Ggplot2
R - How to Make a Click on Webpage Using Rvest or Rcurl
Aggregate 1-Minute Data into 5-Minute Average Data
Ggplot2 for Grayscale Printouts
How to Deal with Hdf5 Files in R
Undefined Columns Selected When Subsetting Data Frame
How to Specify Columns in Facet_Grid or How to Change Labels in Facet_Wrap
Sp::Over() for Point in Polygon Analysis
Tooltip When You Mouseover a Ggplot on Shiny
Photo Alignment with Graph in R
Reset the Graphical Parameters Back to Default Values Without Use of Dev.Off()