Processing Large Xlsx File

Processing large xlsx file

Try using the event API. See Event API (HSSF only) and XSSF and SAX (Event API) in the POI documentation for details. A couple of quotes from that page:

HSSF:

The event API is newer than the User API. It is intended for intermediate developers who are willing to learn a little bit of the low level API structures. Its relatively simple to use, but requires a basic understanding of the parts of an Excel file (or willingness to learn). The advantage provided is that you can read an XLS with a relatively small memory footprint.

XSSF:

If memory footprint is an issue, then for XSSF, you can get at the underlying XML data, and process it yourself. This is intended for intermediate developers who are willing to learn a little bit of low level structure of .xlsx files, and who are happy processing XML in java. Its relatively simple to use, but requires a basic understanding of the file structure. The advantage provided is that you can read a XLSX file with a relatively small memory footprint.

For output, one possible approach is described in the blog post Streaming xlsx files. (Basically, use XSSF to generate a container XML file, then stream the actual content as plain text into the appropriate xml part of the xlsx zip archive.)

How to process extremely large .xlsx files with C#

For reading Excel file I would recommend ExcelDataReader. It does very fine with reading large files. I personally tried 500k-1M:

using (var stream = File.Open("C:\\temp\\input.xlsx", FileMode.Open, FileAccess.Read))
{
    using (var reader = ExcelReaderFactory.CreateReader(stream))
    {
        while (reader.Read())
        {
            for (var i = 0; i < reader.FieldCount; i++)
            {
                var value = reader.GetValue(i)?.ToString();
            }
        }
    }
}

Writing data back in the same efficient way is more tricky. I finished up with creating my own SwiftExcel library that is extremely fast and efficient (there is a performance chart comparing to other Nuget libraries including EPPlus) as it does not use any XML-serialization and writes data directly to the file:

using (var ew = new ExcelWriter("C:\\temp\\test.xlsx"))
{
    for (var row = 1; row <= 100; row++)
    {
        for (var col = 1; col <= 10; col++)
        {
            ew.Write($"row:{row}-col:{col}", col, row);
        }
    }
}

Fastest way to read large Excel xlsx files? To parallelize or not?

You could try to run it in parallel using the parallel package, but it is a bit hard to estimate how fast it will be without sample data:

library(parallel)
library(readxl)

excel_path <- ""
sheets <- excel_sheets(excel_path)

Make a cluster with a specified number of cores:

cl <- makeCluster(detectCores() - 1)

Use parLapplyLB to go through the excel sheets and read them in parallel using load balancing:

parLapplyLB(cl, sheets, function(sheet, excel_path) {
  readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)

You can use the package microbenchmark to test how fast certain options are:

library(microbenchmark)

microbenchmark(
  lapply = {lapply(sheets, function(sheet) {
    read_excel(excel_path, sheet = sheet)
  })},
  parralel = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
    readxl::read_excel(excel_path, sheet = sheet)
  }, excel_path)},
  times = 10
)

In my case, the parallel version is faster:

Unit: milliseconds
     expr       min        lq     mean    median        uq      max neval
   lapply 133.44857 167.61801 179.0888 179.84616 194.35048 226.6890    10
 parralel  58.94018  64.96452 118.5969  71.42688  80.48588 316.9914    10

The test file contains of 6 sheets, each containing this table:

    test test1 test3 test4 test5
 1     1     1     1     1     1
 2     2     2     2     2     2
 3     3     3     3     3     3
 4     4     4     4     4     4
 5     5     5     5     5     5
 6     6     6     6     6     6
 7     7     7     7     7     7
 8     8     8     8     8     8
 9     9     9     9     9     9
10    10    10    10    10    10
11    11    11    11    11    11
12    12    12    12    12    12
13    13    13    13    13    13
14    14    14    14    14    14
15    15    15    15    15    15

Note:
you can use stopCluster(cl) to shut down the workers when the process is finished.

Reading large XLSX files

If you plan on only performing a read on the excel file content, I suggest you use the ExcelDataReader library instead Link, which extracts the worksheetData into a DataSet object.

        IExcelDataReader reader = null;
        string FilePath = "PathToExcelFile";

        //Load file into a stream
        FileStream stream = File.Open(FilePath, FileMode.Open, FileAccess.Read);

        //Must check file extension to adjust the reader to the excel file type
        if (Path.GetExtension(FilePath).Equals(".xls"))
            reader = ExcelReaderFactory.CreateBinaryReader(stream);
        else if (Path.GetExtension(FilePath).Equals(".xlsx"))
            reader = ExcelReaderFactory.CreateOpenXmlReader(stream);

        if (reader != null)
        {
            //Fill DataSet
            DataSet content = reader.AsDataSet();
            //Read....
        }

How to load a large xlsx file with Apache POI?

I was in a similar situation with a webserver environment. The typical size of the uploads were ~150k rows and it wouldn't have been good to consume a ton of memory from a single request. The Apache POI Streaming API works well for this, but it requires a total redesign of your read logic. I already had a bunch of read logic using the standard API that I didn't want to have to redo, so I wrote this instead: https://github.com/monitorjbl/excel-streaming-reader

It's not entirely a drop-in replacement for the standard XSSFWorkbook class, but if you're just iterating through rows it behaves similarly:

import com.monitorjbl.xlsx.StreamingReader;

InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
StreamingReader reader = StreamingReader.builder()
        .rowCacheSize(100)    // number of rows to keep in memory (defaults to 10)
        .bufferSize(4096)     // buffer size to use when reading InputStream to file (defaults to 1024)
        .sheetIndex(0)        // index of sheet to use (defaults to 0)
        .read(is);            // InputStream or File for XLSX file (required)

for (Row r : reader) {
  for (Cell c : r) {
    System.out.println(c.getStringCellValue());
  }
}

There are some caveats to using it; due to the way XLSX sheets are structured, not all data is available in the current window of the stream. However, if you're just trying to read simple data out from the cells, it works pretty well for that.

Processing Large Xlsx File