Best Language to Parse Extremely Large Excel 2007 Files

Best language to parse extremely large Excel 2007 files

I kept getting all kinds of weird errors when working with .xlsx files.

Here's a simple example of using Apache POI to traverse an .xlsx file, updated to POI v5. See also Upgrading to POI 3.5, including converting existing HSSF Usermodel code to SS Usermodel (for XSSF and HSSF).

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.DateUtil;
import org.apache.poi.ss.usermodel.FormulaEvaluator;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

/** @see https://stackoverflow.com/a/3562214/230513 */
public class XlsxReader {

    public static void main(String[] args) throws IOException {
        InputStream myxls = new FileInputStream("test.xlsx");
        Workbook book = new XSSFWorkbook(myxls);
        FormulaEvaluator eval =
            book.getCreationHelper().createFormulaEvaluator();
        Sheet sheet = book.getSheetAt(0);
        for (Row row : sheet) {
            for (Cell cell : row) {
                printCell(cell, eval);
                System.out.print("; ");
            }
            System.out.println();
        }
        myxls.close();
    }

    private static void printCell(Cell cell, FormulaEvaluator eval) {
        switch (cell.getCellType()) {
            case BLANK:
                System.out.print("EMPTY");
                break;
            case STRING:
                System.out.print(cell.getStringCellValue());
                break;
            case NUMERIC:
                if (DateUtil.isCellDateFormatted(cell)) {
                    System.out.print(cell.getDateCellValue());
                } else {
                    System.out.print(cell.getNumericCellValue());
                }
                break;
            case BOOLEAN:
                System.out.print(cell.getBooleanCellValue());
                break;
            case FORMULA:
                System.out.print(cell.getCellFormula());
                break;
            default:
                System.out.print("DEFAULT");
        }
    }
}

Smallest learning curve language to work with CSV files

There are many tools for the job, but yes, Python is perhaps the best these days. There is a special module for dealing with csv files. Check the official docs.

Reading Big XLS and XLSX files

Okay, so I've tried replicating your excel file and I completly threw the XLSX2CSV out the window. I don't think the approach of converting the xlsx into csv is the right one because, as depending on your XLSX format, it can read all the empty rows (you probably know that because you've set a row counter of 60k). not only that but if we're taking into consideration fields, it may or may not cause incorrect output with special characters, like your problem.

What I've done is I've used this library https://github.com/davidpelfree/sjxlsx to read and re-write the file. It's pretty much straight-forward and the new xlsx generated file has the fields corrected.

I suggest you try this approach (maybe not with this lib), of trying to re-write the file in order to correct it.

Perl reading huge excel file

I'm working on a new module for fast and memory efficient reading of Excel xlsx files with Perl. It isn't on CPAN yet (it needs a good bit more work) but you can get it on GitHub.

Here is a example of how to use it:

use strict;
use warnings;
use Excel::Reader::XLSX;

my $reader   = Excel::Reader::XLSX->new();
my $workbook = $reader->read_file( 'Book1.xlsx' );

if ( !defined $workbook ) {
    die $reader->error(), "\n";
}

for my $worksheet ( $workbook->worksheets() ) {

    my $sheetname = $worksheet->name();

    print "Sheet = $sheetname\n";

    while ( my $row = $worksheet->next_row() ) {

        while ( my $cell = $row->next_cell() ) {

            my $row   = $cell->row();
            my $col   = $cell->col();
            my $value = $cell->value();

            print "  Cell ($row, $col) = $value\n";
        }
    }
}

__END__

Update: This module never made it to CPAN quality. Try Spreadsheet::ParseXLSX instead.

Smallest learning curve language to work with CSV files

There are many tools for the job, but yes, Python is perhaps the best these days. There is a special module for dealing with csv files. Check the official docs.

Fastest way to read the first row of big XLSX file in Ruby

The ruby gem roo does not support file streaming; it reads the whole file into memory. Which, as you say, works fine for smaller files but not so well for reading small sections of huge files.

You need to use a different library/approach. For example, you can use the gem: creek, which describes itself as:

a Ruby gem that provides a fast, simple and efficient method of parsing large Excel (xlsx and xlsm) files.

And, taking the example from the project's README, it's pretty straightforward to translate the code you wrote for roo into code that uses creek:

require 'creek'
creek = Creek::Book.new(file_path)
sheet = creek.sheets[0]
header = sheet.rows[0]

Note: A quick google of your StackOverflow question title led me to this blog post as the top search result. It's always worth searching on Google first.

excel xlsx file parsing - using koogra

found it by some more .... google ing...
first row for usage on 2007 (xlsx)

second row is for xls version

        Net.SourceForge.Koogra.IWorkbook genericWB = Net.SourceForge.Koogra.WorkbookFactory.GetExcel2007Reader("tst.xlsx");

        //genericWB = Net.SourceForge.Koogra.WorkbookFactory.GetExcelBIFFReader("some.xls");

        Net.SourceForge.Koogra.IWorksheet genericWS = genericWB.Worksheets.GetWorksheetByIndex(0);

        for (uint r = genericWS.FirstRow; r <= genericWS.LastRow; ++r)
        {
            Net.SourceForge.Koogra.IRow row = genericWS.Rows.GetRow(r);

            for (uint c = genericWS.FirstCol; c <= genericWS.LastCol; ++c)
            {
                // raw value
                Console.WriteLine(row.GetCell(c).Value);

                // formatted value
                Console.WriteLine(row.GetCell(c).GetFormattedValue());
            }
        }

i hope that i helped anyone else out there that encountered same "out of memory" issue ... '
enjoy

a small update to the code above

OK.. I Have played with this a little , so as far as it is related to the content of the file
the chart is ranked based on Unique IP and the current code is

            //place source file within your current:
            //project directory\bin\debug and you should find extracted file next to the source file 
            var pathtoRead = Path.Combine(Environment.CurrentDirectory, "tst.xlsx");
            var pathtoWrite = Path.Combine(Environment.CurrentDirectory, "tst.txt");

            Net.SourceForge.Koogra.IWorkbook genericWB = Net.SourceForge.Koogra.WorkbookFactory.GetExcel2007Reader(pathtoRead);
            Net.SourceForge.Koogra.IWorksheet genericWS = genericWB.Worksheets.GetWorksheetByIndex(0);
            StringBuilder SbXls = new StringBuilder();
            for (uint r = genericWS.FirstRow; r <= genericWS.LastRow; ++r)
            {
                Net.SourceForge.Koogra.IRow row = genericWS.Rows.GetRow(r);
                string LineEnding = string.Empty;
                for (uint ColCount = genericWS.FirstCol; ColCount <= genericWS.LastCol; ++ColCount)
                {

                    var formated = row.GetCell(ColCount).GetFormattedValue();
                    if (ColCount == 1)
                        LineEnding = Environment.NewLine;
                    else if (ColCount == 0)
                        LineEnding = "\t";
                    if (ColCount > 1 == false)
                        SbXls.Append(string.Concat(formated, LineEnding));
                }
            }
            File.WriteAllText(pathtoWrite, SbXls.ToString());

Best Language to Parse Extremely Large Excel 2007 Files