Best language to parse extremely large Excel 2007 files
I kept getting all kinds of weird errors when working with .xlsx files.
Here's a simple example of using Apache POI to traverse an .xlsx
file, updated to POI v5. See also Upgrading to POI 3.5, including converting existing HSSF Usermodel code to SS Usermodel (for XSSF and HSSF).
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.DateUtil;
import org.apache.poi.ss.usermodel.FormulaEvaluator;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
/** @see https://stackoverflow.com/a/3562214/230513 */
public class XlsxReader {
public static void main(String[] args) throws IOException {
InputStream myxls = new FileInputStream("test.xlsx");
Workbook book = new XSSFWorkbook(myxls);
FormulaEvaluator eval =
book.getCreationHelper().createFormulaEvaluator();
Sheet sheet = book.getSheetAt(0);
for (Row row : sheet) {
for (Cell cell : row) {
printCell(cell, eval);
System.out.print("; ");
}
System.out.println();
}
myxls.close();
}
private static void printCell(Cell cell, FormulaEvaluator eval) {
switch (cell.getCellType()) {
case BLANK:
System.out.print("EMPTY");
break;
case STRING:
System.out.print(cell.getStringCellValue());
break;
case NUMERIC:
if (DateUtil.isCellDateFormatted(cell)) {
System.out.print(cell.getDateCellValue());
} else {
System.out.print(cell.getNumericCellValue());
}
break;
case BOOLEAN:
System.out.print(cell.getBooleanCellValue());
break;
case FORMULA:
System.out.print(cell.getCellFormula());
break;
default:
System.out.print("DEFAULT");
}
}
}
Smallest learning curve language to work with CSV files
There are many tools for the job, but yes, Python is perhaps the best these days. There is a special module for dealing with csv files. Check the official docs.
Reading Big XLS and XLSX files
Okay, so I've tried replicating your excel file and I completly threw the XLSX2CSV out the window. I don't think the approach of converting the xlsx into csv is the right one because, as depending on your XLSX format, it can read all the empty rows (you probably know that because you've set a row counter of 60k). not only that but if we're taking into consideration fields, it may or may not cause incorrect output with special characters, like your problem.
What I've done is I've used this library https://github.com/davidpelfree/sjxlsx to read and re-write the file. It's pretty much straight-forward and the new xlsx generated file has the fields corrected.
I suggest you try this approach (maybe not with this lib), of trying to re-write the file in order to correct it.
Perl reading huge excel file
I'm working on a new module for fast and memory efficient reading of Excel xlsx files with Perl. It isn't on CPAN yet (it needs a good bit more work) but you can get it on GitHub.
Here is a example of how to use it:
use strict;
use warnings;
use Excel::Reader::XLSX;
my $reader = Excel::Reader::XLSX->new();
my $workbook = $reader->read_file( 'Book1.xlsx' );
if ( !defined $workbook ) {
die $reader->error(), "\n";
}
for my $worksheet ( $workbook->worksheets() ) {
my $sheetname = $worksheet->name();
print "Sheet = $sheetname\n";
while ( my $row = $worksheet->next_row() ) {
while ( my $cell = $row->next_cell() ) {
my $row = $cell->row();
my $col = $cell->col();
my $value = $cell->value();
print " Cell ($row, $col) = $value\n";
}
}
}
__END__
Update: This module never made it to CPAN quality. Try Spreadsheet::ParseXLSX instead.
Smallest learning curve language to work with CSV files
There are many tools for the job, but yes, Python is perhaps the best these days. There is a special module for dealing with csv files. Check the official docs.
Fastest way to read the first row of big XLSX file in Ruby
The ruby gem roo
does not support file streaming; it reads the whole file into memory. Which, as you say, works fine for smaller files but not so well for reading small sections of huge files.
You need to use a different library/approach. For example, you can use the gem: creek
, which describes itself as:
a Ruby gem that provides a fast, simple and efficient method of parsing large Excel (xlsx and xlsm) files.
And, taking the example from the project's README, it's pretty straightforward to translate the code you wrote for roo
into code that uses creek
:
require 'creek'
creek = Creek::Book.new(file_path)
sheet = creek.sheets[0]
header = sheet.rows[0]
Note: A quick google of your StackOverflow question title led me to this blog post as the top search result. It's always worth searching on Google first.
excel xlsx file parsing - using koogra
found it by some more .... google ing...
first row for usage on 2007 (xlsx
)
second row is for xls
version
Net.SourceForge.Koogra.IWorkbook genericWB = Net.SourceForge.Koogra.WorkbookFactory.GetExcel2007Reader("tst.xlsx");
//genericWB = Net.SourceForge.Koogra.WorkbookFactory.GetExcelBIFFReader("some.xls");
Net.SourceForge.Koogra.IWorksheet genericWS = genericWB.Worksheets.GetWorksheetByIndex(0);
for (uint r = genericWS.FirstRow; r <= genericWS.LastRow; ++r)
{
Net.SourceForge.Koogra.IRow row = genericWS.Rows.GetRow(r);
for (uint c = genericWS.FirstCol; c <= genericWS.LastCol; ++c)
{
// raw value
Console.WriteLine(row.GetCell(c).Value);
// formatted value
Console.WriteLine(row.GetCell(c).GetFormattedValue());
}
}
i hope that i helped anyone else out there that encountered same "out of memory" issue ... '
enjoy
a small update to the code above
OK.. I Have played with this a little , so as far as it is related to the content of the file
the chart is ranked based on Unique IP
and the current code is
//place source file within your current:
//project directory\bin\debug and you should find extracted file next to the source file
var pathtoRead = Path.Combine(Environment.CurrentDirectory, "tst.xlsx");
var pathtoWrite = Path.Combine(Environment.CurrentDirectory, "tst.txt");
Net.SourceForge.Koogra.IWorkbook genericWB = Net.SourceForge.Koogra.WorkbookFactory.GetExcel2007Reader(pathtoRead);
Net.SourceForge.Koogra.IWorksheet genericWS = genericWB.Worksheets.GetWorksheetByIndex(0);
StringBuilder SbXls = new StringBuilder();
for (uint r = genericWS.FirstRow; r <= genericWS.LastRow; ++r)
{
Net.SourceForge.Koogra.IRow row = genericWS.Rows.GetRow(r);
string LineEnding = string.Empty;
for (uint ColCount = genericWS.FirstCol; ColCount <= genericWS.LastCol; ++ColCount)
{
var formated = row.GetCell(ColCount).GetFormattedValue();
if (ColCount == 1)
LineEnding = Environment.NewLine;
else if (ColCount == 0)
LineEnding = "\t";
if (ColCount > 1 == false)
SbXls.Append(string.Concat(formated, LineEnding));
}
}
File.WriteAllText(pathtoWrite, SbXls.ToString());
Related Topics
Xml Element with Attribute and Content Using Jaxb
Multiple Axes on the Same Data
Difference Between Paint() and Paintcomponent()
CSV File with "Id" as First Item Is Corrupt in Excel
Does the Preparedstatement Avoid SQL Injection
Hibernate: "Field 'Id' Doesn't Have a Default Value"
Java. Ignore Accents When Comparing Strings
Good Examples Using Java.Util.Logging
Why Does Collections.Sort Use Mergesort But Arrays.Sort Does Not
How to Zip a Folder Itself Using Java
What's the Difference Between Getrequesturi and Getpathinfo Methods in Httpservletrequest
How to See Javadoc in Intellij Idea
How to Get the Unique Id of an Object Which Overrides Hashcode()
Put Byte Array to JSON and Vice Versa
Java: Difference Between Strong/Soft/Weak/Phantom Reference
Rounding Up a Number to Nearest Multiple of 5