Efficient Way of Processing Large CSV File Using Java

How to parse huge csv file efficiently in java

univocity-parsers is your best bet on loading the CSV file, you probably won't be able to hand code anything faster. The problems you are having come from possibly 2 things:

1 - loading everything in memory. That's generally a bad design decision, but if you do that make sure to have enough memory allocated for your application. Give it more memory
using flags -Xms8G and Xmx8G for example.

2 - you are probably not batching your insert statements.

My suggestion is to try this (using univocity-parsers):

    //configure input format using
CsvParserSettings settings = new CsvParserSettings();

//get an interator
CsvParser parser = new CsvParser(settings);
Iterator<String[]> it = parser.iterate(new File("/path/to/your.csv"), "UTF-8").iterator();

//connect to the database and create an insert statement
Connection connection = getYourDatabaseConnectionSomehow();
final int COLUMN_COUNT = 2;
PreparedStatement statement = connection.prepareStatement("INSERT INTO some_table(column1, column2) VALUES (?,?)");

//run batch inserts of 1000 rows per batch
int batchSize = 0;
while (it.hasNext()) {
//get next row from parser and set values in your statement
String[] row = it.next();
for(int i = 0; i < COLUMN_COUNT; i++){
if(i < row.length){
statement.setObject(i + 1, row[i]);
} else { //row in input is shorter than COLUMN_COUNT
statement.setObject(i + 1, null);
}
}

//add the values to the batch
statement.addBatch();
batchSize++;

//once 1000 rows made into the batch, execute it
if (batchSize == 1000) {
statement.executeBatch();
batchSize = 0;
}
}
// the last batch probably won't have 1000 rows.
if (batchSize > 0) {
statement.executeBatch();
}

This should execute pretty quickly and you won't need not even 100mb of memory to run.

For the sake of clarity, I didn't use any try/catch/finally block to close any resources here. Your actual code must handle that.

Hope it helps.

What is the best way to process large CSV files?

Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.

Tl;dr

Database: PostgreSQL as it is good for CSV, free and open source.

Tool: Apache Spark is a good fit for such type of tasks. Good performance.

DB

Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.

NoSQL

I thought about the usage of Cassandra here, but this solution would be too complex right now. Cassandra does not have ad-hoc queries. Cassandra data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.

RDBMS

I didn't want to overengineer here. And I stopped the choice here.

MS SQL Server

It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.

Another drawback here was the support of CSV files. This will be the main data source for us here. MS SQL Server can neither import nor export CSV.

  • MS SQL Server silently truncating a text field.

  • MS SQL Server's text encoding handling going wrong.

MS SQL Server throwing an error message because it doesn't understand quoting or escaping.
More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.

PostgreSQL

This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.

It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).

Anyway, right now I'm using DataGrip from JetBrains.

Processing Tool

I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.

Regarding Apache Spark example, the code could be found in learning-spark project.
My choice is Apache Spark for now.

How to handle processing large csv file or read large CSV file in chunks

The enhanced for loop (for (MyObject myObject : myObjects)) is implemented using the Iterator (it requires that the instance returned by csv.parse(strat, getReader("file.txt")) implements the Iterable interface, which contains an iterator() method that returns an Iterator), so there's no performance difference between the two code snippets.

P.S

In the second snippet, don't use the raw Iterator type, Use Iterator<MyObject> :

Iterator<MyObject> myObjects = csv.parse(strat, getReader("file.txt")).iterator();

while (myObjects.hasNext()) {
MyObject myObject = myObjects.next();
System.out.println(myObject);
}

Best/efficient way to search a large csv in Java

A 1.5K entries files of about 1 MB should take tens of milli-seconds. A 1 GB file could takes tens of seconds and it might be worth saving an index for this file to save having to re-read it each time.

You can load into a Map to have an index by name

You can add a latitude and longitude index via a NavigableMap. This will speed up lookups by location.

Loading the file once takes a little time, however reading the file from disk each time is much slower.

BTW You can have 100s TB of data with trillions of rows, to use this data in Java you have to get creative.

In short, if its much less than you have memory, it's relatively small file.

Read large CSV in java

Read line by line

something like this

    CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
// nextLine[] is an array of values from the line
System.out.println(nextLine[0] + nextLine[1] + "etc...");
}


Related Topics



Leave a reply



Submit