How to parse huge csv file efficiently in java
univocity-parsers is your best bet on loading the CSV file, you probably won't be able to hand code anything faster. The problems you are having come from possibly 2 things:
1 - loading everything in memory. That's generally a bad design decision, but if you do that make sure to have enough memory allocated for your application. Give it more memory
using flags -Xms8G
and Xmx8G
for example.
2 - you are probably not batching your insert statements.
My suggestion is to try this (using univocity-parsers):
//configure input format using
CsvParserSettings settings = new CsvParserSettings();
//get an interator
CsvParser parser = new CsvParser(settings);
Iterator<String[]> it = parser.iterate(new File("/path/to/your.csv"), "UTF-8").iterator();
//connect to the database and create an insert statement
Connection connection = getYourDatabaseConnectionSomehow();
final int COLUMN_COUNT = 2;
PreparedStatement statement = connection.prepareStatement("INSERT INTO some_table(column1, column2) VALUES (?,?)");
//run batch inserts of 1000 rows per batch
int batchSize = 0;
while (it.hasNext()) {
//get next row from parser and set values in your statement
String[] row = it.next();
for(int i = 0; i < COLUMN_COUNT; i++){
if(i < row.length){
statement.setObject(i + 1, row[i]);
} else { //row in input is shorter than COLUMN_COUNT
statement.setObject(i + 1, null);
}
}
//add the values to the batch
statement.addBatch();
batchSize++;
//once 1000 rows made into the batch, execute it
if (batchSize == 1000) {
statement.executeBatch();
batchSize = 0;
}
}
// the last batch probably won't have 1000 rows.
if (batchSize > 0) {
statement.executeBatch();
}
This should execute pretty quickly and you won't need not even 100mb of memory to run.
For the sake of clarity, I didn't use any try/catch/finally block to close any resources here. Your actual code must handle that.
Hope it helps.
What is the best way to process large CSV files?
Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.
Tl;dr
Database: PostgreSQL
as it is good for CSV, free and open source.
Tool: Apache Spark is a good fit for such type of tasks. Good performance.
DB
Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.
NoSQL
I thought about the usage of Cassandra
here, but this solution would be too complex right now. Cassandra
does not have ad-hoc queries. Cassandra
data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.
RDBMS
I didn't want to overengineer here. And I stopped the choice here.
MS SQL Server
It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.
Another drawback here was the support of CSV files. This will be the main data source for us here. MS SQL Server
can neither import nor export CSV.
MS SQL Server
silently truncating a text field.MS SQL Server
's text encoding handling going wrong.
MS SQL Server throwing an error message because it doesn't understand quoting or escaping.
More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.
PostgreSQL
This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.
It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).
Anyway, right now I'm using DataGrip from JetBrains.
Processing Tool
I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.
Regarding Apache Spark example, the code could be found in learning-spark project.
My choice is Apache Spark for now.
How to handle processing large csv file or read large CSV file in chunks
The enhanced for loop (for (MyObject myObject : myObjects)
) is implemented using the Iterator
(it requires that the instance returned by csv.parse(strat, getReader("file.txt"))
implements the Iterable
interface, which contains an iterator()
method that returns an Iterator
), so there's no performance difference between the two code snippets.
P.S
In the second snippet, don't use the raw Iterator
type, Use Iterator<MyObject>
:
Iterator<MyObject> myObjects = csv.parse(strat, getReader("file.txt")).iterator();
while (myObjects.hasNext()) {
MyObject myObject = myObjects.next();
System.out.println(myObject);
}
Best/efficient way to search a large csv in Java
A 1.5K entries files of about 1 MB should take tens of milli-seconds. A 1 GB file could takes tens of seconds and it might be worth saving an index for this file to save having to re-read it each time.
You can load into a Map to have an index by name
You can add a latitude and longitude index via a NavigableMap. This will speed up lookups by location.
Loading the file once takes a little time, however reading the file from disk each time is much slower.
BTW You can have 100s TB of data with trillions of rows, to use this data in Java you have to get creative.
In short, if its much less than you have memory, it's relatively small file.
Read large CSV in java
Read line by line
something like this
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
// nextLine[] is an array of values from the line
System.out.println(nextLine[0] + nextLine[1] + "etc...");
}
Related Topics
Maven: Best Way of Linking Custom External Jar to My Project
How to Pass Json Object as a Pathvariable to Spring Controller
How to Do Name Validation Allowing Alphabet, Spaces and Dot in Android
Automatically Convert Style Sheets to Inline Style
How to Exit an Android App Programmatically
How to Get Row Count Using Resultset in Java
Setting Custom Key When Pushing New Data to Firebase Database
How to Junit Test That Two List<E> Contain the Same Elements in the Same Order
Post and Get for the Same Url - Controller - Spring
Setting Default Values to Null Fields When Mapping With Jackson
How to Avoid Thread.Sleep in Unit Tests
Intellij Compilation Error Zip End Header Not Found
How to Evaluate a Math Expression Given in String Form
How to Delete the Content of Text File Without Deleting Itself
How to Force Java Server to Accept Only Tls 1.2 and Reject Tls 1.0 and Tls 1.1 Connections
Split Comma Separated Values in Java, Int and String
Testing Two Json Objects for Equality Ignoring Child Order in Java