Process Very Big CSV File Without Timeout and Memory Error

Process very big csv file without timeout and memory error

I've used fgetcsv to read a 120MB csv in a stream-wise-manner (is that correct english?). That reads in line by line and then I've inserted every line into a database. That way only one line is hold in memory on each iteration. The script still needed 20 min. to run. Maybe I try Python next time… Don't try to load a huge csv-file into an array, that really would consume a lot of memory.

// WDI_GDF_Data.csv (120.4MB) are the World Bank collection of development indicators:
// http://data.worldbank.org/data-catalog/world-development-indicators
if(($handle = fopen('WDI_GDF_Data.csv', 'r')) !== false)
{
    // get the first row, which contains the column-titles (if necessary)
    $header = fgetcsv($handle);

    // loop through the file line-by-line
    while(($data = fgetcsv($handle)) !== false)
    {
        // resort/rewrite data and insert into DB here
        // try to use conditions sparingly here, as those will cause slow-performance

        // I don't know if this is really necessary, but it couldn't harm;
        // see also: http://php.net/manual/en/features.gc.php
        unset($data);
    }
    fclose($handle);
}

Read large data from csv file in php

An excellent method to deal with large files is located at: https://stackoverflow.com/a/5249971/797620

This method is used at ~~http://www.cuddlycactus.com/knownpasswords/~~ (page has been taken down) to search through 170+ million passwords in just a few milliseconds.

What does 'killed' mean when processing a huge CSV with Python, which suddenly stops?

Exit code 137 (128+9) indicates that your program exited due to receiving signal 9, which is SIGKILL. This also explains the killed message. The question is, why did you receive that signal?

The most likely reason is probably that your process crossed some limit in the amount of system resources that you are allowed to use. Depending on your OS and configuration, this could mean you had too many open files, used too much filesytem space or something else. The most likely is that your program was using too much memory. Rather than risking things breaking when memory allocations started failing, the system sent a kill signal to the process that was using too much memory.

As I commented earlier, one reason you might hit a memory limit after printing finished counting is that your call to counter.items() in your final loop allocates a list that contains all the keys and values from your dictionary. If your dictionary had a lot of data, this might be a very big list. A possible solution would be to use counter.iteritems() which is a generator. Rather than returning all the items in a list, it lets you iterate over them with much less memory usage.

So, I'd suggest trying this, as your final loop:

for key, value in counter.iteritems():
    writer.writerow([key, value])

Note that in Python 3, items returns a "dictionary view" object which does not have the same overhead as Python 2's version. It replaces iteritems, so if you later upgrade Python versions, you'll end up changing the loop back to the way it was.

Java - OutofMemoryError while reading a huge csv file

You can read the lines individually then do your processing until you have read the entire file

String encoding = "UTF-8";
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding));
String line;
while ((line = br.readLine()) != null) {
   // process the line.
}
br.close();

this should not go fubar just make sure you proces it immediatly and don't store it in variables outside your loop

PHP/Symfony: Writing to CSV large file

solution 2, without a database. Read from csv, process and output to csv just like someone mentioned in the comments, use fgetcsv() and fputcsv(). Going line by line should hardly consume any memory and it will remove the need for a database in between. The issue with these types of operations is the sequential nature of reading csv files as streams, eventually the speed of the process will boil down to the speed of the operation on the data in between the read and write operations. Using a database in between will just slow the entire process down and be somewhat wasteful.

PHP Using fgetcsv on a huge csv file

From your problem description it really sounds like you need to switch hosts. Processing a 2 GB file with a hard time limit is not a very constructive environment. Having said that, deleting read lines from the file is even less constructive, since you would have to rewrite the entire 2 GB to disk minus the part you have already read, which is incredibly expensive.

Assuming you save how many rows you have already processed, you can skip rows like this:

$alreadyProcessed = 42; // for example

$i = 0;
while ($row = fgetcsv($fileHandle)) {
    if ($i++ < $alreadyProcessed) {
        continue;
    }

    ...
}

However, this means you're reading the entire 2 GB file from the beginning each time you go through it, which in itself already takes a while and you'll be able to process fewer and fewer rows each time you start again.

The best solution here is to remember the current position of the file pointer, for which ftell is the function you're looking for:

$lastPosition = file_get_contents('last_position.txt');
$fh = fopen('my.csv', 'r');
fseek($fh, $lastPosition);

while ($row = fgetcsv($fh)) {
    ...

    file_put_contents('last_position.txt', ftell($fh));
}

This allows you to jump right back to the last position you were at and continue reading. You obviously want to add a lot of error handling here, so you're never in an inconsistent state no matter which point your script is interrupted at.

Process Very Big CSV File Without Timeout and Memory Error