Best Way to Work with Large Amounts of CSV Data Quickly

Best way to work with large amounts of CSV data quickly

how about using a database.

jam the records into tables, and then query them out using joins.

the import might take awhile, but the DB engine will be optimized for the join and retrieval part...

How to deal with large csv file quickly?

The simplest fix is to change:

test_callsites = callsites[0:n_test]

test_callsites = frozenset(callsites[:n_test])  # set also works; frozenset just reduces chance of mistakenly modifying it

This would reduce the work for each test of if row[0] in test_callsites: from O(n_test) to O(1), which would likely make a huge improvement if n_test is on the order of four digits or more (likely, when we're talking about millions of rows).

You could also slightly reduce the work (mostly in terms of improving memory locality by having a smaller bin of things being selected) in creating it in the first place by changing:

random.shuffle(callsites)
test_callsites = callsites[0:n_test]

to:

test_callsites = frozenset(random.sample(callsites, n_test))

which avoids reshuffling the whole of callsites in favor of selecting n_test values from it (which you then convert to a frozenset, or just set, for cheap lookup). Bonus, it's a one-liner. :-)

Side-note: Your code is potentially wrong as written. You must pass newline='' to your various calls to open to ensure that the chosen CSV dialect's newline preferences are honored.

What is the best way to process large CSV files?

Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.

Tl;dr

Database: PostgreSQL as it is good for CSV, free and open source.

Tool: Apache Spark is a good fit for such type of tasks. Good performance.

DB

Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.

NoSQL

I thought about the usage of Cassandra here, but this solution would be too complex right now. Cassandra does not have ad-hoc queries. Cassandra data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.

RDBMS

I didn't want to overengineer here. And I stopped the choice here.

MS SQL Server

It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.

Another drawback here was the support of CSV files. This will be the main data source for us here. MS SQL Server can neither import nor export CSV.

MS SQL Server silently truncating a text field.
MS SQL Server's text encoding handling going wrong.

MS SQL Server throwing an error message because it doesn't understand quoting or escaping.
More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.

PostgreSQL

This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.

It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).

Anyway, right now I'm using DataGrip from JetBrains.

Processing Tool

I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.

Regarding Apache Spark example, the code could be found in learning-spark project.
My choice is Apache Spark for now.

Best way to transform a huge CSV file content into quickly queryable data store?

You can use Amazon Athena or Amazon S3 Select.

Amazon Athena is a query engine that can read data directly from (multiple) files stored in Amazon S3. It works best when the files are in a columnar format (eg Parquet or ORC) and compressed, but it can work on normal CSV files too. It is highly scalable, especially where multiple files are being queried. However, it treats data as being stored in a 'table' based on its location in S3, so it isn't ideal for querying random files.

Amazon S3 Select only works on a single file at a time, but it can directly query a CSV file (and a few other formats). It has an SQL-like query capability.

If your need is to query a different file each time, I would recommend S3 Select.

The benefit of both of these options is that you do not need to 'load' the data into a database. However, that is certainly an option you might consider if you need very fast access to the data. (Amazon Redshift can handle billions of rows in a table quite quickly, but it is more expensive.)

What is the fastest way to upload a big csv file in notebook to work with python pandas?

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5 [various compression]):

                  read_s  write_s  size_ratio_to_CSV
storage
CSV               17.900    69.00              1.000
CSV.gzip          18.900   186.00              0.047
Pickle             0.173     1.77              0.374
HDF_fixed          0.196     2.03              0.435
HDF_tab            0.230     2.60              0.437
HDF_tab_zlib_c5    0.845     5.44              0.035
HDF_tab_zlib_c9    0.860     5.95              0.035
HDF_tab_bzip2_c5   2.500    36.50              0.011
HDF_tab_bzip2_c9   2.500    36.50              0.011

reading

Sample Image

writing/saving

Sample Image

file size ratio in relation to uncompressed CSV file

Sample Image

RAW DATA:

CSV:

In [68]: %timeit df.to_csv(fcsv)
1 loop, best of 3: 1min 9s per loop

In [74]: %timeit pd.read_csv(fcsv)
1 loop, best of 3: 17.9 s per loop

CSV.gzip:

In [70]: %timeit df.to_csv(fcsv_gz, compression='gzip')
1 loop, best of 3: 3min 6s per loop

In [75]: %timeit pd.read_csv(fcsv_gz)
1 loop, best of 3: 18.9 s per loop

Pickle:

In [66]: %timeit df.to_pickle(fpckl)
1 loop, best of 3: 1.77 s per loop

In [72]: %timeit pd.read_pickle(fpckl)
10 loops, best of 3: 173 ms per loop

HDF (format='fixed') [Default]:

In [67]: %timeit df.to_hdf(fh5, 'df')
1 loop, best of 3: 2.03 s per loop

In [73]: %timeit pd.read_hdf(fh5, 'df')
10 loops, best of 3: 196 ms per loop

HDF (format='table'):

In [37]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab.h5', 'df', format='t')
1 loop, best of 3: 2.6 s per loop

In [38]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab.h5', 'df')
1 loop, best of 3: 230 ms per loop

HDF (format='table', complib='zlib', complevel=5):

In [40]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib5.h5', 'df', format='t', complevel=5, complib='zlib')
1 loop, best of 3: 5.44 s per loop

In [41]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib5.h5', 'df')
1 loop, best of 3: 854 ms per loop

HDF (format='table', complib='zlib', complevel=9):

In [36]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5', 'df', format='t', complevel=9, complib='zlib')
1 loop, best of 3: 5.95 s per loop

In [39]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5', 'df')
1 loop, best of 3: 860 ms per loop

HDF (format='table', complib='bzip2', complevel=5):

In [42]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5', 'df', format='t', complevel=5, complib='bzip2')
1 loop, best of 3: 36.5 s per loop

In [43]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5', 'df')
1 loop, best of 3: 2.5 s per loop

HDF (format='table', complib='bzip2', complevel=9):

In [42]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5', 'df', format='t', complevel=9, complib='bzip2')
1 loop, best of 3: 36.5 s per loop

In [43]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5', 'df')
1 loop, best of 3: 2.5 s per loop

PS i can't test feather on my Windows notebook

DF info:

In [49]: df.shape
Out[49]: (4000000, 6)

In [50]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000000 entries, 0 to 3999999
Data columns (total 6 columns):
a    datetime64[ns]
b    datetime64[ns]
c    datetime64[ns]
d    datetime64[ns]
e    datetime64[ns]
f    datetime64[ns]
dtypes: datetime64[ns](6)
memory usage: 183.1 MB

In [41]: df.head()
Out[41]:
                    a                   b                   c  \
0 1970-01-01 00:00:00 1970-01-01 00:01:00 1970-01-01 00:02:00
1 1970-01-01 00:01:00 1970-01-01 00:02:00 1970-01-01 00:03:00
2 1970-01-01 00:02:00 1970-01-01 00:03:00 1970-01-01 00:04:00
3 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
4 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00

                    d                   e                   f
0 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
1 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00
2 1970-01-01 00:05:00 1970-01-01 00:06:00 1970-01-01 00:07:00
3 1970-01-01 00:06:00 1970-01-01 00:07:00 1970-01-01 00:08:00
4 1970-01-01 00:07:00 1970-01-01 00:08:00 1970-01-01 00:09:00

File sizes:

{ .data }  » ls -lh 37010212.*                                                                          /d/temp/.data
-rw-r--r-- 1 Max None 492M May  3 22:21 37010212.csv
-rw-r--r-- 1 Max None  23M May  3 22:19 37010212.csv.gz
-rw-r--r-- 1 Max None 214M May  3 22:02 37010212.h5
-rw-r--r-- 1 Max None 184M May  3 22:02 37010212.pickle
-rw-r--r-- 1 Max None 215M May  4 10:39 37010212_tab.h5
-rw-r--r-- 1 Max None 5.4M May  4 10:46 37010212_tab_compress_bzip2_l5.h5
-rw-r--r-- 1 Max None 5.4M May  4 10:51 37010212_tab_compress_bzip2_l9.h5
-rw-r--r-- 1 Max None  17M May  4 10:42 37010212_tab_compress_zlib5.h5
-rw-r--r-- 1 Max None  17M May  4 10:36 37010212_tab_compress_zlib9.h5

Conclusion:

Pickle and HDF5 are much faster, but HDF5 is more convenient - you can store multiple tables/frames inside, you can read your data conditionally (look at where parameter in read_hdf()), you can also store your data compressed (zlib - is faster, bzip2 - provides better compression ratio), etc.

PS if you can build/use feather-format - it should be even faster compared to HDF5 and Pickle

PPS: don't use Pickle for big data frames, as you may end up with SystemError: error return without exception set error message. It's also described here and here.

Faster method to read large CSV file

The basic mistake you are making is doing only 1 single line per frame so you can basically calculate how long you will take for around 60fps:

120,000 rows / 60fps = 2000 seconds = 33.3333 minutes

due to the yield return null; which basically says "Pause the routine, render this frame and continue in the next frame".

Of course it would be way faster speaking about absolute time by not using yield return null or a Coroutine at all but let the entire thing be parsed in one single go. But then of course it freezes the UI main thread for a moment.

To avoid that the best way in my opinion would actually be to move the entire thing in a Thread/Task and only return the result!

FileIO and string parsing is always quite slow.

However, I think you could already speed it up a lot by simply using a StopWatch like

...

var stopWatch = new Stopwatch();
stopWatch.Start();

// Use the last frame duration as a guide for how long one frame should take
var targetMilliseconds = Time.deltaTime * 1000f;

while ((line = sr.ReadLine()) != null)
{
    ....

    // If you are too long in this frame render one and continue in the next frame
    // otherwise keep going with the next line
    if(stopWatch.ElapsedMilliseconds > targetMilliseconds)
    {
        yield return null;
        stopWatch.Restart();
    }
}

This allows to work off multiple lines within one frame while trying to keep a 60fps frame-rate. You might want to experiment a bit with it to find a good trade off between frame-rate and duration. E.g. maybe you can allow it to run with only 30fps but importing faster since like this it can handle more rows in one frame.

In general I wouldn't read "manually" through each byte/char. Rather use the builtin methods for that like e.g. String.Split.

I am actually using a bit more advanced Regex.Matches since if you export a CSV from Excel it allows special cases like one cell itself containing a , or other special characters like e.g. linebreaks(!).

Excel does it by wrapping the cell in " in this case. Which adds a second special case, namely a cell itself containing a ".

The Regex.Marches is quite complex of course and slow itself but covers these special cases. (See also Basic CSV rules for more detailed explanation on special cases)

If you know the format of your CSV well and don't need it you could/should probably rather just stick to

var columns = row.Split(new []{ ','});

to split it always just on , which would run faster.

private const char Quote = '\"';
private const string LineBreak = "\r\n";
private const string DoubleQuote = "\"\"";

private IEnumerator readDataset(string path)
{
    starsRead = 0;
    // Use the last frame duration as a guide how long one frame should take
    // you can also try and experiment with hardcodd target framerates like e.g. "1000f / 30" for 30fps
    var targetMilliseconds = Time.deltaTime * 1000f;
    var stopWatch = new Stopwatch();

    // NOTE: YOU ARE ALREADY READING THE ENTIRE FILE HERE ONCE!!
    // => Instead of later again read it line by line rather re-use this file content
    var lines = File.ReadLines(path).ToArray();
    var totalLines = lines.Length;
    totalStars = totalLines - 1;

    // HERE YOU DID READ THE FILE AGAIN JUST TO GET THE FIRST LINE ;)
    string firstLine = lines[0];

    var firstLineColumns = GetColumns(firstLine);

    columnCount = firstLineColumns.Length;

    var datasetTable = new string[totalStars, columnCount];

    stopWatch.Start();
    for(var i = 0; i < totalStars; i++)
    {
        string row = lines[i + 1];

        string[] columns = GetColumns(row);

        var colIndex = 0;
        foreach(var column in columns)
        {
            if(colIndex >= columnCount - 1) break;
            datasetTable[i, colIndex] = colum;
            colIndex++;
        }

        starsRead = i + 1;

        // If you are too long in this frame render one and continue in the next frame
        // otherwise keep going with the next line
        if (stopWatch.ElapsedMilliseconds > targetMilliseconds)
        {
            yield return null;
            stopWatch.Restart();
        }
    }
}

private string[] GetColumns(string row)
{
    var columns = new List<string>();

    // Look for the following expressions:
    // (?<x>(?=[,\r\n]+))           --> Creates a Match Group (?<x>...) of every expression it finds before a , a \r or a \n (?=[...])
    // OR |
    // ""(?<x>([^""]|"""")+)""      --> An Expression wrapped in single-quotes (escaped by "") is matched into a Match Group that is neither NOT a single-quote [^""] or is a double-quote
    // OR |
    // (?<x>[^,\r\n]+)),?)          --> Creates a Match Group (?<x>...) that does not contain , \r, or \n
    var matches = Regex.Matches(row, @"(((?<x>(?=[,\r\n]+))|""(?<x>([^""]|"""")+)""|(?<x>[^,\r\n]+)),?)", RegexOptions.ExplicitCapture);

    foreach (Match match in matches)
    {
        var cleanedMatch = match.Groups[1].Value == "\"\"" ? "" : match.Groups[1].Value.Replace("\"\"", Quote.ToString());
        columns.Add(cleanedMatch);
    }

    // If last thing is a `,` then there is an empty item missing at the end
    if (row.Length > 0 && row[row.Length - 1].Equals(','))
    {
        columns.Add("");
    }

    return columns.ToArray();
}

Best Way to Work with Large Amounts of CSV Data Quickly