Reading a Huge .CSV File

Reading a huge .csv file

You are reading all rows into a list, then processing that list. Don't do that.

Process your rows as you produce them. If you need to filter the data first, use a generator function:

import csv

def getstuff(filename, criterion):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
yield next(datareader) # yield the header row
count = 0
for row in datareader:
if row[3] == criterion:
yield row
count += 1
elif count:
# done when having read a consecutive series of rows
return

I also simplified your filter test; the logic is the same but more concise.

Because you are only matching a single sequence of rows matching the criterion, you could also use:

import csv
from itertools import dropwhile, takewhile

def getstuff(filename, criterion):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
yield next(datareader) # yield the header row
# first row, plus any subsequent rows that match, then stop
# reading altogether
# Python 2: use `for row in takewhile(...): yield row` instead
# instead of `yield from takewhile(...)`.
yield from takewhile(
lambda r: r[3] == criterion,
dropwhile(lambda r: r[3] != criterion, datareader))
return

You can now loop over getstuff() directly. Do the same in getdata():

def getdata(filename, criteria):
for criterion in criteria:
for row in getstuff(filename, criterion):
yield row

Now loop directly over getdata() in your code:

for row in getdata(somefilename, sequence_of_criteria):
# process row

You now only hold one row in memory, instead of your thousands of lines per criterion.

yield makes a function a generator function, which means it won't do any work until you start looping over it.

In Python, I would like to read a huge CSV file stored in a zip file, but only a fixed sized chunk at a time

You just need to do something like:

import zipfile
import io
import csv

with zipfile.ZipFile("test.zip") as zipf:
with zipf.open("test.csv", "r") as f:
reader = csv.reader(
io.TextIOWrapper(f, newline='')
)
for row in reader:
do_something(row)

Assuming you have a zip archive like:

jarrivillaga$ unzip -l test.zip
Archive: test.zip
Length Date Time Name
--------- ---------- ----- ----
1308888890 04-01-2022 16:23 test.csv
--------- -------
1308888890 1 file

Note, the zipf.open returns a binary stream, so you can just use an io.TextIOWrapper to make it a text stream, which would work with any of the csv.reader or csv.DictReader objects.

This should read it in reasonably sized chunks by default, probably whatever io.DEFAULT_BUFFER_SIZE is, because looking at the zipfile.ZipExtFile source code it is inheriting from io.BufferedIOBase.

Best way to transform a huge CSV file content into quickly queryable data store?

You can use Amazon Athena or Amazon S3 Select.

Amazon Athena is a query engine that can read data directly from (multiple) files stored in Amazon S3. It works best when the files are in a columnar format (eg Parquet or ORC) and compressed, but it can work on normal CSV files too. It is highly scalable, especially where multiple files are being queried. However, it treats data as being stored in a 'table' based on its location in S3, so it isn't ideal for querying random files.

Amazon S3 Select only works on a single file at a time, but it can directly query a CSV file (and a few other formats). It has an SQL-like query capability.

If your need is to query a different file each time, I would recommend S3 Select.

The benefit of both of these options is that you do not need to 'load' the data into a database. However, that is certainly an option you might consider if you need very fast access to the data. (Amazon Redshift can handle billions of rows in a table quite quickly, but it is more expensive.)



Related Topics



Leave a reply



Submit