Reading a huge .csv file
You are reading all rows into a list, then processing that list. Don't do that.
Process your rows as you produce them. If you need to filter the data first, use a generator function:
import csv
def getstuff(filename, criterion):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
yield next(datareader) # yield the header row
count = 0
for row in datareader:
if row[3] == criterion:
yield row
count += 1
elif count:
# done when having read a consecutive series of rows
return
I also simplified your filter test; the logic is the same but more concise.
Because you are only matching a single sequence of rows matching the criterion, you could also use:
import csv
from itertools import dropwhile, takewhile
def getstuff(filename, criterion):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
yield next(datareader) # yield the header row
# first row, plus any subsequent rows that match, then stop
# reading altogether
# Python 2: use `for row in takewhile(...): yield row` instead
# instead of `yield from takewhile(...)`.
yield from takewhile(
lambda r: r[3] == criterion,
dropwhile(lambda r: r[3] != criterion, datareader))
return
You can now loop over getstuff()
directly. Do the same in getdata()
:
def getdata(filename, criteria):
for criterion in criteria:
for row in getstuff(filename, criterion):
yield row
Now loop directly over getdata()
in your code:
for row in getdata(somefilename, sequence_of_criteria):
# process row
You now only hold one row in memory, instead of your thousands of lines per criterion.
yield
makes a function a generator function, which means it won't do any work until you start looping over it.
In Python, I would like to read a huge CSV file stored in a zip file, but only a fixed sized chunk at a time
You just need to do something like:
import zipfile
import io
import csv
with zipfile.ZipFile("test.zip") as zipf:
with zipf.open("test.csv", "r") as f:
reader = csv.reader(
io.TextIOWrapper(f, newline='')
)
for row in reader:
do_something(row)
Assuming you have a zip archive like:
jarrivillaga$ unzip -l test.zip
Archive: test.zip
Length Date Time Name
--------- ---------- ----- ----
1308888890 04-01-2022 16:23 test.csv
--------- -------
1308888890 1 file
Note, the zipf.open
returns a binary stream, so you can just use an io.TextIOWrapper
to make it a text stream, which would work with any of the csv.reader
or csv.DictReader
objects.
This should read it in reasonably sized chunks by default, probably whatever io.DEFAULT_BUFFER_SIZE
is, because looking at the zipfile.ZipExtFile
source code it is inheriting from io.BufferedIOBase
.
Best way to transform a huge CSV file content into quickly queryable data store?
You can use Amazon Athena or Amazon S3 Select.
Amazon Athena is a query engine that can read data directly from (multiple) files stored in Amazon S3. It works best when the files are in a columnar format (eg Parquet or ORC) and compressed, but it can work on normal CSV files too. It is highly scalable, especially where multiple files are being queried. However, it treats data as being stored in a 'table' based on its location in S3, so it isn't ideal for querying random files.
Amazon S3 Select only works on a single file at a time, but it can directly query a CSV file (and a few other formats). It has an SQL-like query capability.
If your need is to query a different file each time, I would recommend S3 Select.
The benefit of both of these options is that you do not need to 'load' the data into a database. However, that is certainly an option you might consider if you need very fast access to the data. (Amazon Redshift can handle billions of rows in a table quite quickly, but it is more expensive.)
Related Topics
Saving a Numpy Array as an Image
Correct Way to Define Python Source Code Encoding
In Pandas, Is Inplace = True Considered Harmful, or Not
Making Object JSON Serializable with Regular Encoder
What Exactly Does "Import *" Import
How to Implement an Ordered, Default Dict
Where Is Python's Sys.Path Initialized From
Convert Numpy Array to Python List
From ... Import' VS 'Import .'
How to Print Original Variable's Name in Python After It Was Returned from a Function
How to Check If a Float Value Is a Whole Number
Fast Haversine Approximation (Python/Pandas)
Pandas: Drop Consecutive Duplicates
How to Create a Custom String Representation for a Class Object
How to Time a Code Segment for Testing Performance with Pythons Timeit
Python Try...Except Comma VS 'As' in Except
Python App Does Not Print Anything When Running Detached in Docker