Reading a huge .csv file
You are reading all rows into a list, then processing that list. Don't do that.
Process your rows as you produce them. If you need to filter the data first, use a generator function:
import csv
def getstuff(filename, criterion):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
yield next(datareader) # yield the header row
count = 0
for row in datareader:
if row[3] == criterion:
yield row
count += 1
elif count:
# done when having read a consecutive series of rows
return
I also simplified your filter test; the logic is the same but more concise.
Because you are only matching a single sequence of rows matching the criterion, you could also use:
import csv
from itertools import dropwhile, takewhile
def getstuff(filename, criterion):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
yield next(datareader) # yield the header row
# first row, plus any subsequent rows that match, then stop
# reading altogether
# Python 2: use `for row in takewhile(...): yield row` instead
# instead of `yield from takewhile(...)`.
yield from takewhile(
lambda r: r[3] == criterion,
dropwhile(lambda r: r[3] != criterion, datareader))
return
You can now loop over getstuff()
directly. Do the same in getdata()
:
def getdata(filename, criteria):
for criterion in criteria:
for row in getstuff(filename, criterion):
yield row
Now loop directly over getdata()
in your code:
for row in getdata(somefilename, sequence_of_criteria):
# process row
You now only hold one row in memory, instead of your thousands of lines per criterion.
yield
makes a function a generator function, which means it won't do any work until you start looping over it.
read.csv to import more than 2105 columns?
Try using data.table
.
library(data.table)
data <- fread("data.csv")
Related Topics
How to Save Summary(Lm) to a File
Plotting Envfit Vectors (Vegan Package) in Ggplot2
Apply() Is Slow - How to Make It Faster or What Are My Alternatives
Sum of Antidiagonal of a Matrix
Basic - T-Test -> Grouping Factor Must Have Exactly 2 Levels
Using Grep in R to Delete Rows from a Data.Frame
Subscripts and Superscripts "-" or "+" with Ggplot2 Axis Labels? (Ionic Chemical Notation)
Using Filter_ in Dplyr Where Both Field and Value Are in Variables
Ggplot Us State Map; Colors Are Fine, Polygons Jagged - R
Replace Accented Characters in R with Non-Accented Counterpart (Utf-8 Encoding)
Get Filename and Path of 'Source'D File
R Scoping: Disallow Global Variables in Function
How to Convert a String in a Function into an Object