Read.CSV Is Extremely Slow in Reading CSV Files with Large Numbers of Columns

Reading a huge .csv file

You are reading all rows into a list, then processing that list. Don't do that.

Process your rows as you produce them. If you need to filter the data first, use a generator function:

import csv

def getstuff(filename, criterion):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
yield next(datareader) # yield the header row
count = 0
for row in datareader:
if row[3] == criterion:
yield row
count += 1
elif count:
# done when having read a consecutive series of rows
return

I also simplified your filter test; the logic is the same but more concise.

Because you are only matching a single sequence of rows matching the criterion, you could also use:

import csv
from itertools import dropwhile, takewhile

def getstuff(filename, criterion):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
yield next(datareader) # yield the header row
# first row, plus any subsequent rows that match, then stop
# reading altogether
# Python 2: use `for row in takewhile(...): yield row` instead
# instead of `yield from takewhile(...)`.
yield from takewhile(
lambda r: r[3] == criterion,
dropwhile(lambda r: r[3] != criterion, datareader))
return

You can now loop over getstuff() directly. Do the same in getdata():

def getdata(filename, criteria):
for criterion in criteria:
for row in getstuff(filename, criterion):
yield row

Now loop directly over getdata() in your code:

for row in getdata(somefilename, sequence_of_criteria):
# process row

You now only hold one row in memory, instead of your thousands of lines per criterion.

yield makes a function a generator function, which means it won't do any work until you start looping over it.

read.csv to import more than 2105 columns?

Try using data.table.

library(data.table)

data <- fread("data.csv")


Related Topics



Leave a reply



Submit