Counting CSV Column Occurrences on the Fly in Python

Counting csv column occurrences on the fly in Python

Don't use DictReader(). DictReader() does a lot of work converting a row to a dictionary, with configurable handling of missing and extra columns, that you really don't need here. Just use a regular reader and access the 3rd column of each row.

You can further speed this up by using a Counter() object to begin with (it'll automatically handle the 0 case for you). You may be able to get a very minor speed boost by opening the file with newline=''; the CSV module recommends you do this anyway as it wants to make sure it knows about line endings versus possible embedded newlines in columns.

If you use a map() object and operator.itemgetter(), you can further avoid the evaluation loop overhead, and pass the ids directly to the counter:

import csv
import os.path
from collections import Counter
from operator import itemgetter

filename = '/PATH/file'

with open(os.path(data_path, filename), newline='') as f:
reader = csv.reader(f)
id_counts = Counter(map(itemgetter(2), reader))

Still, 200 million rows is a lot of work to handle. I generated 1 million rows of semi-realistic data using Faker, copied those rows 200 times into a new file, and my 2017 model Macbook Pro with SSD processed the resulting 12GB of data in just over 6 minutes with tqdm included, and 5 minutes 14 seconds without. tqdm claims it adds only 60 nanoseconds to each iteration (12 seconds over 200 million rows) but in my tests it appears to be easily 3 or 4 times that number.

Pandas reading the data is going to be about the same speed, as Pandas' read_csv() is built on top of csv.reader(), and the above is as fast as Python can make reading a file with 200 million rows. However, it'll then build a dataframe for those 200 million rows, and that'll take a significant amount of memory to process. You'd have to process your data in chunks and aggregate the results for this to be feasible at all.

Lets do some speed tests, comparing your version (one with and one without the tqdm speed bump), Pandas, and the above approach. We'll use a test set of 10k rows with about 100 unique ids to compare things evenly, without using I/O. This tests just the counting capabilities of each approach. So, setting up the test data and tests; the name=name keyword assignments help avoid global name lookups for repeated tests:

>>> import csv, pandas
>>> from timeit import Timer
>>> from collections import Counter
>>> from contextlib import redirect_stderr
>>> from io import StringIO
>>> from operator import itemgetter
>>> from random import randrange
>>> from tqdm import tqdm
>>> row = lambda: f",,{randrange(100)},,\r\n" # 5 columns, only care about middle column
>>> test_data = ''.join([row() for _ in range(10 ** 4)]) # CSV of 10.000 rows
>>> field_names = ['A', 'B', 'ID', 'C', 'D']
>>> filename = '/PATH/file'
>>> tests = []
>>> def as_test(f):
... tests.append((f.__name__, f))
...
>>> @as_test
... def in_question(f, csv=csv, tqdm=tqdm, field_names=field_names):
... ID_dict = {}
... reader = csv.DictReader(f, field_names, delimiter=',')
... for row in tqdm(reader):
... label = row['ID']
... if label not in ID_dict.keys():
... ID_dict[label] = 0
... ID_dict[label] += 1
...
>>> @as_test
... def in_question_no_tqdm(f, csv=csv, tqdm=tqdm, field_names=field_names):
... ID_dict = {}
... reader = csv.DictReader(f, field_names, delimiter=',')
... for row in reader:
... label = row['ID']
... if label not in ID_dict.keys():
... ID_dict[label] = 0
... ID_dict[label] += 1
...
>>> @as_test
... def pandas_groupby_count(f, pandas=pandas, field_names=field_names):
... df = pandas.read_csv(f, names=field_names)
... grouped_counts = df.groupby('ID').count()
...
>>> @as_test
... def pandas_value_counts(f, pandas=pandas, field_names=field_names):
... df = pandas.read_csv(f, names=field_names)
... counts = df['ID'].value_counts()
...
>>> @as_test
... def counter_over_map(f, csv=csv, Counter=Counter, ig2=itemgetter(2)):
... reader = csv.reader(f)
... id_counts = Counter(map(ig2, reader))
...

and the running the timed tests:

>>> for testname, testfunc in tests:
... timer = Timer(lambda s=StringIO, t=test_data: testfunc(s(t)))
... with redirect_stderr(StringIO()): # silence tqdm
... count, totaltime = timer.autorange()
... print(f"{testname:>25}: {totaltime / count * 1000:6.3f} microseconds ({count:>2d} runs)")
...
in_question: 33.303 microseconds (10 runs)
in_question_no_tqdm: 30.467 microseconds (10 runs)
pandas_groupby_count: 5.298 microseconds (50 runs)
pandas_value_counts: 5.975 microseconds (50 runs)
counter_over_map: 4.047 microseconds (50 runs)

The combination of a DictReader() and Python for loop is what really causes your version to be 6 to 7 times as slow. tqdm's overhead has dropped to 0.3 nanosecond with stderr suppressed; dropping the with redirect_stderr() context manager made output more verbose and increases the time to 50 microseconds, so about 2 nanoseconds per iteration:

>>> timer = Timer(lambda s=StringIO, t=test_data: tests[0][1](s(t)))
>>> count, totaltime = timer.autorange()
10000it [00:00, 263935.46it/s]
10000it [00:00, 240672.96it/s]
10000it [00:00, 215298.98it/s]
10000it [00:00, 226025.18it/s]
10000it [00:00, 201787.96it/s]
10000it [00:00, 202984.24it/s]
10000it [00:00, 192296.06it/s]
10000it [00:00, 195963.46it/s]
>>> print(f"{totaltime / count * 1000:6.3f} microseconds ({count:>2d} runs)")
50.193 microseconds ( 5 runs)

Pandas is keeping up well here however! But without chunking the gigabytes of memory needed to read all 200 million rows of data into memory (with the actual dataset, not empty columns as I produced here) is going to be a lot slower and perhaps not something your machine can actually carry. Using a Counter() doesn't require gigabytes of memory here.

If you need to do more processing of the CSV dataset, then using SQLite is going to be a good idea too. I wouldn't even use Python then; just use the SQLite command line tool to import the CSV data directly:

$  csvanalysis.db
SQLite version 3.19.3 2017-06-27 16:48:08
Enter ".help" for usage hints.
sqlite> CREATE TABLE csvdata (A, B, ID, C, D);
sqlite> CREATE INDEX csvdata_id on csvdata (ID);
sqlite> .import /path/to/file.csv csvdata
sqlite> SELECT ID, COUNT(*) FROM csvdata GROUP BY ID;

etc.

Counting word occurrences in csv and determine row appearances

You are comparing if the row == 'apple, what you need is if 'apple' in row. And to count the occurrences you can use str.count(), for example:

import csv
my_reader = csv.reader(open('file.csv', encoding = 'utf-8'))
ctr = 0
rows = 0
for record in my_reader:
if 'apples' in record[0]:
rows += 1
ctr += record[0].count('apples')

print('apples: {}, rows: {}'.format(ctr, rows))

This way you will check if the row contains apples then you increment rows by one and increment ctr by number of apples in that row.

Counting occurrences of a string in a column of a csv file

I think you can use str.count with column sentence:

print df
# sentence
#0 Sam ate an apple and she felt great apple apple
#1 Jill thinks the sky is purple but Bob says it'...
#2 Ralph wants to go apple picking this fall

print df.columns
#Index([u'sentence'], dtype='object')

df['count'] = df['sentence'].str.count('apple')
print df
# sentence count
#0 Sam ate an apple and she felt great apple apple 3
#1 Jill thinks the sky is purple but Bob says it'... 0
#2 Ralph wants to go apple picking this fall 1

Python algorithm of counting occurrence of specific word in csv

Basic example, with using csv and collections.Counter (Python 2.7+) from standard Python libraly:

import csv
import collections

grades = collections.Counter()
with open('file.csv') as input_file:
for row in csv.reader(input_file, delimiter=';'):
grades[row[1]] += 1

print 'Number of A grades: %s' % grades['A']
print grades.most_common()

Output (for small dataset):

Number of A grades: 2055
[('A', 2055), ('B', 2034), ('D', 1995), ('E', 1977), ('C', 1939)]


Related Topics



Leave a reply



Submit