Counting csv column occurrences on the fly in Python
Don't use DictReader()
. DictReader()
does a lot of work converting a row to a dictionary, with configurable handling of missing and extra columns, that you really don't need here. Just use a regular reader and access the 3rd column of each row.
You can further speed this up by using a Counter()
object to begin with (it'll automatically handle the 0
case for you). You may be able to get a very minor speed boost by opening the file with newline=''
; the CSV module recommends you do this anyway as it wants to make sure it knows about line endings versus possible embedded newlines in columns.
If you use a map()
object and operator.itemgetter()
, you can further avoid the evaluation loop overhead, and pass the ids directly to the counter:
import csv
import os.path
from collections import Counter
from operator import itemgetter
filename = '/PATH/file'
with open(os.path(data_path, filename), newline='') as f:
reader = csv.reader(f)
id_counts = Counter(map(itemgetter(2), reader))
Still, 200 million rows is a lot of work to handle. I generated 1 million rows of semi-realistic data using Faker, copied those rows 200 times into a new file, and my 2017 model Macbook Pro with SSD processed the resulting 12GB of data in just over 6 minutes with tqdm
included, and 5 minutes 14 seconds without. tqdm
claims it adds only 60 nanoseconds to each iteration (12 seconds over 200 million rows) but in my tests it appears to be easily 3 or 4 times that number.
Pandas reading the data is going to be about the same speed, as Pandas' read_csv()
is built on top of csv.reader()
, and the above is as fast as Python can make reading a file with 200 million rows. However, it'll then build a dataframe for those 200 million rows, and that'll take a significant amount of memory to process. You'd have to process your data in chunks and aggregate the results for this to be feasible at all.
Lets do some speed tests, comparing your version (one with and one without the tqdm
speed bump), Pandas, and the above approach. We'll use a test set of 10k rows with about 100 unique ids to compare things evenly, without using I/O. This tests just the counting capabilities of each approach. So, setting up the test data and tests; the name=name
keyword assignments help avoid global name lookups for repeated tests:
>>> import csv, pandas
>>> from timeit import Timer
>>> from collections import Counter
>>> from contextlib import redirect_stderr
>>> from io import StringIO
>>> from operator import itemgetter
>>> from random import randrange
>>> from tqdm import tqdm
>>> row = lambda: f",,{randrange(100)},,\r\n" # 5 columns, only care about middle column
>>> test_data = ''.join([row() for _ in range(10 ** 4)]) # CSV of 10.000 rows
>>> field_names = ['A', 'B', 'ID', 'C', 'D']
>>> filename = '/PATH/file'
>>> tests = []
>>> def as_test(f):
... tests.append((f.__name__, f))
...
>>> @as_test
... def in_question(f, csv=csv, tqdm=tqdm, field_names=field_names):
... ID_dict = {}
... reader = csv.DictReader(f, field_names, delimiter=',')
... for row in tqdm(reader):
... label = row['ID']
... if label not in ID_dict.keys():
... ID_dict[label] = 0
... ID_dict[label] += 1
...
>>> @as_test
... def in_question_no_tqdm(f, csv=csv, tqdm=tqdm, field_names=field_names):
... ID_dict = {}
... reader = csv.DictReader(f, field_names, delimiter=',')
... for row in reader:
... label = row['ID']
... if label not in ID_dict.keys():
... ID_dict[label] = 0
... ID_dict[label] += 1
...
>>> @as_test
... def pandas_groupby_count(f, pandas=pandas, field_names=field_names):
... df = pandas.read_csv(f, names=field_names)
... grouped_counts = df.groupby('ID').count()
...
>>> @as_test
... def pandas_value_counts(f, pandas=pandas, field_names=field_names):
... df = pandas.read_csv(f, names=field_names)
... counts = df['ID'].value_counts()
...
>>> @as_test
... def counter_over_map(f, csv=csv, Counter=Counter, ig2=itemgetter(2)):
... reader = csv.reader(f)
... id_counts = Counter(map(ig2, reader))
...
and the running the timed tests:
>>> for testname, testfunc in tests:
... timer = Timer(lambda s=StringIO, t=test_data: testfunc(s(t)))
... with redirect_stderr(StringIO()): # silence tqdm
... count, totaltime = timer.autorange()
... print(f"{testname:>25}: {totaltime / count * 1000:6.3f} microseconds ({count:>2d} runs)")
...
in_question: 33.303 microseconds (10 runs)
in_question_no_tqdm: 30.467 microseconds (10 runs)
pandas_groupby_count: 5.298 microseconds (50 runs)
pandas_value_counts: 5.975 microseconds (50 runs)
counter_over_map: 4.047 microseconds (50 runs)
The combination of a DictReader()
and Python for
loop is what really causes your version to be 6 to 7 times as slow. tqdm
's overhead has dropped to 0.3 nanosecond with stderr
suppressed; dropping the with redirect_stderr()
context manager made output more verbose and increases the time to 50 microseconds, so about 2 nanoseconds per iteration:
>>> timer = Timer(lambda s=StringIO, t=test_data: tests[0][1](s(t)))
>>> count, totaltime = timer.autorange()
10000it [00:00, 263935.46it/s]
10000it [00:00, 240672.96it/s]
10000it [00:00, 215298.98it/s]
10000it [00:00, 226025.18it/s]
10000it [00:00, 201787.96it/s]
10000it [00:00, 202984.24it/s]
10000it [00:00, 192296.06it/s]
10000it [00:00, 195963.46it/s]
>>> print(f"{totaltime / count * 1000:6.3f} microseconds ({count:>2d} runs)")
50.193 microseconds ( 5 runs)
Pandas is keeping up well here however! But without chunking the gigabytes of memory needed to read all 200 million rows of data into memory (with the actual dataset, not empty columns as I produced here) is going to be a lot slower and perhaps not something your machine can actually carry. Using a Counter()
doesn't require gigabytes of memory here.
If you need to do more processing of the CSV dataset, then using SQLite is going to be a good idea too. I wouldn't even use Python then; just use the SQLite command line tool to import the CSV data directly:
$ csvanalysis.db
SQLite version 3.19.3 2017-06-27 16:48:08
Enter ".help" for usage hints.
sqlite> CREATE TABLE csvdata (A, B, ID, C, D);
sqlite> CREATE INDEX csvdata_id on csvdata (ID);
sqlite> .import /path/to/file.csv csvdata
sqlite> SELECT ID, COUNT(*) FROM csvdata GROUP BY ID;
etc.
Counting word occurrences in csv and determine row appearances
You are comparing if the row == 'apple
, what you need is if 'apple' in row
. And to count the occurrences you can use str.count()
, for example:
import csv
my_reader = csv.reader(open('file.csv', encoding = 'utf-8'))
ctr = 0
rows = 0
for record in my_reader:
if 'apples' in record[0]:
rows += 1
ctr += record[0].count('apples')
print('apples: {}, rows: {}'.format(ctr, rows))
This way you will check if the row
contains apples
then you increment rows
by one and increment ctr
by number of apples
in that row
.
Counting occurrences of a string in a column of a csv file
I think you can use str.count
with column sentence
:
print df
# sentence
#0 Sam ate an apple and she felt great apple apple
#1 Jill thinks the sky is purple but Bob says it'...
#2 Ralph wants to go apple picking this fall
print df.columns
#Index([u'sentence'], dtype='object')
df['count'] = df['sentence'].str.count('apple')
print df
# sentence count
#0 Sam ate an apple and she felt great apple apple 3
#1 Jill thinks the sky is purple but Bob says it'... 0
#2 Ralph wants to go apple picking this fall 1
Python algorithm of counting occurrence of specific word in csv
Basic example, with using csv
and collections.Counter
(Python 2.7+) from standard Python libraly:
import csv
import collections
grades = collections.Counter()
with open('file.csv') as input_file:
for row in csv.reader(input_file, delimiter=';'):
grades[row[1]] += 1
print 'Number of A grades: %s' % grades['A']
print grades.most_common()
Output (for small dataset):
Number of A grades: 2055
[('A', 2055), ('B', 2034), ('D', 1995), ('E', 1977), ('C', 1939)]
Related Topics
Python: How to Split a List Based on a Specific Element
How to Tell Python to Convert Integers into Words
Pandas Map One Column to the Combination of Two Columns
Change Specific Value in CSV File Via Python
How to Run an .Ipynb Jupyter Notebook from Terminal
Python Opencv Cv2 - How to Increase the Brightness and Contrast of an Image by 100%
Change CSV Name to CSV Date Time Python
Numpy: How to Pick Rows from Two 2D Arrays Based on Conditions in 1D Arrays
Convert Float to Float Time in Python
How to Convert an Integer to Time
Adding Columns to Dataframe Based on File Name in Python
How to Disable the Security Certificate Check in Python Requests
How to Make a Discord Bot Leave a Server from a Command in Another Server
Converting Two Lists into a Matrix
How to Send Smtp Email for Office365 With Python Using Tls/Ssl
How to Compute Mean() for Particular Column in Pandas Dataframe Without Considering Nan Values