What Is the Most Efficient Way of Counting Occurrences in Pandas

What is the most efficient way of counting occurrences in pandas?

I think df['word'].value_counts() should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)

In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.

Efficient way in Pandas to count occurrences of Series of values by row

We can compare the transposed df.T directly to the df.max series, thanks to broadcasting:

(df.T == df.max(axis=1)).sum()

# result
0 2
1 1
2 1
3 2
dtype: int64

(Transposing also has the added benefit that we can use sum without specifying the axis, i.e. with the default axis=0.)

Fastest way to count occurrence of words in Pandas

Perhaps you can use Counter. If you have multiple sets of words to test against the same text, just save the intermediate step after applying Counter. As these counted words are now in a dictionary keyed on the word, it is an O(1) operation to test if this dictionary contains a given word.

from collections import Counter

data["Count"] = (
data['col'].str.split()
.apply(Counter)
.apply(lambda counts: sum(word in counts for word in words))
)
>>> data
col Count
0 I want to find 2
1 the fastest way 0
2 to count occurrence 0
3 of words in a column 0
4 Can you help please 1

What is the most efficient way to count the number of instances occurring within a time frame in python?

Without example data, it's not absolutely clear what you want. But this should help you vectorise:

numSurgeries = {shift: np.sum((OR['PATIENT_IN_ROOM_DTTM'] >= df.DateTime[shift]) & \
(OR['PATIENT_IN_ROOM_DTTM'] < df.DateTime[shift+1])) \
for shift in range(len(df.Date))}

The output is a dictionary mapping integer shift to numSurgeries.

Fast way to count occurrences of all values in a pandas DataFrame

Approach #1

Well the NumPy trick would be to convert to numbers (that's where NumPy shines) and simply let bincount do the counting -

a = df.fillna('[').values.astype(str).view(np.uint8)
count = np.bincount(a.ravel())[65:-1]

This works for single characters. np.bincount(a.ravel()) holds the count for all the characters.

Approach #1S (super-charged)

Previous approach had bottlenecks at the string conversion : astype(str). Also, the fillna() was another show-stopper. More trickery was needed to super-charge it by getting around those bottlenecks. Now, astype('S1') could be used upfront to force everything to single character. So, single characters stay put, while the NaNs get reduced to just a single character 'n'. This lets us skip fillna, as the count for 'n' could be simply skipped later on with indexing.

Hence, the implementation would be -

def app1S(df):
ar = df.values.astype('S1')
a = ar.view(np.uint8)
count = np.bincount(a.ravel())[65:65+26]
return count

Timings on pandas-0.20.3 and numpy-1.13.3 -

In [3]: # Setup input
...: random.seed(100)
...: n = 1000000
...: data = {letter: [random.choice(list(ascii_uppercase) +
...: [np.nan]) for _ in range(n)] for letter in ascii_uppercase}
...: df = pd.DataFrame(data)
...:

# @Wen's soln
In [4]: %timeit df.melt().value.value_counts()
1 loop, best of 3: 2.5 s per loop

# @andrew_reece's soln
In [5]: %timeit df.apply(pd.value_counts).sum(axis=1)
1 loop, best of 3: 2.14 s per loop

# Super-charged one
In [6]: %timeit app1S(df)
1 loop, best of 3: 501 ms per loop

Generic case

We can also np.unique to cover for generic cases (data with more than single characters) -

unq, count = np.unique(df.fillna(-999), return_counts=1)

Fastest way to count number of occurrences in a Python list

a = ['1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '7', '7', '7', '10', '10']
print a.count("1")

It's probably optimized heavily at the C level.

Edit: I randomly generated a large list.

In [8]: len(a)
Out[8]: 6339347

In [9]: %timeit a.count("1")
10 loops, best of 3: 86.4 ms per loop

Edit edit: This could be done with collections.Counter

a = Counter(your_list)
print a['1']

Using the same list in my last timing example

In [17]: %timeit Counter(a)['1']
1 loops, best of 3: 1.52 s per loop

My timing is simplistic and conditional on many different factors, but it gives you a good clue as to performance.

Here is some profiling

In [24]: profile.run("a.count('1')")
3 function calls in 0.091 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.091 0.091 <string>:1(<module>)
1 0.091 0.091 0.091 0.091 {method 'count' of 'list' objects}

1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Prof
iler' objects}

In [25]: profile.run("b = Counter(a); b['1']")
6339356 function calls in 2.143 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 2.143 2.143 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 _weakrefset.py:68(__contains__)
1 0.000 0.000 0.000 0.000 abc.py:128(__instancecheck__)
1 0.000 0.000 2.143 2.143 collections.py:407(__init__)
1 1.788 1.788 2.143 2.143 collections.py:470(update)
1 0.000 0.000 0.000 0.000 {getattr}
1 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Prof
iler' objects}
6339347 0.356 0.000 0.356 0.000 {method 'get' of 'dict' objects}


Related Topics



Leave a reply



Submit