Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series
There's actually a bit of hidden overhead in zip(df.A.values, df.B.values)
. The key here comes down to numpy arrays being stored in memory in a fundamentally different way than Python objects.
A numpy array, such as np.arange(10)
, is essentially stored as a contiguous block of memory, and not as individual Python objects. Conversely, a Python list, such as list(range(10))
, is stored in memory as pointers to individual Python objects (i.e. integers 0-9). This difference is the basis for why numpy arrays are smaller in memory than the Python equivalent lists, and why you can perform faster computations on numpy arrays.
So, as Counter
is consuming the zip
, the associated tuples need to be created as Python objects. This means that Python needs to extract the tuple values from numpy data and create corresponding Python objects in memory. There is noticeable overhead to this, which is why you want to be very careful when combining pure Python functions with numpy data. A basic example of this pitfall that you might commonly see is using the built-in Python sum
on a numpy array: sum(np.arange(10**5))
is actually a bit slower than the pure Python sum(range(10**5))
, and both of which are of course significantly slower than np.sum(np.arange(10**5))
.
See this video for a more in depth discussion of this topic.
As an example specific to this question, observe the following timings comparing the performance of Counter
on zipped numpy arrays vs. the corresponding zipped Python lists.
In [2]: a = np.random.randint(10**4, size=10**6)
...: b = np.random.randint(10**4, size=10**6)
...: a_list = a.tolist()
...: b_list = b.tolist()
In [3]: %timeit Counter(zip(a, b))
455 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit Counter(zip(a_list, b_list))
334 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The difference between these two timings gives you a reasonable estimate of the overhead discussed earlier.
This isn't quite the end of the story though. Constructing a groupby
object in pandas involves a some overhead too, at least as related to this problem, since there's some groupby
metadata that isn't strictly necessary just to get size
, whereas Counter
does the one singular thing you care about. Usually this overhead is far less than the overhead associated with Counter
, but from some quick experimentation I've found that you can actually get marginally better performance from Counter
when the majority of your groups just consist of single elements.
Consider the following timings (using @BallpointBen's sort=False
suggestion) that go along the spectrum of few large groups <--> many small groups:
def grouper(df):
return df.groupby(['A', 'B'], sort=False).size()
def count(df):
return Counter(zip(df.A.values, df.B.values))
for m, n in [(10, 10**6), (10**3, 10**6), (10**7, 10**6)]:
df = pd.DataFrame({'A': np.random.randint(0, m, n),
'B': np.random.randint(0, m, n)})
print(m, n)
%timeit grouper(df)
%timeit count(df)
Which gives me the following table:
m grouper counter
10 62.9 ms 315 ms
10**3 191 ms 535 ms
10**7 514 ms 459 ms
Of course, any gains from Counter
would be offset by converting back to a Series
, if that's what you want as your final object.
Python: get a frequency count based on two columns (variables) in pandas dataframe some row appears
You can use groupby's size
:
In [11]: df.groupby(["Group", "Size"]).size()
Out[11]:
Group Size
Moderate Medium 1
Small 1
Short Small 2
Tall Large 1
dtype: int64
In [12]: df.groupby(["Group", "Size"]).size().reset_index(name="Time")
Out[12]:
Group Size Time
0 Moderate Medium 1
1 Moderate Small 1
2 Short Small 2
3 Tall Large 1
Pandas DataFrame Groupby two columns and get counts
Followed by @Andy's answer, you can do following to solve your second question:
In [56]: df.groupby(['col5','col2']).size().reset_index().groupby('col2')[[0]].max()
Out[56]:
0
col2
A 3
B 2
C 1
D 3
What is the most efficient way of counting occurrences in pandas?
I think df['word'].value_counts()
should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count
should be much slower than max
. Both take some time to avoid missing values. (Compare with size
.)
In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.
How to do group by in two levels on python pandas an count values?
Use groupby
+size
and reset_index
:
df1 = dfs.groupby(['Cat','Number']).size().reset_index(name='Count')
Or:
df1 = dfs.groupby(['Cat','Number'])['Email'].value_counts().reset_index(name='Count')
print(df1)
Cat Number Count
0 ab1 1 2
1 ab1 2 1
2 ab2 1 3
When is it appropriate to use df.value_counts() vs df.groupby('...').count()?
There is difference value_counts
return:
The resulting object will be in descending order so that the first element is the most frequently-occurring element.
but count
not, it sort output by index
(created by column in groupby('col')
).
df.groupby('colA').count()
is for aggregate all columns of df
by function count.
So it count values excluding NaN
s.
So if need count
only one column need:
df.groupby('colA')['colA'].count()
Sample:
df = pd.DataFrame({'colB':list('abcdefg'),
'colC':[1,3,5,7,np.nan,np.nan,4],
'colD':[np.nan,3,6,9,2,4,np.nan],
'colA':['c','c','b','a',np.nan,'b','b']})
print (df)
colA colB colC colD
0 c a 1.0 NaN
1 c b 3.0 3.0
2 b c 5.0 6.0
3 a d 7.0 9.0
4 NaN e NaN 2.0
5 b f NaN 4.0
6 b g 4.0 NaN
print (df['colA'].value_counts())
b 3
c 2
a 1
Name: colA, dtype: int64
print (df.groupby('colA').count())
colB colC colD
colA
a 1 1 1
b 3 2 2
c 2 2 1
print (df.groupby('colA')['colA'].count())
colA
a 1
b 3
c 2
Name: colA, dtype: int64
GroupBy pandas DataFrame and select most common value
You can use value_counts()
to get a count series, and get the first row:
source.groupby(['Country','City']).agg(lambda x: x.value_counts().index[0])
In case you are wondering about performing other agg functions in the .agg()
,
try this.
# Let's add a new col, "account"
source['account'] = [1, 2, 3, 3]
source.groupby(['Country','City']).agg(
mod=('Short name', lambda x: x.value_counts().index[0]),
avg=('account', 'mean'))
Number of occurrence of pair of value in dataframe
For performance implications of the below solutions, see Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series. They are presented below with best performance first.
GroupBy.size
You can create a series of counts with (Name, Surname) tuple indices using GroupBy.size
:
res = df.groupby(['Name', 'Surname']).size().sort_values(ascending=False)
By sorting these values, we can easily extract the most common:
most_common = res.head(1)
most_common_dups = res[res == res.iloc[0]].index.tolist() # handles duplicate top counts
value_counts
Another way is to construct a series of tuples, then apply pd.Series.value_counts
:
res = pd.Series(list(zip(df.Name, df.Surname))).value_counts()
The result will be a series of counts indexed by Name-Surname combinations, sorted from most common to least.
name, surname = res.index[0] # return most common
most_common_dups = res[res == res.max()].index.tolist()
collections.Counter
If you wish to create a dictionary of (name, surname): counts
entries, you can do so via collections.Counter
:
from collections import Counter
zipper = zip(df.Name, df.Surname)
c = Counter(zipper)
Counter
has useful methods such as most_common
, which you can use to extract your result.
Related Topics
Python Create Unix Timestamp Five Minutes in the Future
Ambiguity in Pandas Dataframe/Numpy Array "Axis" Definition
Asyncio.Gather VS Asyncio.Wait
Replace Negative Values in an Numpy Array
Get Column Index from Column Name in Python Pandas
Is There a Difference Between Continue and Pass in a for Loop in Python
How to Find Element by Part of Its Id Name in Selenium with Python
How to Format Axis Number Format to Thousands with a Comma in Matplotlib
Better Way to Shuffle Two Numpy Arrays in Unison
Insert an Item into Sorted List in Python
Extracting Specific Columns in Numpy Array
Putting Many Python Pandas Dataframes to One Excel Worksheet
"Private" (Implementation) Class in Python
Passing a Function to Re.Sub in Python