Differencebetween Size and Count in Pandas

What is the difference between size and count in pandas?

size includes NaN values, count does not:

In [46]:
df = pd.DataFrame({'a':[0,0,1,2,2,2], 'b':[1,2,3,4,np.NaN,4], 'c':np.random.randn(6)})
df

Out[46]:
   a   b         c
0  0   1  1.067627
1  0   2  0.554691
2  1   3  0.458084
3  2   4  0.426635
4  2 NaN -2.238091
5  2   4  1.256943

In [48]:
print(df.groupby(['a'])['b'].count())
print(df.groupby(['a'])['b'].size())

a
0    2
1    1
2    2
Name: b, dtype: int64

a
0    2
1    1
2    3
dtype: int64

What's the difference between count(), size(), unique() in pandas?

For example you have df like below

df=pd.DataFrame({
    'key1':['a','a','b','b','a'],
    'data1':[1,1,np.nan,1,2]
})
grouped=df['data1'].groupby(df['key1'])

grouped.size()# return length of value included the  NaN value

Out[413]:
key1
a    3
b    2
Name: data1, dtype: int64

grouped.count()# not include the NaN , it will ignore np.nan in b
Out[414]:
key1
a    3
b    1
Name: data1, dtype: int64

grouped.nunique() # only return the real unique value(exclude NaN) , in a it will be 1 , 2 so return 2 , at b it will be NaN and 1 so return 1
Out[415]: 
key1
a    2
b    1
Name: data1, dtype: int64

Why do I see a different result when using size() or count()?

Size returns the number of rows times number of columns if DataFrame.
I suggest you check the documentation of pandas commands from the website.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.size.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html

Please also consider posting your original dataframe (or a sample), so that answers can be more specific and helpful to you.

Count and Sort with Pandas

I think you need add reset_index, then parameter ascending=False to sort_values because sort return:

FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
.sort_values(['count'], ascending=False)

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'] \
                             .count() \
                             .reset_index(name='count') \
                             .sort_values(['count'], ascending=False) \
                             .head(5)

Sample:

df = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'),
                   'CTYNAME':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]})

print (df)
    CTYNAME STNAME
0         4      a
1         5      b
2         6      s
3         5      c
4         6      s
5         2      c
6         3      b
7         4      c
8         5      d
9         6      b
10        4      c
11        5      s
12        4      s
13        3      c
14        6      a
15        5      e

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'] \
                             .count() \
                             .reset_index(name='count') \
                             .sort_values(['count'], ascending=False) \
                             .head(5)

print (df)
  STNAME  count
2      c      5
5      s      4
1      b      3
0      a      2
3      d      1

But it seems you need Series.nlargest:

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'].count().nlargest(5)

or:

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'].size().nlargest(5)

The difference between size and count is:

size counts NaN values, count does not.

Sample:

df = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'),
                   'CTYNAME':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]})

print (df)
    CTYNAME STNAME
0         4      a
1         5      b
2         6      s
3         5      c
4         6      s
5         2      c
6         3      b
7         4      c
8         5      d
9         6      b
10        4      c
11        5      s
12        4      s
13        3      c
14        6      a
15        5      e

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME']
                             .size()
                             .nlargest(5)
                             .reset_index(name='top5')
print (df)
  STNAME  top5
0      c     5
1      s     4
2      b     3
3      a     2
4      d     1

Pandas group by column and count values

You can try groupby.agg:

d = dict(zip(['sum','count'],['Positive','Both']))
(df['result'].eq('Positive').view('i1').groupby(df['code']).
agg(['sum','count']).rename(columns=d))

        Positive  Both
code                  
2069.0         1     3
2070.0         1     2

Pandas, groupby and count

You seem to want to group by several columns at once:

df.groupby(['revenue','session','user_id'])['user_id'].count()

should give you what you want

What is the difference between sum() and count() in pandas?

sum() is for like 1+0 = 1. if data is 3 and 3 then it will return 6.

count() return number of rows, so it will return 2.

Need pandas groupby.count() or groupby.size.unstack() to output a dataframe I can use

Try:

x = df.pivot_table(
    index=["Animal", "Year"], columns="Value", aggfunc="size", fill_value=0
).reset_index()
x.columns.name = None
print(x)

Prints:

   Animal  Year  A  B
0       1  2019  0  2
1       1  2020  2  0
2       2  2020  1  0

How to make pandas groupby().count() sum values rather than rows?

I think what you want to use is GroupBy.sum

How do I get the row count of a Pandas DataFrame?

For a dataframe df, one can use any of the following:

len(df.index)
df.shape[0]
df[df.columns[0]].count() (== number of non-NaN values in first column)

Performance plot

Code to reproduce the plot:

import numpy as np
import pandas as pd
import perfplot

perfplot.save(
    "out.png",
    setup=lambda n: pd.DataFrame(np.arange(n * 3).reshape(n, 3)),
    n_range=[2**k for k in range(25)],
    kernels=[
        lambda df: len(df.index),
        lambda df: df.shape[0],
        lambda df: df[df.columns[0]].count(),
    ],
    labels=["len(df.index)", "df.shape[0]", "df[df.columns[0]].count()"],
    xlabel="Number of rows",
)