Pandas Counting and Summing Specific Conditions

Pandas counting and summing specific conditions

You can first make a conditional selection, and sum up the results of the selection using the sum function.

>> df = pd.DataFrame({'a': [1, 2, 3]})
>> df[df.a > 1].sum()
a 5
dtype: int64

Having more than one condition:

>> df[(df.a > 1) & (df.a < 3)].sum()
a 2
dtype: int64

If you want to do COUNTIF, just replace sum() with count()

Pandas counting and suming specific conditions returns only nan

Instead loops in apply is possible use vectorized solution, first create numpy arrays chained by &, compare and for counts Trues is possible use sum:

a = df['datet']
b = a + pd.Timedelta(days=1)
c = a - pd.Timedelta(days=1)

mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])

df["caseIntensity"] = mask.sum(axis=1)
print (df)
datet caseIntensity
0 2020-03-04 2
1 2020-03-05 2
2 2020-03-09 2
3 2020-03-10 3
4 2020-03-11 3
5 2020-03-12 2

Here is perfomance for 6k rows:

df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})
df = pd.concat([df] * 1000, ignore_index=True)


In [140]: %%timeit
...: a = df['datet']
...: b = a + pd.Timedelta(days=1)
...: c = a - pd.Timedelta(days=1)
...:
...: mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])
...:
...: df["caseIntensity"] = mask.sum(axis=1)
...:
469 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [141]: %%timeit
...: df["caseIntensity1"] = df.apply(lambda row: get_dates_in_range(df, row), axis=1)
...:
...:
6.2 s ± 368 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Counting the number of rows that meet certain sum condition in pandas dataframe

cumsum + idxmax should work:

df.A.cumsum().gt(5).idxmax()
3

Pandas Pivot Table counting based on condition and sum columns

Use:

dft=df.pivot_table(values='sold_kg',columns='day', index='Product', aggfunc=['sum','size'])

First flatten MultiIndex in columns with mapping:

dft.columns = dft.columns.map(lambda x: f'{x[0]}_{x[1]}')

Then select columns by DataFrame.filter and sum, for count values greater or equal use DataFrame.ge and count Trues by sum:

dft['Fruit Total'] = dft.filter(like='sum').sum(axis=1)

dft['Count >= 2'] = dft.filter(like='size').ge(2).sum(axis=1)
print (dft)
sum_22 sum_23 sum_25 size_22 size_23 size_25 Fruit Total \
Product
apple 8 2 2 3 1 1 12
orange 7 7 2 2 2 1 16

Count >= 2
Product
apple 1
orange 2

Python Pandas Counting and Summing columns based on datetime values

You would have to loop across the dataframe as you have to compare each row with every other row. One improvement can be there in the below solution is by sorting by Submit_Date such that you have to compare with either below that record or above that record for the submit_date comparison.

result = list()
for row in df.iterrows():
cur_data = row[1]
result.append((((cur_data['Submit_Date'] < df['Submit_Date']) & (df['Submit_Date']< cur_data['Resolved_Date']))
| ((cur_data['Submit_Date'] < df['Resolved_Date']) & (df['Resolved_Date'] < cur_data['Resolved_Date']))).sum())
df['count'] = result


Submit_Date Resolved_Date count
1 2016-10-01 23:41:00 2016-10-02 02:27:00 2
2 2016-10-01 23:50:00 2017-03-09 19:39:00 3
3 2016-10-02 00:05:00 2016-11-15 12:46:00 2
4 2016-10-03 05:17:00 2016-11-14 17:37:00 0

Count values in column with ranges given a specific condition

You need to loop here.

Either using Series.apply with a lambda function and sum:

df['ct'] = df['nv1'].apply(lambda s: sum(e<-1 for e in s))

or with a classical loop comprehension:

df['ct'] = [sum(e<-1 for e in s) for s in df['nv1']]

output:

   R       an                       nv1  ct
0 1 f [-1.0] 0
1 2 i [-1.0] 0
2 3 - [] 0
3 4 - [] 0
4 5 f [-2.0] 1
5 6 c,f,i,j [-2.0, -1.0, -3.0, -1.0] 2
6 7 c,d,e,j [-2.0, -1.0, -2.0, -1.0] 2

If you really want empty strings in place of zeros:

df['ct'] = [S if (S:=sum(e<-1 for e in s)) else '' for s in df['nv1']]

output:

   R       an                       nv1 ct
0 1 f [-1.0]
1 2 i [-1.0]
2 3 - []
3 4 - []
4 5 f [-2.0] 1
5 6 c,f,i,j [-2.0, -1.0, -3.0, -1.0] 2
6 7 c,d,e,j [-2.0, -1.0, -2.0, -1.0] 2

Countif in Pandas Dataframe

Since (df[cols] == 2) outputs a df of True or False values, and True is equivalent to 1, while False is equivalent to 0, you should use sum instead of count:

Twos = (df[cols] == 2).sum(axis=1)

count will count all non missing values, sum with a conditional filter will result in a count of values satisfying your condition.



Related Topics



Leave a reply



Submit