How do I sum values in a column that match a given condition using pandas?
The essential idea here is to select the data you want to sum, and then sum them. This selection of data can be done in several different ways, a few of which are shown below.
Boolean indexing
Arguably the most common way to select the values is to use Boolean indexing.
With this method, you find out where column 'a' is equal to 1
and then sum the corresponding rows of column 'b'. You can use loc
to handle the indexing of rows and columns:
>>> df.loc[df['a'] == 1, 'b'].sum()
15
The Boolean indexing can be extended to other columns. For example if df
also contained a column 'c' and we wanted to sum the rows in 'b' where 'a' was 1 and 'c' was 2, we'd write:
df.loc[(df['a'] == 1) & (df['c'] == 2), 'b'].sum()
Query
Another way to select the data is to use query
to filter the rows you're interested in, select column 'b' and then sum:
>>> df.query("a == 1")['b'].sum()
15
Again, the method can be extended to make more complicated selections of the data:
df.query("a == 1 and c == 2")['b'].sum()
Note this is a little more concise than the Boolean indexing approach.
Groupby
The alternative approach is to use groupby
to split the DataFrame into parts according to the value in column 'a'. You can then sum each part and pull out the value that the 1s added up to:
>>> df.groupby('a')['b'].sum()[1]
15
This approach is likely to be slower than using Boolean indexing, but it is useful if you want check the sums for other values in column a
:
>>> df.groupby('a')['b'].sum()
a
1 15
2 8
python sum a column's value with condition
To get the sum of positive values in the column, use the appropriate condition
import pandas as pd
df = pd.DataFrame({'price': [12, 14, 15, 10, 2, 4, -5, -4, -3, -5, 16, 15]})
total = df.loc[df['price'] > 0, 'price'].sum()
print(total) # 88
That isn't a good idea to set a column with values not relative to the other row param, here one single value. But to get the logic
# you need to pad with zeros, if you not you'll have 88 at every row
df['total'] = [total] + [0] * (len(df) - 1)
print(df)
price total
0 12 88
1 14 0
2 15 0
3 10 0
4 2 0
5 4 0
6 -5 0
7 -4 0
8 -3 0
9 -5 0
10 16 0
11 15 0
Sum column based on another column in Pandas DataFrame
I ended up using this script:
dff = df.groupby(["SINID","EXTRA"]).MONTREGL.sum().reset_index()
And it works in this test and production.
Python: sum values in column where condition is met
You can first group by "exchange", then apply np.cumsum
and finally assign the result where type
is "deposit".
import pandas as pd
import numpy as np
df.loc[df["type"]=="deposit", "balance"] = df.loc[df["type"]=="deposit"].groupby("exchange", sort=False)["value"].apply(np.cumsum)
Finally you can fill missing value with the forward-fill as you have mentioned.
df = df.fillna(method='ffill')
Python sum values in column given a condition
One can use Groupby to do this efficiently
Assuming that the dataframe is df
ans = df.groupby(df['Item Code'])['Units Sold'].sum()
This is the output .
Item Code
179 3
180 5
190 8
Name: Units Sold, dtype: int64
Hope this helps!
How to sum over some columns based on condition in pandas
You can use mask
. The idea is to create a boolean mask with the w
columns, and use it to filter the relevant w
columns and sum
:
df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
Output:
p1 p2 p3 p4 p5 w1 w2 w3 w4 w5 top_p
0 0.1 0.2 0.10 0.11 0.3 cancel good thanks CUSTOM_MASK CUSTOM_MASK 0.40
1 0.2 0.1 0.90 0.20 0.1 hello bad CUSTOM_MASK CUSTOM_UNKNOWN CUSTOM_MASK 0.30
2 0.3 0.3 0.01 0.40 0.5 hi ugly great trible job 1.51
Before sum
ming, the output of mask
looks like:
p1 p2 p3 p4 p5
0 0.1 0.2 0.10 NaN NaN
1 0.2 0.1 NaN NaN NaN
2 0.3 0.3 0.01 0.4 0.5
Pandas: How to sum columns based on conditional of other column values?
The following should work, here we mask the df where the condition is met, this will set NaN
to the rows where the condition isn't met so we call fillna
on the new col:
In [67]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df
Out[67]:
A B C
0 0.197334 0.707852 -0.443475
1 -1.063765 -0.914877 1.585882
2 0.899477 1.064308 1.426789
3 -0.556486 -0.150080 -0.149494
4 -0.035858 0.777523 -0.453747
In [73]:
df['total'] = df.loc[df['A'] > 0,['A','B']].sum(axis=1)
df['total'].fillna(0, inplace=True)
df
Out[73]:
A B C total
0 0.197334 0.707852 -0.443475 0.905186
1 -1.063765 -0.914877 1.585882 0.000000
2 0.899477 1.064308 1.426789 1.963785
3 -0.556486 -0.150080 -0.149494 0.000000
4 -0.035858 0.777523 -0.453747 0.000000
Another approach is to call where
on the sum
result, this takes a value param to return when the condition isn't met:
In [75]:
df['total'] = df[['A','B']].sum(axis=1).where(df['A'] > 0, 0)
df
Out[75]:
A B C total
0 0.197334 0.707852 -0.443475 0.905186
1 -1.063765 -0.914877 1.585882 0.000000
2 0.899477 1.064308 1.426789 1.963785
3 -0.556486 -0.150080 -0.149494 0.000000
4 -0.035858 0.777523 -0.453747 0.000000
How do I sum up values in a column into groups that match a given condition by date in pandas?
You can get first numeric value by Series.str.extract
, compare by 60
and set by np.where
to 2 groups:
m = df['AgeGroup'].str.extract('(\d+)', expand=False).astype(int) < 60
df['AgeGroup'] = np.where(m, '18 - 59', '60+')
df1 = df.groupby(['Date', 'AgeGroup'])['Quantity'].sum()
print (df1)
Date AgeGroup
2020-12-08 18 - 59 7
60+ 6
2020-12-09 18 - 59 5
60+ 5
Name: Quantity, dtype: int64
Related Topics
C and Python - Different Behaviour of the Modulo (%) Operation
How Does Perspective Transformation Work in Pil
Plot with Custom Text for X Axis Points
How to Source Virtualenv Activate in a Bash Script
Python Typeerror: Not Enough Arguments for Format String
How to Convert a String Date into Datetime Format in Python
How to Find the Groups of Consecutive Elements in a Numpy Array
Understanding Popen.Communicate
Django: Improperlyconfigured: the Secret_Key Setting Must Not Be Empty
Monitoring Contents of Files/Directories
Python Equivalent of Setinterval()
How to Check the Difference, in Seconds, Between Two Dates
Consistently Create Same Random Numpy Array
Why Doesn't a Python Dict.Update() Return the Object
Calculate Mean Across Dimension in a 2D Array
Why Do I Get Typeerror: Can't Multiply Sequence by Non-Int of Type 'Float'