Groupby & Sum - Create new column with added If Condition
We can use Series.where
to replace the values that don't match the condition with NaN
, then just groupby transform
'sum' since NaN
values are ignored by 'sum' by default:
df['Overspend Total'] = (
df['Variance'].where(df['Variance'] > 0).groupby(df['ID']).transform('sum')
)
Or explicitly replace with the additive identity (0) which will not affect the sum:
df['Overspend Total'] = (
df['Variance'].where(df['Variance'] > 0, 0)
.groupby(df['ID']).transform('sum')
)
Or with a lambda
inside groupby transform
:
df['Overspend Total'] = df.groupby('ID')['Variance'].transform(
lambda s: s[s > 0].sum()
)
In any case df
is:
ID Start End Variance Overspend Total
0 1 100000.00 120000.00 20000.00 20000.0
1 1 1.00 0.00 -1.00 20000.0
2 1 7815.58 7815.58 0.00 20000.0
3 1 5261.00 5261.00 0.00 20000.0
4 1 138783.20 89969.37 -48813.83 20000.0
5 1 2459.92 2459.92 0.00 20000.0
6 2 101421.99 93387.45 -8034.54 3000.0
7 2 940.04 940.04 0.00 3000.0
8 2 63.06 63.06 0.00 3000.0
9 2 2454.86 2454.86 0.00 3000.0
10 2 830.00 830.00 0.00 3000.0
11 2 299.00 299.00 0.00 3000.0
12 2 14000.00 12000.00 2000.00 3000.0
13 2 1500.00 500.00 1000.00 3000.0
Sum a column based on groupby and condition
Idea is convert column time
to datetimes
with floor
by 10Min
, then convert to strings HH:MM:SS
:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
df['time'] = pd.to_datetime(df['time']).dt.floor('10Min').dt.strftime('%H:%M:%S')
print (df)
region date time gap
0 1 2016-01-01 00:00:00 1
1 1 2016-01-01 00:00:00 0
2 1 2016-01-01 00:00:00 1
3 1 2016-01-01 00:00:00 0
4 1 2016-01-01 00:10:00 0
5 1 2016-01-01 00:10:00 1
6 1 2016-01-01 00:10:00 0
7 1 2016-01-01 00:10:00 0
Aggregate sum
and last map
values by dictionary with swapped keys with values:
regres = df.groupby(['region','date','time'], as_index=False)['gap'].sum()
regres['time'] = regres['time'] + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:00:00/slot1 2
1 1 2016-01-01 00:10:00/slot2 1
If want display next 10Min
slots:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
times = pd.to_datetime(df['time']).dt.floor('10Min')
df['time'] = times.dt.strftime('%H:%M:%S')
df['time1'] = times.add(pd.Timedelta('10Min')).dt.strftime('%H:%M:%S')
print (df)
region date time gap time1
0 1 2016-01-01 00:00:00 1 00:10:00
1 1 2016-01-01 00:00:00 0 00:10:00
2 1 2016-01-01 00:00:00 1 00:10:00
3 1 2016-01-01 00:00:00 0 00:10:00
4 1 2016-01-01 00:10:00 0 00:20:00
5 1 2016-01-01 00:10:00 1 00:20:00
6 1 2016-01-01 00:10:00 0 00:20:00
7 1 2016-01-01 00:10:00 0 00:20:00
regres = df.groupby(['region','date','time','time1'], as_index=False)['gap'].sum()
regres['time'] = regres.pop('time1') + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
EDIT:
Improvement for floor and convert to strings is use bining by cut
or searchsorted
:
df['time'] = pd.to_timedelta(df['time'])
bins = pd.timedelta_range('00:00:00', '24:00:00', freq='10Min')
labels = np.array(['{}'.format(str(x)[-8:]) for x in bins])
labels = labels[:-1]
df['time1'] = pd.cut(df['time'], bins=bins, labels=labels)
df['time11'] = labels[np.searchsorted(bins, df['time'].values) - 1]
how to SUM a column by different condition then group by date
use conditional aggregation
SELECT delivery_date, SUM(subtotal) AS TotalSale,
SUM(case when State='Rejected by business' then subtotal else 0 end) as Rejected ,
SUM(case when State='Delivery Completed' then subtotal else 0 end) as actual
from table_name group by delivery_date
Groupby multiple columns & Sum - Create new column with added If Condition
Cause of error
- The syntax to select multiple columns
df['column1', 'column2']
is wrong. This should bedf[['column1', 'column2']]
- Even if you use
df[['column1', 'column2']]
forgroupby
, pandas will raise another error complaining that the grouper should beone dimensional
. This is becausedf[['column1', 'column2']]
returns a dataframe which is a two dimensional object.
How to fix the error?
Hard way:
Pass each of the grouping columns as one dimensional series to groupby
df['new_column'] = (
df['value']
.where(df['value'] > 0)
.groupby([df['column1'], df['column2']]) # Notice the change
.transform('sum')
)
Easy way:
First assign the masked column values to the target column, then do groupby
+ transform
as you would normally do
df['new_column'] = df['value'].where(df['value'] > 0)
df['new_column'] = df.groupby(['column1', 'column2'])['new_column'].transform('sum')
How can I sum rows of a column based on an index condition to create a % of group column?
First of all, I would like to compliment you on using comprehensive row by row. I still use them for time to time, because I consider loops to be easier for someone else to read and understand what the principle is without running the code itself.
But ye. For this solution, I have created a couple one liners and let me explain what each are.
df['% Quantity of Menu'] = ((df['Sales Quantity']/df['Sales Quantity'].sum())*100).round(2)
For your first problem, instead of looping row to row, this divides the column value with a scalar value (which is the total of the column df['Sales Quantity'].sum()
), then the ratio is multiplied with 100 for percentage, then round off at 2 decimal points.
df['%Qty of Menu Category'] = ((df['Sales Quantity']/df.groupby(['Menu Category'])['Sales Quantity'].transform('sum'))*100).round(2)
So, for the second problem, we need to divide the column value with the total of each corresponding category instead of the whole column. So, we get the value with groupby for each category df.groupby(['Menu Category'])['Sales Quantity'].transform('sum')
, then did the same as the first one, by replacing the portion of the code.
Here, why do we use df.groupby(['Menu Category'])['Sales Quantity'].transform('sum')
instead of df.groupby(['Menu Category'])['Sales Quantity'].sum()
? Because for division of a series can be done either with a scalar or with a series of same dimension, and the former way gives us the series of same dimension.
df['Sales Quantity']
0 100
1 50
2 40
3 200
4 400
5 250
6 100
7 120
8 50
Name: Sales Quantity, dtype: int64
df.groupby(['Menu Category'])['Sales Quantity'].transform('sum')
0 190
1 190
2 190
3 850
4 850
5 850
6 270
7 270
8 270
Name: Sales Quantity, dtype: int64
df.groupby(['Menu Category'])['Sales Quantity'].sum()
Menu Category
Appetizers 190
Desserts 270
Mains 850
Name: Sales Quantity, dtype: int64
Pandas: Group by and conditional sum based on value of current row
Try:
df["total_successfully_previously_paid"] = (df["payment_successful"].mul(df["order_value"])
.groupby(df["customer_nr"])
.transform(lambda x: x.cumsum().shift().fillna(0))
)
>>> df
customer_nr ... total_successfully_previously_paid
0 1 ... 0.0
1 1 ... 50.0
2 1 ... 50.0
3 2 ... 0.0
4 2 ... 55.0
5 2 ... 355.0
[6 rows x 5 columns]
Pandas multiple groupby and sum if conditions
Here is a way that works, similar to your attempt. The idea is to replace the values in D by 0 where
the column C is over 6. then groupby.transform
with the sum
.
df['E'] = (
df['D'].where(df['C'].le(6), other=0)
.groupby([df['A'], df['B']])
.transform(sum)
)
print(df)
# A B C D E
# 0 75987 1 0 0 4
# 1 75987 1 1 1 4
# 2 75987 2 1 1 6
# 3 75987 2 2 1 6
# 4 75987 2 6 4 6
# 5 75987 1 6 2 4
# 6 75987 1 6 1 4
# 7 59221 2 18 4 0
# 8 59221 1 18 0 0
# 9 59221 2 18 1 0
Related Topics
Pandas: Update Column Values from Another Column If Criteria
Python/Regex - How to Extract Date from Filename Using Regular Expression
How to Merge Two Cnn That Are Trained Over Different Data Stream
Ssl: Certificate_Verify_Failed With Python3
How to Split Vector into Columns - Using Pyspark
Filtering Date Column in Python
Getting the Id of the Last Record Inserted for Postgresql Serial Key With Python
Pandas Get the Age from a Date (Example: Date of Birth)
How to Select Only One Column Using Sqlalchemy
Running an Excel Macro Via Python
Plotly Graph Does Not Show When Jupyter Notebook Is Converted to Slides
Discord.Py Show Who Invited a User
Typeerror: Image Data Can Not Convert to Float
How to Compute Mean() for Particular Column in Pandas Dataframe Without Considering Nan Values
How to Tell Python to Convert Integers into Words
Matrix Flip Horizontal or Vertical