Groupby & Sum - Create new column with added If Condition
We can use Series.where
to replace the values that don't match the condition with NaN
, then just groupby transform
'sum' since NaN
values are ignored by 'sum' by default:
df['Overspend Total'] = (
df['Variance'].where(df['Variance'] > 0).groupby(df['ID']).transform('sum')
)
Or explicitly replace with the additive identity (0) which will not affect the sum:
df['Overspend Total'] = (
df['Variance'].where(df['Variance'] > 0, 0)
.groupby(df['ID']).transform('sum')
)
Or with a lambda
inside groupby transform
:
df['Overspend Total'] = df.groupby('ID')['Variance'].transform(
lambda s: s[s > 0].sum()
)
In any case df
is:
ID Start End Variance Overspend Total
0 1 100000.00 120000.00 20000.00 20000.0
1 1 1.00 0.00 -1.00 20000.0
2 1 7815.58 7815.58 0.00 20000.0
3 1 5261.00 5261.00 0.00 20000.0
4 1 138783.20 89969.37 -48813.83 20000.0
5 1 2459.92 2459.92 0.00 20000.0
6 2 101421.99 93387.45 -8034.54 3000.0
7 2 940.04 940.04 0.00 3000.0
8 2 63.06 63.06 0.00 3000.0
9 2 2454.86 2454.86 0.00 3000.0
10 2 830.00 830.00 0.00 3000.0
11 2 299.00 299.00 0.00 3000.0
12 2 14000.00 12000.00 2000.00 3000.0
13 2 1500.00 500.00 1000.00 3000.0
Pandas create new column with count from groupby
That's not a new column, that's a new DataFrame:
In [11]: df.groupby(["item", "color"]).count()
Out[11]:
id
item color
car black 2
truck blue 1
red 2
To get the result you want is to use reset_index
:
In [12]: df.groupby(["item", "color"])["id"].count().reset_index(name="count")
Out[12]:
item color count
0 car black 2
1 truck blue 1
2 truck red 2
To get a "new column" you could use transform:
In [13]: df.groupby(["item", "color"])["id"].transform("count")
Out[13]:
0 2
1 2
2 2
3 1
4 2
dtype: int64
I recommend reading the split-apply-combine section of the docs.
How to create a new column that increments within a subgroup of a group in Python?
You could use groupby
+ ngroup
:
df['colC'] = df.groupby('colA').apply(lambda x: x.groupby('colB').ngroup()+1).droplevel(0)
Output:
colA colB colC
0 1 a 1
1 1 a 1
2 1 c 2
3 1 c 2
4 1 f 3
5 1 z 4
6 1 z 4
7 1 z 4
8 2 a 1
9 2 b 2
10 2 b 2
11 2 b 2
12 3 c 1
13 3 d 2
14 3 k 3
15 3 k 3
16 3 m 4
17 3 m 4
18 3 m 4
pandas - groupby a column, apply a function to create a new column - giving incorrect results
You can remove values
:
df['num_col1_SMA'] = things_groupby['num_col1'].apply(pandas_rolling)
df['num_col2_SMA'] = things_groupby['num_col2'].apply(pandas_rolling)
Or:
df[['num_col1_SMA', 'num_col2_SMA']] = (things_groupby[['num_col1','num_col2']]
.apply(pandas_rolling))
If possible without groupby.apply
is necessary remove first level of MultiIndex
:
df[['num_col1_SMA', 'num_col2_SMA']] = (things_groupby[['num_col1','num_col2']]
.rolling(window=N)
.mean()
.droplevel(0))
How to assign group by sum results to new columns in Pandas
We do pivot here I am using crosstab
then merge
s=pd.crosstab(df.SKU,df.Calendar.dt.year,df.Quantity,aggfunc='sum').fillna(0).add_prefix('Year_Quantity_').reset_index()
df=df.merge(s,how='left')
Calendar SKU Quantity Year_Quantity_2017 Year_Quantity_2018
0 2017-10-01 1001 10 50.0 160.0
1 2017-10-01 1002 20 70.0 80.0
2 2017-10-01 1003 30 90.0 0.0
3 2017-11-01 1001 40 50.0 160.0
4 2017-11-01 1002 50 70.0 80.0
5 2017-11-01 1003 60 90.0 0.0
6 2018-11-01 1001 70 50.0 160.0
7 2018-11-01 1002 80 70.0 80.0
8 2018-03-01 1001 90 50.0 160.0
create a new column with pandas groupby division between two columns excluding the current row
Try with transform
g = df.groupby('Group')
df['New'] = (g['Col_2'].transform('sum')-df.Col_2)/(g['Col_1'].transform('sum')-df.Col_1)
df
Out[339]:
Group Col_1 Col_2 New
0 A 100 55 0.286000
1 A 200 66 0.330000
2 A 300 77 0.403333
3 B 400 88 0.198000
4 B 500 99 0.220000
Pandas create new column base on groupby and apply lambda if statement
Use GroupBy.transform
with lambda, function, then compare and for convert True/False
to 1/0
convert to integers:
from scipy import stats
s = df.groupby('A')['B'].transform(lambda x: np.abs(stats.zscore(x, nan_policy='omit')))
df['C'] = (s > 2).astype(int)
Or use numpy.where
:
df['C'] = np.where(s > 2, 1, 0)
Error in your solution is per groups:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: 1 if np.abs(stats.zscore(x, nan_policy='omit')) > 2 else 0)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
If check gotcha in pandas docs:
pandas follows the NumPy convention of raising an error when you try to convert something to a bool. This happens in an if-statement or when using the boolean operations: and, or, and not.
So if use one of solutions instead if-else
:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: (np.abs(stats.zscore(x, nan_policy='omit')) > 2).astype(int))
print (df)
A
a [0, 0, 0]
b [0, 0, 0, 0]
Name: B, dtype: object
but then need convert to column, for avoid this problems is used groupby.transform
.
How to create new column in pandas based on result of groupby without needing to use join
You can use transform()
:
df["max_date"] = df.groupby("name")['date'].transform('max')
Output:
date name max_date
0 2020-01-01 Romulo 2020-03-01
1 2020-02-01 Romulo 2020-03-01
2 2020-03-01 Romulo 2020-03-01
3 2020-01-01 Daniel 2020-03-01
4 2020-02-01 Daniel 2020-03-01
5 2020-03-01 Daniel 2020-03-01
6 2020-01-01 Fernando 2020-03-01
7 2020-02-01 Fernando 2020-03-01
8 2020-03-01 Fernando 2020-03-01
Related Topics
"Pip Install Unroll": "Python Setup.Py Egg_Info" Failed With Error Code 1
Using Numpy to Build an Array of All Combinations of Two Arrays
How to Write the Fibonacci Sequence
Find If 24 Hrs Have Passed Between Datetimes
How to Replace Nan Values by Zeroes in a Column of a Pandas Dataframe
Create List of Single Item Repeated N Times
Changing One List Unexpectedly Changes Another, Too
How to Split a List Based on a Condition
Open Web in New Tab Selenium + Python
Remove All the Elements That Occur in One List from Another
Bare Asterisk in Function Arguments
Typeerror: Method() Takes 1 Positional Argument But 2 Were Given
Display Number With Leading Zeros