Pandas fillna using groupby
If only one non NaN value per group use ffill
(forward filling) and bfill
(backward filling) per group, so need apply
with lambda
:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN
by some constant - e.g. mean
by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
How to fillna by groupby outputs in pandas?
df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
would be faster than apply
In [2400]: df
Out[2400]:
A B C D
0 1 1 1 1.0
1 1 1 1 NaN
2 1 1 1 3.0
3 3 3 3 5.0
In [2401]: df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
Out[2401]:
0 1.0
1 2.0
2 3.0
3 5.0
Name: D, dtype: float64
In [2402]: df['D'] = df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
In [2403]: df
Out[2403]:
A B C D
0 1 1 1 1.0
1 1 1 1 2.0
2 1 1 1 3.0
3 3 3 3 5.0
Details
In [2396]: df.shape
Out[2396]: (10000, 4)
In [2398]: %timeit df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
100 loops, best of 3: 3.44 ms per loop
In [2397]: %timeit df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
100 loops, best of 3: 5.34 ms per loop
Pandas fillna using groupby and mode
You can use GroupBy.transform
with if-else
for median
for numeric and mode
for categorical columns:
df = pd.DataFrame({
'A':list('ebcded'),
'B':[np.nan,np.nan,4,5,5,4],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'F':list('aaabbb'),
'make':list('aaabbb')
})
df.loc[[2,4], 'A'] = np.nan
df.loc[[2,5], 'F'] = np.nan
print (df)
A B C D F make
0 e NaN 7.0 1.0 a a
1 b NaN NaN 3.0 a a
2 NaN 4.0 9.0 5.0 NaN a
3 d 5.0 4.0 NaN b b
4 NaN 5.0 2.0 1.0 b b
5 d 4.0 3.0 0.0 NaN b
f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else x.mode().iloc[0]
df = df.fillna(df.groupby('make').transform(f))
print (df)
A B C D F make
0 e 4 7 1 a a
1 b 4 7 3 a a
2 b 4 9 5 a a
3 d 5 4 0 b b
4 d 5 2 1 b b
5 d 4 3 0 b b
Fillna using groupby and mode not working
Use transform
:
vals = df.groupby(['region', 'basin'])['installer'] \
.transform(lambda x: x.mode(dropna=False).iloc[0])
df['installer'] = df['installer'].fillna(vals)
pandas groupby fillna code does not work and gives error
For your purpose using Andrej Kesely answer might be enough, but have in mind that using apply with pandas is not good performance-wise.
A better option to maintain performance is to use .transform('mean')
like
df["age"].fillna(df.groupby(["name", "job"])["age"].transform("mean"), inplace=True)
df["weight"].fillna(df.groupby(["name", "job"])["weight"].transform("mean"), inplace=True)
This is just a comparison between the groupby and apply strategy and the transform one. Where slow referes to the groupby-apply and fast to the transform.
To reproduce this comparison you can run this
import numpy as np
import pandas as pd
import time
from tqdm import tqdm
def build_dataset(N):
names = ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry']
jobs = ['teacher', 'doctor', 'engineer','teacher', 'doctor', 'engineer']
data = {
'name': np.random.choice(names, size=N),
'job': np.random.choice(jobs, size=N),
'age': np.random.uniform(low=10, high=90, size=N),
'weight': np.random.uniform(low=60, high=150, size=N)
}
df = pd.DataFrame(data)
df.loc[df.sample(frac=0.1).index, "age"] = np.nan
df.loc[df.sample(frac=0.1).index, "weight"] = np.nan
return df
def slow_way(df):
df = df.copy()
def fn(x):
x["age"].fillna(x["age"].mean(), inplace=True)
x["weight"].fillna(x["weight"].mean(), inplace=True)
return x
return df.groupby(["name", "job"]).apply(fn)
def fast_way(df):
df = df.copy()
df["age"].fillna(df.groupby(["name", "job"])["age"].transform("mean"), inplace=True)
df["weight"].fillna(df.groupby(["name", "job"])["weight"].transform("mean"), inplace=True)
return df
Ns = np.arange(10, 100000, step=1000, dtype=np.int32)
slow = []
fast = []
size = []
for N in tqdm(Ns):
for i in range(10):
df = build_dataset(N)
start = time.time()
_ = slow_way(df)
end = time.time()
slow.append(end-start)
start = time.time()
_ = fast_way(df)
end = time.time()
fast.append(end-start)
size.append(N)
df = pd.DataFrame({"N": size, "slow": slow, "fast": fast})
df_group = df.groupby("N").mean()
df_group.plot(figsize=(30,10))
pandas: fillna whole df with groupby
You can use:
df = df.groupby("id").apply(lambda x: x.ffill(limit=2))
print (df)
id date number number2
0 1 2001 4.0 11.0
1 1 2002 4.0 45.0
2 1 2003 4.0 13.0
3 2 2001 7.0 NaN
4 2 2002 8.0 2.0
Also for me working:
df.groupby("id").fillna(method="ffill", limit=2)
so I think is necessary upgrade pandas.
pandas fillna using dict map and groupby
As you have separate conditions you need to have several lines.
What you would do is to refactor the code to reuse the groups and a single function:
f = lambda x: x.ffill().bfill()
g1 = df.groupby(['subj'], sort=False)
g2 = df.groupby(['region'], sort=False)
df['qty_min'] = g1['qty_min'].apply(f)
df['qty_max'] = g1['qty_max'].apply(f)
df['region_min'] = g2['region_min'].apply(f)
df['region_max'] = g2['region_max'].apply(f)
Using your dictionary:
f = lambda x: x.ffill().bfill()
fillna_dict= {
"subj": ['qty_min','qty_max'],
"region": ['region_min','region_max']
}
for k, cols in fillna_dict.items():
df[cols] = df.groupby(df[k])[cols].apply(f)
output:
qty_min qty_max region_min region_max subj region
0 11.0 1.0 10.0 10.0 ab UK
1 21.0 1.0 10.0 20.0 ab UK
2 21.0 1.0 10.0 30.0 ab UK
3 10.0 2.0 20.0 34.0 bc US
4 10.0 2.0 20.0 34.0 bc US
5 10.0 2.0 109.0 47.0 bc TZ
6 11.0 3.0 109.0 47.0 de TZ
7 13.0 3.0 109.0 31.0 de TZ
Related Topics
How to Sort Unicode Strings Alphabetically in Python
Pandas Groupby Multiple Fields Then Diff
Replacing Column Values in a Pandas Dataframe
Seaborn Is Not Plotting Within Defined Subplots
Object of Custom Type as Dictionary Key
Sending Multipart HTML Emails Which Contain Embedded Images
How to Obtain the Element-Wise Logical Not of a Pandas Series
Iterating Through Directories with Python
Python Numpy Valueerror: Operands Could Not Be Broadcast Together with Shapes
Differencebetween Slice Assignment That Slices the Whole List and Direct Assignment
Difference Between Filter and Filter_By in SQLalchemy
List VS Generator Comprehension Speed with Join Function
Calculating Direction of the Player to Shoot Pygame