Pandas Fillna Using Groupby

Pandas fillna using groupby

If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:

df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN

But if multiple value per group and need replace NaN by some constant - e.g. mean by group:

print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN

df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN

How to fillna by groupby outputs in pandas?

df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean')) would be faster than apply

In [2400]: df
Out[2400]:
A B C D
0 1 1 1 1.0
1 1 1 1 NaN
2 1 1 1 3.0
3 3 3 3 5.0

In [2401]: df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
Out[2401]:
0 1.0
1 2.0
2 3.0
3 5.0
Name: D, dtype: float64

In [2402]: df['D'] = df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))

In [2403]: df
Out[2403]:
A B C D
0 1 1 1 1.0
1 1 1 1 2.0
2 1 1 1 3.0
3 3 3 3 5.0

Details

In [2396]: df.shape
Out[2396]: (10000, 4)

In [2398]: %timeit df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
100 loops, best of 3: 3.44 ms per loop

In [2397]: %timeit df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
100 loops, best of 3: 5.34 ms per loop

Pandas fillna using groupby and mode

You can use GroupBy.transform with if-else for median for numeric and mode for categorical columns:

df = pd.DataFrame({
'A':list('ebcded'),
'B':[np.nan,np.nan,4,5,5,4],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'F':list('aaabbb'),
'make':list('aaabbb')
})

df.loc[[2,4], 'A'] = np.nan
df.loc[[2,5], 'F'] = np.nan
print (df)
A B C D F make
0 e NaN 7.0 1.0 a a
1 b NaN NaN 3.0 a a
2 NaN 4.0 9.0 5.0 NaN a
3 d 5.0 4.0 NaN b b
4 NaN 5.0 2.0 1.0 b b
5 d 4.0 3.0 0.0 NaN b


f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else x.mode().iloc[0]
df = df.fillna(df.groupby('make').transform(f))
print (df)

A B C D F make
0 e 4 7 1 a a
1 b 4 7 3 a a
2 b 4 9 5 a a
3 d 5 4 0 b b
4 d 5 2 1 b b
5 d 4 3 0 b b

Fillna using groupby and mode not working

Use transform:

vals = df.groupby(['region', 'basin'])['installer'] \
.transform(lambda x: x.mode(dropna=False).iloc[0])
df['installer'] = df['installer'].fillna(vals)

pandas groupby fillna code does not work and gives error

For your purpose using Andrej Kesely answer might be enough, but have in mind that using apply with pandas is not good performance-wise.

A better option to maintain performance is to use .transform('mean') like

  df["age"].fillna(df.groupby(["name", "job"])["age"].transform("mean"), inplace=True) 
df["weight"].fillna(df.groupby(["name", "job"])["weight"].transform("mean"), inplace=True)

This is just a comparison between the groupby and apply strategy and the transform one. Where slow referes to the groupby-apply and fast to the transform.

Sample Image

To reproduce this comparison you can run this

import numpy as np
import pandas as pd
import time
from tqdm import tqdm

def build_dataset(N):
names = ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry']
jobs = ['teacher', 'doctor', 'engineer','teacher', 'doctor', 'engineer']

data = {
'name': np.random.choice(names, size=N),
'job': np.random.choice(jobs, size=N),
'age': np.random.uniform(low=10, high=90, size=N),
'weight': np.random.uniform(low=60, high=150, size=N)
}

df = pd.DataFrame(data)
df.loc[df.sample(frac=0.1).index, "age"] = np.nan
df.loc[df.sample(frac=0.1).index, "weight"] = np.nan
return df

def slow_way(df):
df = df.copy()
def fn(x):
x["age"].fillna(x["age"].mean(), inplace=True)
x["weight"].fillna(x["weight"].mean(), inplace=True)
return x

return df.groupby(["name", "job"]).apply(fn)

def fast_way(df):
df = df.copy()
df["age"].fillna(df.groupby(["name", "job"])["age"].transform("mean"), inplace=True)
df["weight"].fillna(df.groupby(["name", "job"])["weight"].transform("mean"), inplace=True)

return df

Ns = np.arange(10, 100000, step=1000, dtype=np.int32)
slow = []
fast = []
size = []
for N in tqdm(Ns):
for i in range(10):
df = build_dataset(N)
start = time.time()
_ = slow_way(df)
end = time.time()
slow.append(end-start)
start = time.time()
_ = fast_way(df)
end = time.time()
fast.append(end-start)
size.append(N)

df = pd.DataFrame({"N": size, "slow": slow, "fast": fast})
df_group = df.groupby("N").mean()
df_group.plot(figsize=(30,10))

pandas: fillna whole df with groupby

You can use:

df = df.groupby("id").apply(lambda x: x.ffill(limit=2))
print (df)
id date number number2
0 1 2001 4.0 11.0
1 1 2002 4.0 45.0
2 1 2003 4.0 13.0
3 2 2001 7.0 NaN
4 2 2002 8.0 2.0

Also for me working:

df.groupby("id").fillna(method="ffill", limit=2)

so I think is necessary upgrade pandas.

pandas fillna using dict map and groupby

As you have separate conditions you need to have several lines.

What you would do is to refactor the code to reuse the groups and a single function:

f = lambda x: x.ffill().bfill()

g1 = df.groupby(['subj'], sort=False)
g2 = df.groupby(['region'], sort=False)

df['qty_min'] = g1['qty_min'].apply(f)
df['qty_max'] = g1['qty_max'].apply(f)
df['region_min'] = g2['region_min'].apply(f)
df['region_max'] = g2['region_max'].apply(f)

Using your dictionary:

f = lambda x: x.ffill().bfill()

fillna_dict= {
"subj": ['qty_min','qty_max'],
"region": ['region_min','region_max']
}

for k, cols in fillna_dict.items():
df[cols] = df.groupby(df[k])[cols].apply(f)

output:

   qty_min  qty_max  region_min  region_max subj region
0 11.0 1.0 10.0 10.0 ab UK
1 21.0 1.0 10.0 20.0 ab UK
2 21.0 1.0 10.0 30.0 ab UK
3 10.0 2.0 20.0 34.0 bc US
4 10.0 2.0 20.0 34.0 bc US
5 10.0 2.0 109.0 47.0 bc TZ
6 11.0 3.0 109.0 47.0 de TZ
7 13.0 3.0 109.0 31.0 de TZ


Related Topics



Leave a reply



Submit