Pandas: Filling Missing Values by Mean in Each Group

pandas: Filling missing values within a group

An alternative approach is to use first_valid_index and a transform:

In [11]: g = df.groupby('trial')

In [12]: g['cs_name'].transform(lambda s: s.loc[s.first_valid_index()])
Out[12]:
0 A1
1 A1
2 A1
3 A1
4 B2
5 B2
6 B2
7 B2
8 A1
9 A1
10 A1
11 A1
Name: cs_name, dtype: object

This ought to be more efficient then using ffill followed by a bfill...

And use this to change the cs_name column:

df['cs_name'] = g['cs_name'].transform(lambda s: s.loc[s.first_valid_index()])

Note: I think it would be nice enhancement to have a method to grab the first non-null object in the pandas, in numpy it's an open request, I don't think there is currently a method (I could be wrong!)...

Filling missing values of test from groupby mean of training set

IIUC, here's one way:

from statistics import mode

test_mode = test.set_index('A').fillna(train.groupby('A').agg(mode)).reset_index()
test_mean = test.set_index('A').fillna(train.groupby('A').mean()).reset_index()

If you want a function:

from statistics import mode

def evaluate_nan(strategy= 'mean'):
return test.set_index('A').fillna(train.groupby('A').agg(strategy)).reset_index()

test_mean = evaluate_nan()
test_mode = evaluate_nan(strategy = mode)

Pandas Fill NA with Group Value

You can use transform as answered here

df['Value'] = df.groupby('Site').transform(lambda x: x.fillna(x.mean()))


Site Value
0 A 1
1 A 1
2 A 1
3 B 2
4 B 2
5 B 2
6 C 3
7 C 3
8 C 3

How to fill missing values based on grouped average?

I cannot reproduce the first error, so if i use an example like below:

import pandas as pd
import numpy as np
np.random.seed(111)
df = pd.DataFrame({'Title':np.random.choice(['Mr','Miss','Mrs'],20),'Age':np.random.randint(20,50,20)})
df.loc[[5,9,10,11,12],['Age']]=np.nan

the data frame looks like:

Title   Age
0 Mr 42.0
1 Mr 28.0
2 Mr 25.0
3 Mr 32.0
4 Mrs 26.0
5 Miss NaN
6 Mrs 32.0
7 Mrs 33.0
8 Mrs 25.0
9 Mr NaN
10 Miss NaN
11 Mr NaN
12 Mrs NaN
13 Miss 38.0
14 Mr 31.0
15 Mr 42.0
16 Mr 24.0
17 Mrs 23.0
18 Mrs 49.0
19 Miss 27.0

And we can replace it just doing one more step:

ave_age = df.groupby('Title').mean()['Age']
df.loc[pd.isna(df['Age']),'Age'] = ave_age[df.loc[pd.isna(df['Age']),'Title']].values

Pandas: Fill missing values by mean in each group faster than transform

Here's a NumPy approach using np.bincount that's pretty efficient for such bin-based summing/averaging operations -

ids = df.group.values                    # Extract 2 columns as two arrays
vals = df.value.values

m = np.isnan(vals) # Mask of NaNs
grp_sums = np.bincount(ids,np.where(m,0,vals)) # Group sums with NaNs as 0s
avg_vals = grp_sums*(1.0/np.bincount(ids,~m)) # Group averages
vals[m] = avg_vals[ids[m]] # Set avg values into NaN positions

Note that this would update the value column.

Runtime test

Datasizes :

size = 1000000  # DataFrame length
ngroups = 10 # Number of Groups

Timings :

In [17]: %timeit df.groupby("group").transform(lambda x: x.fillna(x.mean()))
1 loops, best of 3: 276 ms per loop

In [18]: %timeit bincount_based(df)
100 loops, best of 3: 13.6 ms per loop

In [19]: 276.0/13.6 # Speedup
Out[19]: 20.294117647058822

20x+ speedup there!



Related Topics



Leave a reply



Submit