pandas: Filling missing values within a group
An alternative approach is to use first_valid_index
and a transform
:
In [11]: g = df.groupby('trial')
In [12]: g['cs_name'].transform(lambda s: s.loc[s.first_valid_index()])
Out[12]:
0 A1
1 A1
2 A1
3 A1
4 B2
5 B2
6 B2
7 B2
8 A1
9 A1
10 A1
11 A1
Name: cs_name, dtype: object
This ought to be more efficient then using ffill followed by a bfill...
And use this to change the cs_name
column:
df['cs_name'] = g['cs_name'].transform(lambda s: s.loc[s.first_valid_index()])
Note: I think it would be nice enhancement to have a method to grab the first non-null object in the pandas, in numpy it's an open request, I don't think there is currently a method (I could be wrong!)...
Filling missing values of test from groupby mean of training set
IIUC, here's one way:
from statistics import mode
test_mode = test.set_index('A').fillna(train.groupby('A').agg(mode)).reset_index()
test_mean = test.set_index('A').fillna(train.groupby('A').mean()).reset_index()
If you want a function:
from statistics import mode
def evaluate_nan(strategy= 'mean'):
return test.set_index('A').fillna(train.groupby('A').agg(strategy)).reset_index()
test_mean = evaluate_nan()
test_mode = evaluate_nan(strategy = mode)
Pandas Fill NA with Group Value
You can use transform
as answered here
df['Value'] = df.groupby('Site').transform(lambda x: x.fillna(x.mean()))
Site Value
0 A 1
1 A 1
2 A 1
3 B 2
4 B 2
5 B 2
6 C 3
7 C 3
8 C 3
How to fill missing values based on grouped average?
I cannot reproduce the first error, so if i use an example like below:
import pandas as pd
import numpy as np
np.random.seed(111)
df = pd.DataFrame({'Title':np.random.choice(['Mr','Miss','Mrs'],20),'Age':np.random.randint(20,50,20)})
df.loc[[5,9,10,11,12],['Age']]=np.nan
the data frame looks like:
Title Age
0 Mr 42.0
1 Mr 28.0
2 Mr 25.0
3 Mr 32.0
4 Mrs 26.0
5 Miss NaN
6 Mrs 32.0
7 Mrs 33.0
8 Mrs 25.0
9 Mr NaN
10 Miss NaN
11 Mr NaN
12 Mrs NaN
13 Miss 38.0
14 Mr 31.0
15 Mr 42.0
16 Mr 24.0
17 Mrs 23.0
18 Mrs 49.0
19 Miss 27.0
And we can replace it just doing one more step:
ave_age = df.groupby('Title').mean()['Age']
df.loc[pd.isna(df['Age']),'Age'] = ave_age[df.loc[pd.isna(df['Age']),'Title']].values
Pandas: Fill missing values by mean in each group faster than transform
Here's a NumPy approach using np.bincount
that's pretty efficient for such bin-based summing/averaging operations -
ids = df.group.values # Extract 2 columns as two arrays
vals = df.value.values
m = np.isnan(vals) # Mask of NaNs
grp_sums = np.bincount(ids,np.where(m,0,vals)) # Group sums with NaNs as 0s
avg_vals = grp_sums*(1.0/np.bincount(ids,~m)) # Group averages
vals[m] = avg_vals[ids[m]] # Set avg values into NaN positions
Note that this would update the value
column.
Runtime test
Datasizes :
size = 1000000 # DataFrame length
ngroups = 10 # Number of Groups
Timings :
In [17]: %timeit df.groupby("group").transform(lambda x: x.fillna(x.mean()))
1 loops, best of 3: 276 ms per loop
In [18]: %timeit bincount_based(df)
100 loops, best of 3: 13.6 ms per loop
In [19]: 276.0/13.6 # Speedup
Out[19]: 20.294117647058822
20x+
speedup there!
Related Topics
Valueerror: Invalid Literal For Int() With Base 10: ''
Saving Utf-8 Texts With Json.Dumps as Utf8, Not as \U Escape Sequence
Is There a Standardized Method to Swap Two Variables in Python
Prevent Scientific Notation in Matplotlib.Pyplot
What Does the Ellipsis Object Do
How to Properly Determine the Current Script Directory
Converting Unix Timestamp String to Readable Date
How to Convert All Strings in a List of Lists to Integers
How to Get the Source Code of a Python Function
Tkinter - Executing Functions Over Time
How to Improve Performance of This Code