Pandas Groupby Columns With Nan (Missing) Values

Pandas Groupby with Categorical Columns returns NaN

There are all possible combinations of categories, unused categories create missing values, check this.

So if need remove mising values:

print(df.groupby(["a_bin", "b_bin"]).c.mean().dropna())
a_bin b_bin
(0.0, 0.0101] (0.0, 0.0101] 0.381681
(0.0505, 0.0606] 0.148762
(0.0909, 0.101] 0.313093
(0.101, 0.111] 0.488104
(0.313, 0.323] 0.518599

(0.99, 1.0] (0.505, 0.515] 0.149027
(0.576, 0.586] 0.099652
(0.778, 0.788] 0.220360
(0.828, 0.838] 0.166424
(0.97, 0.98] 0.516558
Name: c, Length: 948, dtype: float64

How to impute missing values with groupby if the group has less than 3 nan

Try slightly modify your solution

df_daily_grouped['price'].transform(lambda x : x.fillna(x.mean()) if x.isnull().sum()<3 else x)

Pandas Grouping by Id and getting non-NaN values

This should do what you what:

df.groupby('salesforce_id').first().reset_index(drop=True)

That will merge all the columns into one, keeping only the non-NaN value for each run (unless there are no non-NaN values in all the columns for that row; then the value in the final merged column will be NaN).

Take min and max with null values - pandas groupby

IIUC,DataFrame.mask to set NaN where there are any nan for each group and col

new_df = \
df.groupby('id')\
.agg({'start':'min', 'end':'max'})\
.mask(df[['start', 'end']].isna()
.groupby(df['id'])
.max())\
.reset_index()

print(new_df)
id start end
0 a 2020-01-01 00:00:00 2020-01-02
1 b 2020-01-01 18:37:00 NaT
2 c 2020-02-04 00:00:00 2020-07-13
3 d 2020-04-19 20:45:00 2021-03-02

Detail:

print(df[['start', 'end']].isna()
.groupby(df['id'])
.max())

start end
id
a False False
b False True
c False False
d False False

In the case of multiple columns to group by:

new_df = \
df.groupby(['id', 'status'])\
.agg({'start':'min', 'end':'max'})\
.mask(df[['start', 'end']].isna()
.groupby([df['id'], df['status']])
.max())\
.reset_index()

Pandas groupby NaN/None values in non-key columns

last is designed to get the last non-NA value, independently in each column.

What you want (last row per group) is tail:

df.groupby(by='a', as_index=False).tail(1)

Output:

   a     b     c
2 1 NaN z
3 2 12.0 None

How to groupby df according to two column values and handling missing values in pandas?

IIUC, you want:

  1. groupby the ID and MODE columns and interpolate all numeric columns
  2. groupby the ID and MODE columns and ffill all non-numeric columns
import numpy as np

#replace string "NaN" with numpy.nan
df = df.replace("NaN", np.nan)

numeric = df.filter(like="Signal").select_dtypes(np.number).columns
others = df.filter(like="Signal").select_dtypes(None,np.number).columns

df[numeric] = df.groupby(["ID", "MODE"])[numeric].transform(pd.Series.interpolate, limit_direction="forward")
df[others] = df.groupby(["ID", "MODE"])[others].transform("ffill")

>>> df
ID MODE Signal1 Signal2 Signal3
0 0A active 13.0 NaN on
1 0A active 8.5 0.1 on
2 0A active 4.0 0.3 on
3 0A inactive 11.0 NaN off
4 0A inactive 11.0 4.5 off
5 1C active 22.0 NaN on
6 1C active 25.0 2.0 on
7 1C active 25.0 3.0 on
8 1C inactive 19.0 NaN NaN

>>> df.dropna()
ID MODE Signal1 Signal2 Signal3
1 0A active 8.5 0.1 on
2 0A active 4.0 0.3 on
4 0A inactive 11.0 4.5 off
6 1C active 25.0 2.0 on
7 1C active 25.0 3.0 on


Related Topics



Leave a reply



Submit