Pandas - Replace Outliers With Groupby Mean

Pandas - Replace outliers with groupby mean

Try this:

def replace(group):
mean, std = group.mean(), group.std()
outliers = (group - mean).abs() > 3*std
group[outliers] = mean # or "group[~outliers].mean()"
return group

df.groupby('a').transform(replace)

Note: If you want to eliminate the 100 in your last group you can replace 3*std by just 1*std. The standard deviation in this group is 48.33 so it would be included in the result.

Replace outliers from all columns with mean

I think you should do it like this instead :

median = dt[feature].median()
std = dt[feature].std()
dt.loc[(dt[feature] - median).abs() > std,feature]=np.nan
dt[feature].fillna(median, inplace=True)

My guess it is that your problem with your old code was :

dt[outliers] = np.nan

How to replace outlier after groupby?

The medians seem to differ slightly with what you're saying (see comment in the output dataframe). Here's one approach using GroupBy.transform with where

g = df.groupby('Port').Risk
df['Risk'] = (df.Risk.where(g.transform('quantile', q=0.95) > df.Risk,
g.transform('median')))

      Date     Port  Risk
0 2019-04-30 a 21.80
1 2019-03-29 a 22.60
2 2019-02-28 a 24.35 # -> np.median([21.8, 22.6, 500, 26.1]) = 24.35
3 2019-01-31 a 26.10
4 2019-04-30 b 36.40
5 2019-03-29 b 43.30
6 2019-02-28 b 40.00
7 2019-01-31 b 41.65

Remove outliers in Pandas dataframe with groupby

One way is to filter out as follows:

In [11]: res = df.groupby("Date")["Interval"].quantile([0.05, 0.95]).unstack(level=1)

In [12]: res
Out[12]:
0.05 0.95
Date
2016-03-01 489.6 913.4

Now we can lookup these values for each row using loc and filter:

In [13]: (res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])
Out[13]:
Date
2016-03-01 False
2016-03-01 True
2016-03-01 True
2016-03-01 True
2016-03-01 False
dtype: bool

In [14]: df.loc[((res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])).values]
Out[14]:
Report Date Time Interval Total Volume
1 5785 2016-03-01 25 580.0 NaN
2 5786 2016-03-01 26 716.0 NaN
3 5787 2016-03-01 27 803.0 NaN

Note: grouping by 'Time Interval' will work the same, but in your example doesn't filter any rows!

Pandas finding and replacing outliers based on a group of two columns

First you can identify outliers. This code identifies any values that are greater than one standard deviation away from the mean.

outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index

Then you can determine the median of each group:

medians = df.groupby('group')['value'].median()

Finally, locate the outliers and replace with the medians:

df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].to_list()

All together it looks like:

import pandas as pd
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
'2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['value'] = r
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
medians = df.groupby('group')['value'].median()
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].values

Output:

   group        date  value
0 A 2022-06-28 1
1 A 2022-06-28 2
2 A 2022-06-28 1
3 A 2022-06-27 2
4 A 2022-06-27 3
5 A 2022-06-27 2
6 B 2022-06-28 2
7 B 2022-06-28 3
8 B 2022-06-28 2
9 B 2022-06-27 3
10 B 2022-06-27 4
11 B 2022-06-27 3


Related Topics



Leave a reply



Submit