Pandas - Replace Outliers With Groupby Mean

Pandas - Replace outliers with groupby mean

Try this:

def replace(group):
    mean, std = group.mean(), group.std()
    outliers = (group - mean).abs() > 3*std
    group[outliers] = mean        # or "group[~outliers].mean()"
    return group

df.groupby('a').transform(replace)

Note: If you want to eliminate the 100 in your last group you can replace 3*std by just 1*std. The standard deviation in this group is 48.33 so it would be included in the result.

Replace outliers from all columns with mean

I think you should do it like this instead :

median = dt[feature].median()
std = dt[feature].std()
dt.loc[(dt[feature] - median).abs() > std,feature]=np.nan
dt[feature].fillna(median, inplace=True)

My guess it is that your problem with your old code was :

dt[outliers] = np.nan

How to replace outlier after groupby?

The medians seem to differ slightly with what you're saying (see comment in the output dataframe). Here's one approach using GroupBy.transform with where

g = df.groupby('Port').Risk
df['Risk'] = (df.Risk.where(g.transform('quantile', q=0.95) > df.Risk, 
                            g.transform('median')))

      Date     Port  Risk
0  2019-04-30    a  21.80
1  2019-03-29    a  22.60
2  2019-02-28    a  24.35 # -> np.median([21.8, 22.6, 500, 26.1]) = 24.35
3  2019-01-31    a  26.10
4  2019-04-30    b  36.40
5  2019-03-29    b  43.30
6  2019-02-28    b  40.00
7  2019-01-31    b  41.65

Remove outliers in Pandas dataframe with groupby

One way is to filter out as follows:

In [11]: res = df.groupby("Date")["Interval"].quantile([0.05, 0.95]).unstack(level=1)

In [12]: res
Out[12]:
             0.05   0.95
Date
2016-03-01  489.6  913.4

Now we can lookup these values for each row using loc and filter:

In [13]: (res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])
Out[13]:
Date
2016-03-01    False
2016-03-01     True
2016-03-01     True
2016-03-01     True
2016-03-01    False
dtype: bool

In [14]: df.loc[((res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])).values]
Out[14]:
   Report        Date  Time  Interval  Total Volume
1    5785  2016-03-01    25     580.0           NaN
2    5786  2016-03-01    26     716.0           NaN
3    5787  2016-03-01    27     803.0           NaN

Note: grouping by 'Time Interval' will work the same, but in your example doesn't filter any rows!

Pandas finding and replacing outliers based on a group of two columns

First you can identify outliers. This code identifies any values that are greater than one standard deviation away from the mean.

outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index

Then you can determine the median of each group:

medians = df.groupby('group')['value'].median()

Finally, locate the outliers and replace with the medians:

df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].to_list()

All together it looks like:

import pandas as pd
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
               '2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['value'] = r
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
medians = df.groupby('group')['value'].median()
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].values

Output:

   group        date  value
0      A  2022-06-28      1
1      A  2022-06-28      2
2      A  2022-06-28      1
3      A  2022-06-27      2
4      A  2022-06-27      3
5      A  2022-06-27      2
6      B  2022-06-28      2
7      B  2022-06-28      3
8      B  2022-06-28      2
9      B  2022-06-27      3
10     B  2022-06-27      4
11     B  2022-06-27      3

Pandas - Replace Outliers With Groupby Mean