How to Replace Outliers with the 5Th and 95Th Percentile Values in R

How to replace outliers with the 5th and 95th percentile values in R

This would do it.

fun <- function(x){
    quantiles <- quantile( x, c(.05, .95 ) )
    x[ x < quantiles[1] ] <- quantiles[1]
    x[ x > quantiles[2] ] <- quantiles[2]
    x
}
fun( yourdata )

How to remove 5th and 95th percentile values in ddply while calculating mean for each group

Here is a function mean2 that computes the trimmed means.

mean2 <- function(x, na.rm = FALSE, probs = c(0.05, 0.95), ...){
  if(na.rm) x <- x[!is.na(x)]
  qq <- quantile(x, probs = probs)
  keep <- x > qq[1] & x < qq[2]
  mean(x[keep], ...)
}

Now mutate the data.frame with the function after grouping by species.

library(dplyr)

df %>%
  group_by(species) %>%
  mutate(mean = mean2(trait))

Test data creation code

set.seed(1234)
species <- sample(LETTERS[1:3], 20, TRUE)
trait <- sample(2:8, 20, TRUE)
trait[sample(20, 3)] <- sample(50:60, 3)
trait[sample(20, 1)] <- -2
df <- data.frame(species, trait)

How to replace the outliers with the 95th and 5th percentile in Python?

I'm not sure if this approach is a suitable way to deal with outliers, but to achieve what you want, clip function is useful. It assigns values outside boundary to boundary values. You can read more in documentation.

data=pd.Series(np.random.randn(100))
data.clip(lower=data.quantile(0.05), upper=data.quantile(0.95))

remove data greater than 95th percentile in data frame

Use the quantile function

> quantile(d$Point, 0.95)
 95% 
5800 

> d[d$Point < quantile(d$Point, 0.95), ]
  Group Point
2     B  5000
3     C  1000
4     D   100
5     F    70

Replace and delete first and last percentile in dataframe or multiple columns at once

Using both across and c_across in dplyr, you may also do this-

Steps explained -

c_across is usually used with row_wise as it creates a complete copy of data subsetted through its inner argument. But I have done it without rowwise() so instead of creating one row it is creating a copy of whole data as desired.
thereafter two quantiles of this data will be deduced. (which will be scalar quantities)
Now only job remains is to to check these values with every other value in data. So I used here across directly.
Using across I built a lambda formula which starts with a twiddle and its argument is . only. This twiddle style formula ~ . is equivalent to function(x) x and the rest is clear.

DF %>% mutate(across(starts_with('X'), ~ifelse(. > quantile(c_across(starts_with('X')), 0.99) |
                                                 . < quantile(c_across(starts_with('X')), 0.01),
                                               NA, .) 
                     )) %>% na.omit()

#>           A some_number X1  X2  X3  X4  X5
#> 6   event_6          69  6 106 206 306 406
#> 7   event_7         871  7 107 207 307 407
#> 8   event_8         356  8 108 208 308 408
.
.
.
#> 93 event_93         432 93 193 293 393 493
#> 94 event_94         967 94 194 294 394 494
#> 95 event_95         516 95 195 295 395 495

Since starts_with works only in across or c_across and to avoid slower rowwise here, we can also do this directly

DF %>% filter(rowSums(cur_data()[str_detect(names(DF), 'X')] > quantile(c_across(starts_with('X')), 0.99)) == 0 &
                rowSums(cur_data()[str_detect(names(DF), 'X')] < quantile(c_across(starts_with('X')), 0.01)) == 0)

This will also give 90 rows in output as desired

Remove Outliers in Pandas DataFrame using Percentiles

The initial dataset.

print(df.head())

   Col0  Col1  Col2  Col3  Col4  User_id
0    49    31    93    53    39       44
1    69    13    84    58    24       47
2    41    71     2    43    58       64
3    35    56    69    55    36       67
4    64    24    12    18    99       67

First removing the User_id column

filt_df = df.loc[:, df.columns != 'User_id']

Then, computing percentiles.

low = .05
high = .95
quant_df = filt_df.quantile([low, high])
print(quant_df)

       Col0   Col1  Col2   Col3   Col4
0.05   2.00   3.00   6.9   3.95   4.00
0.95  95.05  89.05  93.0  94.00  97.05

Next filtering values based on computed percentiles. To do that I use an apply by columns and that's it !

filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) & 
                                    (x < quant_df.loc[high,x.name])], axis=0)

Bringing the User_id back.

filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)

Last, rows with NaN values can be dropped simply like this.

filt_df.dropna(inplace=True)
print(filt_df.head())

   User_id  Col0  Col1  Col2  Col3  Col4
1       47    69    13    84    58    24
3       67    35    56    69    55    36
5        9    95    79    44    45    69
6       83    69    41    66    87     6
9       87    50    54    39    53    40

Checking result

print(filt_df.head())

   User_id  Col0  Col1  Col2  Col3  Col4
0       44    49    31   NaN    53    39
1       47    69    13    84    58    24
2       64    41    71   NaN    43    58
3       67    35    56    69    55    36
4       67    64    24    12    18   NaN

print(filt_df.describe())

          User_id       Col0       Col1       Col2       Col3       Col4
count  100.000000  89.000000  88.000000  88.000000  89.000000  89.000000
mean    48.230000  49.573034  45.659091  52.727273  47.460674  57.157303
std     28.372292  25.672274  23.537149  26.509477  25.823728  26.231876
min      0.000000   3.000000   5.000000   7.000000   4.000000   5.000000
25%     23.000000  29.000000  29.000000  29.500000  24.000000  36.000000
50%     47.000000  50.000000  40.500000  52.500000  49.000000  59.000000
75%     74.250000  69.000000  67.000000  75.000000  70.000000  79.000000
max     99.000000  95.000000  89.000000  92.000000  91.000000  97.000000

How to generate the test dataset

np.random.seed(0)
nb_sample = 100
num_sample = (0,100)

d = dict()
d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
for i in range(5):
    d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)

df = DataFrame.from_dict(d)

How to Replace Outliers with the 5Th and 95Th Percentile Values in R