How to Replace Outliers with the 5Th and 95Th Percentile Values in R

How to replace outliers with the 5th and 95th percentile values in R

This would do it.

fun <- function(x){
quantiles <- quantile( x, c(.05, .95 ) )
x[ x < quantiles[1] ] <- quantiles[1]
x[ x > quantiles[2] ] <- quantiles[2]
x
}
fun( yourdata )

How to remove 5th and 95th percentile values in ddply while calculating mean for each group

Here is a function mean2 that computes the trimmed means.

mean2 <- function(x, na.rm = FALSE, probs = c(0.05, 0.95), ...){
if(na.rm) x <- x[!is.na(x)]
qq <- quantile(x, probs = probs)
keep <- x > qq[1] & x < qq[2]
mean(x[keep], ...)
}

Now mutate the data.frame with the function after grouping by species.

library(dplyr)

df %>%
group_by(species) %>%
mutate(mean = mean2(trait))

Test data creation code

set.seed(1234)
species <- sample(LETTERS[1:3], 20, TRUE)
trait <- sample(2:8, 20, TRUE)
trait[sample(20, 3)] <- sample(50:60, 3)
trait[sample(20, 1)] <- -2
df <- data.frame(species, trait)

How to replace the outliers with the 95th and 5th percentile in Python?

I'm not sure if this approach is a suitable way to deal with outliers, but to achieve what you want, clip function is useful. It assigns values outside boundary to boundary values. You can read more in documentation.

data=pd.Series(np.random.randn(100))
data.clip(lower=data.quantile(0.05), upper=data.quantile(0.95))

remove data greater than 95th percentile in data frame

Use the quantile function

> quantile(d$Point, 0.95)
95%
5800

> d[d$Point < quantile(d$Point, 0.95), ]
Group Point
2 B 5000
3 C 1000
4 D 100
5 F 70

Replace and delete first and last percentile in dataframe or multiple columns at once

Using both across and c_across in dplyr, you may also do this-

Steps explained -

  • c_across is usually used with row_wise as it creates a complete copy of data subsetted through its inner argument. But I have done it without rowwise() so instead of creating one row it is creating a copy of whole data as desired.
  • thereafter two quantiles of this data will be deduced. (which will be scalar quantities)
  • Now only job remains is to to check these values with every other value in data. So I used here across directly.
  • Using across I built a lambda formula which starts with a twiddle and its argument is . only. This twiddle style formula ~ . is equivalent to function(x) x and the rest is clear.
DF %>% mutate(across(starts_with('X'), ~ifelse(. > quantile(c_across(starts_with('X')), 0.99) |
. < quantile(c_across(starts_with('X')), 0.01),
NA, .)
)) %>% na.omit()

#> A some_number X1 X2 X3 X4 X5
#> 6 event_6 69 6 106 206 306 406
#> 7 event_7 871 7 107 207 307 407
#> 8 event_8 356 8 108 208 308 408
.
.
.
#> 93 event_93 432 93 193 293 393 493
#> 94 event_94 967 94 194 294 394 494
#> 95 event_95 516 95 195 295 395 495

Since starts_with works only in across or c_across and to avoid slower rowwise here, we can also do this directly

DF %>% filter(rowSums(cur_data()[str_detect(names(DF), 'X')] > quantile(c_across(starts_with('X')), 0.99)) == 0 &
rowSums(cur_data()[str_detect(names(DF), 'X')] < quantile(c_across(starts_with('X')), 0.01)) == 0)

This will also give 90 rows in output as desired

Remove Outliers in Pandas DataFrame using Percentiles

The initial dataset.

print(df.head())

Col0 Col1 Col2 Col3 Col4 User_id
0 49 31 93 53 39 44
1 69 13 84 58 24 47
2 41 71 2 43 58 64
3 35 56 69 55 36 67
4 64 24 12 18 99 67

First removing the User_id column

filt_df = df.loc[:, df.columns != 'User_id']

Then, computing percentiles.

low = .05
high = .95
quant_df = filt_df.quantile([low, high])
print(quant_df)

Col0 Col1 Col2 Col3 Col4
0.05 2.00 3.00 6.9 3.95 4.00
0.95 95.05 89.05 93.0 94.00 97.05

Next filtering values based on computed percentiles. To do that I use an apply by columns and that's it !

filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) & 
(x < quant_df.loc[high,x.name])], axis=0)

Bringing the User_id back.

filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)

Last, rows with NaN values can be dropped simply like this.

filt_df.dropna(inplace=True)
print(filt_df.head())

User_id Col0 Col1 Col2 Col3 Col4
1 47 69 13 84 58 24
3 67 35 56 69 55 36
5 9 95 79 44 45 69
6 83 69 41 66 87 6
9 87 50 54 39 53 40

Checking result

print(filt_df.head())

User_id Col0 Col1 Col2 Col3 Col4
0 44 49 31 NaN 53 39
1 47 69 13 84 58 24
2 64 41 71 NaN 43 58
3 67 35 56 69 55 36
4 67 64 24 12 18 NaN

print(filt_df.describe())

User_id Col0 Col1 Col2 Col3 Col4
count 100.000000 89.000000 88.000000 88.000000 89.000000 89.000000
mean 48.230000 49.573034 45.659091 52.727273 47.460674 57.157303
std 28.372292 25.672274 23.537149 26.509477 25.823728 26.231876
min 0.000000 3.000000 5.000000 7.000000 4.000000 5.000000
25% 23.000000 29.000000 29.000000 29.500000 24.000000 36.000000
50% 47.000000 50.000000 40.500000 52.500000 49.000000 59.000000
75% 74.250000 69.000000 67.000000 75.000000 70.000000 79.000000
max 99.000000 95.000000 89.000000 92.000000 91.000000 97.000000

How to generate the test dataset

np.random.seed(0)
nb_sample = 100
num_sample = (0,100)

d = dict()
d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
for i in range(5):
d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)

df = DataFrame.from_dict(d)


Related Topics



Leave a reply



Submit