How to replace outliers with the 5th and 95th percentile values in R
This would do it.
fun <- function(x){
quantiles <- quantile( x, c(.05, .95 ) )
x[ x < quantiles[1] ] <- quantiles[1]
x[ x > quantiles[2] ] <- quantiles[2]
x
}
fun( yourdata )
How to remove 5th and 95th percentile values in ddply while calculating mean for each group
Here is a function mean2
that computes the trimmed means.
mean2 <- function(x, na.rm = FALSE, probs = c(0.05, 0.95), ...){
if(na.rm) x <- x[!is.na(x)]
qq <- quantile(x, probs = probs)
keep <- x > qq[1] & x < qq[2]
mean(x[keep], ...)
}
Now mutate
the data.frame with the function after grouping by species
.
library(dplyr)
df %>%
group_by(species) %>%
mutate(mean = mean2(trait))
Test data creation code
set.seed(1234)
species <- sample(LETTERS[1:3], 20, TRUE)
trait <- sample(2:8, 20, TRUE)
trait[sample(20, 3)] <- sample(50:60, 3)
trait[sample(20, 1)] <- -2
df <- data.frame(species, trait)
How to replace the outliers with the 95th and 5th percentile in Python?
I'm not sure if this approach is a suitable way to deal with outliers, but to achieve what you want, clip
function is useful. It assigns values outside boundary to boundary values. You can read more in documentation.
data=pd.Series(np.random.randn(100))
data.clip(lower=data.quantile(0.05), upper=data.quantile(0.95))
remove data greater than 95th percentile in data frame
Use the quantile
function
> quantile(d$Point, 0.95)
95%
5800
> d[d$Point < quantile(d$Point, 0.95), ]
Group Point
2 B 5000
3 C 1000
4 D 100
5 F 70
Replace and delete first and last percentile in dataframe or multiple columns at once
Using both across
and c_across
in dplyr
, you may also do this-
Steps explained -
c_across
is usually used withrow_wise
as it creates a complete copy of data subsetted through its inner argument. But I have done it withoutrowwise()
so instead of creating one row it is creating a copy of whole data as desired.- thereafter two quantiles of this data will be deduced. (which will be scalar quantities)
- Now only job remains is to to check these values with every other value in data. So I used here
across
directly. - Using across I built a lambda formula which starts with a
twiddle
and its argument is.
only. This twiddle style formula~ .
is equivalent tofunction(x) x
and the rest is clear.
DF %>% mutate(across(starts_with('X'), ~ifelse(. > quantile(c_across(starts_with('X')), 0.99) |
. < quantile(c_across(starts_with('X')), 0.01),
NA, .)
)) %>% na.omit()
#> A some_number X1 X2 X3 X4 X5
#> 6 event_6 69 6 106 206 306 406
#> 7 event_7 871 7 107 207 307 407
#> 8 event_8 356 8 108 208 308 408
.
.
.
#> 93 event_93 432 93 193 293 393 493
#> 94 event_94 967 94 194 294 394 494
#> 95 event_95 516 95 195 295 395 495
Since starts_with
works only in across
or c_across
and to avoid slower rowwise
here, we can also do this directly
DF %>% filter(rowSums(cur_data()[str_detect(names(DF), 'X')] > quantile(c_across(starts_with('X')), 0.99)) == 0 &
rowSums(cur_data()[str_detect(names(DF), 'X')] < quantile(c_across(starts_with('X')), 0.01)) == 0)
This will also give 90 rows in output as desired
Remove Outliers in Pandas DataFrame using Percentiles
The initial dataset.
print(df.head())
Col0 Col1 Col2 Col3 Col4 User_id
0 49 31 93 53 39 44
1 69 13 84 58 24 47
2 41 71 2 43 58 64
3 35 56 69 55 36 67
4 64 24 12 18 99 67
First removing the User_id
column
filt_df = df.loc[:, df.columns != 'User_id']
Then, computing percentiles.
low = .05
high = .95
quant_df = filt_df.quantile([low, high])
print(quant_df)
Col0 Col1 Col2 Col3 Col4
0.05 2.00 3.00 6.9 3.95 4.00
0.95 95.05 89.05 93.0 94.00 97.05
Next filtering values based on computed percentiles. To do that I use an apply
by columns and that's it !
filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) &
(x < quant_df.loc[high,x.name])], axis=0)
Bringing the User_id
back.
filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)
Last, rows with NaN
values can be dropped simply like this.
filt_df.dropna(inplace=True)
print(filt_df.head())
User_id Col0 Col1 Col2 Col3 Col4
1 47 69 13 84 58 24
3 67 35 56 69 55 36
5 9 95 79 44 45 69
6 83 69 41 66 87 6
9 87 50 54 39 53 40
Checking result
print(filt_df.head())
User_id Col0 Col1 Col2 Col3 Col4
0 44 49 31 NaN 53 39
1 47 69 13 84 58 24
2 64 41 71 NaN 43 58
3 67 35 56 69 55 36
4 67 64 24 12 18 NaN
print(filt_df.describe())
User_id Col0 Col1 Col2 Col3 Col4
count 100.000000 89.000000 88.000000 88.000000 89.000000 89.000000
mean 48.230000 49.573034 45.659091 52.727273 47.460674 57.157303
std 28.372292 25.672274 23.537149 26.509477 25.823728 26.231876
min 0.000000 3.000000 5.000000 7.000000 4.000000 5.000000
25% 23.000000 29.000000 29.000000 29.500000 24.000000 36.000000
50% 47.000000 50.000000 40.500000 52.500000 49.000000 59.000000
75% 74.250000 69.000000 67.000000 75.000000 70.000000 79.000000
max 99.000000 95.000000 89.000000 92.000000 91.000000 97.000000
How to generate the test dataset
np.random.seed(0)
nb_sample = 100
num_sample = (0,100)
d = dict()
d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
for i in range(5):
d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
df = DataFrame.from_dict(d)
Related Topics
How to Insert Pictures into Each Individual Bar in a Ggplot Graph
Programmatically Rename Columns in Dplyr
Reshaping a Data Frame with More Than One Measure Variable
Shade (Fill or Color) Area Under Density Curve by Quantile
Add Na Value to Ggplot Legend for Continuous Data Map
R Markdown - Format Text in Code Chunk with New Lines
How to Put a Box and Its Label in the Same Row? (Shiny Package)
Making a Zip Code Choropleth in R Using Ggplot2 and Ggmap
Rcpp Function Calling Another Rcpp Function
Storing a List Within a Data Frame Element in R
Creating a Sankey Diagram Using Networkd3 Package in R
Calculating Minimum Distance Between a Point and the Coast
Filter Dataframe by Maximum Values in Each Group
How to Check the Amount of Ram
How to Do a Data.Table Rolling Join