How to Remove Outliers from a Dataset

How to remove outliers from a dataset

OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:

remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}

To see it in action:

set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()

And once again, you should never do this on your own, outliers are just meant to be! =)

EDIT: I added na.rm = TRUE as default.

EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)

Sample Image

R - how to remove outliers from dataset by two different groups

You could use an anti_join() in dplyr after getting the outliers. Note, when in my df_outliers I only have IDs 1, 7 and 10.

library(tidyverse)
library(rstatix)

df <- tibble(
ID = c(1L,2L,3L,4L,5L,6L,7L,8L,
9L,10L,11L,12L,13L,14L,15L,16L,17L,18L,19L,
20L,21L,22L,23L,24L,25L,26L,27L,28L,19L),
Treatment = c("A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","B","B","B","B","B","B","B","B","B"),
conc = c(40,40,40,40,40,40,20,20,
20,20,20,20,10,10,10,10,10,10,5,5,0.8,0.8,
0.8,0.8,0.8,0.8,0.6,0.6,0.6),
relabs = c(1.0793923,0.6436631,0.5556844,
0.4834845,0.7224756,0.6804259,0.9958288,0.709936,
0.7028124,0.5016352,0.6860346,0.734197,0.8175491,
0.690091,0.5278228,0.7560026,0.8841343,0.6687616,
0.8563232,0.7419997,1.2049695,0.4969811,0.2835814,0.670025,
1.3126651,0.4510617,0.7629639,0.7513716,0.7956074)
)

df_outliers <- df %>%
group_by(Treatment, conc) %>%
identify_outliers("relabs")

# A tibble: 3 x 6
Treatment conc ID relabs is.outlier is.extreme
<chr> <dbl> <int> <dbl> <lgl> <lgl>
1 A 20 7 0.996 TRUE TRUE
2 A 20 10 0.502 TRUE TRUE
3 A 40 1 1.08 TRUE FALSE

# without outliers
df %>%
anti_join(df_outliers, by = "ID") %>%
view()

# A tibble: 26 x 4
ID Treatment conc relabs
<int> <chr> <dbl> <dbl>
1 2 A 40 0.644
2 3 A 40 0.556
3 4 A 40 0.483
4 5 A 40 0.722
5 6 A 40 0.680
6 8 A 20 0.710
7 9 A 20 0.703
8 11 A 20 0.686
9 12 A 20 0.734
10 13 A 10 0.818
# … with 16 more rows

How to remove outliers from a dataset using bivariate boxplot

You can extract the desired subset like so:

Sample Image

How to remove Outliers in Python?

You're having trouble with your code because you're trying to calculate zscore on categorical columns.

To avoid this, you should first separate your train into parts with numerical and categorical features:

num_train = train.select_dtypes(include=["number"])
cat_train = train.select_dtypes(exclude=["number"])

and only after that calculate index of rows to keep:

idx = np.all(stats.zscore(num_train) < 3, axis=1)

and finally add the two pieces together:

train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)

For IQR part:

Q1 = num_train.quantile(0.02)
Q3 = num_train.quantile(0.98)
IQR = Q3 - Q1
idx = ~((num_train < (Q1 - 1.5 * IQR)) | (num_train > (Q3 + 1.5 * IQR))).any(axis=1)
train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)

Please let us know if you have any further questions.

PS

As well, you might consider one more approach for dealing with outliers with pandas.DataFrame.clip, which will clip outliers on a case-by-case basis instead of dropping a row altogether.



Related Topics



Leave a reply



Submit