How to Remove Outliers from a Dataset

How to remove outliers from a dataset

OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:

remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

To see it in action:

set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()

And once again, you should never do this on your own, outliers are just meant to be! =)

EDIT: I added na.rm = TRUE as default.

EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)

Sample Image

R - how to remove outliers from dataset by two different groups

You could use an anti_join() in dplyr after getting the outliers. Note, when in my df_outliers I only have IDs 1, 7 and 10.

library(tidyverse)
library(rstatix)

df <- tibble(
                ID = c(1L,2L,3L,4L,5L,6L,7L,8L,
                       9L,10L,11L,12L,13L,14L,15L,16L,17L,18L,19L,
                       20L,21L,22L,23L,24L,25L,26L,27L,28L,19L),
         Treatment = c("A","A","A","A","A","A",
                       "A","A","A","A","A","A","A","A","A","A","A","A",
                       "A","A","B","B","B","B","B","B","B","B","B"),
              conc = c(40,40,40,40,40,40,20,20,
                       20,20,20,20,10,10,10,10,10,10,5,5,0.8,0.8,
                       0.8,0.8,0.8,0.8,0.6,0.6,0.6),
            relabs = c(1.0793923,0.6436631,0.5556844,
                       0.4834845,0.7224756,0.6804259,0.9958288,0.709936,
                       0.7028124,0.5016352,0.6860346,0.734197,0.8175491,
                       0.690091,0.5278228,0.7560026,0.8841343,0.6687616,
                       0.8563232,0.7419997,1.2049695,0.4969811,0.2835814,0.670025,
                       1.3126651,0.4510617,0.7629639,0.7513716,0.7956074)
)

df_outliers <- df %>% 
  group_by(Treatment, conc) %>% 
  identify_outliers("relabs") 

# A tibble: 3 x 6
  Treatment  conc    ID relabs is.outlier is.extreme
  <chr>     <dbl> <int>  <dbl> <lgl>      <lgl>     
1 A            20     7  0.996 TRUE       TRUE      
2 A            20    10  0.502 TRUE       TRUE      
3 A            40     1  1.08  TRUE       FALSE  

# without outliers
df %>% 
  anti_join(df_outliers, by = "ID") %>% 
  view()

# A tibble: 26 x 4
      ID Treatment  conc relabs
   <int> <chr>     <dbl>  <dbl>
 1     2 A            40  0.644
 2     3 A            40  0.556
 3     4 A            40  0.483
 4     5 A            40  0.722
 5     6 A            40  0.680
 6     8 A            20  0.710
 7     9 A            20  0.703
 8    11 A            20  0.686
 9    12 A            20  0.734
10    13 A            10  0.818
# … with 16 more rows