Changing the Outlier Rule in a Boxplot

Method of Outlier Removal for Boxplots

tl;dr outliers are points that are beyond approximately twice the interquartile range away from the median (in a symmetric case). More precisely, points beyond a cutoff equal to the 'hinges' (approx. 1st and 3d quartiles) +/- 1.5 times the interquartile range.

R's boxplot() function does not actually remove outliers at all; all observations in the data set are represented in the plot (unless the outline argument is FALSE). The information on the calculation for which points are plotted as outliers (i.e., as individual points beyond the whiskers) is, implicitly, contained in the description of the range parameter:

range [default 1.5]: this determines how far the plot whiskers extend out from the
box. If ‘range’ is positive, the whiskers extend to the most
extreme data point which is no more than ‘range’ times the
interquartile range from the box. A value of zero causes the
whiskers to extend to the data extremes.

This has to be deconstructed a little bit more: what does "from the box" mean? To figure this out, we need to look at the Details of ?boxplot.stats:

The two ‘hinges’ are versions of the first and third quartile,
i.e., close to ‘quantile(x, c(1,3)/4)' [... see ?boxplot.stats for slightly more detail ...]

The reason for all the complexity/"approximately equal to the quartile" language is that the developers of the boxplot wanted to make sure that the hinges and whiskers were always drawn at points representing actual observations in the data set (whereas the quartiles can be located between observed points, e.g. in the case of data sets with odd numbers of observations).

Example:

set.seed(101)
z <- rnorm(100000)
boxplot(z)
hinges <- qnorm(c(0.25,0.75))
IQR <- diff(qnorm(c(0.25,0.75)))
abline(h=hinges,lty=2,col=4) ## hinges ~ quartiles
abline(h=hinges+c(-1,1)*1.5*IQR,col=2)
## in this case hinges = +/- IQR/2, so whiskers ~ +/- 2*IQR
abline(h=c(-1,1)*IQR*2,lty=2,col="purple")

Sample Image

How to remove outliers in boxplot in R?

See ?boxplot for all the help you need.

 outline: if ‘outline’ is not true, the outliers are not drawn (as
points whereas S+ uses lines).

boxplot(x,horizontal=TRUE,axes=FALSE,outline=FALSE)

And for extending the range of the whiskers and suppressing the outliers inside this range:

   range: this determines how far the plot whiskers extend out from the
box. If ‘range’ is positive, the whiskers extend to the most
extreme data point which is no more than ‘range’ times the
interquartile range from the box. A value of zero causes the
whiskers to extend to the data extremes.

# change the value of range to change the whisker length
boxplot(x,horizontal=TRUE,axes=FALSE,range=2)

geom_boxplot outlier shape from Sample ID

I figured out a really ugly solution. I'm pretty sure there is a prettier way to do this but here is the full code:

First we create dummy data:

# start with an clean environment
rm(list=ls())
# create a function to load or install all necessary libraries
install.load.package <- function(x) {
if (!require(x, character.only = TRUE))
install.packages(x)
require(x, character.only = TRUE)
}
package_vec <- c("ggplot2",
"dplyr"
)
sapply(package_vec, install.load.package)

# now to the data
df <- data.frame()
set.seed(42)
os <- 0
sam <- 1
for (time in as.factor(c('T0', 'T1'))) {
if (time == 'T1') {
sam <- 1
}
for (group in as.factor(c('A','B'))) {
for (pat in 1:10) {
df[pat + os, 'Sample'] <- paste('P', pat, '_', sam, sep = '')
df[pat + os, 'Time'] <- time
df[pat + os, 'Group'] <- group
df[pat + os, 'Value'] <- rnorm(1) + os
# add outlier, they are the same in each group in this example,
# but can differ in the real data set
if (pat == 2 | pat == 9) {
print(pat)
df[pat + os, 'Value'] <- df[pat + os, 'Value'] + 10
}
sam <- sam + 1
}
os <- os + 10
}
}

Then we calculate the outliers as following, and create a new column where the ID of the Outlier is placed. If it is not an outlier an 'X' is inserted

# calculate outliers
df = df %>%
group_by(Group,Time) %>%
mutate(is_outlier = case_when(Value > quantile(Value)[4] + 1.5*IQR(Value) ~ as.character(Sample),
Value < quantile(Value)[2] - 1.5*IQR(Value) ~ as.character(Sample),
TRUE ~ as.character('X')))
df$Group <- as.factor(df$Group)

Now, we replace the Sample ID with a number. The first outlier pair(s) gets the number 1, the second gets a 2 and so on. If there are more outliers than available `geom_points' shapes, the code has to be adapted. But lets just assume we don't have more than 23 outliers (I think that's the maximum amount).

for (group in levels(df$Group)) {
count <- 1
for (id in levels(as.factor(df$is_outlier[which(df$Group == group)]))) {
if (id == 'X') {
df[which(df$is_outlier == id), 'is_outlier'] <- as.character(NA)
} else {
df[which(df$is_outlier == id), 'is_outlier'] <- as.character(count)
count <- count + 1
}
}
}

this overwrites the previously created column. Its introducing NA's for the X values.

now we can plot the data

  ggplot(df, aes(x = Time,
y = Value,
label = Time)) +
geom_boxplot(outlier.shape = NA) +
geom_point(data = df,
shape= as.numeric(df$is_outlier),
color = 'red') +
facet_grid(~factor(Group),
switch = 'x',
scales = 'free_y')

This results in this plot:

outlier with identity shape

Now we can see if an outlier stays an outlier from T0 to T1. Be aware that in Group B we use the same shape. But these are totally different samples. One has to adapt the code above the plotting code to account for this. But this way we would have potentially less shapes available.

If one of you has a smoother and more elegant solution, I'd be happy to learn.

Best TMC

Changing whisker end in geom_boxplot

Adapted from the answer Changing whisker definition in geom_boxplot

 p <- ggplot(data=concentration,aes(factor(location), formaldehyde),ylim=c(0,0.15),cex.axis=1.5,cex.lab=15)

f <- function(x) {
r <- quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1))
names(r) <- c("ymin", "lower", "middle", "upper", "ymax")
r
}

p + stat_summary(fun.data=f, aes(fill= Condition), geom="boxplot", position="dodge")

Sample Image

Manual outlier plotting in grouped boxplot with jittered points

An answer was provided in a very helpful comment by @Cucumiiis on Twitter, which I want to share here.

The solution is to create the boxplots as you normally would, and then use a second data set where the outliers are removed for the points. The code then looks like this:

without_outliers <- example_data %>% 
group_by(cut, clarity) %>%
mutate(outlier = ifelse(price > median(price) + IQR(price) * 1.5, TRUE , FALSE)) %>%
filter(!outlier)

example_data %>%
ggplot(aes(y = price, x = cut, colour = clarity)) +
geom_point(
data = without_outliers,
position = position_jitterdodge()
) +
geom_boxplot(fill = NA, outlier.colour = "red") +
theme_classic() +
theme(legend.position = "top") +
scale_shape_manual(values = c(NA, 25))

Sample Image



Related Topics



Leave a reply



Submit