Method of Outlier Removal for Boxplots
tl;dr outliers are points that are beyond approximately twice the interquartile range away from the median (in a symmetric case). More precisely, points beyond a cutoff equal to the 'hinges' (approx. 1st and 3d quartiles) +/- 1.5 times the interquartile range.
R's boxplot()
function does not actually remove outliers at all; all observations in the data set are represented in the plot (unless the outline
argument is FALSE
). The information on the calculation for which points are plotted as outliers (i.e., as individual points beyond the whiskers) is, implicitly, contained in the description of the range
parameter:
range [default 1.5]: this determines how far the plot whiskers extend out from the
box. If ‘range’ is positive, the whiskers extend to the most
extreme data point which is no more than ‘range’ times the
interquartile range from the box. A value of zero causes the
whiskers to extend to the data extremes.
This has to be deconstructed a little bit more: what does "from the box" mean? To figure this out, we need to look at the Details of ?boxplot.stats
:
The two ‘hinges’ are versions of the first and third quartile,
i.e., close to ‘quantile(x, c(1,3)/4)' [... see?boxplot.stats
for slightly more detail ...]
The reason for all the complexity/"approximately equal to the quartile" language is that the developers of the boxplot wanted to make sure that the hinges and whiskers were always drawn at points representing actual observations in the data set (whereas the quartiles can be located between observed points, e.g. in the case of data sets with odd numbers of observations).
Example:
set.seed(101)
z <- rnorm(100000)
boxplot(z)
hinges <- qnorm(c(0.25,0.75))
IQR <- diff(qnorm(c(0.25,0.75)))
abline(h=hinges,lty=2,col=4) ## hinges ~ quartiles
abline(h=hinges+c(-1,1)*1.5*IQR,col=2)
## in this case hinges = +/- IQR/2, so whiskers ~ +/- 2*IQR
abline(h=c(-1,1)*IQR*2,lty=2,col="purple")
How to remove outliers in boxplot in R?
See ?boxplot
for all the help you need.
outline: if ‘outline’ is not true, the outliers are not drawn (as
points whereas S+ uses lines).
boxplot(x,horizontal=TRUE,axes=FALSE,outline=FALSE)
And for extending the range of the whiskers and suppressing the outliers inside this range:
range: this determines how far the plot whiskers extend out from the
box. If ‘range’ is positive, the whiskers extend to the most
extreme data point which is no more than ‘range’ times the
interquartile range from the box. A value of zero causes the
whiskers to extend to the data extremes.
# change the value of range to change the whisker length
boxplot(x,horizontal=TRUE,axes=FALSE,range=2)
geom_boxplot outlier shape from Sample ID
I figured out a really ugly solution. I'm pretty sure there is a prettier way to do this but here is the full code:
First we create dummy data:
# start with an clean environment
rm(list=ls())
# create a function to load or install all necessary libraries
install.load.package <- function(x) {
if (!require(x, character.only = TRUE))
install.packages(x)
require(x, character.only = TRUE)
}
package_vec <- c("ggplot2",
"dplyr"
)
sapply(package_vec, install.load.package)
# now to the data
df <- data.frame()
set.seed(42)
os <- 0
sam <- 1
for (time in as.factor(c('T0', 'T1'))) {
if (time == 'T1') {
sam <- 1
}
for (group in as.factor(c('A','B'))) {
for (pat in 1:10) {
df[pat + os, 'Sample'] <- paste('P', pat, '_', sam, sep = '')
df[pat + os, 'Time'] <- time
df[pat + os, 'Group'] <- group
df[pat + os, 'Value'] <- rnorm(1) + os
# add outlier, they are the same in each group in this example,
# but can differ in the real data set
if (pat == 2 | pat == 9) {
print(pat)
df[pat + os, 'Value'] <- df[pat + os, 'Value'] + 10
}
sam <- sam + 1
}
os <- os + 10
}
}
Then we calculate the outliers as following, and create a new column where the ID of the Outlier is placed. If it is not an outlier an 'X' is inserted
# calculate outliers
df = df %>%
group_by(Group,Time) %>%
mutate(is_outlier = case_when(Value > quantile(Value)[4] + 1.5*IQR(Value) ~ as.character(Sample),
Value < quantile(Value)[2] - 1.5*IQR(Value) ~ as.character(Sample),
TRUE ~ as.character('X')))
df$Group <- as.factor(df$Group)
Now, we replace the Sample ID with a number. The first outlier pair(s) gets the number 1, the second gets a 2 and so on. If there are more outliers than available `geom_points' shapes, the code has to be adapted. But lets just assume we don't have more than 23 outliers (I think that's the maximum amount).
for (group in levels(df$Group)) {
count <- 1
for (id in levels(as.factor(df$is_outlier[which(df$Group == group)]))) {
if (id == 'X') {
df[which(df$is_outlier == id), 'is_outlier'] <- as.character(NA)
} else {
df[which(df$is_outlier == id), 'is_outlier'] <- as.character(count)
count <- count + 1
}
}
}
this overwrites the previously created column. Its introducing NA
's for the X values.
now we can plot the data
ggplot(df, aes(x = Time,
y = Value,
label = Time)) +
geom_boxplot(outlier.shape = NA) +
geom_point(data = df,
shape= as.numeric(df$is_outlier),
color = 'red') +
facet_grid(~factor(Group),
switch = 'x',
scales = 'free_y')
This results in this plot:
Now we can see if an outlier stays an outlier from T0
to T1
. Be aware that in Group B
we use the same shape. But these are totally different samples. One has to adapt the code above the plotting code to account for this. But this way we would have potentially less shapes available.
If one of you has a smoother and more elegant solution, I'd be happy to learn.
Best TMC
Changing whisker end in geom_boxplot
Adapted from the answer Changing whisker definition in geom_boxplot
p <- ggplot(data=concentration,aes(factor(location), formaldehyde),ylim=c(0,0.15),cex.axis=1.5,cex.lab=15)
f <- function(x) {
r <- quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1))
names(r) <- c("ymin", "lower", "middle", "upper", "ymax")
r
}
p + stat_summary(fun.data=f, aes(fill= Condition), geom="boxplot", position="dodge")
Manual outlier plotting in grouped boxplot with jittered points
An answer was provided in a very helpful comment by @Cucumiiis on Twitter, which I want to share here.
The solution is to create the boxplots as you normally would, and then use a second data set where the outliers are removed for the points. The code then looks like this:
without_outliers <- example_data %>%
group_by(cut, clarity) %>%
mutate(outlier = ifelse(price > median(price) + IQR(price) * 1.5, TRUE , FALSE)) %>%
filter(!outlier)
example_data %>%
ggplot(aes(y = price, x = cut, colour = clarity)) +
geom_point(
data = without_outliers,
position = position_jitterdodge()
) +
geom_boxplot(fill = NA, outlier.colour = "red") +
theme_classic() +
theme(legend.position = "top") +
scale_shape_manual(values = c(NA, 25))
Related Topics
Jitter If Multiple Outliers in Ggplot2 Boxplot
How to Check the Existence of a Downloaded File
How to Replace Empty String with Na in R Dataframe
Copy/Move One Environment to Another
Ggplot2: Different Legend Symbols for Points and Lines
Ggplot2: Geom_Text() with Facet_Grid()
Polygons Nicely Cropping Ggplot2/Ggmap at Different Zoom Levels
Error in Eval(Expr, Envir, Enclos):Object Not Found
Conditionally Display Block of Markdown Text Using Knitr
Does R Leverage Simd When Doing Vectorized Calculations
Rolling Window Over Irregular Time Series
Sort a Factor Based on Value in One or More Other Columns
Note in R Cran Check: No Repository Set, So Cyclic Dependency Check Skipped
Change the Color and Font of Text in Shiny App
How to Generate a Frequency Table in R with With Cumulative Frequency and Relative Frequency