Labeling Outliers of Boxplots in R

Labeling outliers on boxplot in R

In the example given it's a bit boring because they are all the same row. but here is the code:

bxpdat <- boxplot(vv)
text(bxpdat$group, # the x locations
bxpdat$out, # the y values
rownames(vv)[which(vv == bxpdat$out, arr.ind=TRUE)[, 1]], # the labels
pos = 4)

This picks the rownames that have values equal to the "out" list (i.e., the outliers) in the result of boxplot. Boxplot calls and returns the values from boxplot.stats. Take a look at:

 str(bxpdat)

Show outlier labels ggplot and geom_boxplot r for multiple variables?

Here is what I tried. I simplified your code a bit to highlight the point you are asking. You want to somehow find label information of the outliers. You can identify outliers using the borrowed function below. When you identify them, you add car names in a new column called outlier. You use this information in geom_text_repel() in the ggrepel package.

library(tidyverse)
library(ggrepel)

z_mtcars <- data.frame(scale(mtcars[-12]))
z_mtcars$type <- rownames(mtcars)

I borrowed this function from this question. Credit goes to JasonAizkalns.

is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

z_mtcars %>%
pivot_longer(names_to = "variable", values_to = "value", -type) %>%
group_by(variable) %>%
mutate(outlier = if_else(is_outlier(value), type, NA_character_)) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
geom_boxplot() +
geom_text_repel(aes(label = outlier), na.rm = TRUE, show.legend = F)

Sample Image

How to label ggplot2 boxplot outliers with a third variable?

Here's a solution to label only the outliers in your data:

library(tidyverse)
outlier <- dff %>%
group_by(B) %>%
summarise(outlier = list(boxplot.stats(C)$out))


ggplot(dff, aes(x=B, y=C, fill=B)) +
geom_boxplot() +
geom_text(aes(label = if_else(C %in% unlist(outlier$outlier), as.character(A), "")), position=position_nudge(x=-.1))

which produces this plot:

Sample Image

Labeling Outliers of Boxplots in a loop R

The issue is that the strings i.e. the column names as strings are not evaluated. An option is to pass the strings directly in across or convert to symbol and evaluate (!!). As the former is more easier, here we show that

library(dplyr) # 1.0.0
library(stringr)
for(i in seq_along(ens_id)) {

dat <- test %>%
tibble::rownames_to_column(var="outlier") %>%
group_by(cond) %>%
mutate(across(ens_id[i], ~ replace(., !is_outlier(.), NA), .names = "{col}_is_outlier")) %>%
# or use mutate_at (if the version is less than 1.0.0
#mutate_at(vars(ens_id[i]), list(is_outlier = ~replace(., !is_outlier(.), NA))) %>%
rename_at(vars(ends_with('is_outlier')), ~ str_remove(., str_c(ens_id[i], "_")))

dat$outlier[which(is.na(dat$is_outlier))] <- as.numeric(NA)
print(head(dat))

}

Or as mentioned above, the second option is to evaluate (!!) after converting to symbol

for(i in seq_along(ens_id)) {
dat <- test %>%
tibble::rownames_to_column(var="outlier") %>%
group_by(cond) %>%
mutate(is_outlier = replace(!! sym(ens_id[i]),
!is_outlier(!!sym(ens_id[i])), NA))
dat$outlier[which(is.na(dat$is_outlier))] <- as.numeric(NA)
print(head(dat))
}

Using a reproducible example

ens_id <- c("mpg", "wt")
test <- mtcars
test$mpg[10] <- 9800
test$wt[22] <- 4895
plist <- vector('list', length(ens_id))
for(i in seq_along(ens_id)) {

dat <- test %>%
tibble::rownames_to_column(var="outlier") %>%
group_by(gear) %>%
mutate(across(ens_id[i], ~ replace(., !is_outlier(.), NA), .names = "{col}_is_outlier")) %>%
# or use mutate_at (if the version is less than 1.0.0
#mutate_at(vars(ens_id[i]), list(is_outlier = ~replace(., !is_outlier(.), NA))) %>%
rename_at(vars(ends_with('is_outlier')), ~ str_remove(., str_c(ens_id[i], "_")))

dat$outlier[which(is.na(dat$is_outlier))] <- as.numeric(NA)
plist[[i]] <- ggplot(dat, aes_string(y=ens_id[i], x="gear", group="gear")) +
geom_boxplot() +
ylab(ens_id[i])+
geom_text(aes(label=outlier), na.rm=TRUE, nudge_x=0.15)

}

plist[[1]]
plist[[2]]

Labelling outliers with ggplot

If you want IQR to be calculated by country, you need to group the data. You could probably do it globally(i.e. before you send the data to ggplot) or locally in the layer.

library(dplyr)
library(ggplot2)

ggplot(df, aes(x = as.factor(WAVE), y = PERCENT, fill = COUNTRY)) +
geom_boxplot(alpha = 0.3) +
geom_point(aes(color = AGE_GROUP, group = COUNTRY), position = position_dodge(width=0.75)) +
geom_text(aes(group = COUNTRY, label = ifelse(!between(PERCENT,-1.3*IQR(PERCENT), 1.3*IQR(PERCENT)),
paste(" ",COUNTRY, ",", AGE_GROUP, ",", round(PERCENT, 1), "%, n =", round(N, 0)),'')),
position = position_dodge(width=0.75),
hjust = "left", size = 3)

Boxplot outlier labeling in R

I took a look at this with debug(boxplot.with.outlier.label), and ... it turns out there's a bug in the function.

The error occurs on line 125, where the data.frame DATA is constructed from x,y and label_name.

Previously x and y have been reordered, while lab_y hasn't been. When the supplied value of x (your x1) isn't itself already in order, you'll get the kind of jumbling you experienced.

As an immediate fix, you can pre-order the x values like this (or do something more elegant)

df <- data.frame(y, x1, lab_y, stringsAsFactors=FALSE)
df <- df[order(df$x1), ]
# Needed since lab_y is not searched for in data (though it probably should be)
lab_y <- df$lab_y

boxplot.with.outlier.label(y~x1, lab_y, data=df)

Boxplot produced by procedure described above



Related Topics



Leave a reply



Submit