Labeling outliers on boxplot in R
In the example given it's a bit boring because they are all the same row. but here is the code:
bxpdat <- boxplot(vv)
text(bxpdat$group, # the x locations
bxpdat$out, # the y values
rownames(vv)[which(vv == bxpdat$out, arr.ind=TRUE)[, 1]], # the labels
pos = 4)
This picks the rownames that have values equal to the "out" list (i.e., the outliers) in the result of boxplot
. Boxplot calls and returns the values from boxplot.stats
. Take a look at:
str(bxpdat)
Show outlier labels ggplot and geom_boxplot r for multiple variables?
Here is what I tried. I simplified your code a bit to highlight the point you are asking. You want to somehow find label information of the outliers. You can identify outliers using the borrowed function below. When you identify them, you add car names in a new column called outlier. You use this information in geom_text_repel()
in the ggrepel package.
library(tidyverse)
library(ggrepel)
z_mtcars <- data.frame(scale(mtcars[-12]))
z_mtcars$type <- rownames(mtcars)
I borrowed this function from this question. Credit goes to JasonAizkalns.
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
z_mtcars %>%
pivot_longer(names_to = "variable", values_to = "value", -type) %>%
group_by(variable) %>%
mutate(outlier = if_else(is_outlier(value), type, NA_character_)) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
geom_boxplot() +
geom_text_repel(aes(label = outlier), na.rm = TRUE, show.legend = F)
How to label ggplot2 boxplot outliers with a third variable?
Here's a solution to label only the outliers in your data:
library(tidyverse)
outlier <- dff %>%
group_by(B) %>%
summarise(outlier = list(boxplot.stats(C)$out))
ggplot(dff, aes(x=B, y=C, fill=B)) +
geom_boxplot() +
geom_text(aes(label = if_else(C %in% unlist(outlier$outlier), as.character(A), "")), position=position_nudge(x=-.1))
which produces this plot:
Labeling Outliers of Boxplots in a loop R
The issue is that the strings i.e. the column names as strings are not evaluated. An option is to pass the strings directly in across
or convert to sym
bol and evaluate (!!
). As the former is more easier, here we show that
library(dplyr) # 1.0.0
library(stringr)
for(i in seq_along(ens_id)) {
dat <- test %>%
tibble::rownames_to_column(var="outlier") %>%
group_by(cond) %>%
mutate(across(ens_id[i], ~ replace(., !is_outlier(.), NA), .names = "{col}_is_outlier")) %>%
# or use mutate_at (if the version is less than 1.0.0
#mutate_at(vars(ens_id[i]), list(is_outlier = ~replace(., !is_outlier(.), NA))) %>%
rename_at(vars(ends_with('is_outlier')), ~ str_remove(., str_c(ens_id[i], "_")))
dat$outlier[which(is.na(dat$is_outlier))] <- as.numeric(NA)
print(head(dat))
}
Or as mentioned above, the second option is to evaluate (!!
) after converting to sym
bol
for(i in seq_along(ens_id)) {
dat <- test %>%
tibble::rownames_to_column(var="outlier") %>%
group_by(cond) %>%
mutate(is_outlier = replace(!! sym(ens_id[i]),
!is_outlier(!!sym(ens_id[i])), NA))
dat$outlier[which(is.na(dat$is_outlier))] <- as.numeric(NA)
print(head(dat))
}
Using a reproducible example
ens_id <- c("mpg", "wt")
test <- mtcars
test$mpg[10] <- 9800
test$wt[22] <- 4895
plist <- vector('list', length(ens_id))
for(i in seq_along(ens_id)) {
dat <- test %>%
tibble::rownames_to_column(var="outlier") %>%
group_by(gear) %>%
mutate(across(ens_id[i], ~ replace(., !is_outlier(.), NA), .names = "{col}_is_outlier")) %>%
# or use mutate_at (if the version is less than 1.0.0
#mutate_at(vars(ens_id[i]), list(is_outlier = ~replace(., !is_outlier(.), NA))) %>%
rename_at(vars(ends_with('is_outlier')), ~ str_remove(., str_c(ens_id[i], "_")))
dat$outlier[which(is.na(dat$is_outlier))] <- as.numeric(NA)
plist[[i]] <- ggplot(dat, aes_string(y=ens_id[i], x="gear", group="gear")) +
geom_boxplot() +
ylab(ens_id[i])+
geom_text(aes(label=outlier), na.rm=TRUE, nudge_x=0.15)
}
plist[[1]]
plist[[2]]
Labelling outliers with ggplot
If you want IQR
to be calculated by country, you need to group the data. You could probably do it globally(i.e. before you send the data to ggplot
) or locally in the layer.
library(dplyr)
library(ggplot2)
ggplot(df, aes(x = as.factor(WAVE), y = PERCENT, fill = COUNTRY)) +
geom_boxplot(alpha = 0.3) +
geom_point(aes(color = AGE_GROUP, group = COUNTRY), position = position_dodge(width=0.75)) +
geom_text(aes(group = COUNTRY, label = ifelse(!between(PERCENT,-1.3*IQR(PERCENT), 1.3*IQR(PERCENT)),
paste(" ",COUNTRY, ",", AGE_GROUP, ",", round(PERCENT, 1), "%, n =", round(N, 0)),'')),
position = position_dodge(width=0.75),
hjust = "left", size = 3)
Boxplot outlier labeling in R
I took a look at this with debug(boxplot.with.outlier.label)
, and ... it turns out there's a bug
in the function.
The error occurs on line 125, where the data.frame DATA
is constructed from x
,y
and label_name
.
Previously x
and y
have been reordered, while lab_y
hasn't been. When the supplied value of x
(your x1
) isn't itself already in order, you'll get the kind of jumbling you experienced.
As an immediate fix, you can pre-order the x
values like this (or do something more elegant)
df <- data.frame(y, x1, lab_y, stringsAsFactors=FALSE)
df <- df[order(df$x1), ]
# Needed since lab_y is not searched for in data (though it probably should be)
lab_y <- df$lab_y
boxplot.with.outlier.label(y~x1, lab_y, data=df)
Related Topics
How Can One Work Fully Generically in Data.Table in R With Column Names in Variables
Change Variable Name in For Loop Using R
A Similar Function to R'S Rep in Matlab
Test If Characters Are in a String
Use a Value from the Previous Row in an R Data.Table Calculation
Check If the Number Is Integer
Ggplot2 - Jitter and Position Dodge Together
Omit Rows Containing Specific Column of Na
Replace/Translate Characters in a String
Put Stars on Ggplot Barplots and Boxplots - to Indicate the Level of Significance (P-Value)
Create a Variable Name With "Paste" in R
Remove Na Values from a Vector
Extract Regression Coefficient Values
How to Read a CSV File in R With Different Number of Columns