Coloring Boxplot Outlier Points in Ggplot2

Coloring boxplot outlier points in ggplot2?

In order to color the outlier points the same as your boxplots, you're going to need to calculate the outliers and plot them separately. As far as I know, the built-in option for coloring outliers colors all outliers the same color.

The help file example

Using the same data as the 'geom_boxplot' help file:

ggplot(mtcars, aes(x=factor(cyl), y=mpg, col=factor(cyl))) +
geom_boxplot()

help file demo

Coloring the outlier points

Now there may be a more streamlined way to do this, but I prefer to calculate things by hand, so I don't have to guess what's going on under the hood. Using the 'plyr' package, we can quickly get the upper and lower limits for using the default (Tukey) method for determining an outlier, which is any point outside the range [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]. Q1 and Q3 are the 1/4 and 3/4 quantiles of the data, and IQR = Q3 - Q1. We could write this all as one huge statement, but since the 'plyr' package's 'mutate' function will allow us to reference newly-created columns, we might as well split it up for easier reading/debugging, like so:

library(plyr)
plot_Data <- ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), IQR=Q3-Q1, upper.limit=Q3+1.5*IQR, lower.limit=Q1-1.5*IQR)

We use the 'ddply' function, because we are inputting a data frame and wanting a data frame as output ("d->d" ply). The 'mutate' function in the above 'ddply' statement is preserving the original data frame and adding additional columns, and the specification of .(cyl) is telling the functions to be calculated for each grouping of 'cyl' values.

At this point, we can now plot the boxplot and then overwrite the outliers with new, colored points.

ggplot() +
geom_boxplot(data=plot_Data, aes(x=factor(cyl), y=mpg, col=factor(cyl))) +
geom_point(data=plot_Data[plot_Data$mpg > plot_Data$upper.limit | plot_Data$mpg < plot_Data$lower.limit,], aes(x=factor(cyl), y=mpg, col=factor(cyl)))

colored outliers

What we are doing in the code is to specify an empty 'ggplot' layer and then adding the boxplot and point geometries using independent data. The boxplot geometry could use the original data frame, but I am using our new 'plot_Data' to be consistent. The point geometry is then only plotting the outlier points, using our new 'lower.limit' and 'upper.limit' columns to determine outlier status. Since we use the same specification for the 'x' and 'col' aesthetic arguments, the colors are magically matched between the boxplots and the corresponding outlier points.

Update: The OP requested a more complete explanation of the 'ddply' function used in this code. Here it is:

The 'plyr' family of functions are basically a way of subsetting data and performing a function on each subset of the data. In this particular case, we have the statement:

ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), IQR=Q3-Q1, upper.limit=Q3+1.5*IQR, lower.limit=Q1-1.5*IQR)

Let's break this down in the order the statement would be written. First, the selection of the 'ddply' function. We want to calculate the lower and upper limits for each value of 'cyl' in the 'mtcars' data. We could write a 'for' loop or other statement to calculate these values, but then we would have to write another logic block later to assess outlier status. Instead, we want to use 'ddply' to calculate the lower and upper limits and add those values to every line. We choose 'ddply' (as opposed to 'dlply', 'd_ply', etc.), because we are inputting a data frame and wanting a data frame as output. This gives us:

ddply(

We want to perform the statement on the 'mtcars' data frame, so we add that.

ddply(mtcars, 

Now, we want to perform our calculations using the 'cyl' values as a grouping variable. We use the 'plyr' function .() to refer to the variable itself rather than to the variable's value, like so:

ddply(mtcars, .(cyl),

The next argument specifies the function to apply to every group. We want our calculation to add new rows to the old data, so we choose the 'mutate' function. This preserves the old data and adds the new calculations as new columns. This is in contrast to other functions like 'summarize', which removes all of the old columns except the grouping varaible(s).

ddply(mtcars, .(cyl), mutate, 

The final series of arguments are all of the new columns of data we want to create. We define these by specifying a name (unquoted) and an expression. First, we create the 'Q1' column.

ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), 

The 'Q3' column is calculated similarly.

ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), 

Luckily, with the 'mutate' function, we can use newly created columns as part of the definition of other columns. This saves us from having to write one giant function or from having to run multiple functions. We need to use 'Q1' and 'Q3' in the calculation of the inter-quartile range for the 'IQR' variable, and that's easy with the 'mutate' function.

ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), IQR=Q3-Q1, 

We're finally where we want to be now. We technically don't need the 'Q1', 'Q3', and 'IQR' columns, but it does make our lower limit and upper limit equations a lot easier to read and debug. We can write our expression just like the theoretical formula: limits=+/- 1.5 * IQR

ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), IQR=Q3-Q1, upper.limit=Q3+1.5*IQR, lower.limit=Q1-1.5*IQR)

Cutting out the middle columns for readability, this is what the new data frame looks like:

plot_Data[, c(-3:-11)]
# mpg cyl Q1 Q3 IQR upper.limit lower.limit
# 1 22.8 4 22.80 30.40 7.60 41.800 11.400
# 2 24.4 4 22.80 30.40 7.60 41.800 11.400
# 3 22.8 4 22.80 30.40 7.60 41.800 11.400
# 4 32.4 4 22.80 30.40 7.60 41.800 11.400
# 5 30.4 4 22.80 30.40 7.60 41.800 11.400
# 6 33.9 4 22.80 30.40 7.60 41.800 11.400
# 7 21.5 4 22.80 30.40 7.60 41.800 11.400
# 8 27.3 4 22.80 30.40 7.60 41.800 11.400
# 9 26.0 4 22.80 30.40 7.60 41.800 11.400
# 10 30.4 4 22.80 30.40 7.60 41.800 11.400
# 11 21.4 4 22.80 30.40 7.60 41.800 11.400
# 12 21.0 6 18.65 21.00 2.35 24.525 15.125
# 13 21.0 6 18.65 21.00 2.35 24.525 15.125
# 14 21.4 6 18.65 21.00 2.35 24.525 15.125
# 15 18.1 6 18.65 21.00 2.35 24.525 15.125
# 16 19.2 6 18.65 21.00 2.35 24.525 15.125
# 17 17.8 6 18.65 21.00 2.35 24.525 15.125
# 18 19.7 6 18.65 21.00 2.35 24.525 15.125
# 19 18.7 8 14.40 16.25 1.85 19.025 11.625
# 20 14.3 8 14.40 16.25 1.85 19.025 11.625
# 21 16.4 8 14.40 16.25 1.85 19.025 11.625
# 22 17.3 8 14.40 16.25 1.85 19.025 11.625
# 23 15.2 8 14.40 16.25 1.85 19.025 11.625
# 24 10.4 8 14.40 16.25 1.85 19.025 11.625
# 25 10.4 8 14.40 16.25 1.85 19.025 11.625
# 26 14.7 8 14.40 16.25 1.85 19.025 11.625
# 27 15.5 8 14.40 16.25 1.85 19.025 11.625
# 28 15.2 8 14.40 16.25 1.85 19.025 11.625
# 29 13.3 8 14.40 16.25 1.85 19.025 11.625
# 30 19.2 8 14.40 16.25 1.85 19.025 11.625
# 31 15.8 8 14.40 16.25 1.85 19.025 11.625
# 32 15.0 8 14.40 16.25 1.85 19.025 11.625

Just to give a contrast, if we were to do the same 'ddply' statement with the 'summarize' function, instead, we would have all of the same answers but without the columns of the other data.

ddply(mtcars, .(cyl), summarize, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), IQR=Q3-Q1, upper.limit=Q3+1.5*IQR, lower.limit=Q1-1.5*IQR)
# cyl Q1 Q3 IQR upper.limit lower.limit
# 1 4 22.80 30.40 7.60 41.800 11.400
# 2 6 18.65 21.00 2.35 24.525 15.125
# 3 8 14.40 16.25 1.85 19.025 11.625

Boxplot, how to match outliers' color to fill aesthetics?

As @koshke said, having the outliers colored like the lines of the box (not the fill color) is now easily possible by setting outlier.colour = NULL:

m <- ggplot(movies, aes(y = votes, x = factor(round(rating)),
colour = factor(Animation)))
m + geom_boxplot(outlier.colour = NULL) + scale_y_log10()

boxplot with coloured outliers

  • outlier.colour must be written with "ou"
  • outlier.colour must be outside aes ()

I'm posting this as a late answer because I find myself looking this up again and again, and I also posted it for the related question Coloring boxplot outlier points in ggplot2?.

geom_boxplot, how to specifically color only outliers based on group and keep everything black?

I take back my comment, you can do something about it, and that is plotting the outliers as seperate points.

First, you'd make a boxplot as per usual and take the layer data.

g <- ggplot(mpg, aes(class, hwy)) + geom_boxplot()

ld <- layer_data(g)

Now you split the original data on the same variable as your x-axis and use the boxplot data to figure out which of your datapoints are outliers.

split <- split(mpg, mpg$class)

outliers <- lapply(seq_along(split), function(i) {
box <- ld[ld$group == i, ]
data <- split[[i]]
data <- data[data$hwy > box$ymax | data$hwy < box$ymin, ]
data
})
outliers <- do.call(rbind, outliers)

Then you plot the boxplot and points as different layers, and you'll have the usual level of control over your points:

ggplot(mpg, aes(class, hwy)) +
geom_boxplot(outlier.shape = NA) +
geom_point(data = outliers, aes(colour = manufacturer))

Sample Image

Color outliers multiple factors in boxplot

Do you want just to change the outliers' colour ? If so, you can do it easily by drawing boxplot twice.

p <- ggplot(data = df, aes(x = factor(delta), y = value)) + 
geom_boxplot(aes(colour=factor(metric))) +
geom_boxplot(aes(fill=factor(metric)), outlier.colour = NA)
# outlier.shape = 21 # if you want a boarder

Sample Image

[EDITED]

colss <- c(P="firebrick3",R="skyblue", C="mediumseagreen")
p + scale_colour_manual(values = colss) + # outliers colours
scale_fill_manual(values = colss) # boxes colours

# the development version (2.1.0.9001)'s geom_boxplot() has an argument outlier.fill,
# so I guess under code would return the similar output in the near future.
p2 <- ggplot(data = df, aes(x = factor(delta), y = value)) +
geom_boxplot(aes(fill=factor(metric)), outlier.shape = 21, outlier.colour = NA)

Fill outliers with same color as Boxplots Fill color in ggplot R?

We can set the outlier.shape to "21" so that it will inherit the fill colors and plot the points before the boxplot:

ToothGrowth$dose <- as.factor(ToothGrowth$dose)
ggplot(ToothGrowth, aes(x=dose, y=len))+
geom_point()+
geom_boxplot(width=0.3,aes(fill=dose),outlier.shape = 21)+
scale_fill_manual(values=c("firebrick", "royalblue","yellow"))

Sample Image
It's hardly visible in the picture but the outlier at dose 0.5 is red.


Edit
To also display points inside the box:

is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

ToothGrowth %>%
group_by(dose) %>%
mutate(outlier = ifelse(is_outlier(len), as.numeric(NA), as.numeric(len))) %>%
ggplot(aes(x=dose, y=len))+
geom_boxplot(width=0.3,aes(fill=dose),outlier.shape = 21)+
geom_point(aes(x = dose, y = outlier))+
scale_fill_manual(values=c("firebrick", "royalblue","yellow"))


Related Topics



Leave a reply



Submit