Ggplot Boxplot - Length of Whiskers with Logarithmic Axis

ggplot boxplot - length of whiskers with logarithmic axis

The problem is due to the fact that scale_y_log10 transforms the data before calculating the stats. This does not matter for the median and percentile points, because e.g. 10^log10(median) is still the median value, which will be plotted in the correct location. But it does matter for the whiskers which are calculated using 1.5 * IQR, because 10^(1.5 * IQR(log10(x)) is not equal to 1.5 * IQR(x). So the calculation fails for the whiskers.

This error becomes evident if we compare

boxplot.stats(my.df$b)$stats
# [1] 117.4978 407.3983 502.0460 601.2937 873.0992
10^boxplot.stats(log10(my.df$b))$stats
# [1] 231.1603 407.3983 502.0459 601.2935 975.1906

In which we see that the median and percentile ppoints are identical, but the whisker ends (1st and last elements of the stats vector) differ

This detailed and useful answer by @eipi10, shows how to calculate the stats yourself and force ggplot to use these user-defined stats rather than its internal (and incorrect) algorithm. Using this approach, it becomes relatively simple to calculate the correct statistics and use these instead.

# Function to use boxplot.stats to set the box-and-whisker locations  
mybxp = function(x) {
bxp = log10(boxplot.stats(10^x)[["stats"]])
names(bxp) = c("ymin","lower", "middle","upper","ymax")
return(bxp)
}

# Function to use boxplot.stats for the outliers
myout = function(x) {
data.frame(y=log10(boxplot.stats(10^x)[["out"]]))
}

ggplot(my.df.long, aes(x=variable, y=vals)) + theme_bw() + coord_flip() +
scale_y_log10(breaks=c(5,10,20,50,100,200,500,1000), limits=c(5,1000)) +
stat_summary(fun.data=mybxp, geom="boxplot") +
stat_summary(fun.data=myout, geom="point")

Which produces the correct plot

Sample Image

A note on using coord_trans as an alternative approach:

Using coord_trans(y = "log10") instead of scale_y_log10, causes the stats to be calculated (correctly) on the untransformed data. However, coord_trans cannot be used in combination with coord_flip. So, this does not solve the issue of creating horizontal boxplots with a log axis. The suggestion here to use ggdraw(switch_axis_position()) from the cowplot package to flip the axes after using coord_trans did not work, but throws an error (cowplot v0.4.0 with ggplot2 v2.1.0)

Error in Ops.unit(gyl$x, grid::unit(0.5, "npc")) : both operands
must be units

In addition: Warning message: axis.ticks.margin is
deprecated. Please set margin property of axis.text instead

Resize whiskers (width) in a ggplot boxplot with a grouping variable

I can get your desired outcome by adjusting the position of your stat_boxplot(). For me, it appears correct by adding the following argument: position = position_dodge(width = 0.75). It was trial and error to get the correct value of 0.75.

p <- ggplot(Salaries, aes(x=rank, y=salary, fill=sex)) +
stat_boxplot(geom= 'errorbar' , width = 0.3, position = position_dodge(width = 0.75) ) +

geom_boxplot() +

labs(title="Salary by Rank and Sex", x="Rank", y="Salary")

show(p)

Sample Image

geom_boxplot gave wrong whiskers

From the quoted section:

The upper whisker extends from the hinge to the largest value no
further than 1.5 * IQR from the hinge
(where IQR is the inter-quartile
range, or distance between the first and third quartiles).

By "value" they mean from among the original data points. If you plot the data, there are no values between the top hinge at 7.09 and 16.15 (+1.5*IQR). If these quartiles had arisen from data with one of the values lying in that range, the upper whisker would go there.

ggplot(data, aes(y = value)) +
geom_jitter(aes(x = 0.5), width = 0.05) +
stat_boxplot(geom = "errorbar", width = 0.3,
color = "red", size = 1.5) +
geom_boxplot(width = 0.5, alpha = 0.5) +
geom_hline(yintercept = c(7.09, 16.15), lty = "dashed")

Sample Image

Label whiskers on ggplot boxplot when there are outliers

Boxplots use boxplots.stats. You can directly use this in your stat_summary:

ggplot(mtcars, aes(x=factor(cyl), y=mpg, fill=factor(cyl))) + 
geom_boxplot(width=0.6) +
stat_summary(
aes(label=sprintf("%1.1f", ..y..), color=factor(cyl)),
geom="text",
fun.y = function(y) boxplot.stats(y)$stats,
position=position_nudge(x=0.33),
size=3.5) +
theme_bw()

Sample Image

If you only need the whiskers, simply use boxplot.stats(y)$stats[c(1, 5)] instead.

Sample size over whiskers of boxplot

Okay scratch that last attempt. I figured it out. boxplot.stats and geom_boxplot calculate quartile stats differently, and that skews everything in small sample sizes. We can call the actual stats geom_boxplot uses with ggplot_build.

This is how it's done, son. First, make your plot, like above, I called it p.
Now calculate sample size for each x variable

samp <- count(mtcars, cyl)

now retrieve the data from the plot using ggplot_build

ggstat <- ggplot_build(p)$data
ggwhisk1 <- ggstat[[1]]$ymax

now combine that with the sample size, and call that data in geom_text

ggwhisk2 <- data.frame(samp, whisk = ggwhisk1)
p <- p + geom_text(data = ggwhisk2, size = 2,
aes(x = cyl, y = whisk, label = paste0("n =", n), vjust = -.5))

Voila!!

Changing whisker length of multiple boxplot in R

By default (notched=FALSE), the geom_boxplot() should give you the whisker you want (Q1 - 1.5*IQR / Q3 + 1.5*IQR). See a more current question link. Although, this is subjected to the quantile, IQR definition.

If you insist on setting them manually with stat_summary

# geom_boxplot parameters with stat summary
f <- function(x) {
r <- quantile(x, probs = c(0.25, 0.25, 0.5, 0.75, 0.75))
r[[1]]<-r[[1]]-1.5*IQR(x) #ymin lower whisker, as per geom_boxplot
r[[5]]<-r[[5]]+1.5*IQR(x) #ymax upper whisker
names(r) <- c("ymin", "lower", "middle", "upper", "ymax")
r
}

# To subset the outlying points for plotting,
o <- function(x) {
r <- quantile(x, probs = c(0.25, 0.75))
r[[1]]<-r[[1]]-1.5*IQR(x)
r[[2]]<-r[[2]]+1.5*IQR(x)
subset(x, x < r[[1]] | r[[2]] < x)
}

# added seed for consistency
set.seed(123)

df <- data.frame(matrix(rnorm(2000), ncol = 10))
plot.data <- gather(df, variable, value)
# plot.data$out <- as.numeric(rep(input_data, each = nrow(x_train)))
p <- ggplot(plot.data, aes(x = 0, y=value))
p <- p + stat_summary(fun.data = f, geom="boxplot")+
stat_summary(fun.y = o, geom="point")
#p <- p + geom_point(aes(x = 0, y = test_data), color = "red")
p <- p + facet_wrap(~variable, scales = "free_x", strip.position = 'top', ncol = 2)
p <- p + coord_flip()
p <- p + xlab("") + ylab("")
p <- p + theme(legend.position="none") + theme_bw()
p <- p + theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank())

Can I use log whiskers on a log-scale boxplot?

You can take the values calculated by boxplot(log(x)) and transform them back onto the original scale of x. I'm not sure how meaningful the resulting plot is though:

x <- rlnorm(n=50, meanlog=0, sdlog=1)
library('beeswarm')
beeswarm(x, log=TRUE)
box = boxplot(log(x), add = FALSE, plot = FALSE, outline = FALSE)
box$stats = exp(box$stats)
box$conf = exp(box$conf)
bxp(box, add=TRUE)

Sample Image



Related Topics



Leave a reply



Submit