ggplot boxplot - length of whiskers with logarithmic axis
The problem is due to the fact that scale_y_log10
transforms the data before calculating the stats. This does not matter for the median and percentile points, because e.g. 10^log10(median)
is still the median value, which will be plotted in the correct location. But it does matter for the whiskers which are calculated using 1.5 * IQR
, because 10^(1.5 * IQR(log10(x))
is not equal to 1.5 * IQR(x)
. So the calculation fails for the whiskers.
This error becomes evident if we compare
boxplot.stats(my.df$b)$stats
# [1] 117.4978 407.3983 502.0460 601.2937 873.0992
10^boxplot.stats(log10(my.df$b))$stats
# [1] 231.1603 407.3983 502.0459 601.2935 975.1906
In which we see that the median and percentile ppoints are identical, but the whisker ends (1st and last elements of the stats vector) differ
This detailed and useful answer by @eipi10, shows how to calculate the stats yourself and force ggplot to use these user-defined stats rather than its internal (and incorrect) algorithm. Using this approach, it becomes relatively simple to calculate the correct statistics and use these instead.
# Function to use boxplot.stats to set the box-and-whisker locations
mybxp = function(x) {
bxp = log10(boxplot.stats(10^x)[["stats"]])
names(bxp) = c("ymin","lower", "middle","upper","ymax")
return(bxp)
}
# Function to use boxplot.stats for the outliers
myout = function(x) {
data.frame(y=log10(boxplot.stats(10^x)[["out"]]))
}
ggplot(my.df.long, aes(x=variable, y=vals)) + theme_bw() + coord_flip() +
scale_y_log10(breaks=c(5,10,20,50,100,200,500,1000), limits=c(5,1000)) +
stat_summary(fun.data=mybxp, geom="boxplot") +
stat_summary(fun.data=myout, geom="point")
Which produces the correct plot
A note on using coord_trans
as an alternative approach:
Using coord_trans(y = "log10")
instead of scale_y_log10
, causes the stats to be calculated (correctly) on the untransformed data. However, coord_trans
cannot be used in combination with coord_flip
. So, this does not solve the issue of creating horizontal boxplots with a log axis. The suggestion here to use ggdraw(switch_axis_position())
from the cowplot package to flip the axes after using coord_trans
did not work, but throws an error (cowplot v0.4.0 with ggplot2 v2.1.0)
Error in Ops.unit(gyl$x, grid::unit(0.5, "npc")) : both operands
must be unitsIn addition: Warning message:
axis.ticks.margin
is
deprecated. Please setmargin
property ofaxis.text
instead
Resize whiskers (width) in a ggplot boxplot with a grouping variable
I can get your desired outcome by adjusting the position of your stat_boxplot()
. For me, it appears correct by adding the following argument: position = position_dodge(width = 0.75)
. It was trial and error to get the correct value of 0.75.
p <- ggplot(Salaries, aes(x=rank, y=salary, fill=sex)) +
stat_boxplot(geom= 'errorbar' , width = 0.3, position = position_dodge(width = 0.75) ) +
geom_boxplot() +
labs(title="Salary by Rank and Sex", x="Rank", y="Salary")
show(p)
geom_boxplot gave wrong whiskers
From the quoted section:
The upper whisker extends from the hinge to the largest value no
further than 1.5 * IQR from the hinge (where IQR is the inter-quartile
range, or distance between the first and third quartiles).
By "value" they mean from among the original data points. If you plot the data, there are no values between the top hinge at 7.09 and 16.15 (+1.5*IQR). If these quartiles had arisen from data with one of the values lying in that range, the upper whisker would go there.
ggplot(data, aes(y = value)) +
geom_jitter(aes(x = 0.5), width = 0.05) +
stat_boxplot(geom = "errorbar", width = 0.3,
color = "red", size = 1.5) +
geom_boxplot(width = 0.5, alpha = 0.5) +
geom_hline(yintercept = c(7.09, 16.15), lty = "dashed")
Label whiskers on ggplot boxplot when there are outliers
Boxplots use boxplots.stats
. You can directly use this in your stat_summary
:
ggplot(mtcars, aes(x=factor(cyl), y=mpg, fill=factor(cyl))) +
geom_boxplot(width=0.6) +
stat_summary(
aes(label=sprintf("%1.1f", ..y..), color=factor(cyl)),
geom="text",
fun.y = function(y) boxplot.stats(y)$stats,
position=position_nudge(x=0.33),
size=3.5) +
theme_bw()
If you only need the whiskers, simply use boxplot.stats(y)$stats[c(1, 5)]
instead.
Sample size over whiskers of boxplot
Okay scratch that last attempt. I figured it out. boxplot.stats and geom_boxplot calculate quartile stats differently, and that skews everything in small sample sizes. We can call the actual stats geom_boxplot uses with ggplot_build.
This is how it's done, son. First, make your plot, like above, I called it p.
Now calculate sample size for each x variable
samp <- count(mtcars, cyl)
now retrieve the data from the plot using ggplot_build
ggstat <- ggplot_build(p)$data
ggwhisk1 <- ggstat[[1]]$ymax
now combine that with the sample size, and call that data in geom_text
ggwhisk2 <- data.frame(samp, whisk = ggwhisk1)
p <- p + geom_text(data = ggwhisk2, size = 2,
aes(x = cyl, y = whisk, label = paste0("n =", n), vjust = -.5))
Voila!!
Changing whisker length of multiple boxplot in R
By default (notched=FALSE), the geom_boxplot() should give you the whisker you want (Q1 - 1.5*IQR / Q3 + 1.5*IQR). See a more current question link. Although, this is subjected to the quantile, IQR definition.
If you insist on setting them manually with stat_summary
# geom_boxplot parameters with stat summary
f <- function(x) {
r <- quantile(x, probs = c(0.25, 0.25, 0.5, 0.75, 0.75))
r[[1]]<-r[[1]]-1.5*IQR(x) #ymin lower whisker, as per geom_boxplot
r[[5]]<-r[[5]]+1.5*IQR(x) #ymax upper whisker
names(r) <- c("ymin", "lower", "middle", "upper", "ymax")
r
}
# To subset the outlying points for plotting,
o <- function(x) {
r <- quantile(x, probs = c(0.25, 0.75))
r[[1]]<-r[[1]]-1.5*IQR(x)
r[[2]]<-r[[2]]+1.5*IQR(x)
subset(x, x < r[[1]] | r[[2]] < x)
}
# added seed for consistency
set.seed(123)
df <- data.frame(matrix(rnorm(2000), ncol = 10))
plot.data <- gather(df, variable, value)
# plot.data$out <- as.numeric(rep(input_data, each = nrow(x_train)))
p <- ggplot(plot.data, aes(x = 0, y=value))
p <- p + stat_summary(fun.data = f, geom="boxplot")+
stat_summary(fun.y = o, geom="point")
#p <- p + geom_point(aes(x = 0, y = test_data), color = "red")
p <- p + facet_wrap(~variable, scales = "free_x", strip.position = 'top', ncol = 2)
p <- p + coord_flip()
p <- p + xlab("") + ylab("")
p <- p + theme(legend.position="none") + theme_bw()
p <- p + theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank())
Can I use log whiskers on a log-scale boxplot?
You can take the values calculated by boxplot(log(x))
and transform them back onto the original scale of x
. I'm not sure how meaningful the resulting plot is though:
x <- rlnorm(n=50, meanlog=0, sdlog=1)
library('beeswarm')
beeswarm(x, log=TRUE)
box = boxplot(log(x), add = FALSE, plot = FALSE, outline = FALSE)
box$stats = exp(box$stats)
box$conf = exp(box$conf)
bxp(box, add=TRUE)
Related Topics
Remove Numbers at the Beginning and End of a String
Filled.Contour in R 3.0.X Throws Error
R: How to Find What S3 Method Will Be Called on an Object
Str_Extract_All: Return All Patterns Found in String Concatenated as Vector
How to Add My Outlook Email Signature to the Com Object Using Rdcomclient
Several Substitutions in One Line R
Sum Non Na Elements Only, But If All Na Then Return Na
Extract Columns from Data Table by Numeric Indices Stored in a Vector
How to Write Special Characters in Rmarkdown Latex Documents
Setting Column Width in R Shiny Datatable Does Not Work in Case of Lots of Column
R: How to Make a Confusion Matrix for a Predictive Model
Rename Columns Using 'Starts_With()' Where New Prefix Is a String
Resetting Cumsum If Value Goes to Negative in R
Understanding Ddply Error Message - Argument "By" Is Missing, with No Default
Drawing Non-Intersecting Circles