Overlaying Histograms With Ggplot2 in R

Overlaying histograms with ggplot2 in R

Your current code:

ggplot(histogram, aes(f0, fill = utt)) + geom_histogram(alpha = 0.2)

is telling ggplot to construct one histogram using all the values in f0 and then color the bars of this single histogram according to the variable utt.

What you want instead is to create three separate histograms, with alpha blending so that they are visible through each other. So you probably want to use three separate calls to geom_histogram, where each one gets it's own data frame and fill:

ggplot(histogram, aes(f0)) + 
geom_histogram(data = lowf0, fill = "red", alpha = 0.2) +
geom_histogram(data = mediumf0, fill = "blue", alpha = 0.2) +
geom_histogram(data = highf0, fill = "green", alpha = 0.2) +

Here's a concrete example with some output:

dat <- data.frame(xx = c(runif(100,20,50),runif(100,40,80),runif(100,0,30)),yy = rep(letters[1:3],each = 100))

ggplot(dat,aes(x=xx)) +
geom_histogram(data=subset(dat,yy == 'a'),fill = "red", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'b'),fill = "blue", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'c'),fill = "green", alpha = 0.2)

which produces something like this:

Sample Image

Edited to fix typos; you wanted fill, not colour.

Overlaying two histograms with different rows using ggplot2

You can make a "long" data.frame and plot that with ggplot2:

set.seed(1)
library(ggplot2)
dist1 <- rnorm(1000, 35, 3)
dist2 <- rnorm(1200, 40, 5)

df <- data.frame(variable = c(rep("dist1", length(dist1)),
rep("dist2", length(dist2))),
value=c(dist1, dist2))
ggplot(df, aes(x=value, fill=variable))+
geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Sample Image

You could also consider density plots, as they are easier to overlay:

ggplot(df, aes(x=value, fill=variable))+
geom_density(alpha=.5)

Sample Image

Overlay KDE and filled histogram with ggplot2 (R)

The problem is that the histogram displays counts, which integrates to the sum, and the density plot shows, well, density, that integrates to 1. To make the two compatible you'd have to use the 'computed variables' of the stat parts of the layers, which are accessible with after_stat(). You can either scale the density such that it integrates to the sum, or you can scale the histogram such that it integrates to 1.

Scaling the histogram to the density:

library(ggplot2)
ggplot(iris, aes(Sepal.Length, fill = Species)) +
geom_histogram(aes(y = after_stat(density)),
position = 'identity') +
geom_density(bw = 0.1, alpha = 0.3)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Sample Image

Scaling density to counts. To do this properly you should multiply the count computed variable with the binwidth parameter of the histogram.

ggplot(iris, aes(Sepal.Length, fill = Species)) +
geom_histogram(binwidth = 0.2, position = 'identity') +
geom_density(aes(y = after_stat(count * 0.2)),
bw = 0.1, alpha = 0.3)

Sample Image

Created on 2021-06-22 by the reprex package (v1.0.0)

As a side note; the default position argument for the histogram is to stack bars on top of oneanother. Setting position = "identity" prevents this. Alternatively, you could also set position = "stack" in the density layer.

EDIT: Sorry I've seem to have glossed over the 'I want 1 KDE for the entire Sepal.Length'-part of the question. You'd have to manually set the group, like so:

ggplot(iris, aes(Sepal.Length, fill = Species)) +
geom_histogram(binwidth = 0.2) +
geom_density(bw = 0.1, alpha = 0.3,
aes(group = 1, y = after_stat(count * 0.2)))

Overlaid histograms in R (ggplot2 preferred)

I believe this is what you are looking for:

Overlaid histograms

Note that I changed your treatment indicator variable to be TRUE/FALSE rather than 0/1, since it needs to be a factor for ggplot to split on it. The scale_alpha is a bit of a hack because it's for continuous variables, but there isn't a discrete analogue as far as I can tell.

library('ggplot2')
my.data <- data.frame(treat = rep(c(FALSE, TRUE), 100), prop_score = runif(2 * 100))
ggplot(my.data) +
geom_histogram(binwidth = 0.05
, aes( x = prop_score
, alpha = treat
, linetype = treat)
, colour="black"
, fill="white"
, position="stack") +
scale_alpha(limits = c(1, 0))

Overlaying histogram with different y-scales

Consider the following situation where you have 800 versus 200 observations:

library(ggplot2)

df <- data.frame(
x = rnorm(1000, rep(c(1, 2), c(800, 200))),
class = rep(c("A", "B"), c(800, 200))
)

ggplot(df, aes(x, fill = class)) +
geom_histogram(bins = 20, position = "identity", alpha = 0.5,
# Note that y = stat(count) is the default behaviour
mapping = aes(y = stat(count)))

Sample Image

You could scale the counts for each group to a maximum of 1 by using y = stat(ncount):

ggplot(df, aes(x, fill = class)) +
geom_histogram(bins = 20, position = "identity", alpha = 0.5,
mapping = aes(y = stat(ncount)))

Sample Image

Alternatively, you can set y = stat(density) to have the total area integrate to 1.

ggplot(df, aes(x, fill = class)) +
geom_histogram(bins = 20, position = "identity", alpha = 0.5,
mapping = aes(y = stat(density)))

Sample Image

Note that after ggplot 3.3.0 stat() probably will get replaced by after_stat().

General rule of overlaying density plot using ggplot2

You need to make sure that to multiply value of ..count.. in in the density plot call by the value of whatever the binwidth is in the histogram call.

You can do it as follows:

set.seed(100)
a = data.frame(z = rnorm(10000))
binwidthVal=0.1
ggplot(a, aes(x=z)) +
geom_histogram(binwidth = binwidthVal) +
geom_density(colour='red', aes(y=binwidthVal * ..count..))

Sample Image

Credit to Brian Diggs for the idea.

EDIT: Seems like there is already a perfectly good answer here



Related Topics



Leave a reply



Submit