Add Density Lines to Histogram and Cumulative Histogram

Add density lines to histogram and cumulative histogram

Offered without explanation:

## Make some sample data
x <- sample(0:30, 200, replace=T, prob=15 - abs(15 - 0:30))

## Calculate and plot the two histograms
hcum <- h <- hist(x, plot=FALSE)
hcum$counts <- cumsum(hcum$counts)
plot(hcum, main="")
plot(h, add=T, col="grey")

## Plot the density and cumulative density
d <- density(x)
lines(x = d$x, y = d$y * length(x) * diff(h$breaks)[1], lwd = 2)
lines(x = d$x, y = cumsum(d$y)/max(cumsum(d$y)) * length(x), lwd = 2)

Sample Image

Overlay histogram with density curve

Here you go!

# create some data to work with
x = rnorm(1000);

# overlay histogram, empirical density and normal density
p0 = qplot(x, geom = 'blank') +
geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') +
stat_function(fun = dnorm, aes(colour = 'Normal')) +
geom_histogram(aes(y = ..density..), alpha = 0.4) +
scale_colour_manual(name = 'Density', values = c('red', 'blue')) +
theme(legend.position = c(0.85, 0.85))

print(p0)

How to plot a density estimate on top of the histogram?

You need to call hist and density separately. Something like this:

hist(x, prob=TRUE)
lines(density(x))

Overlay normal curve to histogram in R

Here's a nice easy way I found:

h <- hist(g, breaks = 10, density = 10,
col = "lightgray", xlab = "Accuracy", main = "Overall")
xfit <- seq(min(g), max(g), length = 40)
yfit <- dnorm(xfit, mean = mean(g), sd = sd(g))
yfit <- yfit * diff(h$mids[1:2]) * length(g)

lines(xfit, yfit, col = "black", lwd = 2)

How can one add a cumulative trend line based on weight to a histogram in R?

geom_histogram does not have a weight aesthetic so I do not understand how do you want to do with tonne.km. But if you want to superimpose the CDF to the histogram, here is a way.

First realize that a density such as the empirical histogram density and a ECDF are many times on different scales, specially if the distribution is continuous and the sample is large. Then, the main trick is to scale the ECDF by the maximum density y value.

library(ggplot2)
library(scales)

distance <- rnorm(1000000, mean = 1000, sd = 500)
tonne.km <- rnorm(1000000, mean = 25000, sd = 500)
dist.tk.test <- data.frame(distance, tonne.km)

bins <- 50L
x_breaks <- 10L

max_y <- max(density(dist.tk.test$distance)$y)

ggplot(dist.tk.test) +
geom_histogram(
aes(x = distance, y = ..density..), bins = bins
) +
geom_line(
aes(
x = sort(distance),
y = max_y * seq_along(distance)/length(distance)
),
color = "red"
) +
scale_x_continuous(label = comma,
breaks = extended_breaks(x_breaks)) +
scale_y_continuous(
name = "Density",
sec.axis = sec_axis(~ .x / max_y ,
labels = scales::percent,
name = "Cumulative Share (%)")
)

Created on 2022-08-17 by the reprex package (v2.0.1)



Edit

Following the comment below, here is another solution.

The total tonne.km by bins of distance is computed first.

In order to do this, the distances must be binned. I use findInterval to bin them and then sum the tonne.km per bin (variable breaks) with aggregate. This is the data.frame used in the plot.

library(ggplot2)
library(scales)

set.seed(2022)
distance <- rnorm(1000000, mean = 1000, sd = 500)
tonne.km <- rnorm(1000000, mean = 25000, sd = 500)
dist.tk.test <- data.frame(distance, tonne.km)

breaks <- range(dist.tk.test$distance)
breaks <- round(breaks/100)*100
breaks <- seq(breaks[1], breaks[2], by = 50)
bins <- findInterval(dist.tk.test$distance, breaks)
breaks <- breaks[bins]

new_df <- aggregate(tonne.km ~ breaks, dist.tk.test, sum, na.rm = TRUE)
y_max <- max(new_df$tonne.km, na.rm = TRUE)

x_axis_breaks <- 10L

ggplot(new_df, aes(breaks, tonne.km)) +
geom_col(position = position_dodge(), width = 100) +
geom_line(
aes(
y = y_max * cumsum(tonne.km)/sum(tonne.km)
),
color = "red"
) +
scale_x_continuous(
name = "Distance",
label = comma,
breaks = extended_breaks(x_axis_breaks)) +
scale_y_continuous(
name = "Tonne/Km",
sec.axis = sec_axis(~ .x/y_max,
labels = scales::percent,
name = "Cumulative Share (%)")
)
#> Warning: position_dodge requires non-overlapping x intervals

Created on 2022-08-17 by the reprex package (v2.0.1)



Related Topics



Leave a reply



Submit