R - Emulate the Default Behavior of Hist() with Ggplot2 for Bin Width

R - emulate the default behavior of hist() with ggplot2 for bin width

Without sample data, it's always difficult to get reproducible results, so i've created a sample dataset

set.seed(16)
mydata <- data.frame(myvariable=rnorm(500, 1500000, 10000))

#base histogram
hist(mydata$myvariable)

As you've learned, hist() is a generic function. If you want to see the different implementations you can type methods(hist). Most of the time you'll be running hist.default. So if be borrow the break finding logic from that funciton, we come up with

brx <- pretty(range(mydata$myvariable), 
n = nclass.Sturges(mydata$myvariable),min.n = 1)

which is how hist() by default calculates the breaks. We can then use these breaks with the ggplot command

ggplot(mydata, aes(x=myvariable)) + 
geom_histogram(color="darkgray",fill="white", breaks=brx) +
scale_x_continuous("My variable") +
theme(axis.text=element_text(size=14),axis.title=element_text(size=16,face="bold"))

and the plot below shows the two results side-by-side and as you can see they are quite similar.

Sample Image

Also, that empty bim was probably caused by your y-axis limits. If a shape goes outside the limits of the range you specify in scale_y_continuous, it will simply get dropped from the plot. It looks like that bin wanted to be 14 tall, but you clipped y at 12.5.

geom_histogram: What is the default origin of the first bin?

By default, the histogram is centered at 0, and the first bars xlimits are at 0.5*binwidth and -0.5*binwidth. From there, the bars continue with width = binwidth in both directions until they hit the minimum and maximum. Or, if you data is all > 0, they start at the first (x+0.5)*binwidth that contains data.

For your example (using a set.seed for reproducibility):

set.seed(1)
x <- rnorm(25)
binwidth <- (range(x)[2]-range(x)[1])/10
p <- ggplot(data.frame(x=x), aes(x = x)) +
geom_histogram(aes(y = ..density..), binwidth = binwidth)

We can get the breaks out by using:

x1 <- ggplot_build(p)$data

giving us our breaks:

x1[[1]]$x
[1] -2.4764874 -2.0954894 -1.7144913 -1.3334932 -0.9524952 -0.5714971 -0.1904990 0.1904990 0.5714971
[10] 0.9524952 1.3334932 1.7144913 2.0954894

So, to get the minimum, we need to round the lowest value of the data to a multiple of binwidth + 0.5 (NB I'm sure there is a better formula, but this works):

binwidth*(floor((min(x)-binwidth/2)/binwidth)+0.5)
-2.476487

similarly the maximum is:

binwidth*(ceiling((max(x)+binwidth/2)/binwidth)+0.5)
2.095489

Is there a way to create a histogram in R using ggplot so that only the vertical lines of the bins that are protruding show?

Maybe using hist to generate the values then plotting in ggplot:

library(ggplot2)
set.seed(1)
x = hist(rchisq(1000, df = 4), 100)
df = data.frame(
x = rep(x$breaks, each=2),
y = c(0, rep(x$counts, each = 2), 0))

ggplot(df, aes(x,y)) +
geom_polygon(fill='grey80') +
geom_line(col='red')

Sample Image

Adjusting the x-Axis and Bins when Making a Histogram with Ggplot2

binwidth controls the width of each bin while bins specifies the number of bins and ggplot works it out.

Depending on how much control you want over your age buckets this may do the job:

ggplot(Df, aes(Age)) + geom_histogram(binwidth = 5)

Edit: for closer control of the breaks experiment with:

+ scale_x_continuous(breaks = seq(0, 100, 5))

To label the actual spans, not the middle of the bar, which is what you need for something like an age histogram, use something like this:

ggplot(Df, aes(Age)) +
geom_histogram(
breaks = seq(10, 90, by = 10),
aes(fill = ..count..,
colour = "black")) +
scale_x_continuous(breaks = seq(10, 90, by=10))

Plot histogram using ggplots to be similar as hist() in R base

You're not getting value ranges, because you converted d to a factor. Leave it as numeric, and you'll get bars that span ranges. Also, I've converted your data to a data frame, because ggplot requires a data frame.

dat = data.frame(d=d)

ggplot(dat, aes(x=d)) +
geom_histogram(breaks=seq(0,max(dat$d)+10,10),
fill="lightblue", colour="black")

Sample Image

Setting histogram breaks in ggplot2

If you use the code you will see how the R decided to break up your data:

data(mtcars)
histinfo <- hist(mtcars$mpg)

From the histinfo you will get the necessary information concerning the breaks.

$breaks
[1] 10 15 20 25 30 35

$counts
[1] 6 12 8 2 4

$density
[1] 0.0375 0.0750 0.0500 0.0125 0.0250

$mids
[1] 12.5 17.5 22.5 27.5 32.5

$xname
[1] "mtcars$mpg"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"
>

Now you can tweak the code below to make your ggplot histogram, look more like the base one. You would have to change axis labels, scale and colours. theme_bw() will help you to get some settings in order.

data(mtcars)
require(ggplot2)
qplot(mtcars$mpg,
geom="histogram",
binwidth = 5) +
theme_bw()

and change the binwidth value to whatever suits you.
histogram

Why are R hist and ggplot histograms output so different?

By default, ggplot uses range/30 as binwidth, as prompted. In your case, it is approximately 48/30 (depends on the seed), which is more than 1 and is around 1.5.

Now, your data is not continuous, you only get integers, so for any two adjacent histogram bins you'll get irregularities, caused by the fact that the first bin will only contain one possible integer, and the next will contain two, and so on. As a result, you'll see the count approximately doubled for every second bin.

Say, your data looks like

1 2 3 4 5 6
5 5 5 5 5 5

and if you start counting from 0.5, you'll get these bins:

(0.5, 2] (2, 3.5] (3.5 5] (5, 6.5]
10 5 10 5

which is exactly those spikes you see on the first of your plots.

As you have already found out, this won't be a problem if binwidth is strictly 1.

Edit:

as pointed out by @James, use the following to reproduce the picture given by ggplot with base graph:

hist(RB, breaks=seq(min(RB), max(RB), length.out=30))

It may look a bit different, but the spikes are there.



Related Topics



Leave a reply



Submit