R - What Algorithm Does Geom_Density() Use and How to Extract Points/Equation of Curves

R - What algorithm does geom_density() use and how to extract points/equation of curves?

Typing get("compute_group", ggplot2::StatDensity) (or, formerly, get("calculate", ggplot2:::StatDensity)) will get you the algorithm used to calculate the density. (At root, it's a call to density() with kernel="gaussian" the default.)

The points used in the plot are invisibly returned by print.ggplot(), so you can access them like this:

library(ggplot2)
m <- ggplot(movies, aes(x = rating))
m <- m + geom_density()
p <- print(m)
head(p$data[[1]], 3)
# y x density scaled count PANEL group ymin ymax
# 1 0.0073761 1.0000 0.0073761 0.025917 433.63 1 1 0 0.0073761
# 2 0.0076527 1.0176 0.0076527 0.026888 449.88 1 1 0 0.0076527
# 3 0.0078726 1.0352 0.0078726 0.027661 462.81 1 1 0 0.0078726

## Just to show that those are the points you are after,
## extract and use them to create a lattice xyplot
library(gridExtra)
library(lattice)
mm <- xyplot(y ~x, data=p$data[[1]], type="l")

Sample Image

R - What algorithm does geom_density() use and how to extract points/equation of curves?

Typing get("compute_group", ggplot2::StatDensity) (or, formerly, get("calculate", ggplot2:::StatDensity)) will get you the algorithm used to calculate the density. (At root, it's a call to density() with kernel="gaussian" the default.)

The points used in the plot are invisibly returned by print.ggplot(), so you can access them like this:

library(ggplot2)
m <- ggplot(movies, aes(x = rating))
m <- m + geom_density()
p <- print(m)
head(p$data[[1]], 3)
# y x density scaled count PANEL group ymin ymax
# 1 0.0073761 1.0000 0.0073761 0.025917 433.63 1 1 0 0.0073761
# 2 0.0076527 1.0176 0.0076527 0.026888 449.88 1 1 0 0.0076527
# 3 0.0078726 1.0352 0.0078726 0.027661 462.81 1 1 0 0.0078726

## Just to show that those are the points you are after,
## extract and use them to create a lattice xyplot
library(gridExtra)
library(lattice)
mm <- xyplot(y ~x, data=p$data[[1]], type="l")

Sample Image

How to extract the density value from ggplot in r

Save the plot in a variable, build the data structure with ggplot_build and split the data by panel. Then interpolate with approx to get the new values.

g <- ggplot(df, aes(x = weight)) +
geom_density() +
facet_grid(fruits ~ ., scales = "free", space = "free")

p <- ggplot_build(g)

# These are the columns of interest
p$data[[1]]$x
p$data[[1]]$density
p$data[[1]]$PANEL

Split the list member p$data[[1]] by panel but keep only the x and density values. Then loop through the split data to interpolate by group of fruit.

sp <- split(p$data[[1]][c("x", "density")], p$data[[1]]$PANEL)

new_weight <- 71
sapply(sp, \(DF){
with(DF, approx(x, density, xout = new_weight))
})
# 1 2 3 4
#x 71 71 71 71
#y 0.04066888 0.05716947 0.001319164 0.07467761

Or, without splitting the data previously, use by.

b <- by(p$data[[1]][c("x", "density")], p$data[[1]]$PANEL, \(DF){
with(DF, approx(x, density, xout = new_weight))
})
do.call(rbind, lapply(b, as.data.frame))
# x y
#1 71 0.040668880
#2 71 0.057169474
#3 71 0.001319164
#4 71 0.074677607

How calculate probabillity of density plot?

Your data is not included in the question, so let's make up a small random sample:

library(ggplot2)

set.seed(69)

df <- data.frame(x = rnorm(10))

Now we can create a density plot as per your example:

p <- ggplot(df, aes(x)) + 
geom_density() +
xlim(c(-5, 5))

p

Sample Image

Now, we can actually find the x and y coordinates of this line using the base R function density and extracting its x and y components into a data frame:

dens <- density(df$x)
d <- data.frame(x = dens$x, y = dens$y)

head(d)
#> x y
#> 1 -3.157056 0.0009453767
#> 2 -3.144949 0.0010145927
#> 3 -3.132841 0.0010870523
#> 4 -3.120733 0.0011665920
#> 5 -3.108625 0.0012488375
#> 6 -3.096517 0.0013382316

We can see plotting this as a red dashed geom_line it is the same as geom_density:

p + geom_line(data = d, aes(x, y), col = "red", linetype = 2, size = 2) 

Sample Image

Now suppose we want to know the probability of having a value of more than one. We can show the area we are interested in like this:

p + geom_area(data = d[d$x >= 1,], aes(x, y), fill = "red")

Sample Image

Since the x values are all equally spaced in our data frame d, then the red area's proportion of the area under the line is a simple ratio of the sum of all y values at x values greater than one to the grand sum of y:

sum(d$y[d$x > 1])/sum(d$y)
#> [1] 0.1599931

So the probability of getting an x value of > 1 is 0.15999, or 16%

Created on 2020-08-17 by the reprex package (v0.3.0)

Showing major peaks in densities across facets using R

I'm assuming you want to identify the single largest peak in each facet – this would be the mode of the distribution. If your distribution is multimodal, my answer will only identify the largest peak. This answer to another question explains that geom_density() uses the density() function w/ default arguments.

That being said, the following code should work for you:

library(ggplot2)
library(grid)
library(plyr)

data <- data.frame("frame"=c(rep("A",9), rep("B", 13), rep("C", 7)), "val"=c(1,rep(2,4),4,5,6,rep(1,6),2,rep(3,7),1,rep(4,6)))
attach(data)

densMode <- function(x){
td <- density(x)
maxDens <- which.max(td$y)
list(x=td$x[maxDens], y=td$y[maxDens])
}
xdat <- ddply(data,"frame", transform, val_mean = signif(densMode(val)$x,3), med.x = signif(densMode(val)$x,3), med.y=signif(densMode(val)$y,3))

hp <- ggplot(data=data, aes(x=val)) +
geom_density() +
geom_vline(aes(xintercept=val_mean),xdat, color="red",linetype="dashed",size=1) +
theme_bw()

hp<- hp +
facet_wrap (~ frame, ncol=2, scales="free_y") +
geom_text(data = xdat, aes(x=med.x,y=med.y,label=val_mean))

hp

The only lines I changed were those that determined how the graph was created (I didn't use png()), inserting the densMode() function, and using densMode() in the definition of xdat. I also created a data.frame based on your example data (which I've submitted as an edit to your question, for the convenience of others who may want to answer).

The code produces the following figure:
Sample Image

How to put the legends in the peaks of multiple distributions using ggplot?

How's this?

library(ggplot2)

set.seed(1)
a <- rnorm(100, mean = 20, sd = 1)
b <- rnorm(100, mean = 22, sd = 1)
aleg <- rep("A",length(a))
bleg <- rep("B",length(b))

data <- data.frame(value=c(a,b), variable=c(aleg,bleg))

labels <-
lapply(split(data, data$variable), function(x){
dens <- density(x$value) # compute density of each variable

data.frame(y = max(dens$y), # get maximum density of each variable
x = dens$x[which(dens$y == max(dens$y))], # get corresponding x value
label = x$variable[1])
})

ggplot(data,aes(x=value, fill=variable)) +
geom_density(alpha=0.25) +
geom_text(data = do.call(rbind, labels), aes(x = x, y = y, label = label),
inherit.aes = F,
nudge_y = 0.03) +
labs(fill="")

Sample Image

How does ggplot2 density differ from the density function?

In this case, it is not the density calculation that is different but how
the log10 transform is applied.

First check the densities are similar without transform

library(ggplot2)
library(fueleconomy)

d <- density(vehicles$cty, from=min(vehicles$cty), to=max(vehicles$cty))
ggplot(data.frame(x=d$x, y=d$y), aes(x=x, y=y)) + geom_line()
ggplot(vehicles, aes(x=cty)) + stat_density(geom="line")

So the issue seems to be the transform. In the stat_density below, it seems as
if the log10 transform is applied to the x variable before the density calculation.
So to reproduce the results manually you have to transform the variable prior to the
calculating the density. Eg

d2 <- density(log10(vehicles$cty), from=min(log10(vehicles$cty)), 
to=max(log10(vehicles$cty)))
ggplot(data.frame(x=d2$x, y=d2$y), aes(x=x, y=y)) + geom_line()
ggplot(vehicles, aes(x=cty)) + stat_density(geom="line") + scale_x_log10()

PS: To see how ggplot prepares the data for the density, you can look at the code as.list(StatDensity) leads to StatDensity$compute_group to ggplot2:::compute_density

calculate area of overlapping density plot by ggplot using R

I was looking for a way to do this for empirical data, and had the problem of multiple intersections as mentioned by user5878028. After some digging I found a very simple solution, even for a total R noob like me:

Install and load the libraries "overlapping" (which performs the calculation) and "lattice" (which displays the result):

library(overlapping)
library(lattice)

Then define a variable "x" as a list that contains the two density distributions you want to compare. For this example, the two datasets "data1" and "data2" are both columns in a text file called "yourfile":

x <- list(X1=yourfile$data1, X2=yourfile$data2)

Then just tell it to display the output as a plot which will also display the estimated % overlap:

out <- overlap(x, plot=TRUE)

I hope this helps someone like it helped me! Here's an example overlap plot

overlapping plot



Related Topics



Leave a reply



Submit