R - What algorithm does geom_density() use and how to extract points/equation of curves?
Typing get("compute_group", ggplot2::StatDensity)
(or, formerly, get("calculate", ggplot2:::StatDensity)
) will get you the algorithm used to calculate the density. (At root, it's a call to density()
with kernel="gaussian"
the default.)
The points used in the plot are invisibly returned by print.ggplot()
, so you can access them like this:
library(ggplot2)
m <- ggplot(movies, aes(x = rating))
m <- m + geom_density()
p <- print(m)
head(p$data[[1]], 3)
# y x density scaled count PANEL group ymin ymax
# 1 0.0073761 1.0000 0.0073761 0.025917 433.63 1 1 0 0.0073761
# 2 0.0076527 1.0176 0.0076527 0.026888 449.88 1 1 0 0.0076527
# 3 0.0078726 1.0352 0.0078726 0.027661 462.81 1 1 0 0.0078726
## Just to show that those are the points you are after,
## extract and use them to create a lattice xyplot
library(gridExtra)
library(lattice)
mm <- xyplot(y ~x, data=p$data[[1]], type="l")
R - What algorithm does geom_density() use and how to extract points/equation of curves?
Typing get("compute_group", ggplot2::StatDensity)
(or, formerly, get("calculate", ggplot2:::StatDensity)
) will get you the algorithm used to calculate the density. (At root, it's a call to density()
with kernel="gaussian"
the default.)
The points used in the plot are invisibly returned by print.ggplot()
, so you can access them like this:
library(ggplot2)
m <- ggplot(movies, aes(x = rating))
m <- m + geom_density()
p <- print(m)
head(p$data[[1]], 3)
# y x density scaled count PANEL group ymin ymax
# 1 0.0073761 1.0000 0.0073761 0.025917 433.63 1 1 0 0.0073761
# 2 0.0076527 1.0176 0.0076527 0.026888 449.88 1 1 0 0.0076527
# 3 0.0078726 1.0352 0.0078726 0.027661 462.81 1 1 0 0.0078726
## Just to show that those are the points you are after,
## extract and use them to create a lattice xyplot
library(gridExtra)
library(lattice)
mm <- xyplot(y ~x, data=p$data[[1]], type="l")
How to extract the density value from ggplot in r
Save the plot in a variable, build the data structure with ggplot_build
and split the data by panel. Then interpolate with approx
to get the new values.
g <- ggplot(df, aes(x = weight)) +
geom_density() +
facet_grid(fruits ~ ., scales = "free", space = "free")
p <- ggplot_build(g)
# These are the columns of interest
p$data[[1]]$x
p$data[[1]]$density
p$data[[1]]$PANEL
Split the list member p$data[[1]]
by panel but keep only the x
and density
values. Then loop through the split data to interpolate by group of fruit.
sp <- split(p$data[[1]][c("x", "density")], p$data[[1]]$PANEL)
new_weight <- 71
sapply(sp, \(DF){
with(DF, approx(x, density, xout = new_weight))
})
# 1 2 3 4
#x 71 71 71 71
#y 0.04066888 0.05716947 0.001319164 0.07467761
Or, without splitting the data previously, use by
.
b <- by(p$data[[1]][c("x", "density")], p$data[[1]]$PANEL, \(DF){
with(DF, approx(x, density, xout = new_weight))
})
do.call(rbind, lapply(b, as.data.frame))
# x y
#1 71 0.040668880
#2 71 0.057169474
#3 71 0.001319164
#4 71 0.074677607
How calculate probabillity of density plot?
Your data is not included in the question, so let's make up a small random sample:
library(ggplot2)
set.seed(69)
df <- data.frame(x = rnorm(10))
Now we can create a density plot as per your example:
p <- ggplot(df, aes(x)) +
geom_density() +
xlim(c(-5, 5))
p
Now, we can actually find the x and y coordinates of this line using the base R function density
and extracting its x and y components into a data frame:
dens <- density(df$x)
d <- data.frame(x = dens$x, y = dens$y)
head(d)
#> x y
#> 1 -3.157056 0.0009453767
#> 2 -3.144949 0.0010145927
#> 3 -3.132841 0.0010870523
#> 4 -3.120733 0.0011665920
#> 5 -3.108625 0.0012488375
#> 6 -3.096517 0.0013382316
We can see plotting this as a red dashed geom_line
it is the same as geom_density
:
p + geom_line(data = d, aes(x, y), col = "red", linetype = 2, size = 2)
Now suppose we want to know the probability of having a value of more than one. We can show the area we are interested in like this:
p + geom_area(data = d[d$x >= 1,], aes(x, y), fill = "red")
Since the x values are all equally spaced in our data frame d
, then the red area's proportion of the area under the line is a simple ratio of the sum of all y values at x values greater than one to the grand sum of y:
sum(d$y[d$x > 1])/sum(d$y)
#> [1] 0.1599931
So the probability of getting an x value of > 1 is 0.15999, or 16%
Created on 2020-08-17 by the reprex package (v0.3.0)
Showing major peaks in densities across facets using R
I'm assuming you want to identify the single largest peak in each facet – this would be the mode of the distribution. If your distribution is multimodal, my answer will only identify the largest peak. This answer to another question explains that geom_density()
uses the density()
function w/ default arguments.
That being said, the following code should work for you:
library(ggplot2)
library(grid)
library(plyr)
data <- data.frame("frame"=c(rep("A",9), rep("B", 13), rep("C", 7)), "val"=c(1,rep(2,4),4,5,6,rep(1,6),2,rep(3,7),1,rep(4,6)))
attach(data)
densMode <- function(x){
td <- density(x)
maxDens <- which.max(td$y)
list(x=td$x[maxDens], y=td$y[maxDens])
}
xdat <- ddply(data,"frame", transform, val_mean = signif(densMode(val)$x,3), med.x = signif(densMode(val)$x,3), med.y=signif(densMode(val)$y,3))
hp <- ggplot(data=data, aes(x=val)) +
geom_density() +
geom_vline(aes(xintercept=val_mean),xdat, color="red",linetype="dashed",size=1) +
theme_bw()
hp<- hp +
facet_wrap (~ frame, ncol=2, scales="free_y") +
geom_text(data = xdat, aes(x=med.x,y=med.y,label=val_mean))
hp
The only lines I changed were those that determined how the graph was created (I didn't use png()
), inserting the densMode()
function, and using densMode()
in the definition of xdat
. I also created a data.frame based on your example data (which I've submitted as an edit to your question, for the convenience of others who may want to answer).
The code produces the following figure:
How to put the legends in the peaks of multiple distributions using ggplot?
How's this?
library(ggplot2)
set.seed(1)
a <- rnorm(100, mean = 20, sd = 1)
b <- rnorm(100, mean = 22, sd = 1)
aleg <- rep("A",length(a))
bleg <- rep("B",length(b))
data <- data.frame(value=c(a,b), variable=c(aleg,bleg))
labels <-
lapply(split(data, data$variable), function(x){
dens <- density(x$value) # compute density of each variable
data.frame(y = max(dens$y), # get maximum density of each variable
x = dens$x[which(dens$y == max(dens$y))], # get corresponding x value
label = x$variable[1])
})
ggplot(data,aes(x=value, fill=variable)) +
geom_density(alpha=0.25) +
geom_text(data = do.call(rbind, labels), aes(x = x, y = y, label = label),
inherit.aes = F,
nudge_y = 0.03) +
labs(fill="")
How does ggplot2 density differ from the density function?
In this case, it is not the density calculation that is different but how
the log10 transform is applied.
First check the densities are similar without transform
library(ggplot2)
library(fueleconomy)
d <- density(vehicles$cty, from=min(vehicles$cty), to=max(vehicles$cty))
ggplot(data.frame(x=d$x, y=d$y), aes(x=x, y=y)) + geom_line()
ggplot(vehicles, aes(x=cty)) + stat_density(geom="line")
So the issue seems to be the transform. In the stat_density
below, it seems as
if the log10 transform is applied to the x variable before the density calculation.
So to reproduce the results manually you have to transform the variable prior to the
calculating the density. Eg
d2 <- density(log10(vehicles$cty), from=min(log10(vehicles$cty)),
to=max(log10(vehicles$cty)))
ggplot(data.frame(x=d2$x, y=d2$y), aes(x=x, y=y)) + geom_line()
ggplot(vehicles, aes(x=cty)) + stat_density(geom="line") + scale_x_log10()
PS: To see how ggplot
prepares the data for the density, you can look at the code as.list(StatDensity)
leads to StatDensity$compute_group
to ggplot2:::compute_density
calculate area of overlapping density plot by ggplot using R
I was looking for a way to do this for empirical data, and had the problem of multiple intersections as mentioned by user5878028. After some digging I found a very simple solution, even for a total R noob like me:
Install and load the libraries "overlapping" (which performs the calculation) and "lattice" (which displays the result):
library(overlapping)
library(lattice)
Then define a variable "x" as a list that contains the two density distributions you want to compare. For this example, the two datasets "data1" and "data2" are both columns in a text file called "yourfile":
x <- list(X1=yourfile$data1, X2=yourfile$data2)
Then just tell it to display the output as a plot which will also display the estimated % overlap:
out <- overlap(x, plot=TRUE)
I hope this helps someone like it helped me! Here's an example overlap plot
Related Topics
R Data.Table Apply Function to Rows Using Columns as Arguments
Filling in Missing (Blanks) in a Data Table, Per Category - Backwards and Forwards
Shiny Leaflet Ploygon Click Event
R Sum a Variable by Two Groups
Passing Data Within Shiny Modules from Module 1 to Module 2
Appending a List to a List of Lists in R
Interactive Directory Input in Shiny App (R)
Can You Specify Different Geoms for Different Facets in a Ggplot
R Tm Package Vcorpus: Error in Converting Corpus to Data Frame
How to Cross-Paste All Combinations of Two Vectors (Each-To-Each)
Calculate the Mean of One Column from Several CSV Files
R: Sample() Command Subject to a Constraint
Row-Wise Sort Then Concatenate Across Specific Columns of Data Frame
How to Scrape/Automatically Download PDF Files from a Document Search Web Interface in R
Plotting Multiple Time Series on the Same Plot Using Ggplot()