Easier Way to Plot the Cumulative Frequency Distribution in Ggplot

Easier way to plot the cumulative frequency distribution in ggplot?

There is a built in ecdf() function in R which should make things easier. Here's some sample code, utilizing plyr

library(plyr)
data(iris)

## Ecdf over all species
iris.all <- summarize(iris, Sepal.Length = unique(Sepal.Length), 
                            ecdf = ecdf(Sepal.Length)(unique(Sepal.Length)))

ggplot(iris.all, aes(Sepal.Length, ecdf)) + geom_step()

#Ecdf within species
iris.species <- ddply(iris, .(Species), summarize,
                            Sepal.Length = unique(Sepal.Length),
                            ecdf = ecdf(Sepal.Length)(unique(Sepal.Length)))

ggplot(iris.species, aes(Sepal.Length, ecdf, color = Species)) + geom_step()

Edit I just realized that you want cumulative frequency. You can get that by multiplying the ecdf value by the total number of observations:

iris.all <- summarize(iris, Sepal.Length = unique(Sepal.Length), 
                            ecdf = ecdf(Sepal.Length)(unique(Sepal.Length)) * length(Sepal.Length))

iris.species <- ddply(iris, .(Species), summarize,
                            Sepal.Length = unique(Sepal.Length),
                            ecdf = ecdf(Sepal.Length)(unique(Sepal.Length))*length(Sepal.Length))

Plotting cumulative counts in ggplot2

This will not solve directly problem with grouping of lines but it will be workaround.

You can add three calls to stat_bin() where you subset your data according to A levels.

ggplot(x,aes(x=X,color=A)) +
  stat_bin(data=subset(x,A=="a"),aes(y=cumsum(..count..)),geom="step")+
  stat_bin(data=subset(x,A=="b"),aes(y=cumsum(..count..)),geom="step")+
  stat_bin(data=subset(x,A=="c"),aes(y=cumsum(..count..)),geom="step")

Sample Image

UPDATE - solution using geom_step()

Another possibility is to multiply values of ..y.. with number of observations in each level. To get this number of observations at this moment only way I found is to precalculate them before plotting and add them to original data frame. I named this column len. Then in geom_step() inside aes() you should define that you will use variable len=len and then define y values as y=..y.. * len.

set.seed(123)
x <- data.frame(A=replicate(200,sample(c("a","b","c"),1)),X=rnorm(200))
library(plyr)
df <- ddply(x,.(A),transform,len=length(X))
ggplot(df,aes(x=X,color=A)) + geom_step(aes(len=len,y=..y.. * len),stat="ecdf")

Sample Image

How to create a grouped cumulative frequency graph with ggplot2

I think you'd like to use stat_ecdf from ggplot2:

ggplot(df, aes(Con, color = Zone)) + stat_ecdf(geom = "point")

R Add Frequency Distribution Ticks to ggplot

You can also use the geom_segment function if you want to specify the height of the tick.

library(tidyverse)
library(mlbench)

data(BostonHousing)

ggplot(data = BostonHousing) +
  geom_line(aes(x = lstat, y = medv)) +
  geom_segment(aes(x = lstat, xend = lstat, yend = 3, y = 0))

Sample Image

Cumulative histogram with ggplot2

Building on Didzis's answer, here's a way to get the ggplot2 (author: hadley) data into a geom_line to reproduce the look of the base R hist.

Brief explanation: to get the bins to position in the same way as base R, I set binwidth=1 and boundary=0. To get a similar look, I used color=black and fill=white. And to get the same position of the line segments, I used ggplot_build. You will find other answers by Didzis that use this trick.

# make a dataframe for ggplot
set.seed(1)
x = runif(100, 0, 10)
y = cumsum(x)
df <- data.frame(x = sort(x), y = y)

# make geom_histogram 
p <- ggplot(data = df, aes(x = x)) + 
    geom_histogram(aes(y = cumsum(..count..)), binwidth = 1, boundary = 0,
                color = "black", fill = "white")

# extract ggplot data
d <- ggplot_build(p)$data[[1]]

# make a data.frame for geom_line and geom_point
# add (0,0) to mimick base-R plots
df2 <- data.frame(x = c(0, d$xmax), y = c(0, d$y))

# combine plots: note that geom_line and geom_point use the new data in df2
p + geom_line(data = df2, aes(x = x, y = y),
        color = "darkblue", size = 1) +
    geom_point(data = df2, aes(x = x, y = y),
        color = "darkred", size = 1) +
    ylab("Frequency") + 
    scale_x_continuous(breaks = seq(0, 10, 2))

# save for posterity
ggsave("ggplot-histogram-cumulative-2.png")

There may be easier ways mind you! As it happens the ggplot object also stores two other values of x: the minimum and the maximum. So you can make other polygons with this convenience function:

# Make polygons: takes a plot object, returns a data.frame
get_hist <- function(p, pos = 2) {
    d <- ggplot_build(p)$data[[1]]
    if (pos == 1) { x = d$xmin; y = d$y; }
    if (pos == 2) { x = d$x; y = d$y; }
    if (pos == 3) { x = c(0, d$xmax); y = c(0, d$y); }
    data.frame(x = x, y = y)
}
df2 = get_hist(p, pos = 3)  # play around with pos=1, pos=2, pos=3

Sample Image

How to define xaxis for a cumulative distribution function using ggplot and geom_ribbon in R?

I found a "manual" solution. First, I created a variable equal to the cumulative distribution of my variable of interest:

df <- 
  df %>%
  dplyr::mutate(cumula_var = cume_dist(var_x))

Then, I made the graph:

Graph <- 
  ggplot(df, aes(x=var_x, y=cumula_var)) +
  geom_line() +
  geom_ribbon(aes(ymin = 0, ymax = ..y..,
                  xmin = 0, xmax = 20))+
  coord_cartesian(xlim = c(0, 20))

What is the best plot to show a distribution in R?

Try the geom_boxplot() distribution:

ggplot(iris, aes(x = Petal.Length, fill=Species)) + 
  geom_boxplot() + 
  scale_fill_brewer(palette="Set1")

Sample Image

Or geom_histogram() As @akrun suggests. I've added combined with facet_grid().

ggplot(iris, aes(x = Petal.Length, y=Species, fill=Species)) + 
  geom_histogram() + 
  scale_fill_brewer(palette="Set1")+
  facet_grid("Species")

Sample Image

And the popular geom_violin() plot

ggplot(iris, aes(x = Petal.Length, y=Species, fill=Species)) + 
  geom_violin() + 
  scale_fill_brewer(palette="Set1")

Sample Image

Easier Way to Plot the Cumulative Frequency Distribution in Ggplot