Making Binned Scatter Plots for Two Variables in Ggplot2 in R

Interpreting binned scatterplot (R) and calculating variance of the mean

We can bin the data by the cut() function as follows,

mybin <- cut(df$x,20,include.lowest=TRUE,right = FALSE)
df$Bins <- mybin

Then to calculate the mean of the binned data,

library(tidyverse)

out<- df %>% group_by(Bins) %>% summarise(x=mean(x),y=mean(y)) %>% as.data.frame()

To compare our results with the stat_summary_bin() function of the ggplot2 we can plot them together,

(ggplot(df, aes(x=x,y=y)) +
geom_point(alpha = 0.4) +
stat_summary_bin(fun='mean', bins=20,
color='orange', size=2, geom='point') +
geom_point(data = out,color="green"))

# green dots are the points we calculated. They are perfectly matching.

Sample Image

Now, to calculate the variance, we can simply follow the same process with the var() function. So,

 df %>% group_by(Bins) %>% summarise(Varx=var(x),Vary=var(y)) %>% as.data.frame()

gives the variance of the binned data. Note that, since the x axis is binned, the variance of x will be almost zero. So,the important one in here is the variance of the y axis actually.

  • The variances of the binned data gives us a mimic about the heteroscedasticity of the data.

  • The path of the binned mean also shows the pattern of the data. So your data have a positive trend. (No need to see a perfect smooth line). But it becomes weaker because of the different means around as you suggested.

Data:

set.seed(42)
x <- runif(1000)
y <- x^2 + x + 4 * rnorm(1000)
df <- data.frame(x=x, y=y)

Note: The data and some of the ggplot2 codes have been taken from the OP's referred question.

How to create two lines and scatter plots using ggplot

How about something like this?

data %>%
gather(k, value, -id) %>%
mutate(
state = gsub("(\\.e$|\\.f$)", "", k),
what = gsub("(initial\\.|final\\.)", "", k)) %>%
ggplot(aes(id, value, colour = what)) +
geom_line() +
facet_wrap(~ state)

Sample Image

Or with points

data %>%
gather(k, value, -id) %>%
mutate(
state = gsub("(\\.e$|\\.f$)", "", k),
what = gsub("(initial\\.|final\\.)", "", k)) %>%
ggplot(aes(id, value, colour = what)) +
geom_line() +
geom_point() +
facet_wrap(~ state)

Sample Image


Update

data %>%
gather(k, value, -id) %>%
mutate(
state = gsub("(\\.e$|\\.f$)", "", k),
what = gsub("(initial\\.|final\\.)", "", k)) %>%
select(-k) %>%
spread(state, value) %>%
ggplot(aes(x = initial, y = final, colour = what, fill = what)) +
geom_smooth(fullrange = T, method = "lm") +
geom_point()

Sample Image

We're showing a trend-line based on a simple linear regression lm, including confidence band (disable with se = F inside geom_smooth). You could also show a LOESS trend with method = loess inside geom_smooth. See ?geom_smooth for more details.

How to have two variable in a scatter qplot or ggplot2?

Basically, you need to reshape your data with melt() into one long data_frame

library(reshape)
M <- melt(baseSenior,id.vars=c("Date","Type","Rating","Amount.Outstanding"),measure.vars=c("Min","Max"))

library(ggplot2)
ggplot(data=M,aes(x=Date,y=value,colour=Type,shape=variable)) +
geom_point() +
facet_grid(Rating~Amount.Outstanding)

ggplot2 geom_point with binned x-axis for binary data

As @Kohske said, there is no direct way to do that in ggplot2; you have to pre-summarize the data and pass that to ggplot. Your approach works, but I would have done it slightly differently, using the plyr package instead of aggregate.

library("plyr")
data$bin <- cut(data$x,seq(0,1,0.05))
data.bin <- ddply(data, "bin", function(DF) {
data.frame(mean=numcolwise(mean)(DF), length=numcolwise(length)(DF))
})
ggplot(data.bin,aes(x=mean.x,y=mean.y,size=length.x)) + geom_point() +
ylim(0,1)

Sample Image

The advantage, in my opinion, is that you get a simple data frame with better names this way, rather than a data frame where some columns are matrices. But that is probably a matter of personal style than correctness.



Related Topics



Leave a reply



Submit