Binning data, finding results by group, and plotting using R
First classify into depth classes with cut:
depth.class <- cut(quakes$depth, c(40, 120, 200, 300, 400, 500, 600, 680), include.lowest = TRUE)
(Note that your class definitions may need to vary for exactly what you are after and given the details of cut()'s behaviour).
Find the mean magnitude within each depth.class (assumes no NAs):
mean.mag <- tapply(quake$mag, depth.class, mean)
(Add na.rm e.g. mean.mag <- tapply(quake$mag, depth.class, mean, na.rm = TRUE)
for data sets with missing values where appropriate).
Plot as a line:
plot(mean.mag, type = "l", xlab = "magnitude class")
It's a little extra work to put the class labels on the X-axis, but at that point you might question if a line plot is really appropriate here.
A quick stab, turn off the axes and then put up the classes directly from the cut factor:
plot(mean.mag, type = "l", xlab = "magnitude class", axes = FALSE)
axis(1, 1:nlevels(depth.class), levels(depth.class))
axis(2)
box()
R: Graphing binned data
Maybe:
lines(mean.yaxis ~ seq(0, 30, length=length(mean.yaxis)))
HTH
R summing up binned data
I think i understand it now: you want the sum of all values that fit into a bin?
You can use tapply
for this:
n = 100
x = rnorm(n)
n.breaks = as.integer(sqrt(n))
bins = cut(x, breaks = n.breaks)
sums = tapply(x, bins, sum)
print(sums)
(-2.65,-2.17] (-2.17,-1.7] (-1.7,-1.23] (-1.23,-0.754] (-0.754,-0.281]
-7.5825100 -7.6457772 -5.6796399 -8.6823512 -12.8808658
(-0.281,0.193] (0.193,0.666] (0.666,1.14] (1.14,1.61] (1.61,2.09]
-0.8756864 8.1137694 7.5262578 4.1649094 10.8759823
Group the data by column and obtain the mean of the rest of the variables in R
I'll emulate with mtcars
and dplyr
.
library(dplyr)
quant <- c("mpg", "disp", "hp")
qual <- c("vs", "am", "gear")
mtcars %>%
group_by(cyl) %>%
summarize(across(quant, mean), across(qual, ~ names(sort(table(.),decreasing=TRUE))[1]))
# # A tibble: 3 x 7
# cyl mpg disp hp vs am gear
# <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
# 1 4 26.7 105. 82.6 1 1 4
# 2 6 19.7 183. 122. 1 0 4
# 3 8 15.1 353. 209. 0 0 3
The names(table(.))[1]
is meant to be your "mode" of a qualitative variable. We can validate that it is doing what we expect with a quick table:
xtabs(~cyl+vs, data=mtcars)
# vs
# cyl 0 1
# 4 1 10
# 6 3 4
# 8 14 0
xtabs(~cyl+am, data=mtcars)
# am
# cyl 0 1
# 4 3 8
# 6 4 3
# 8 12 2
xtabs(~cyl+gear, data=mtcars)
# gear
# cyl 3 4 5
# 4 1 8 2
# 6 2 4 1
# 8 12 0 2
showing that for gear 4, 6, and 8, respectively, the most common vs
is 1
, 1
, and 0
; for am
: 1
, 0
, and 0
; for gear
: 4
, 4
, and 3
. Those correspond to the values in the return above.
In your case, change cyl
to neighborhood
, and make sure your qual
and quant
have the desired variables listed.
How to create dodge bar plot from binned/interval data in r?
Always a good idea to take a look at the dataframe you are passing to ggplot to see if the data is making sense.
In your case, the dataframe is:
mpg %>% mutate(cty_interval = cut(cty,5)) %>% add_count(cty_interval)
manufacturer model displ year cyl trans drv cty hwy fl class cty_interval n
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> <fct> <int>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact (14.2,19.4] 105
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact (19.4,24.6] 46
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact (19.4,24.6] 46
4 audi a4 2 2008 4 auto(av) f 21 30 p compact (19.4,24.6] 46
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact (14.2,19.4] 105
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact (14.2,19.4] 105
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact (14.2,19.4] 105
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact (14.2,19.4] 105
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact (14.2,19.4] 105
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact (19.4,24.6] 46
So, the n
column shows the total number of cars in each bin (regardless of the number of cyl
). So when you divide it by cyl
, it shows the values in n
(which are the same for all rows in the same bin -- compare rows 1 and 6).
It is also probably overplotting a lot of bars in the same position (since it plots one bar for each row and there is a lot of repetition). So you could simply
use add_count(cty_interval, cyl)
(like @qdread suggested in the comment above), but this would still have this issue of overplotting the same bar over and over.
I think that the right way to do this is by using dplyr::group_by
and dplyr::summarise
(included in tidyverse
). You should group by the two variables you are interested (cty_interval
and cyl
) and count the number of occurrences in each group with summarise
. Also, because this will not show empty groups, I used complete
to add rows for the empty groups (otherwise the column plot would look weird).
df.1 <- mpg %>%
mutate(cty_interval = cut(cty,5)) %>%
dplyr::group_by(cty_interval, cyl) %>%
summarise(n=n()) %>%
complete(cty_interval, cyl, fill = list(n = 0))
Which results in:
cty_interval cyl n
<fct> <int> <dbl>
1 (8.97,14.2] 6 14
2 (8.97,14.2] 8 59
3 (14.2,19.4] 6 65
4 (14.2,19.4] 8 11
5 (19.4,24.6] 6 0
6 (19.4,24.6] 8 0
7 (24.6,29.8] 6 0
8 (24.6,29.8] 8 0
And the plot now looks like this:
ggplot(data=df.1, aes(x = cty_interval, y = n, fill = as.factor(cyl))) +
geom_col(position = "dodge")
You can probably improve it by changing the width of the bars (I think the groups in the bins are too close to the next one and it looks confusing)
How to plot line graph of normalized differences from binned data with ggplot?
library(tidyverse)
Creating example data as shown in question, but adding different probabilities to the two sample()
calls, to create so visible difference
between the two sets of randomized data.
dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100))) %>% as_tibble()
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100))) %>% as_tibble()
Using dplyr
we can handle this within data.frame
s (tibble
s) without
the need to switch to other datatypes.
Let’s define a function that can be applied to both datasets to get
the preprocessing done.
We use base::cut()
to create
a new column that pairs each value with its bin. We then group the data
by bin
, calculate the sum for each bin and finally divide the bin sums
by the total sum.
calc_bin_props <- function(data) {
as_tibble(data) %>%
mutate(bin = cut(a, breaks = seq(25, 50, by = 2), labels = seq(25, 48, by = 2))) %>%
group_by(bin) %>%
summarise(sum = sum(a), .groups = "drop") %>%
filter(!is.na(bin)) %>%
ungroup() %>%
mutate(sum = sum / sum(sum))
}
Now we call calc_bin_props()
on both datasets and join them by bin
.
This gives us a dataframe with the columns bin
, sum.x
and sum.y
.
The latter two are correspond to the bin sums derived from dat1
anddat2
. With the mutate()
line we calculate the differences between the
two columns.
diff_data <-
full_join(
calc_bin_props(data = dat1),
calc_bin_props(dat2),
by = "bin") %>%
mutate(dbin = (sum.x - sum.y),
bin = as.numeric(as.character(bin))) %>%
select(-starts_with("trsh"))
Before we feed the data into ggplot()
we convert it to the long
format using pivot_longer()
this allows us to instruct ggplot()
to
plot the results for sum.x
, sum.y
and dbin
as separate lines.
diff_data %>%
pivot_longer(-bin) %>%
ggplot(aes(as.numeric(bin), value, color = name, linetype = name)) +
geom_line() +
scale_linetype_manual(values=c("longdash", "solid", "solid")) +
scale_color_manual(values = c("black", "purple", "green"))
Using cut() with group_by()
If you want to use cut
, you could do it this way:
df %>%
group_by(group, subgroup) %>%
mutate(bin = cut(value, breaks = c(-Inf, mean(value), Inf), labels = c(1,2)))
Splitting a continuous variable into equal sized groups
try this:
split(das, cut(das$anim, 3))
if you want to split based on the value of wt
, then
library(Hmisc) # cut2
split(das, cut2(das$wt, g=3))
anyway, you can do that by combining cut
, cut2
and split
.
UPDATED
if you want a group index as an additional column, then
das$group <- cut(das$anim, 3)
if the column should be index like 1, 2, ..., then
das$group <- as.numeric(cut(das$anim, 3))
UPDATED AGAIN
try this:
> das$wt2 <- as.numeric(cut2(das$wt, g=3))
> das
anim wt wt2
1 1 181.0 1
2 2 179.0 1
3 3 180.5 1
4 4 201.0 2
5 5 201.5 2
6 6 245.0 2
7 7 246.4 3
8 8 189.3 1
9 9 301.0 3
10 10 354.0 3
11 11 369.0 3
12 12 205.0 2
13 13 199.0 1
14 14 394.0 3
15 15 231.3 2
Group/bin/bucket data in R and get count per bucket and sum of values per bucket
From the comments, "C2" seems to be "character" column with %
as suffix. Before, creating a group, remove the %
using sub
, convert to "numeric" (as.numeric
). The variable "group" is created (transform(df,...)
) by using the function cut
with breaks
(group buckets/intervals) and labels
(for the desired group labels) arguments. Once the group variable is created, the sum
of the "C1" by "group" and the "count" of elements within "group" can be done using aggregate
from "base R"
df1 <- transform(df, group=cut(as.numeric(sub('[%]', '', C2)),
breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))
res <- do.call(data.frame,aggregate(C1~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))
dNew <- data.frame(group=levels(df1$group))
merge(res, dNew, all=TRUE)
# group C1.Count C1.Sum
#1 <0.005 2 3491509.6
#2 0.005 NA NA
#3 0.01 2 302997.1
#4 0.014 8 364609.5
or you can use data.table
. setDT
converts the data.frame
to data.table
. Specify the "grouping" variable with by=
and summarize/create the two variables "Count" and "Sum" within the list(
. .N
gives the count of elements within each "group".
library(data.table)
setDT(df1)[, list(Count=.N, Sum=sum(C1)), by=group][]
Or using dplyr
. The %>%
connect the LHS with RHS arguments and chains them together. Use group_by
to specify the "group" variable, and then use summarise_each
or summarise
to get summary count and sum
of the concerned column. summarise_each
would be useful if there are more than one column.
library(dplyr)
df1 %>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1)
Update
Using the new dataset df
df1 <- transform(df, group=cut(C2, breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))
res <- do.call(data.frame,aggregate(cbind(C1,C3)~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))
res
# group C1.Count C1.Sum C3.Count C3.Sum
#1 <0.005 2 3491509.6 2 91233
#2 0.01 2 302997.1 2 88843
#3 0.014 8 364609.5 8 268809
and you can do the merge
as detailed above.
The dplyr
approach would be the same except specifying the additional variable
df1%>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1, C3)
#Source: local data frame [3 x 5]
# group C1_n C3_n C1_Sum C3_Sum
#1 <0.005 2 2 3491509.6 91233
#2 0.01 2 2 302997.1 88843
#3 0.014 8 8 364609.5 268809
data
df <-structure(list(C1 = c(49488.01172, 268221.1563, 34775.96094,
13046.98047, 2121699.75, 71155.09375, 1369809.875, 750, 44943.82813,
85585.04688, 31090.10938, 68550.40625), C2 = c("0.0512%", "0.0128%",
"0.0128%", "0.07241%", "0.00453%", "0.0181%", "0.00453%", "0.2048%",
"0.0362%", "0.0362%", "0.0362%", "0.0181%")), .Names = c("C1",
"C2"), row.names = c(NA, -12L), class = "data.frame")
dplyr: Find mean for each bin by groups
You seem to be flailing a bit. You've got correct code, then you've got extra code.
Starting from a fresh R session and defining your data, then
library(dplyr)
res <- df %>% group_by(id, bin, sign) %>%
summarise(Num = n(), value = mean(value,na.rm=TRUE))
The above code is from your question, but I replaced length(bin)
with the built-in dplyr::n()
function. The above code accurately gives the group-wise averages:
head(res)
# id bin sign Num value
# 1 A [0,1] - 122 -0.08330338
# 2 A [0,1] + 111 0.11394381
# 3 A [0,1] NULL 2 0.75232462
# 4 A (1,2] - 54 -0.09236725
# 5 A (1,2] + 45 0.20581095
# 6 A (2,3] - 12 -0.08998771
Jumping ahead to your last couple lines in the code block:
groupA = df[df$id=="A" & df$bin=="[0, 1]" & df$sign=="NULL", ]
# mean(groupA$value, na.rm=T)
# [1] 0.7523246
Which matches the 3rd line of the above results. So you did it, it works fine!
The rest of your code is confused:
res %>% group_by(id) %>%
summarise(total= sum(Num))
I'm not sure what you're trying to accomplish with this, but you don't assign it to anything so it is run but not saved.
As for your ddply
attempt:
ddply(df, .(id, bin, sign), summarize, mean = mean(value,na.rm=TRUE))
You'll notice that if you have dplyr
loaded and then load the plyr
library, there's a message that:
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
Do not ignore this warning! My guess is this happened, you ignored it, and that's part of the source of your troubles. Probably you don't need plyr
at all, but if you do, load it before dplyr
!
Related Topics
Using Functions and Environments
R - Cumulative Sum by Condition
How to Create a Variable of Rownames
How to Plot Igraph Community with Defined Colors
Error in Terms.Formula(Formula):'.' in Formula and No 'Data' Argument
Extracting Data Used to Make a Smooth Plot in Mgcv
Package Domc Not Available for R Version 3.0.0 Warning in Install.Packages
How to Perform a Pairwise T.Test in R Across Multiple Independent Vectors
How to Transpose a Tibble() in R
Replicate a List to Create a List-Of-Lists
Drawing a Tangent to the Plot and Finding the X-Intercept Using R
Solving a System of Nonlinear Equations in R
R: Replacing Nas in a Data.Frame with Values in the Same Position in Another Dataframe
How to Draw Half-Filled Points in R (Preferably Using Ggplot)
Determine Season from Date Using Lubridate in R
Fixing Variance Values in Lme4