Binning Data, Finding Results by Group, and Plotting Using R

Binning data, finding results by group, and plotting using R

First classify into depth classes with cut:

depth.class <- cut(quakes$depth, c(40, 120, 200, 300, 400, 500, 600, 680), include.lowest = TRUE)

(Note that your class definitions may need to vary for exactly what you are after and given the details of cut()'s behaviour).

Find the mean magnitude within each depth.class (assumes no NAs):

mean.mag <- tapply(quake$mag, depth.class, mean)

(Add na.rm e.g. mean.mag <- tapply(quake$mag, depth.class, mean, na.rm = TRUE) for data sets with missing values where appropriate).

Plot as a line:

plot(mean.mag, type = "l", xlab = "magnitude class")

It's a little extra work to put the class labels on the X-axis, but at that point you might question if a line plot is really appropriate here.

A quick stab, turn off the axes and then put up the classes directly from the cut factor:

plot(mean.mag, type = "l", xlab = "magnitude class", axes = FALSE)
axis(1, 1:nlevels(depth.class), levels(depth.class))
axis(2)
box()

R: Graphing binned data

Maybe:

lines(mean.yaxis ~ seq(0, 30, length=length(mean.yaxis)))

HTH

R summing up binned data

I think i understand it now: you want the sum of all values that fit into a bin?
You can use tapply for this:

n = 100
x = rnorm(n)
n.breaks = as.integer(sqrt(n))
bins = cut(x, breaks = n.breaks)

sums = tapply(x, bins, sum)
print(sums)

(-2.65,-2.17] (-2.17,-1.7] (-1.7,-1.23] (-1.23,-0.754] (-0.754,-0.281]
-7.5825100 -7.6457772 -5.6796399 -8.6823512 -12.8808658
(-0.281,0.193] (0.193,0.666] (0.666,1.14] (1.14,1.61] (1.61,2.09]
-0.8756864 8.1137694 7.5262578 4.1649094 10.8759823

Group the data by column and obtain the mean of the rest of the variables in R

I'll emulate with mtcars and dplyr.

library(dplyr)
quant <- c("mpg", "disp", "hp")
qual <- c("vs", "am", "gear")

mtcars %>%
group_by(cyl) %>%
summarize(across(quant, mean), across(qual, ~ names(sort(table(.),decreasing=TRUE))[1]))
# # A tibble: 3 x 7
# cyl mpg disp hp vs am gear
# <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
# 1 4 26.7 105. 82.6 1 1 4
# 2 6 19.7 183. 122. 1 0 4
# 3 8 15.1 353. 209. 0 0 3

The names(table(.))[1] is meant to be your "mode" of a qualitative variable. We can validate that it is doing what we expect with a quick table:

xtabs(~cyl+vs, data=mtcars)
# vs
# cyl 0 1
# 4 1 10
# 6 3 4
# 8 14 0
xtabs(~cyl+am, data=mtcars)
# am
# cyl 0 1
# 4 3 8
# 6 4 3
# 8 12 2
xtabs(~cyl+gear, data=mtcars)
# gear
# cyl 3 4 5
# 4 1 8 2
# 6 2 4 1
# 8 12 0 2

showing that for gear 4, 6, and 8, respectively, the most common vs is 1, 1, and 0; for am: 1, 0, and 0; for gear: 4, 4, and 3. Those correspond to the values in the return above.

In your case, change cyl to neighborhood, and make sure your qual and quant have the desired variables listed.

How to create dodge bar plot from binned/interval data in r?

Always a good idea to take a look at the dataframe you are passing to ggplot to see if the data is making sense.

In your case, the dataframe is:

mpg %>% mutate(cty_interval = cut(cty,5)) %>% add_count(cty_interval)

manufacturer model displ year cyl trans drv cty hwy fl class cty_interval n
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> <fct> <int>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact (14.2,19.4] 105
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact (19.4,24.6] 46
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact (19.4,24.6] 46
4 audi a4 2 2008 4 auto(av) f 21 30 p compact (19.4,24.6] 46
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact (14.2,19.4] 105
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact (14.2,19.4] 105
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact (14.2,19.4] 105
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact (14.2,19.4] 105
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact (14.2,19.4] 105
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact (19.4,24.6] 46

So, the n column shows the total number of cars in each bin (regardless of the number of cyl). So when you divide it by cyl, it shows the values in n (which are the same for all rows in the same bin -- compare rows 1 and 6).

It is also probably overplotting a lot of bars in the same position (since it plots one bar for each row and there is a lot of repetition). So you could simply
use add_count(cty_interval, cyl) (like @qdread suggested in the comment above), but this would still have this issue of overplotting the same bar over and over.

I think that the right way to do this is by using dplyr::group_by and dplyr::summarise (included in tidyverse). You should group by the two variables you are interested (cty_interval and cyl) and count the number of occurrences in each group with summarise. Also, because this will not show empty groups, I used complete to add rows for the empty groups (otherwise the column plot would look weird).

df.1 <- mpg %>% 
mutate(cty_interval = cut(cty,5)) %>%
dplyr::group_by(cty_interval, cyl) %>%
summarise(n=n()) %>%
complete(cty_interval, cyl, fill = list(n = 0))

Which results in:

   cty_interval   cyl     n
<fct> <int> <dbl>
1 (8.97,14.2] 6 14
2 (8.97,14.2] 8 59
3 (14.2,19.4] 6 65
4 (14.2,19.4] 8 11
5 (19.4,24.6] 6 0
6 (19.4,24.6] 8 0
7 (24.6,29.8] 6 0
8 (24.6,29.8] 8 0

And the plot now looks like this:

ggplot(data=df.1, aes(x = cty_interval, y = n, fill = as.factor(cyl))) +
geom_col(position = "dodge")

Sample Image

You can probably improve it by changing the width of the bars (I think the groups in the bins are too close to the next one and it looks confusing)

How to plot line graph of normalized differences from binned data with ggplot?

library(tidyverse)

Creating example data as shown in question, but adding different probabilities to the two sample() calls, to create so visible difference
between the two sets of randomized data.

dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100))) %>% as_tibble()
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100))) %>% as_tibble()

Using dplyr we can handle this within data.frames (tibbles) without
the need to switch to other datatypes.

Let’s define a function that can be applied to both datasets to get
the preprocessing done.

We use base::cut() to create
a new column that pairs each value with its bin. We then group the data
by bin, calculate the sum for each bin and finally divide the bin sums
by the total sum.

calc_bin_props <- function(data) {
as_tibble(data) %>%
mutate(bin = cut(a, breaks = seq(25, 50, by = 2), labels = seq(25, 48, by = 2))) %>%
group_by(bin) %>%
summarise(sum = sum(a), .groups = "drop") %>%
filter(!is.na(bin)) %>%
ungroup() %>%
mutate(sum = sum / sum(sum))
}

Now we call calc_bin_props() on both datasets and join them by bin.
This gives us a dataframe with the columns bin, sum.x and sum.y.
The latter two are correspond to the bin sums derived from dat1 and
dat2. With the mutate() line we calculate the differences between the
two columns.

diff_data <- 
full_join(
calc_bin_props(data = dat1),
calc_bin_props(dat2),
by = "bin") %>%
mutate(dbin = (sum.x - sum.y),
bin = as.numeric(as.character(bin))) %>%
select(-starts_with("trsh"))

Before we feed the data into ggplot() we convert it to the long
format using pivot_longer() this allows us to instruct ggplot() to
plot the results for sum.x, sum.y and dbin as separate lines.

diff_data %>% 
pivot_longer(-bin) %>%
ggplot(aes(as.numeric(bin), value, color = name, linetype = name)) +
geom_line() +
scale_linetype_manual(values=c("longdash", "solid", "solid")) +
scale_color_manual(values = c("black", "purple", "green"))

Sample Image

Using cut() with group_by()

If you want to use cut, you could do it this way:

df %>% 
group_by(group, subgroup) %>%
mutate(bin = cut(value, breaks = c(-Inf, mean(value), Inf), labels = c(1,2)))

Splitting a continuous variable into equal sized groups

try this:

split(das, cut(das$anim, 3))

if you want to split based on the value of wt, then

library(Hmisc) # cut2
split(das, cut2(das$wt, g=3))

anyway, you can do that by combining cut, cut2 and split.

UPDATED

if you want a group index as an additional column, then

das$group <- cut(das$anim, 3)

if the column should be index like 1, 2, ..., then

das$group <- as.numeric(cut(das$anim, 3))

UPDATED AGAIN

try this:

> das$wt2 <- as.numeric(cut2(das$wt, g=3))
> das
anim wt wt2
1 1 181.0 1
2 2 179.0 1
3 3 180.5 1
4 4 201.0 2
5 5 201.5 2
6 6 245.0 2
7 7 246.4 3
8 8 189.3 1
9 9 301.0 3
10 10 354.0 3
11 11 369.0 3
12 12 205.0 2
13 13 199.0 1
14 14 394.0 3
15 15 231.3 2

Group/bin/bucket data in R and get count per bucket and sum of values per bucket

From the comments, "C2" seems to be "character" column with % as suffix. Before, creating a group, remove the % using sub, convert to "numeric" (as.numeric). The variable "group" is created (transform(df,...)) by using the function cut with breaks (group buckets/intervals) and labels (for the desired group labels) arguments. Once the group variable is created, the sum of the "C1" by "group" and the "count" of elements within "group" can be done using aggregate from "base R"

df1 <-  transform(df, group=cut(as.numeric(sub('[%]', '', C2)), 
breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))

res <- do.call(data.frame,aggregate(C1~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))

dNew <- data.frame(group=levels(df1$group))
merge(res, dNew, all=TRUE)
# group C1.Count C1.Sum
#1 <0.005 2 3491509.6
#2 0.005 NA NA
#3 0.01 2 302997.1
#4 0.014 8 364609.5

or you can use data.table. setDT converts the data.frame to data.table. Specify the "grouping" variable with by= and summarize/create the two variables "Count" and "Sum" within the list(. .N gives the count of elements within each "group".

 library(data.table)
setDT(df1)[, list(Count=.N, Sum=sum(C1)), by=group][]

Or using dplyr. The %>% connect the LHS with RHS arguments and chains them together. Use group_by to specify the "group" variable, and then use summarise_each or summarise to get summary count and sum of the concerned column. summarise_each would be useful if there are more than one column.

 library(dplyr)
df1 %>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1)

Update

Using the new dataset df

df1 <- transform(df, group=cut(C2,  breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))

res <- do.call(data.frame,aggregate(cbind(C1,C3)~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))
res
# group C1.Count C1.Sum C3.Count C3.Sum
#1 <0.005 2 3491509.6 2 91233
#2 0.01 2 302997.1 2 88843
#3 0.014 8 364609.5 8 268809

and you can do the merge as detailed above.

The dplyr approach would be the same except specifying the additional variable

 df1%>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1, C3)
#Source: local data frame [3 x 5]

# group C1_n C3_n C1_Sum C3_Sum
#1 <0.005 2 2 3491509.6 91233
#2 0.01 2 2 302997.1 88843
#3 0.014 8 8 364609.5 268809

data

df <-structure(list(C1 = c(49488.01172, 268221.1563, 34775.96094, 
13046.98047, 2121699.75, 71155.09375, 1369809.875, 750, 44943.82813,
85585.04688, 31090.10938, 68550.40625), C2 = c("0.0512%", "0.0128%",
"0.0128%", "0.07241%", "0.00453%", "0.0181%", "0.00453%", "0.2048%",
"0.0362%", "0.0362%", "0.0362%", "0.0181%")), .Names = c("C1",
"C2"), row.names = c(NA, -12L), class = "data.frame")

dplyr: Find mean for each bin by groups

You seem to be flailing a bit. You've got correct code, then you've got extra code.

Starting from a fresh R session and defining your data, then

library(dplyr)
res <- df %>% group_by(id, bin, sign) %>%
summarise(Num = n(), value = mean(value,na.rm=TRUE))

The above code is from your question, but I replaced length(bin) with the built-in dplyr::n() function. The above code accurately gives the group-wise averages:

head(res)
# id bin sign Num value
# 1 A [0,1] - 122 -0.08330338
# 2 A [0,1] + 111 0.11394381
# 3 A [0,1] NULL 2 0.75232462
# 4 A (1,2] - 54 -0.09236725
# 5 A (1,2] + 45 0.20581095
# 6 A (2,3] - 12 -0.08998771

Jumping ahead to your last couple lines in the code block:

groupA = df[df$id=="A" & df$bin=="[0, 1]" & df$sign=="NULL", ]
# mean(groupA$value, na.rm=T)
# [1] 0.7523246

Which matches the 3rd line of the above results. So you did it, it works fine!

The rest of your code is confused:

res %>% group_by(id) %>%
summarise(total= sum(Num))

I'm not sure what you're trying to accomplish with this, but you don't assign it to anything so it is run but not saved.

As for your ddply attempt:

ddply(df, .(id, bin, sign), summarize, mean = mean(value,na.rm=TRUE))

You'll notice that if you have dplyr loaded and then load the plyr library, there's a message that:

You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)

Do not ignore this warning! My guess is this happened, you ignored it, and that's part of the source of your troubles. Probably you don't need plyr at all, but if you do, load it before dplyr!



Related Topics



Leave a reply



Submit