R Calculate the Average of One Column Corresponding to Each Bin of Another Column

R calculate the average of one column corresponding to each bin of another column

Alternatively, you can use the wonderful plyr package.

library(plyr)
ddply(df, .(cut(df$r, 5)), colwise(mean))

However, if you have to ask a question like the above, you are just fine with the tapply solution.

R aggregate data in one column based on 2 other columns

library(plyr) 

#I am using cut function with 50 breaks for both v1 and v2 and ddply from plyr package for computing the mean

newdata<-ddply(df,.(cut(v1,50),cut(v2,50)),summarise,mean.v3=mean(v3))
> head(newdata)
cut(v1, 50) cut(v2, 50) mean.v3
1 (-49.4,-47.5] (-34.7,-32.7] 18.123
2 (-49.4,-47.5] (-0.576,1.43] 20.887
3 (-49.4,-47.5] (15.5,17.5] 20.887
4 (-47.5,-45.5] (-52.7,-50.7] 9.918
5 (-47.5,-45.5] (-44.7,-42.7] 14.477
6 (-47.5,-45.5] (-34.7,-32.7] 16.314

Updated as per the comments: If you want the lower, middle and mid-points, you can use the following function or use with details as follow(you need to use the sub function to deal with ( and ]):

    df$newv1<-with(df,cut(v1,50)) 
df$newv2<-with(df,cut(v2,50))
df$lowerv1<-with(df,as.numeric( sub("\\((.+),.*", "\\1", newv1))) #lower value
df$upperv1<-with(df,as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", newv1))) # upper value
df$midv1<-with(df,(lowerv1+upperv1)/2) #mid value
df$lowerv2<-with(df,as.numeric( sub("\\((.+),.*", "\\1",newv2))) #lower value
df$upperv2<-with(df,as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", newv2))) # upper value
df$midv2<-with(df,(lowerv2+upperv2)/2)#mid value
newdata<-ddply(df,.(newv1,newv2),transform,mean.v3=mean(v3))

> head(newdata)
v1 v2 v3 newv1 newv2 lowerv1 upperv1 midv1 lowerv2 upperv2 midv2 mean.v3
1 -47.456 -32.714 18.123 (-49.4,-47.5] (-34.7,-32.7] -49.4 -47.5 -48.45 -34.700 -32.70 -33.700 18.123
2 -49.329 -0.465 20.887 (-49.4,-47.5] (-0.576,1.43] -49.4 -47.5 -48.45 -0.576 1.43 0.427 20.887
3 -48.652 16.558 20.800 (-49.4,-47.5] (15.5,17.5] -49.4 -47.5 -48.45 15.500 17.50 16.500 20.887
4 -48.323 17.153 20.974 (-49.4,-47.5] (15.5,17.5] -49.4 -47.5 -48.45 15.500 17.50 16.500 20.887
5 -45.713 -52.599 9.918 (-47.5,-45.5] (-52.7,-50.7] -47.5 -45.5 -46.50 -52.700 -50.70 -51.700 9.918
6 -45.805 -43.071 14.477 (-47.5,-45.5] (-44.7,-42.7] -47.5 -45.5 -46.50 -44.700 -42.70 -43.700 14.477

Timeseries average based on a defined time interval (bin)

There are many ways to calculate a binned average: with base aggregate,by, with the packages dplyr, data.table, probably with zoo and surely other timeseries packages...

library(dplyr)
df %>%
group_by(interval = round(df$ts/10)*10) %>%
summarize(Var_mean = mean(Var))
# A tibble: 11 x 2
interval Var_mean
<dbl> <dbl>
1 0 4.561653
2 10 6.544980
3 20 6.110336
4 30 4.288523
5 40 5.339249
6 50 6.811147
7 60 6.180795
8 70 4.920476
9 80 5.486937
10 90 5.284871
11 100 5.917074

That's the dplyr approach, see how it and data.table let you name the intermediate variables, which keeps code clean and legible.

dplyr: Find mean for each bin by groups

You seem to be flailing a bit. You've got correct code, then you've got extra code.

Starting from a fresh R session and defining your data, then

library(dplyr)
res <- df %>% group_by(id, bin, sign) %>%
summarise(Num = n(), value = mean(value,na.rm=TRUE))

The above code is from your question, but I replaced length(bin) with the built-in dplyr::n() function. The above code accurately gives the group-wise averages:

head(res)
# id bin sign Num value
# 1 A [0,1] - 122 -0.08330338
# 2 A [0,1] + 111 0.11394381
# 3 A [0,1] NULL 2 0.75232462
# 4 A (1,2] - 54 -0.09236725
# 5 A (1,2] + 45 0.20581095
# 6 A (2,3] - 12 -0.08998771

Jumping ahead to your last couple lines in the code block:

groupA = df[df$id=="A" & df$bin=="[0, 1]" & df$sign=="NULL", ]
# mean(groupA$value, na.rm=T)
# [1] 0.7523246

Which matches the 3rd line of the above results. So you did it, it works fine!

The rest of your code is confused:

res %>% group_by(id) %>%
summarise(total= sum(Num))

I'm not sure what you're trying to accomplish with this, but you don't assign it to anything so it is run but not saved.

As for your ddply attempt:

ddply(df, .(id, bin, sign), summarize, mean = mean(value,na.rm=TRUE))

You'll notice that if you have dplyr loaded and then load the plyr library, there's a message that:

You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)

Do not ignore this warning! My guess is this happened, you ignored it, and that's part of the source of your troubles. Probably you don't need plyr at all, but if you do, load it before dplyr!

Group/bin/bucket data in R and get count per bucket and sum of values per bucket

From the comments, "C2" seems to be "character" column with % as suffix. Before, creating a group, remove the % using sub, convert to "numeric" (as.numeric). The variable "group" is created (transform(df,...)) by using the function cut with breaks (group buckets/intervals) and labels (for the desired group labels) arguments. Once the group variable is created, the sum of the "C1" by "group" and the "count" of elements within "group" can be done using aggregate from "base R"

df1 <-  transform(df, group=cut(as.numeric(sub('[%]', '', C2)), 
breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))

res <- do.call(data.frame,aggregate(C1~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))

dNew <- data.frame(group=levels(df1$group))
merge(res, dNew, all=TRUE)
# group C1.Count C1.Sum
#1 <0.005 2 3491509.6
#2 0.005 NA NA
#3 0.01 2 302997.1
#4 0.014 8 364609.5

or you can use data.table. setDT converts the data.frame to data.table. Specify the "grouping" variable with by= and summarize/create the two variables "Count" and "Sum" within the list(. .N gives the count of elements within each "group".

 library(data.table)
setDT(df1)[, list(Count=.N, Sum=sum(C1)), by=group][]

Or using dplyr. The %>% connect the LHS with RHS arguments and chains them together. Use group_by to specify the "group" variable, and then use summarise_each or summarise to get summary count and sum of the concerned column. summarise_each would be useful if there are more than one column.

 library(dplyr)
df1 %>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1)

Update

Using the new dataset df

df1 <- transform(df, group=cut(C2,  breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))

res <- do.call(data.frame,aggregate(cbind(C1,C3)~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))
res
# group C1.Count C1.Sum C3.Count C3.Sum
#1 <0.005 2 3491509.6 2 91233
#2 0.01 2 302997.1 2 88843
#3 0.014 8 364609.5 8 268809

and you can do the merge as detailed above.

The dplyr approach would be the same except specifying the additional variable

 df1%>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1, C3)
#Source: local data frame [3 x 5]

# group C1_n C3_n C1_Sum C3_Sum
#1 <0.005 2 2 3491509.6 91233
#2 0.01 2 2 302997.1 88843
#3 0.014 8 8 364609.5 268809

data

df <-structure(list(C1 = c(49488.01172, 268221.1563, 34775.96094, 
13046.98047, 2121699.75, 71155.09375, 1369809.875, 750, 44943.82813,
85585.04688, 31090.10938, 68550.40625), C2 = c("0.0512%", "0.0128%",
"0.0128%", "0.07241%", "0.00453%", "0.0181%", "0.00453%", "0.2048%",
"0.0362%", "0.0362%", "0.0362%", "0.0181%")), .Names = c("C1",
"C2"), row.names = c(NA, -12L), class = "data.frame")

Calculate mean of column based on another column

The shortest solution with GNU datamash:

datamash -st, -g1 mean 2 mean 3 mean 4 <file
  • -s - sort records

  • -t, - set comma , as field separator

  • -g1 - group records by the 1st field


The output:

0.5,4.178,0.7669464,0.009579418
0.6,3.736,0.7655912,0.011483042
0.7,3.8425,0.77699725,0.01570746

Replacing data in column with mean value of corresponding bin?

It's exactly as you laid out. Using this technique to get nearest

df = pd.DataFrame({"col":[4, 8, 15, 21, 21, 24, 25, 28, 34]})

df2 = df.assign(bin=pd.qcut(df.col, 3),
colbmean=lambda dfa: dfa.groupby("bin").transform("mean"),
colbin=lambda dfa: dfa.apply(lambda r: min([r.bin.left,r.bin.right], key=lambda x: abs(x-r.col)), axis=1))



Leave a reply



Submit