R calculate the average of one column corresponding to each bin of another column
Alternatively, you can use the wonderful plyr
package.
library(plyr)
ddply(df, .(cut(df$r, 5)), colwise(mean))
However, if you have to ask a question like the above, you are just fine with the tapply
solution.
R aggregate data in one column based on 2 other columns
library(plyr)
#I am using cut
function with 50 breaks for both v1 and v2 and ddply
from plyr package for computing the mean
newdata<-ddply(df,.(cut(v1,50),cut(v2,50)),summarise,mean.v3=mean(v3))
> head(newdata)
cut(v1, 50) cut(v2, 50) mean.v3
1 (-49.4,-47.5] (-34.7,-32.7] 18.123
2 (-49.4,-47.5] (-0.576,1.43] 20.887
3 (-49.4,-47.5] (15.5,17.5] 20.887
4 (-47.5,-45.5] (-52.7,-50.7] 9.918
5 (-47.5,-45.5] (-44.7,-42.7] 14.477
6 (-47.5,-45.5] (-34.7,-32.7] 16.314
Updated as per the comments: If you want the lower, middle and mid-points, you can use the following function or use with details as follow(you need to use the sub
function to deal with (
and ]
):
df$newv1<-with(df,cut(v1,50))
df$newv2<-with(df,cut(v2,50))
df$lowerv1<-with(df,as.numeric( sub("\\((.+),.*", "\\1", newv1))) #lower value
df$upperv1<-with(df,as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", newv1))) # upper value
df$midv1<-with(df,(lowerv1+upperv1)/2) #mid value
df$lowerv2<-with(df,as.numeric( sub("\\((.+),.*", "\\1",newv2))) #lower value
df$upperv2<-with(df,as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", newv2))) # upper value
df$midv2<-with(df,(lowerv2+upperv2)/2)#mid value
newdata<-ddply(df,.(newv1,newv2),transform,mean.v3=mean(v3))
> head(newdata)
v1 v2 v3 newv1 newv2 lowerv1 upperv1 midv1 lowerv2 upperv2 midv2 mean.v3
1 -47.456 -32.714 18.123 (-49.4,-47.5] (-34.7,-32.7] -49.4 -47.5 -48.45 -34.700 -32.70 -33.700 18.123
2 -49.329 -0.465 20.887 (-49.4,-47.5] (-0.576,1.43] -49.4 -47.5 -48.45 -0.576 1.43 0.427 20.887
3 -48.652 16.558 20.800 (-49.4,-47.5] (15.5,17.5] -49.4 -47.5 -48.45 15.500 17.50 16.500 20.887
4 -48.323 17.153 20.974 (-49.4,-47.5] (15.5,17.5] -49.4 -47.5 -48.45 15.500 17.50 16.500 20.887
5 -45.713 -52.599 9.918 (-47.5,-45.5] (-52.7,-50.7] -47.5 -45.5 -46.50 -52.700 -50.70 -51.700 9.918
6 -45.805 -43.071 14.477 (-47.5,-45.5] (-44.7,-42.7] -47.5 -45.5 -46.50 -44.700 -42.70 -43.700 14.477
Timeseries average based on a defined time interval (bin)
There are many ways to calculate a binned average: with base aggregate
,by
, with the packages dplyr
, data.table
, probably with zoo
and surely other timeseries packages...
library(dplyr)
df %>%
group_by(interval = round(df$ts/10)*10) %>%
summarize(Var_mean = mean(Var))
# A tibble: 11 x 2
interval Var_mean
<dbl> <dbl>
1 0 4.561653
2 10 6.544980
3 20 6.110336
4 30 4.288523
5 40 5.339249
6 50 6.811147
7 60 6.180795
8 70 4.920476
9 80 5.486937
10 90 5.284871
11 100 5.917074
That's the dplyr approach, see how it and data.table let you name the intermediate variables, which keeps code clean and legible.
dplyr: Find mean for each bin by groups
You seem to be flailing a bit. You've got correct code, then you've got extra code.
Starting from a fresh R session and defining your data, then
library(dplyr)
res <- df %>% group_by(id, bin, sign) %>%
summarise(Num = n(), value = mean(value,na.rm=TRUE))
The above code is from your question, but I replaced length(bin)
with the built-in dplyr::n()
function. The above code accurately gives the group-wise averages:
head(res)
# id bin sign Num value
# 1 A [0,1] - 122 -0.08330338
# 2 A [0,1] + 111 0.11394381
# 3 A [0,1] NULL 2 0.75232462
# 4 A (1,2] - 54 -0.09236725
# 5 A (1,2] + 45 0.20581095
# 6 A (2,3] - 12 -0.08998771
Jumping ahead to your last couple lines in the code block:
groupA = df[df$id=="A" & df$bin=="[0, 1]" & df$sign=="NULL", ]
# mean(groupA$value, na.rm=T)
# [1] 0.7523246
Which matches the 3rd line of the above results. So you did it, it works fine!
The rest of your code is confused:
res %>% group_by(id) %>%
summarise(total= sum(Num))
I'm not sure what you're trying to accomplish with this, but you don't assign it to anything so it is run but not saved.
As for your ddply
attempt:
ddply(df, .(id, bin, sign), summarize, mean = mean(value,na.rm=TRUE))
You'll notice that if you have dplyr
loaded and then load the plyr
library, there's a message that:
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
Do not ignore this warning! My guess is this happened, you ignored it, and that's part of the source of your troubles. Probably you don't need plyr
at all, but if you do, load it before dplyr
!
Group/bin/bucket data in R and get count per bucket and sum of values per bucket
From the comments, "C2" seems to be "character" column with %
as suffix. Before, creating a group, remove the %
using sub
, convert to "numeric" (as.numeric
). The variable "group" is created (transform(df,...)
) by using the function cut
with breaks
(group buckets/intervals) and labels
(for the desired group labels) arguments. Once the group variable is created, the sum
of the "C1" by "group" and the "count" of elements within "group" can be done using aggregate
from "base R"
df1 <- transform(df, group=cut(as.numeric(sub('[%]', '', C2)),
breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))
res <- do.call(data.frame,aggregate(C1~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))
dNew <- data.frame(group=levels(df1$group))
merge(res, dNew, all=TRUE)
# group C1.Count C1.Sum
#1 <0.005 2 3491509.6
#2 0.005 NA NA
#3 0.01 2 302997.1
#4 0.014 8 364609.5
or you can use data.table
. setDT
converts the data.frame
to data.table
. Specify the "grouping" variable with by=
and summarize/create the two variables "Count" and "Sum" within the list(
. .N
gives the count of elements within each "group".
library(data.table)
setDT(df1)[, list(Count=.N, Sum=sum(C1)), by=group][]
Or using dplyr
. The %>%
connect the LHS with RHS arguments and chains them together. Use group_by
to specify the "group" variable, and then use summarise_each
or summarise
to get summary count and sum
of the concerned column. summarise_each
would be useful if there are more than one column.
library(dplyr)
df1 %>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1)
Update
Using the new dataset df
df1 <- transform(df, group=cut(C2, breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))
res <- do.call(data.frame,aggregate(cbind(C1,C3)~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))
res
# group C1.Count C1.Sum C3.Count C3.Sum
#1 <0.005 2 3491509.6 2 91233
#2 0.01 2 302997.1 2 88843
#3 0.014 8 364609.5 8 268809
and you can do the merge
as detailed above.
The dplyr
approach would be the same except specifying the additional variable
df1%>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1, C3)
#Source: local data frame [3 x 5]
# group C1_n C3_n C1_Sum C3_Sum
#1 <0.005 2 2 3491509.6 91233
#2 0.01 2 2 302997.1 88843
#3 0.014 8 8 364609.5 268809
data
df <-structure(list(C1 = c(49488.01172, 268221.1563, 34775.96094,
13046.98047, 2121699.75, 71155.09375, 1369809.875, 750, 44943.82813,
85585.04688, 31090.10938, 68550.40625), C2 = c("0.0512%", "0.0128%",
"0.0128%", "0.07241%", "0.00453%", "0.0181%", "0.00453%", "0.2048%",
"0.0362%", "0.0362%", "0.0362%", "0.0181%")), .Names = c("C1",
"C2"), row.names = c(NA, -12L), class = "data.frame")
Calculate mean of column based on another column
The shortest solution with GNU datamash
:
datamash -st, -g1 mean 2 mean 3 mean 4 <file
-s
- sort records-t,
- set comma,
as field separator-g1
- group records by the 1st field
The output:
0.5,4.178,0.7669464,0.009579418
0.6,3.736,0.7655912,0.011483042
0.7,3.8425,0.77699725,0.01570746
Replacing data in column with mean value of corresponding bin?
It's exactly as you laid out. Using this technique to get nearest
df = pd.DataFrame({"col":[4, 8, 15, 21, 21, 24, 25, 28, 34]})
df2 = df.assign(bin=pd.qcut(df.col, 3),
colbmean=lambda dfa: dfa.groupby("bin").transform("mean"),
colbin=lambda dfa: dfa.apply(lambda r: min([r.bin.left,r.bin.right], key=lambda x: abs(x-r.col)), axis=1))
col | bin | colbmean | colbin | |
---|---|---|---|---|
0 | 4 | (3.999, 19.0] | 9 | 3.999 |
1 | 8 | (3.999, 19.0] | 9 | 3.999 |
2 | 15 | (3.999, 19.0] | 9 | 19 |
3 | 21 | (19.0, 24.333] | 22 | 19 |
4 | 21 | (19.0, 24.333] | 22 | 19 |
5 | 24 | (19.0, 24.333] | 22 | 24.333 |
6 | 25 | (24.333, 34.0] | 29 | 24.333 |
7 | 28 | (24.333, 34.0] | 29 | 24.333 |
8 | 34 | (24.333, 34.0] | 29 | 34 |
Related Topics
Specifying the Colour Scale for Maps in Ggplot
Wrapping Base R Reshape for Ease-Of-Use
How to Install the Odbc Driver for Snowflake Successfully on an M1 Apple Silicon MAC
Naive Bayes in Quanteda VS Caret: Wildly Different Results
Twitter Sentiment Analysis W R Using German Language Set Sentiws
Prevent Automatic Conversion of Single Column to Vector
Creating Shiny Reactive Variable That Indicates Which Widget Was Last Modified
How to Use a Non-Ascii Symbol (E.G. £) in an R Package Function
Rstudio Shiny Not Able to Use Ggvis
From [Package] Import [Function] in R
Transparency and Alpha Levels for Ggplot2 Stat_Density2D with Maps and Layers in R
Legend Venn Diagram in Venneuler
Fitting a Lognormal Distribution to Truncated Data in R
Two Y Axis in Highcharter in R
R Create Function to Add Water Year Column