Group/bin/bucket data in R and get count per bucket and sum of values per bucket
From the comments, "C2" seems to be "character" column with %
as suffix. Before, creating a group, remove the %
using sub
, convert to "numeric" (as.numeric
). The variable "group" is created (transform(df,...)
) by using the function cut
with breaks
(group buckets/intervals) and labels
(for the desired group labels) arguments. Once the group variable is created, the sum
of the "C1" by "group" and the "count" of elements within "group" can be done using aggregate
from "base R"
df1 <- transform(df, group=cut(as.numeric(sub('[%]', '', C2)),
breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))
res <- do.call(data.frame,aggregate(C1~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))
dNew <- data.frame(group=levels(df1$group))
merge(res, dNew, all=TRUE)
# group C1.Count C1.Sum
#1 <0.005 2 3491509.6
#2 0.005 NA NA
#3 0.01 2 302997.1
#4 0.014 8 364609.5
or you can use data.table
. setDT
converts the data.frame
to data.table
. Specify the "grouping" variable with by=
and summarize/create the two variables "Count" and "Sum" within the list(
. .N
gives the count of elements within each "group".
library(data.table)
setDT(df1)[, list(Count=.N, Sum=sum(C1)), by=group][]
Or using dplyr
. The %>%
connect the LHS with RHS arguments and chains them together. Use group_by
to specify the "group" variable, and then use summarise_each
or summarise
to get summary count and sum
of the concerned column. summarise_each
would be useful if there are more than one column.
library(dplyr)
df1 %>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1)
Update
Using the new dataset df
df1 <- transform(df, group=cut(C2, breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))
res <- do.call(data.frame,aggregate(cbind(C1,C3)~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))
res
# group C1.Count C1.Sum C3.Count C3.Sum
#1 <0.005 2 3491509.6 2 91233
#2 0.01 2 302997.1 2 88843
#3 0.014 8 364609.5 8 268809
and you can do the merge
as detailed above.
The dplyr
approach would be the same except specifying the additional variable
df1%>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1, C3)
#Source: local data frame [3 x 5]
# group C1_n C3_n C1_Sum C3_Sum
#1 <0.005 2 2 3491509.6 91233
#2 0.01 2 2 302997.1 88843
#3 0.014 8 8 364609.5 268809
data
df <-structure(list(C1 = c(49488.01172, 268221.1563, 34775.96094,
13046.98047, 2121699.75, 71155.09375, 1369809.875, 750, 44943.82813,
85585.04688, 31090.10938, 68550.40625), C2 = c("0.0512%", "0.0128%",
"0.0128%", "0.07241%", "0.00453%", "0.0181%", "0.00453%", "0.2048%",
"0.0362%", "0.0362%", "0.0362%", "0.0181%")), .Names = c("C1",
"C2"), row.names = c(NA, -12L), class = "data.frame")
How to bin data based on values in one column, and count occurrences from another column excluding duplicates in R?
Will This work?
df <- data.frame(CNV=c("1:10405137","1:10405137","1:10405137","1:101161140","1:110028467")
,r_value=c(0.035118621,0.070643341,0.391963719,0.376573375,0.950231679))
> df # minimal example
CNV r_value
1 1:10405137 0.03511862
2 1:10405137 0.07064334
3 1:10405137 0.39196372
4 1:101161140 0.37657337
5 1:110028467 0.95023168
df1 <- transform(df, group=cut(r_value,
breaks=c(0,0.1,0.2, 0.3, 0.4, 0.5,1),
labels=c("<0.1","0.1","0.2", "0.3", "0.4", "0.5<")))
res <- do.call(data.frame,aggregate(r_value~group, df1,
FUN=function(x) c(Count=length(x))))
> res # counts of intervals
group r_value
1 <0.1 2
2 0.3 2
3 0.5< 1
dNew <- data.frame(group=levels(df1$group))
dNew <- merge(res, dNew, all=TRUE)
colnames(dNew) <- c("interval","count")
> dNew # count of CNV by interval
interval count
1 <0.1 2
2 0.1 NA
3 0.2 NA
4 0.3 2
5 0.4 NA
6 0.5< 1
adapted from Group/bin/bucket data in R and get count per bucket and sum of values per bucket
Restructuring values in unequal buckets by aggregating on a column in R
One of the issues was having 'NA'
as a character string instead of NA
. Here is a solution with:
df <- data.frame(
id = c('50', '84'),
name = c('david', 'ron'),
date_diff = c(0, -16, -4, -1, 0, -2, -11, -12, -168, -8, 16, na)
)
library(dplyr)
library(tidyr)
df %>%
mutate(
group = cut(
Date_diff,
breaks = c(-Inf,-13,-7,-4,-2,-1,Inf),
labels = c('<-12', '-7 to -12','-4 to -6','-2 to -3', '-1','>0')
),
group = if_else(is.na(group), "NA", as.character(group))
) %>%
group_by(Id, Name, group) %>%
summarise(n = n()) %>%
mutate(Total = sum(n, na.rm = T)) %>%
pivot_wider(names_from = group, values_from = n)
Creating buckets based on a target variable
It's not clear how close to 10% you need the values to add to. Here's a solution that should give acceptable results unless you need the values to add up to exactly 10, in which case you may be very limited in the sets that you can "randomly" choose, as per Konrad Rudolph's comments.
We can replicate your data structure like this:
set.seed(3)
df <- data.frame(DSA = apply(replicate(3, sample(LETTERS, 180, TRUE)), 1,
paste, collapse = ""),
x = rexp(180),
y = rexp(180))
df$x <- 100 * df$x/sum(df$x)
df$y <- 100 * df$y/sum(df$y)
names(df) <- c("DSA", "3M Contribution", "6M Contribution")
So now we have:
head(df)
#> DSA 3M Contribution 6M Contribution
#> 1 EFS 0.6825126 1.2142943
#> 2 ZDL 0.0751734 0.1383280
#> 3 LWK 1.6487536 0.6453974
#> 4 GBY 2.1218616 0.2870302
#> 5 DDU 0.3414601 2.6318194
#> 6 ZHT 0.2341914 0.4497800
There are 180 rows, each with a unique 3 letter identifier for DSA
and each of the 3M Contribtion
and 6M contribution
columns add to 100:
length(unique(df$DSA))
#> [1] 180
sum(df$`3M Contribution`)
#> [1] 100
sum(df$`6M Contribution`)
#> [1] 100
Now that we have the data, we can select a random ordering of the data frame by doing:
sample_df <- df[sample(nrow(df)),]
All this does is put the data frame rows in an arbitrary order. However, we can now get only the first n
rows that allow 3M contribution
to (approximately) add to 10 like this:
sample_df <- sample_df[cumsum(sample_df$`3M Contribution`) < 10,]
sample_df
#> DSA 3M Contribution 6M Contribution
#> 59 QQQ 0.85167563 0.02718667
#> 43 WJY 0.35287431 1.40795918
#> 165 ISF 0.47397401 0.32810467
#> 114 GVV 0.35121926 0.56594478
#> 42 EHT 1.44584224 0.33704927
#> 71 DJX 0.84489716 0.16338381
#> 2 ZDL 0.07517340 0.13832804
#> 110 PZM 0.98701927 0.07642120
#> 11 JME 0.22959383 2.31503532
#> 153 BER 0.09994387 0.61655857
#> 81 UXU 0.22244869 0.31506248
#> 166 QYV 0.86343465 1.62441563
#> 171 SPH 2.43818827 0.84201677
#> 131 MAB 0.64740835 1.06533067
And we can see that we have a completely random selection of 14 DSAs that add up to (almost) 10:
sum(sample_df$`3M Contribution`)
#> [1] 9.883693
And if we run it again, we get a completely different set
sample_df <- df[sample(nrow(df)),]
sample_df <- sample_df[cumsum(sample_df$`3M Contribution`) < 10,]
sample_df
#> DSA 3M Contribution 6M Contribution
#> 165 ISF 0.47397401 0.32810467
#> 2 ZDL 0.07517340 0.13832804
#> 129 XAN 0.68953299 0.23862892
#> 37 OFV 0.32093896 0.54626344
#> 56 ZSU 0.85837341 0.15694326
#> 42 EHT 1.44584224 0.33704927
#> 115 YEB 0.05722397 0.81590015
#> 89 SQV 0.34496853 0.32799710
#> 67 DBT 0.82052602 0.47833734
#> 124 ZOI 0.01553935 0.49215515
#> 125 GLS 1.87641137 0.70018877
#> 91 AOW 0.55159848 0.09659654
#> 38 JPH 0.05702578 0.02310634
#> 62 RGO 0.57398460 0.20540972
#> 139 LOA 0.34511367 0.38109741
#> 154 NWH 1.04575037 1.08159278
#> 114 GVV 0.35121926 0.56594478
#> 47 DIL 0.07265622 0.18332102
On this occasion, we have a set of 18 randomly chosen DSAs that again add up to almost 10:
sum(sample_df$`3M Contribution`)
#> [1] 9.975853
And if you want to get 10 such samples, you can get them all in a list, like so:
all_samples <- lapply(1:10, function(x) {
sample_df <- df[sample(nrow(df)),];
sample_df[cumsum(sample_df$`3M Contribution`) < 10,]
})
Now we have a list of 10 samples called all_samples
, and we can get each just by using subscripts. For example, the fifth such sample would be obtained like this:
all_samples[[5]]
#> DSA 3M Contribution 6M Contribution
#> 147 YQZ 0.85928604 0.07089867
#> 84 WUD 0.15243070 0.09202777
#> 103 WSQ 0.27323608 1.79443012
#> 166 QYV 0.86343465 1.62441563
#> 1 EFS 0.68251264 1.21429430
#> 52 YTJ 0.61530262 0.47584261
#> 98 DVY 0.61661083 0.46831908
#> 3 LWK 1.64875363 0.64539744
#> 8 KDV 0.05386368 0.31977784
#> 41 CQE 0.85948052 0.14494440
#> 99 YSU 0.69153090 0.96764012
#> 72 LZZ 0.77170284 0.29414941
#> 170 BRA 0.15377281 0.54707772
#> 95 OAF 0.68996185 0.73363205
#> 69 ICC 0.71197929 0.16501976
#> 82 BMR 0.12875373 0.07509324
#> 169 HWR 0.01473804 0.71425244
#> 124 ZOI 0.01553935 0.49215515
#> 157 SGN 0.06565888 0.29731670
And we can see that the sum is still close to 10%:
sum(all_samples[[5]]$`3M Contribution`)
[1] 9.868549
Grouping time and counting instances by 12 hour bins in R
A possible solution in base R:
# convert the 'timestamp' column to a datetime format
df1$timestamp <- as.POSIXct(strptime(df1$timestamp, format = '%d-%m-%Y %H:%M'))
# create day.of.month variable
df1$day.of.month <- format(df1$timestamp, '%d')
# extract the 12 hour interval as am/pm values
df1$group <- gsub('[0-9: ]+','\\1',format(df1$timestamp, '%r'))
# aggregate
aggregate(. ~ group + day.of.month, df1, length)
which gives:
group day.of.month timestamp
1 am 01 6
2 pm 01 2
3 am 02 1
4 pm 02 1
Another solution using data.table
and and the pm
function of lubridate
:
library(lubridate)
library(data.table)
setDT(df1)[, timestamp := dmy_hm(timestamp)
][, group := pm(timestamp)+1
][, .N, .(day.of.month = day(timestamp),group)]
which gives:
day.of.month group N
1: 1 1 6
2: 1 2 2
3: 2 1 1
4: 2 2 1
Used data:
df1 <- structure(list(timestamp = c("01-12-2015 00:04", "01-12-2015 02:20", "01-12-2015 02:43", "01-12-2015 04:31", "01-12-2015 08:51",
"01-12-2015 11:28", "01-12-2015 20:53", "01-12-2015 21:28", "02-12-2015 00:30", "02-12-2015 20:22")),
.Names = "timestamp", class = "data.frame", row.names = c(NA,-10L))
sql query that groups different items into buckets
An expanded option from what Kerrek described, you can do you grouping based on a case/when
select
case when price >= 0 and price <= 10 then ' 0 - 10'
when price > 10 and price <= 50 then ' 10+ - 50'
when price > 50 and price <= 100 then ' 50+ - 100'
else 'over 100'
end PriceRange,
count(*) as TotalWithinRange
from
YourTable
group by 1
Here, the "group by 1" represents the ordinal column in your select statement... in this case, the case/when as TotalWithinRange.
Related Topics
How to Place +/- Plus Minus Operator in Text Annotation of Plot (Ggplot2)
Robust Standard Errors for Mixed-Effects Models in Lme4 Package of R
Importing an Excel File with Greek Characters into R in The Correct Encoding
What Happens When Prob Argument in Sample Sums to Less/Greater Than 1
Adding Row to a Data Frame with Missing Values
Manually Set Order of Fill Bars in Arbitrary Order Using Ggplot2
How to Add Columnn Titles in a Sankey Chart Networkd3
R: How to Overlay Pie Charts on 'Dots' in a Scatterplot in R
How to Debug Methods from Reference Classes
The Fastest Way to Convert Numeric to Character in R
R: How to Prompt The User for Input from The Console
Standard Deviation on Dataframe Does Not Work
Error in Dev.Off(): Cannot Shut Down Device 1 (The Null Device)