Group/Bin/Bucket Data in R and Get Count Per Bucket and Sum of Values Per Bucket

Group/bin/bucket data in R and get count per bucket and sum of values per bucket

From the comments, "C2" seems to be "character" column with % as suffix. Before, creating a group, remove the % using sub, convert to "numeric" (as.numeric). The variable "group" is created (transform(df,...)) by using the function cut with breaks (group buckets/intervals) and labels (for the desired group labels) arguments. Once the group variable is created, the sum of the "C1" by "group" and the "count" of elements within "group" can be done using aggregate from "base R"

df1 <-  transform(df, group=cut(as.numeric(sub('[%]', '', C2)), 
breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))

res <- do.call(data.frame,aggregate(C1~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))

dNew <- data.frame(group=levels(df1$group))
merge(res, dNew, all=TRUE)
# group C1.Count C1.Sum
#1 <0.005 2 3491509.6
#2 0.005 NA NA
#3 0.01 2 302997.1
#4 0.014 8 364609.5

or you can use data.table. setDT converts the data.frame to data.table. Specify the "grouping" variable with by= and summarize/create the two variables "Count" and "Sum" within the list(. .N gives the count of elements within each "group".

 library(data.table)
setDT(df1)[, list(Count=.N, Sum=sum(C1)), by=group][]

Or using dplyr. The %>% connect the LHS with RHS arguments and chains them together. Use group_by to specify the "group" variable, and then use summarise_each or summarise to get summary count and sum of the concerned column. summarise_each would be useful if there are more than one column.

 library(dplyr)
df1 %>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1)

Update

Using the new dataset df

df1 <- transform(df, group=cut(C2,  breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))

res <- do.call(data.frame,aggregate(cbind(C1,C3)~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))
res
# group C1.Count C1.Sum C3.Count C3.Sum
#1 <0.005 2 3491509.6 2 91233
#2 0.01 2 302997.1 2 88843
#3 0.014 8 364609.5 8 268809

and you can do the merge as detailed above.

The dplyr approach would be the same except specifying the additional variable

 df1%>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1, C3)
#Source: local data frame [3 x 5]

# group C1_n C3_n C1_Sum C3_Sum
#1 <0.005 2 2 3491509.6 91233
#2 0.01 2 2 302997.1 88843
#3 0.014 8 8 364609.5 268809

data

df <-structure(list(C1 = c(49488.01172, 268221.1563, 34775.96094, 
13046.98047, 2121699.75, 71155.09375, 1369809.875, 750, 44943.82813,
85585.04688, 31090.10938, 68550.40625), C2 = c("0.0512%", "0.0128%",
"0.0128%", "0.07241%", "0.00453%", "0.0181%", "0.00453%", "0.2048%",
"0.0362%", "0.0362%", "0.0362%", "0.0181%")), .Names = c("C1",
"C2"), row.names = c(NA, -12L), class = "data.frame")

How to bin data based on values in one column, and count occurrences from another column excluding duplicates in R?

Will This work?

df <- data.frame(CNV=c("1:10405137","1:10405137","1:10405137","1:101161140","1:110028467")
,r_value=c(0.035118621,0.070643341,0.391963719,0.376573375,0.950231679))

> df # minimal example
CNV r_value
1 1:10405137 0.03511862
2 1:10405137 0.07064334
3 1:10405137 0.39196372
4 1:101161140 0.37657337
5 1:110028467 0.95023168

df1 <- transform(df, group=cut(r_value,
breaks=c(0,0.1,0.2, 0.3, 0.4, 0.5,1),
labels=c("<0.1","0.1","0.2", "0.3", "0.4", "0.5<")))

res <- do.call(data.frame,aggregate(r_value~group, df1,
FUN=function(x) c(Count=length(x))))

> res # counts of intervals
group r_value
1 <0.1 2
2 0.3 2
3 0.5< 1

dNew <- data.frame(group=levels(df1$group))
dNew <- merge(res, dNew, all=TRUE)
colnames(dNew) <- c("interval","count")

> dNew # count of CNV by interval
interval count
1 <0.1 2
2 0.1 NA
3 0.2 NA
4 0.3 2
5 0.4 NA
6 0.5< 1

adapted from Group/bin/bucket data in R and get count per bucket and sum of values per bucket

Restructuring values in unequal buckets by aggregating on a column in R

One of the issues was having 'NA' as a character string instead of NA. Here is a solution with:

df <- data.frame(
id = c('50', '84'),
name = c('david', 'ron'),
date_diff = c(0, -16, -4, -1, 0, -2, -11, -12, -168, -8, 16, na)
)

library(dplyr)
library(tidyr)

df %>%
mutate(
group = cut(
Date_diff,
breaks = c(-Inf,-13,-7,-4,-2,-1,Inf),
labels = c('<-12', '-7 to -12','-4 to -6','-2 to -3', '-1','>0')
),
group = if_else(is.na(group), "NA", as.character(group))
) %>%
group_by(Id, Name, group) %>%
summarise(n = n()) %>%
mutate(Total = sum(n, na.rm = T)) %>%
pivot_wider(names_from = group, values_from = n)

Creating buckets based on a target variable

It's not clear how close to 10% you need the values to add to. Here's a solution that should give acceptable results unless you need the values to add up to exactly 10, in which case you may be very limited in the sets that you can "randomly" choose, as per Konrad Rudolph's comments.

We can replicate your data structure like this:

set.seed(3)

df <- data.frame(DSA = apply(replicate(3, sample(LETTERS, 180, TRUE)), 1,
paste, collapse = ""),
x = rexp(180),
y = rexp(180))

df$x <- 100 * df$x/sum(df$x)
df$y <- 100 * df$y/sum(df$y)

names(df) <- c("DSA", "3M Contribution", "6M Contribution")

So now we have:

head(df)
#> DSA 3M Contribution 6M Contribution
#> 1 EFS 0.6825126 1.2142943
#> 2 ZDL 0.0751734 0.1383280
#> 3 LWK 1.6487536 0.6453974
#> 4 GBY 2.1218616 0.2870302
#> 5 DDU 0.3414601 2.6318194
#> 6 ZHT 0.2341914 0.4497800

There are 180 rows, each with a unique 3 letter identifier for DSA and each of the 3M Contribtion and 6M contribution columns add to 100:

length(unique(df$DSA))
#> [1] 180

sum(df$`3M Contribution`)
#> [1] 100

sum(df$`6M Contribution`)
#> [1] 100

Now that we have the data, we can select a random ordering of the data frame by doing:

sample_df <- df[sample(nrow(df)),]

All this does is put the data frame rows in an arbitrary order. However, we can now get only the first n rows that allow 3M contribution to (approximately) add to 10 like this:

sample_df <- sample_df[cumsum(sample_df$`3M Contribution`) < 10,]

sample_df
#> DSA 3M Contribution 6M Contribution
#> 59 QQQ 0.85167563 0.02718667
#> 43 WJY 0.35287431 1.40795918
#> 165 ISF 0.47397401 0.32810467
#> 114 GVV 0.35121926 0.56594478
#> 42 EHT 1.44584224 0.33704927
#> 71 DJX 0.84489716 0.16338381
#> 2 ZDL 0.07517340 0.13832804
#> 110 PZM 0.98701927 0.07642120
#> 11 JME 0.22959383 2.31503532
#> 153 BER 0.09994387 0.61655857
#> 81 UXU 0.22244869 0.31506248
#> 166 QYV 0.86343465 1.62441563
#> 171 SPH 2.43818827 0.84201677
#> 131 MAB 0.64740835 1.06533067

And we can see that we have a completely random selection of 14 DSAs that add up to (almost) 10:

sum(sample_df$`3M Contribution`)
#> [1] 9.883693

And if we run it again, we get a completely different set

sample_df <- df[sample(nrow(df)),]
sample_df <- sample_df[cumsum(sample_df$`3M Contribution`) < 10,]

sample_df
#> DSA 3M Contribution 6M Contribution
#> 165 ISF 0.47397401 0.32810467
#> 2 ZDL 0.07517340 0.13832804
#> 129 XAN 0.68953299 0.23862892
#> 37 OFV 0.32093896 0.54626344
#> 56 ZSU 0.85837341 0.15694326
#> 42 EHT 1.44584224 0.33704927
#> 115 YEB 0.05722397 0.81590015
#> 89 SQV 0.34496853 0.32799710
#> 67 DBT 0.82052602 0.47833734
#> 124 ZOI 0.01553935 0.49215515
#> 125 GLS 1.87641137 0.70018877
#> 91 AOW 0.55159848 0.09659654
#> 38 JPH 0.05702578 0.02310634
#> 62 RGO 0.57398460 0.20540972
#> 139 LOA 0.34511367 0.38109741
#> 154 NWH 1.04575037 1.08159278
#> 114 GVV 0.35121926 0.56594478
#> 47 DIL 0.07265622 0.18332102

On this occasion, we have a set of 18 randomly chosen DSAs that again add up to almost 10:

sum(sample_df$`3M Contribution`)
#> [1] 9.975853

And if you want to get 10 such samples, you can get them all in a list, like so:

all_samples <- lapply(1:10, function(x) {
sample_df <- df[sample(nrow(df)),];
sample_df[cumsum(sample_df$`3M Contribution`) < 10,]
})

Now we have a list of 10 samples called all_samples, and we can get each just by using subscripts. For example, the fifth such sample would be obtained like this:

all_samples[[5]]
#> DSA 3M Contribution 6M Contribution
#> 147 YQZ 0.85928604 0.07089867
#> 84 WUD 0.15243070 0.09202777
#> 103 WSQ 0.27323608 1.79443012
#> 166 QYV 0.86343465 1.62441563
#> 1 EFS 0.68251264 1.21429430
#> 52 YTJ 0.61530262 0.47584261
#> 98 DVY 0.61661083 0.46831908
#> 3 LWK 1.64875363 0.64539744
#> 8 KDV 0.05386368 0.31977784
#> 41 CQE 0.85948052 0.14494440
#> 99 YSU 0.69153090 0.96764012
#> 72 LZZ 0.77170284 0.29414941
#> 170 BRA 0.15377281 0.54707772
#> 95 OAF 0.68996185 0.73363205
#> 69 ICC 0.71197929 0.16501976
#> 82 BMR 0.12875373 0.07509324
#> 169 HWR 0.01473804 0.71425244
#> 124 ZOI 0.01553935 0.49215515
#> 157 SGN 0.06565888 0.29731670

And we can see that the sum is still close to 10%:

sum(all_samples[[5]]$`3M Contribution`)
[1] 9.868549

Grouping time and counting instances by 12 hour bins in R

A possible solution in base R:

# convert the 'timestamp' column to a datetime format
df1$timestamp <- as.POSIXct(strptime(df1$timestamp, format = '%d-%m-%Y %H:%M'))
# create day.of.month variable
df1$day.of.month <- format(df1$timestamp, '%d')
# extract the 12 hour interval as am/pm values
df1$group <- gsub('[0-9: ]+','\\1',format(df1$timestamp, '%r'))
# aggregate
aggregate(. ~ group + day.of.month, df1, length)

which gives:

  group day.of.month timestamp
1 am 01 6
2 pm 01 2
3 am 02 1
4 pm 02 1

Another solution using data.table and and the pm function of lubridate:

library(lubridate)
library(data.table)
setDT(df1)[, timestamp := dmy_hm(timestamp)
][, group := pm(timestamp)+1
][, .N, .(day.of.month = day(timestamp),group)]

which gives:

   day.of.month group N
1: 1 1 6
2: 1 2 2
3: 2 1 1
4: 2 2 1

Used data:

df1 <- structure(list(timestamp = c("01-12-2015 00:04", "01-12-2015 02:20", "01-12-2015 02:43", "01-12-2015 04:31", "01-12-2015 08:51", 
"01-12-2015 11:28", "01-12-2015 20:53", "01-12-2015 21:28", "02-12-2015 00:30", "02-12-2015 20:22")),
.Names = "timestamp", class = "data.frame", row.names = c(NA,-10L))

sql query that groups different items into buckets

An expanded option from what Kerrek described, you can do you grouping based on a case/when

select
case when price >= 0 and price <= 10 then ' 0 - 10'
when price > 10 and price <= 50 then ' 10+ - 50'
when price > 50 and price <= 100 then ' 50+ - 100'
else 'over 100'
end PriceRange,
count(*) as TotalWithinRange
from
YourTable
group by 1

Here, the "group by 1" represents the ordinal column in your select statement... in this case, the case/when as TotalWithinRange.



Related Topics



Leave a reply



Submit