R Data.Table: Subgroup Weighted Percent of Group

R data.table: subgroup weighted percent of group

This is almost a single step:

# A
widgets[,{
    totwt = .N
    .SD[,.(frac=.N/totwt),by=style]
},by=color]
    # color  style frac
 # 1:   red  round 0.36
 # 2:   red pointy 0.32
 # 3:   red   flat 0.32
 # 4: green pointy 0.36
 # 5: green   flat 0.32
 # 6: green  round 0.32
 # 7:  blue   flat 0.36
 # 8:  blue  round 0.32
 # 9:  blue pointy 0.32
# 10: black  round 0.36
# 11: black pointy 0.32
# 12: black   flat 0.32

# B
widgets[,{
    totwt = sum(weight)
    .SD[,.(frac=sum(weight)/totwt),by=style]
},by=color]
 #    color  style      frac
 # 1:   red  round 0.3466667
 # 2:   red pointy 0.3466667
 # 3:   red   flat 0.3066667
 # 4: green pointy 0.3333333
 # 5: green   flat 0.3200000
 # 6: green  round 0.3466667
 # 7:  blue   flat 0.3866667
 # 8:  blue  round 0.2933333
 # 9:  blue pointy 0.3200000
# 10: black  round 0.3733333
# 11: black pointy 0.3333333
# 12: black   flat 0.2933333

How it works: Construct your denominator for the top-level group (color) before going to the finer group (color with style) to tabulate.

Alternatives. If styles repeat within each color and this is only for display purposes, try a table:

# A
widgets[,
  prop.table(table(color,style),1)
]
#        style
# color   flat pointy round
#   black 0.32   0.32  0.36
#   blue  0.36   0.32  0.32
#   green 0.32   0.36  0.32
#   red   0.32   0.32  0.36

# B
widgets[,rep(1L,sum(weight)),by=.(color,style)][,
  prop.table(table(color,style),1)
]

#        style
# color        flat    pointy     round
#   black 0.2933333 0.3333333 0.3733333
#   blue  0.3866667 0.3200000 0.2933333
#   green 0.3200000 0.3333333 0.3466667
#   red   0.3066667 0.3466667 0.3466667

For B, this expands the data so that there is one observation for each unit of weight. With large data, such an expansion would be a bad idea (since it costs so much memory). Also, weight has to be an integer; otherwise, its sum will be silently truncated to one (e.g., try rep(1,2.5) # [1] 1 1).

Calculating the proportion per subgroup with data.table

Using data.table:

df <- read.table(header = T, text = "row  country year
     1  NLD     2005
                 2  NLD     2005       
                 3  BLG     2006
                 4  BLG     2005
                 5  GER     2005
                 6  NLD     2007
                 7  NLD     2005
                 8  NLD     2008")

setDT(df)[, sum := .N, by = country][, prop := .N, by = c("country", "year")][, prop := prop/sum][, sum := NULL]

    row country year prop
1:   1     NLD 2005  0.6
2:   2     NLD 2005  0.6
3:   3     BLG 2006  0.5
4:   4     BLG 2005  0.5
5:   5     GER 2005  1.0
6:   6     NLD 2007  0.2
7:   7     NLD 2005  0.6
8:   8     NLD 2008  0.2

Weighted sum of variables by groups with data.table

Final attempt (copying Roland's answer :))

Copying @Roland's excellent answer:

print(dt[, lapply(.SD, function(x, w) sum(x*w), w=w), by=gr][, w := NULL])

still not the most efficient one: (second attempt)

Following @Roland's comment, it's indeed faster to do the operation on all columns and then just remove the unwanted ones (as long as the operation itself is not time consuming, which is the case here).

dt[, {lapply(.SD, function(x) sum(x*w))}, by=gr][, w := NULL][]

For some reason, w seems to be not found when I don't use {}.. No idea why though.

old (inefficient) answer:

(Subsetting can be costly if there are too many groups)

You can do this without using .SDcols and then removing it while providing it to lapply as follows:

dt[, lapply(.SD[, -1, with=FALSE], function(x) sum(x*w)), by=gr]
#    gr V1  V2  V3  V4
# 1:  1 20 120 220 320
# 2:  2 70 170 270 370

.SDcols makes .SD without the w column. So, it's not possible to multiply with w as it doesn't exist within the scope of .SD environment then.

Summarizing by subgroup percentage in R

Per your comment, if the subgroups are unique you can do

library(dplyr)
group_by(df, group) %>% mutate(percent = value/sum(value))
#   group subgroup value   percent
# 1     A        a     1 0.1250000
# 2     A        b     4 0.5000000
# 3     A        c     2 0.2500000
# 4     A        d     1 0.1250000
# 5     B        a     1 0.1666667
# 6     B        b     2 0.3333333
# 7     B        c     3 0.5000000

Or to remove the value column and add the percent column at the same time, use transmute

group_by(df, group) %>% transmute(subgroup, percent = value/sum(value))
#   group subgroup   percent
# 1     A        a 0.1250000
# 2     A        b 0.5000000
# 3     A        c 0.2500000
# 4     A        d 0.1250000
# 5     B        a 0.1666667
# 6     B        b 0.3333333
# 7     B        c 0.5000000

Display weighted mean by group in the data.frame

If we use mutate, then we can avoid the left_join

library(dplyr)
df %>%
   group_by(education) %>% 
   mutate(weighted_income = weighted.mean(income, weight))
#    obs income education weight weighted_income
#  <int>  <int>    <fctr>  <int>           <dbl>
#1     1   1000         A     10        1166.667
#2     2   2000         B      1        1583.333
#3     3   1500         B      5        1583.333
#4     4   2000         A      2        1166.667

R (data.table) group data by custom range (for example, -18, 18-25, ..., 65+)

@jdharrison is right: cut(...) is the way to go.

library(data.table)
# create sample - you have this already
set.seed(1)   # for reproducibility
DT <- data.table(age=sample(15:70,1000,replace=TRUE),
                 value=rpois(1000,10))

# you start here...
breaks <- c(0,18,25,35,45,65,Inf)
DT[,list(mean=mean(value)),by=list(age=cut(age,breaks=breaks))][order(age)]
#         age      mean
# 1:   (0,18] 10.000000
# 2:  (18,25]  9.579365
# 3:  (25,35] 10.158192
# 4:  (35,45]  9.775510
# 5:  (45,65]  9.969697
# 6: (65,Inf] 10.141414

Data.table in R: Fitting a function on a different subgroup than the function's application group

We could use

library(data.table)
dt[, Percentile := ecdf(value[group %in% c("A", "B", "C")])(value), date]

-output

> dt
     group       date      value Percentile
    <char>     <Date>      <num>      <num>
 1:      A 2022-01-08 -0.3260365  0.3333333
 2:      A 2022-01-09  0.5524619  1.0000000
 3:      A 2022-01-10 -0.6749438  0.3333333
 4:      B 2022-01-08  0.2143595  0.6666667
 5:      B 2022-01-09  0.3107692  0.6666667
 6:      B 2022-01-10  1.1739663  1.0000000
 7:      C 2022-01-08  0.6187899  1.0000000
 8:      C 2022-01-09 -0.1127343  0.3333333
 9:      C 2022-01-10  0.9170283  0.6666667
10:      D 2022-01-08 -0.2232594  0.3333333
11:      D 2022-01-09  0.5264481  0.6666667
12:      D 2022-01-10 -0.7948444  0.0000000

matching values within group to calculate percent change R

Based on the description, we may add 'id' as grouping variable and then calculate the 'percecnt_change' by dividing the deviation of dollars (from the max value of 'dollars') with the dollars column

library(dplyr)
A %>%
    group_by(id, .add = TRUE) %>% 
    mutate(percent_change = (max(dollars) - dollars)/dollars)

Or may be

A %>%
    group_by(id, .add = TRUE) %>% 
    mutate(percent_change = if(n() == 2) 
        (first(dollars) - last(dollars))/last(dollars) else NA)

-output

# A tibble: 200 × 8
# Groups:   retailer_id, store_id, id [123]
   week_id    retailer_id store_id dollars  fill spins_week id    percent_change
   <chr>            <int>    <int>   <dbl> <dbl> <chr>      <chr>          <dbl>
 1 2021100301           2      167 121818.     0 20211040   1040          0.271 
 2 2021092601           2      167 123837.     0 20211039   1039          0.309 
 3 2021091901           2      167 118670.     0 20211038   1038          0.200 
 4 2021091201           2      167 125060.     0 20211037   1037          0.467 
 5 2021082901           2      167 108753.     0 20210935   0935          0.163 
 6 2021082201           2      167 103086.     0 20210934   0934          0.180 
 7 2021081501           2      167  93883.     0 20210933   0933          0.0649
 8 2021080801           2      167  97233.     0 20210832   0832          0.117 
 9 2021080101           2      167 104718.     0 20210831   0831          0.576 
10 2021072501           2      167 106884.     0 20210830   0830          0.187 
# … with 190 more rows

Or as @Greg mentioned in the comments, if we want the first 4 values

A %>% 
   mutate(r_num = row_number()) %>%
   group_by(id, .add = TRUE) %>% 
   mutate(percent_change = if(n() == 2) 
        (first(dollars) - last(dollars))/last(dollars) else NA) %>% 
   filter(r_num <= 4)

R Data.Table: Subgroup Weighted Percent of Group