R Data.Table: Subgroup Weighted Percent of Group

R data.table: subgroup weighted percent of group

This is almost a single step:

# A
widgets[,{
totwt = .N
.SD[,.(frac=.N/totwt),by=style]
},by=color]
# color style frac
# 1: red round 0.36
# 2: red pointy 0.32
# 3: red flat 0.32
# 4: green pointy 0.36
# 5: green flat 0.32
# 6: green round 0.32
# 7: blue flat 0.36
# 8: blue round 0.32
# 9: blue pointy 0.32
# 10: black round 0.36
# 11: black pointy 0.32
# 12: black flat 0.32

# B
widgets[,{
totwt = sum(weight)
.SD[,.(frac=sum(weight)/totwt),by=style]
},by=color]
# color style frac
# 1: red round 0.3466667
# 2: red pointy 0.3466667
# 3: red flat 0.3066667
# 4: green pointy 0.3333333
# 5: green flat 0.3200000
# 6: green round 0.3466667
# 7: blue flat 0.3866667
# 8: blue round 0.2933333
# 9: blue pointy 0.3200000
# 10: black round 0.3733333
# 11: black pointy 0.3333333
# 12: black flat 0.2933333

How it works: Construct your denominator for the top-level group (color) before going to the finer group (color with style) to tabulate.


Alternatives. If styles repeat within each color and this is only for display purposes, try a table:

# A
widgets[,
prop.table(table(color,style),1)
]
# style
# color flat pointy round
# black 0.32 0.32 0.36
# blue 0.36 0.32 0.32
# green 0.32 0.36 0.32
# red 0.32 0.32 0.36

# B
widgets[,rep(1L,sum(weight)),by=.(color,style)][,
prop.table(table(color,style),1)
]

# style
# color flat pointy round
# black 0.2933333 0.3333333 0.3733333
# blue 0.3866667 0.3200000 0.2933333
# green 0.3200000 0.3333333 0.3466667
# red 0.3066667 0.3466667 0.3466667

For B, this expands the data so that there is one observation for each unit of weight. With large data, such an expansion would be a bad idea (since it costs so much memory). Also, weight has to be an integer; otherwise, its sum will be silently truncated to one (e.g., try rep(1,2.5) # [1] 1 1).

Calculating the proportion per subgroup with data.table

Using data.table:

df <- read.table(header = T, text = "row  country year
1 NLD 2005
2 NLD 2005
3 BLG 2006
4 BLG 2005
5 GER 2005
6 NLD 2007
7 NLD 2005
8 NLD 2008")

setDT(df)[, sum := .N, by = country][, prop := .N, by = c("country", "year")][, prop := prop/sum][, sum := NULL]

row country year prop
1: 1 NLD 2005 0.6
2: 2 NLD 2005 0.6
3: 3 BLG 2006 0.5
4: 4 BLG 2005 0.5
5: 5 GER 2005 1.0
6: 6 NLD 2007 0.2
7: 7 NLD 2005 0.6
8: 8 NLD 2008 0.2

Weighted sum of variables by groups with data.table

Final attempt (copying Roland's answer :))

Copying @Roland's excellent answer:

print(dt[, lapply(.SD, function(x, w) sum(x*w), w=w), by=gr][, w := NULL])

still not the most efficient one: (second attempt)

Following @Roland's comment, it's indeed faster to do the operation on all columns and then just remove the unwanted ones (as long as the operation itself is not time consuming, which is the case here).

dt[, {lapply(.SD, function(x) sum(x*w))}, by=gr][, w := NULL][]

For some reason, w seems to be not found when I don't use {}.. No idea why though.


old (inefficient) answer:

(Subsetting can be costly if there are too many groups)

You can do this without using .SDcols and then removing it while providing it to lapply as follows:

dt[, lapply(.SD[, -1, with=FALSE], function(x) sum(x*w)), by=gr]
# gr V1 V2 V3 V4
# 1: 1 20 120 220 320
# 2: 2 70 170 270 370

.SDcols makes .SD without the w column. So, it's not possible to multiply with w as it doesn't exist within the scope of .SD environment then.

Summarizing by subgroup percentage in R

Per your comment, if the subgroups are unique you can do

library(dplyr)
group_by(df, group) %>% mutate(percent = value/sum(value))
# group subgroup value percent
# 1 A a 1 0.1250000
# 2 A b 4 0.5000000
# 3 A c 2 0.2500000
# 4 A d 1 0.1250000
# 5 B a 1 0.1666667
# 6 B b 2 0.3333333
# 7 B c 3 0.5000000

Or to remove the value column and add the percent column at the same time, use transmute

group_by(df, group) %>% transmute(subgroup, percent = value/sum(value))
# group subgroup percent
# 1 A a 0.1250000
# 2 A b 0.5000000
# 3 A c 0.2500000
# 4 A d 0.1250000
# 5 B a 0.1666667
# 6 B b 0.3333333
# 7 B c 0.5000000

Display weighted mean by group in the data.frame

If we use mutate, then we can avoid the left_join

library(dplyr)
df %>%
group_by(education) %>%
mutate(weighted_income = weighted.mean(income, weight))
# obs income education weight weighted_income
# <int> <int> <fctr> <int> <dbl>
#1 1 1000 A 10 1166.667
#2 2 2000 B 1 1583.333
#3 3 1500 B 5 1583.333
#4 4 2000 A 2 1166.667

R (data.table) group data by custom range (for example, -18, 18-25, ..., 65+)

@jdharrison is right: cut(...) is the way to go.

library(data.table)
# create sample - you have this already
set.seed(1) # for reproducibility
DT <- data.table(age=sample(15:70,1000,replace=TRUE),
value=rpois(1000,10))

# you start here...
breaks <- c(0,18,25,35,45,65,Inf)
DT[,list(mean=mean(value)),by=list(age=cut(age,breaks=breaks))][order(age)]
# age mean
# 1: (0,18] 10.000000
# 2: (18,25] 9.579365
# 3: (25,35] 10.158192
# 4: (35,45] 9.775510
# 5: (45,65] 9.969697
# 6: (65,Inf] 10.141414

Data.table in R: Fitting a function on a different subgroup than the function's application group

We could use

library(data.table)
dt[, Percentile := ecdf(value[group %in% c("A", "B", "C")])(value), date]

-output

> dt
group date value Percentile
<char> <Date> <num> <num>
1: A 2022-01-08 -0.3260365 0.3333333
2: A 2022-01-09 0.5524619 1.0000000
3: A 2022-01-10 -0.6749438 0.3333333
4: B 2022-01-08 0.2143595 0.6666667
5: B 2022-01-09 0.3107692 0.6666667
6: B 2022-01-10 1.1739663 1.0000000
7: C 2022-01-08 0.6187899 1.0000000
8: C 2022-01-09 -0.1127343 0.3333333
9: C 2022-01-10 0.9170283 0.6666667
10: D 2022-01-08 -0.2232594 0.3333333
11: D 2022-01-09 0.5264481 0.6666667
12: D 2022-01-10 -0.7948444 0.0000000

matching values within group to calculate percent change R

Based on the description, we may add 'id' as grouping variable and then calculate the 'percecnt_change' by dividing the deviation of dollars (from the max value of 'dollars') with the dollars column

library(dplyr)
A %>%
group_by(id, .add = TRUE) %>%
mutate(percent_change = (max(dollars) - dollars)/dollars)

Or may be

A %>%
group_by(id, .add = TRUE) %>%
mutate(percent_change = if(n() == 2)
(first(dollars) - last(dollars))/last(dollars) else NA)

-output

# A tibble: 200 × 8
# Groups: retailer_id, store_id, id [123]
week_id retailer_id store_id dollars fill spins_week id percent_change
<chr> <int> <int> <dbl> <dbl> <chr> <chr> <dbl>
1 2021100301 2 167 121818. 0 20211040 1040 0.271
2 2021092601 2 167 123837. 0 20211039 1039 0.309
3 2021091901 2 167 118670. 0 20211038 1038 0.200
4 2021091201 2 167 125060. 0 20211037 1037 0.467
5 2021082901 2 167 108753. 0 20210935 0935 0.163
6 2021082201 2 167 103086. 0 20210934 0934 0.180
7 2021081501 2 167 93883. 0 20210933 0933 0.0649
8 2021080801 2 167 97233. 0 20210832 0832 0.117
9 2021080101 2 167 104718. 0 20210831 0831 0.576
10 2021072501 2 167 106884. 0 20210830 0830 0.187
# … with 190 more rows

Or as @Greg mentioned in the comments, if we want the first 4 values

A %>% 
mutate(r_num = row_number()) %>%
group_by(id, .add = TRUE) %>%
mutate(percent_change = if(n() == 2)
(first(dollars) - last(dollars))/last(dollars) else NA) %>%
filter(r_num <= 4)


Related Topics



Leave a reply



Submit