R data.table: subgroup weighted percent of group
This is almost a single step:
# A
widgets[,{
totwt = .N
.SD[,.(frac=.N/totwt),by=style]
},by=color]
# color style frac
# 1: red round 0.36
# 2: red pointy 0.32
# 3: red flat 0.32
# 4: green pointy 0.36
# 5: green flat 0.32
# 6: green round 0.32
# 7: blue flat 0.36
# 8: blue round 0.32
# 9: blue pointy 0.32
# 10: black round 0.36
# 11: black pointy 0.32
# 12: black flat 0.32
# B
widgets[,{
totwt = sum(weight)
.SD[,.(frac=sum(weight)/totwt),by=style]
},by=color]
# color style frac
# 1: red round 0.3466667
# 2: red pointy 0.3466667
# 3: red flat 0.3066667
# 4: green pointy 0.3333333
# 5: green flat 0.3200000
# 6: green round 0.3466667
# 7: blue flat 0.3866667
# 8: blue round 0.2933333
# 9: blue pointy 0.3200000
# 10: black round 0.3733333
# 11: black pointy 0.3333333
# 12: black flat 0.2933333
How it works: Construct your denominator for the top-level group (color
) before going to the finer group (color
with style
) to tabulate.
Alternatives. If style
s repeat within each color
and this is only for display purposes, try a table
:
# A
widgets[,
prop.table(table(color,style),1)
]
# style
# color flat pointy round
# black 0.32 0.32 0.36
# blue 0.36 0.32 0.32
# green 0.32 0.36 0.32
# red 0.32 0.32 0.36
# B
widgets[,rep(1L,sum(weight)),by=.(color,style)][,
prop.table(table(color,style),1)
]
# style
# color flat pointy round
# black 0.2933333 0.3333333 0.3733333
# blue 0.3866667 0.3200000 0.2933333
# green 0.3200000 0.3333333 0.3466667
# red 0.3066667 0.3466667 0.3466667
For B, this expands the data so that there is one observation for each unit of weight. With large data, such an expansion would be a bad idea (since it costs so much memory). Also, weight
has to be an integer; otherwise, its sum will be silently truncated to one (e.g., try rep(1,2.5) # [1] 1 1
).
Calculating the proportion per subgroup with data.table
Using data.table
:
df <- read.table(header = T, text = "row country year
1 NLD 2005
2 NLD 2005
3 BLG 2006
4 BLG 2005
5 GER 2005
6 NLD 2007
7 NLD 2005
8 NLD 2008")
setDT(df)[, sum := .N, by = country][, prop := .N, by = c("country", "year")][, prop := prop/sum][, sum := NULL]
row country year prop
1: 1 NLD 2005 0.6
2: 2 NLD 2005 0.6
3: 3 BLG 2006 0.5
4: 4 BLG 2005 0.5
5: 5 GER 2005 1.0
6: 6 NLD 2007 0.2
7: 7 NLD 2005 0.6
8: 8 NLD 2008 0.2
Weighted sum of variables by groups with data.table
Final attempt (copying Roland's answer :))
Copying @Roland's excellent answer:
print(dt[, lapply(.SD, function(x, w) sum(x*w), w=w), by=gr][, w := NULL])
still not the most efficient one: (second attempt)
Following @Roland's comment, it's indeed faster to do the operation on all columns and then just remove the unwanted ones (as long as the operation itself is not time consuming, which is the case here).
dt[, {lapply(.SD, function(x) sum(x*w))}, by=gr][, w := NULL][]
For some reason, w
seems to be not found when I don't use {}
.. No idea why though.
old (inefficient) answer:
(Subsetting can be costly if there are too many groups)
You can do this without using .SDcols
and then removing it while providing it to lapply
as follows:
dt[, lapply(.SD[, -1, with=FALSE], function(x) sum(x*w)), by=gr]
# gr V1 V2 V3 V4
# 1: 1 20 120 220 320
# 2: 2 70 170 270 370
.SDcols
makes .SD
without the w
column. So, it's not possible to multiply with w
as it doesn't exist within the scope of .SD environment then.
Summarizing by subgroup percentage in R
Per your comment, if the subgroups are unique you can do
library(dplyr)
group_by(df, group) %>% mutate(percent = value/sum(value))
# group subgroup value percent
# 1 A a 1 0.1250000
# 2 A b 4 0.5000000
# 3 A c 2 0.2500000
# 4 A d 1 0.1250000
# 5 B a 1 0.1666667
# 6 B b 2 0.3333333
# 7 B c 3 0.5000000
Or to remove the value
column and add the percent
column at the same time, use transmute
group_by(df, group) %>% transmute(subgroup, percent = value/sum(value))
# group subgroup percent
# 1 A a 0.1250000
# 2 A b 0.5000000
# 3 A c 0.2500000
# 4 A d 0.1250000
# 5 B a 0.1666667
# 6 B b 0.3333333
# 7 B c 0.5000000
Display weighted mean by group in the data.frame
If we use mutate
, then we can avoid the left_join
library(dplyr)
df %>%
group_by(education) %>%
mutate(weighted_income = weighted.mean(income, weight))
# obs income education weight weighted_income
# <int> <int> <fctr> <int> <dbl>
#1 1 1000 A 10 1166.667
#2 2 2000 B 1 1583.333
#3 3 1500 B 5 1583.333
#4 4 2000 A 2 1166.667
R (data.table) group data by custom range (for example, -18, 18-25, ..., 65+)
@jdharrison is right: cut(...)
is the way to go.
library(data.table)
# create sample - you have this already
set.seed(1) # for reproducibility
DT <- data.table(age=sample(15:70,1000,replace=TRUE),
value=rpois(1000,10))
# you start here...
breaks <- c(0,18,25,35,45,65,Inf)
DT[,list(mean=mean(value)),by=list(age=cut(age,breaks=breaks))][order(age)]
# age mean
# 1: (0,18] 10.000000
# 2: (18,25] 9.579365
# 3: (25,35] 10.158192
# 4: (35,45] 9.775510
# 5: (45,65] 9.969697
# 6: (65,Inf] 10.141414
Data.table in R: Fitting a function on a different subgroup than the function's application group
We could use
library(data.table)
dt[, Percentile := ecdf(value[group %in% c("A", "B", "C")])(value), date]
-output
> dt
group date value Percentile
<char> <Date> <num> <num>
1: A 2022-01-08 -0.3260365 0.3333333
2: A 2022-01-09 0.5524619 1.0000000
3: A 2022-01-10 -0.6749438 0.3333333
4: B 2022-01-08 0.2143595 0.6666667
5: B 2022-01-09 0.3107692 0.6666667
6: B 2022-01-10 1.1739663 1.0000000
7: C 2022-01-08 0.6187899 1.0000000
8: C 2022-01-09 -0.1127343 0.3333333
9: C 2022-01-10 0.9170283 0.6666667
10: D 2022-01-08 -0.2232594 0.3333333
11: D 2022-01-09 0.5264481 0.6666667
12: D 2022-01-10 -0.7948444 0.0000000
matching values within group to calculate percent change R
Based on the description, we may add 'id' as grouping variable and then calculate the 'percecnt_change' by dividing the deviation of dollars (from the max
value of 'dollars') with the dollars column
library(dplyr)
A %>%
group_by(id, .add = TRUE) %>%
mutate(percent_change = (max(dollars) - dollars)/dollars)
Or may be
A %>%
group_by(id, .add = TRUE) %>%
mutate(percent_change = if(n() == 2)
(first(dollars) - last(dollars))/last(dollars) else NA)
-output
# A tibble: 200 × 8
# Groups: retailer_id, store_id, id [123]
week_id retailer_id store_id dollars fill spins_week id percent_change
<chr> <int> <int> <dbl> <dbl> <chr> <chr> <dbl>
1 2021100301 2 167 121818. 0 20211040 1040 0.271
2 2021092601 2 167 123837. 0 20211039 1039 0.309
3 2021091901 2 167 118670. 0 20211038 1038 0.200
4 2021091201 2 167 125060. 0 20211037 1037 0.467
5 2021082901 2 167 108753. 0 20210935 0935 0.163
6 2021082201 2 167 103086. 0 20210934 0934 0.180
7 2021081501 2 167 93883. 0 20210933 0933 0.0649
8 2021080801 2 167 97233. 0 20210832 0832 0.117
9 2021080101 2 167 104718. 0 20210831 0831 0.576
10 2021072501 2 167 106884. 0 20210830 0830 0.187
# … with 190 more rows
Or as @Greg mentioned in the comments, if we want the first 4 values
A %>%
mutate(r_num = row_number()) %>%
group_by(id, .add = TRUE) %>%
mutate(percent_change = if(n() == 2)
(first(dollars) - last(dollars))/last(dollars) else NA) %>%
filter(r_num <= 4)
Related Topics
Extract Survival Probabilities in Survfit by Groups
Produce an Inset in Each Facet of an R Ggplot While Preserving Colours of the Original Facet Content
Add Missing Xts/Zoo Data with Linear Interpolation in R
Ggplot2:How to Reduce the Width and the Space Between Bars with Geom_Bar
How Does One Merge Dataframes by Row Name Without Adding a "Row.Names" Column
How to Make Discrete Gradient Color Bar with Geom_Contour_Filled
How to Store the Returned Value from a Shiny Module in Reactivevalues
How to Manually Set Geom_Bar Fill Color in Ggplot
Left_Join Two Data Frames and Overwrite
Convert a Netcdf Time Variable to an R Date Object
Hover Image in Plotly R Chart in Shiny App
Shiny: Switching Between Reactive Data Sets with Rhandsontable
Use a Factor Column in "By" and Do Not Drop Empty Factors
Cumulative Count of Unique Values in R
How to Install Rhadoop Packages (Rmr, Rhdfs, Rhbase)
How Does Branch Prediction Affect Performance in R
How to Control Ggplot's Plotting Area Proportions Instead of Fitting Them to Devices in R