Calculate Percentage of Each Category in Each Group in R

Summarizing by subgroup percentage in R

Per your comment, if the subgroups are unique you can do

library(dplyr)
group_by(df, group) %>% mutate(percent = value/sum(value))
# group subgroup value percent
# 1 A a 1 0.1250000
# 2 A b 4 0.5000000
# 3 A c 2 0.2500000
# 4 A d 1 0.1250000
# 5 B a 1 0.1666667
# 6 B b 2 0.3333333
# 7 B c 3 0.5000000

Or to remove the value column and add the percent column at the same time, use transmute

group_by(df, group) %>% transmute(subgroup, percent = value/sum(value))
# group subgroup percent
# 1 A a 0.1250000
# 2 A b 0.5000000
# 3 A c 0.2500000
# 4 A d 0.1250000
# 5 B a 0.1666667
# 6 B b 0.3333333
# 7 B c 0.5000000

Using dplyr function to calculate percentage within groups

library(dplyr)

df %>%
# line below to freeze order of type_n if not ordered factor already
mutate(type_n = forcats::fct_inorder(type_n)) %>%
group_by(type_n) %>%
summarize(n = n(), total = sum(population)) %>%
mutate(new_col = (n / total) %>% scales::percent(decimal.mark = ",", suffix = ""))

# A tibble: 3 x 4
type_n n total new_col
<fct> <int> <int> <chr>
1 small 2 7 28,6
2 medium 2 14 14,3
3 large 3 15 20,0

Percentage of factor levels by group in R

Another solution (with base-R):

prop.table(table(mydata$CNT, mydata$FACTOR), margin = 1)
            1         2
A 0.6000000 0.4000000
B 0.6666667 0.3333333
C 0.5000000 0.5000000
D 1.0000000 0.0000000

R dplyr group by more than 2 variables and calculate relative percentages inside each 1st variable group

The best way to do this is to group_by the variables that you want to use for the new, less specific bucket (origin), and then divide the count by the total count in a mutate:

flights %>%
left_join(airlines, by = c('carrier'), na_matches = "never") %>%
group_by(origin, name, dest, day) %>%
summarise(count_flights = n()) %>%
arrange(origin) %>%
group_by(origin) %>%
mutate(prop = count_flights/sum(count_flights),
cumprop = cumsum(prop))

Finding percentage in a sub-group using group_by and summarise

Try

library(dplyr)
data %>%
group_by(month) %>%
mutate(countT= sum(count)) %>%
group_by(type, add=TRUE) %>%
mutate(per=paste0(round(100*count/countT,2),'%'))

Or make it more simpler without creating additional columns

data %>%
group_by(month) %>%
mutate(per = 100 *count/sum(count)) %>%
ungroup

We could also use left_join after summarising the sum(count) by 'month'

Or an option using data.table.

 library(data.table)
setkey(setDT(data), month)[data[, list(count=sum(count)), month],
per:= paste0(round(100*count/i.count,2), '%')][]

Percentage group by for multiple columns in R dataframe

You can use the dplyr package.

For one column:

df %>%
group_by(Group) %>%
mutate(A_percent = A / sum(A)) # could use `A` instead of `A_percent`

For several columns at the same time, you can do the following which will overwrite the existing columns as you asked:

df %>%
group_by(Group) %>%
mutate_at(vars(A:D), funs(./sum(.)))

Note that if you wanted to create new columns instead of overwriting, you could have done:

df %>%
group_by(Group) %>%
mutate_at(vars(A:D), funs("percent" = ./sum(.)))

This would have created new columns with a "_percent" suffix.

If you have many columns, you may want a more powerful way to select the columns to process. Have a look at the list of select helpers you can use in vars(...).You can also simply use numerical indexes.

Finding percentage share grouped over category and month in r

Using dplyryou can do this:

library(dplyr)

set.seed(123)
filea <- data.frame(
ITEMS = c(rep("a",12),rep("b",12)),
MONTHS = c(seq(1,12),seq(1,12)),
VALUE = c(runif(12,0,50),runif(12,0,100))
)

filea = filea %>% group_by(ITEMS) %>% mutate(Percent_Share = VALUE/sum(VALUE)*100)

The output :

head(filea)
# A tibble: 6 x 4
# Groups: ITEMS [1]
ITEMS MONTHS VALUE Percent_Share
<fct> <int> <dbl> <dbl>
1 a 1 14.4 4.00
2 a 2 39.4 11.0
3 a 3 20.4 5.69
4 a 4 44.2 12.3
5 a 5 47.0 13.1
6 a 6 2.28 0.633

Relative frequencies / proportions with dplyr

Try this:

mtcars %>%
group_by(am, gear) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))

# am gear n freq
# 1 0 3 15 0.7894737
# 2 0 4 4 0.2105263
# 3 1 4 8 0.6153846
# 4 1 5 5 0.3846154

From the dplyr vignette:

When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset.

Thus, after the summarise, the last grouping variable specified in group_by, 'gear', is peeled off. In the mutate step, the data is grouped by the remaining grouping variable(s), here 'am'. You may check grouping in each step with groups.

The outcome of the peeling is of course dependent of the order of the grouping variables in the group_by call. You may wish to do a subsequent group_by(am), to make your code more explicit.

For rounding and prettification, please refer to the nice answer by @Tyler Rinker.



Related Topics



Leave a reply



Submit