Summarizing by subgroup percentage in R
Per your comment, if the subgroups are unique you can do
library(dplyr)
group_by(df, group) %>% mutate(percent = value/sum(value))
# group subgroup value percent
# 1 A a 1 0.1250000
# 2 A b 4 0.5000000
# 3 A c 2 0.2500000
# 4 A d 1 0.1250000
# 5 B a 1 0.1666667
# 6 B b 2 0.3333333
# 7 B c 3 0.5000000
Or to remove the value
column and add the percent
column at the same time, use transmute
group_by(df, group) %>% transmute(subgroup, percent = value/sum(value))
# group subgroup percent
# 1 A a 0.1250000
# 2 A b 0.5000000
# 3 A c 0.2500000
# 4 A d 0.1250000
# 5 B a 0.1666667
# 6 B b 0.3333333
# 7 B c 0.5000000
Using dplyr function to calculate percentage within groups
library(dplyr)
df %>%
# line below to freeze order of type_n if not ordered factor already
mutate(type_n = forcats::fct_inorder(type_n)) %>%
group_by(type_n) %>%
summarize(n = n(), total = sum(population)) %>%
mutate(new_col = (n / total) %>% scales::percent(decimal.mark = ",", suffix = ""))
# A tibble: 3 x 4
type_n n total new_col
<fct> <int> <int> <chr>
1 small 2 7 28,6
2 medium 2 14 14,3
3 large 3 15 20,0
Percentage of factor levels by group in R
Another solution (with base-R):
prop.table(table(mydata$CNT, mydata$FACTOR), margin = 1)
1 2
A 0.6000000 0.4000000
B 0.6666667 0.3333333
C 0.5000000 0.5000000
D 1.0000000 0.0000000
R dplyr group by more than 2 variables and calculate relative percentages inside each 1st variable group
The best way to do this is to group_by
the variables that you want to use for the new, less specific bucket (origin), and then divide the count by the total count in a mutate
:
flights %>%
left_join(airlines, by = c('carrier'), na_matches = "never") %>%
group_by(origin, name, dest, day) %>%
summarise(count_flights = n()) %>%
arrange(origin) %>%
group_by(origin) %>%
mutate(prop = count_flights/sum(count_flights),
cumprop = cumsum(prop))
Finding percentage in a sub-group using group_by and summarise
Try
library(dplyr)
data %>%
group_by(month) %>%
mutate(countT= sum(count)) %>%
group_by(type, add=TRUE) %>%
mutate(per=paste0(round(100*count/countT,2),'%'))
Or make it more simpler without creating additional columns
data %>%
group_by(month) %>%
mutate(per = 100 *count/sum(count)) %>%
ungroup
We could also use left_join
after summarising the sum(count)
by 'month'
Or an option using data.table
.
library(data.table)
setkey(setDT(data), month)[data[, list(count=sum(count)), month],
per:= paste0(round(100*count/i.count,2), '%')][]
Percentage group by for multiple columns in R dataframe
You can use the dplyr
package.
For one column:
df %>%
group_by(Group) %>%
mutate(A_percent = A / sum(A)) # could use `A` instead of `A_percent`
For several columns at the same time, you can do the following which will overwrite the existing columns as you asked:
df %>%
group_by(Group) %>%
mutate_at(vars(A:D), funs(./sum(.)))
Note that if you wanted to create new columns instead of overwriting, you could have done:
df %>%
group_by(Group) %>%
mutate_at(vars(A:D), funs("percent" = ./sum(.)))
This would have created new columns with a "_percent" suffix.
If you have many columns, you may want a more powerful way to select the columns to process. Have a look at the list of select helpers you can use in vars(...)
.You can also simply use numerical indexes.
Finding percentage share grouped over category and month in r
Using dplyr
you can do this:
library(dplyr)
set.seed(123)
filea <- data.frame(
ITEMS = c(rep("a",12),rep("b",12)),
MONTHS = c(seq(1,12),seq(1,12)),
VALUE = c(runif(12,0,50),runif(12,0,100))
)
filea = filea %>% group_by(ITEMS) %>% mutate(Percent_Share = VALUE/sum(VALUE)*100)
The output :
head(filea)
# A tibble: 6 x 4
# Groups: ITEMS [1]
ITEMS MONTHS VALUE Percent_Share
<fct> <int> <dbl> <dbl>
1 a 1 14.4 4.00
2 a 2 39.4 11.0
3 a 3 20.4 5.69
4 a 4 44.2 12.3
5 a 5 47.0 13.1
6 a 6 2.28 0.633
Relative frequencies / proportions with dplyr
Try this:
mtcars %>%
group_by(am, gear) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# am gear n freq
# 1 0 3 15 0.7894737
# 2 0 4 4 0.2105263
# 3 1 4 8 0.6153846
# 4 1 5 5 0.3846154
From the dplyr vignette:
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset.
Thus, after the summarise
, the last grouping variable specified in group_by
, 'gear', is peeled off. In the mutate
step, the data is grouped by the remaining grouping variable(s), here 'am'. You may check grouping in each step with groups
.
The outcome of the peeling is of course dependent of the order of the grouping variables in the group_by
call. You may wish to do a subsequent group_by(am)
, to make your code more explicit.
For rounding and prettification, please refer to the nice answer by @Tyler Rinker.
Related Topics
R - What Algorithm Does Geom_Density() Use and How to Extract Points/Equation of Curves
Group Data and Plot Multiple Lines
How to Embed an Image in a Cell a Table Using Dt, R and Shiny
Format Numbers to Significant Figures Nicely in R
How Achieve Identical Facet Sizes and Scales in Several Multi-Facet Ggplot2 Graphics
Make a Rectangular Legend, with Rows and Columns Labeled, in Grid
Read CSV File in R with Currency Column as Numeric
Ggplot: Boxplot of Multiple Column Values
Check If a Date Is Within an Interval in R
Object Not Found Error When Passing Model Formula to Another Function
How to Use R to Download a Zipped File from a Ssl Page That Requires Cookies
Saving and Loading a Model in R
Rselenium: Server Signals Port Is Already in Use
Unicode with Knitr and Rmarkdown
How to Install Rjava for Use with 64Bit R on a 64 Bit Windows Computer