Calculate Percentages of a Binary Variable by Another Variable in R

Calculate percentages of a binary variable BY another variable in R

You could also use data.table:

library(data.table)

setDT(d)[,.(.N,prop=sum(treatment==2)/.N),
by=region]
region N prop
1: A 200 0.5
2: B 200 0.5
3: C 200 0.5
4: D 200 0.5
5: E 200 0.5

Calculate proportion of several binary variables by another variable

You can use tidyr to pivot your data first and then summarize it:

library(tidyr)

tidyr::pivot_longer(my_df, banana:peach,
names_to = "fruit") %>%
dplyr::group_by(gender, fruit) %>%
dplyr::summarize(prop = sum(value) / n())

gender fruit prop
<chr> <chr> <dbl>
1 female apple 0.5
2 female banana 0.625
3 female orange 0.625
4 female peach 0.5
5 female strawberry 0.25
6 male apple 0.75
7 male banana 0.667
8 male orange 0.25
9 male peach 0.583
10 male strawberry 0.333

You can pipe it to arrange if you want to sort by fruit. You can also add the number of observations in the summarize function with n = n().

Frequencies / percentages of multiple binary variables by group with dplyr

After grouping by 'gender', get the 'total' with n(), then loop over the 'var' variables with across, get the mean of logical vector

library(dplyr) # 1.0.0
data %>%
group_by(gender) %>%
summarise(total = n(),
across(starts_with('var'), ~ mean(. == 1)))

calculating counts and percentages of a variable

We may use add_count to create the 'count' column and then get the mean of the binary column by group to return the percent_yes and subtract 100 from the percent_yes to return percent_no

library(dplyr)
df1 %>%
add_count(UserID, name = 'count') %>%
group_by(UserID) %>%
mutate(percent_yes = 100 * mean(substance_use),
percent_no = 100 - percent_yes) %>%
ungroup

-output

# A tibble: 7 × 5
UserID substance_use count percent_yes percent_no
<int> <int> <int> <dbl> <dbl>
1 43124 0 5 40 60
2 43124 1 5 40 60
3 43124 0 5 40 60
4 43124 0 5 40 60
5 43124 1 5 40 60
6 215 1 2 100 0
7 215 1 2 100 0

NOTE: Here, we assumed no missing values in 'substance_use' column

data

df1 <- structure(list(UserID = c(43124L, 43124L, 43124L, 43124L, 43124L, 
215L, 215L), substance_use = c(0L, 1L, 0L, 0L, 1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-7L))

Calculating percentage of multiple columns of binary variables and plotting bar graph in r

Since you have only 1/0 values we can take mean of columns to get the percentage of 1's. Use barplot to plot it.

barplot(colMeans(df[-1]) * 100, ylim = c(0, 100), ylab='Percentage',
xlab = 'bins', main = 'Percentage of yes')

Sample Image
data

df <- structure(list(name = c("a", "b", "c"), bin1 = c(1L, 0L, 0L), 
bin2 = c(0L, 1L, 1L)), class = "data.frame", row.names = c(NA, -3L))

How to calculate and display percentages from a binary dataframe

You can group your tibble by CandidateType and divide the Amount of every row by the maximum Amount:

recruitmentDF %>% 
group_by(CandidateType) %>%
mutate(Pct = scales::percent(Amount / max(Amount)))

This returns:

# A tibble: 6 x 4
# Groups: CandidateType [2]
CandidateType Step Amount Pct
<fct> <fct> <int> <chr>
1 External Hiring 304 3.5%
2 Internal Hiring 164 19.8%
3 External Interview 950 10.9%
4 Internal Interview 512 61.8%
5 External Application 8726 100.0%
6 Internal Application 828 100.0%

How to get percentage of categorical variables and overall percent of a single choice

Updated answer

The tricky part of this problem is the difference between row percentages and column percentages that are represented in the data. Since all rows but the total row are column percentages, we will need to process the data twice, first for the the province * variable level of aggregation, and then variable aggregated over province.

new_data <-data.frame(province=c("a","b"),
food=c("yes","no","no","yes","yes","no"),
shelter_type=c("unfinished","permanent","transitional"))
library(dplyr)
library(tidyr)

First we'll generate what ultimately becomes column percentages within a wide format data frame. We use pivot_longer() to create a narrow format tidy data set, create counts, summarise() the counts, and then group_by() variable & value to generate column percentages.

new_data  %>% group_by(province) %>%
pivot_longer(.,c(food,shelter_type),names_to = "variable",
values_to = "value") %>% ungroup() %>%
group_by(province,variable,value) %>%
mutate(count = 1) %>% summarise(.,count = sum(count)) %>% ungroup() %>%
group_by(variable,value) %>%
mutate(pct = count / sum(count)) -> prov_var

Next, we reaggregate the data to create what will become the Total province. We take the original data, convert to narrow format tidy data, and this time group_by() variable & value to calculate the percentages across province.

new_data  %>% group_by(province) %>%
pivot_longer(.,c(food,shelter_type),names_to = "variable",
values_to = "value") %>% ungroup() %>%
group_by(variable,value) %>%
mutate(count = 1) %>% summarise(., count = sum(count)) %>%
mutate(province = "Total",
pct = count / sum(count)) -> tot_var

Finally, we rbind() the data and use tidyr::pivot_wider() to create the wide format data frame as illustrated in the original question.

# now add rows & pivot_wider()
rbind(prov_var,tot_var) %>%
mutate(concat_var = paste(variable,value,sep="_")) %>%
select(-variable,-value,-count) %>%
pivot_wider(id_cols = province,names_from=concat_var,
values_from = pct)

...and the output:

# A tibble: 3 x 6
province food_no food_yes shelter_type_perm… shelter_type_tra… shelter_type_unf…
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 0.333 0.667 0.5 0.5 0.5
2 b 0.667 0.333 0.5 0.5 0.5
3 Total 0.5 0.5 0.333 0.333 0.333

Partial solutions with tables::tabular()

Another way to attempt to answer the question is with the tables package. We can generate the column percentages by province as follows.

library(tables)

# replicate column percentages, where "All" is 100

tabular((Factor(province,"Province") + 1) ~
(Factor(food) + Factor(shelter_type)) *
(Percent("col")),data = new_data )

Unfortunately, the row for totals isn't what was requested.

          food            shelter_type                        
no yes permanent transitional unfinished
Province Percent Percent Percent Percent Percent
a 33.33 66.67 50 50 50
b 66.67 33.33 50 50 50
All 100.00 100.00 100 100 100

We can fix the All row by configuring the table with row percentages, but then the data by province doesn't match what was requested.

# replicate row percentages in All row
tabular((Factor(province,"Province") + 1) ~
(Factor(food) + Factor(shelter_type)) *
(Percent("row")),data = new_data )

food shelter_type
no yes permanent transitional unfinished
Province Percent Percent Percent Percent Percent
a 33.33 66.67 33.33 33.33 33.33
b 66.67 33.33 33.33 33.33 33.33
All 50.00 50.00 33.33 33.33 33.33

Correct solution with tabular()

However, if we control the percentages by specifying them on the row dimension of the table instead of the column dimension, we can achieve the desired output.

tabular((Factor(province,"Province")*( colPct = Percent("col")) + 1*(rowPct = Percent("row")))  ~ 
(Factor(food) + Factor(shelter_type)),data = new_data )

...and the output:

                 food        shelter_type                        
Province no yes permanent transitional unfinished
a colPct 33.33 66.67 50.00 50.00 50.00
b colPct 66.67 33.33 50.00 50.00 50.00
All rowPct 50.00 50.00 33.33 33.33 33.33

Original answer

We'll use the dplyr package to summarise the data by province & food, calculate percentages, and then ungroup() to calculate percentage of total responses.

new_data <-data.frame(province=c("a","b"),
food=c("yes","no","no","yes","yes","no"),
shelter_type=c("unfinished","permanent","transitional"))

library(dplyr)

new_data %>% group_by(province,food) %>%
summarise(count_food = n()) %>% group_by(province) %>%
mutate(pct_food = count_food / sum(count_food)) %>%
ungroup(.) %>%
mutate(pct_total = count_food / sum(count_food))

...and the output:

# A tibble: 4 x 5
province food count_food pct_food pct_total
<chr> <chr> <int> <dbl> <dbl>
1 a no 1 0.333 0.167
2 a yes 2 0.667 0.333
3 b no 2 0.667 0.333
4 b yes 1 0.333 0.167
>

Calculating percentages for multiple numeric variables by a group variable

Here is one way to handle your task. You group the data by Tag. Then, you want to do the calculation you described for the four columns (i.e., Long, Medium, short, and Urgent). Your are dividing each value in each group with the sum of the values for each group in mutate_at().

library(dplyr)

group_by(df, Tag) %>%
mutate_at(.vars = vars(Long:Urgent),
.funs = funs(. / sum(., na.rm = TRUE)))

# Tag YPred Long Medium short Urgent
# <dbl> <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 1 L1 0.4225589226 0.150000000 0.151041667 0.02958580
# 2 1 L2 0.2289562290 0.350000000 0.307291667 0.41420118
# 3 1 L3 0.2293771044 0.293055556 0.341145833 0.38461538
# 4 1 L4 0.1186868687 0.206944444 0.190104167 0.14201183
# 5 1 L5 0.0004208754 0.000000000 0.010416667 0.02958580
# 6 2 L1 0.1853046595 0.023611111 0.000000000 0.13017751
# 7 2 L2 0.2693548387 0.152777778 0.111979167 0.10650888
# 8 2 L3 0.3325268817 0.344444444 0.390625000 0.18343195
# 9 2 L4 0.2098566308 0.473611111 0.492187500 0.56804734
#10 2 L5 0.0029569892 0.005555556 0.005208333 0.01183432


Related Topics



Leave a reply



Submit