Calculate Percentages of a Binary Variable by Another Variable in R

Calculate percentages of a binary variable BY another variable in R

You could also use data.table:

library(data.table)

setDT(d)[,.(.N,prop=sum(treatment==2)/.N),
         by=region]
   region   N prop
1:      A 200  0.5
2:      B 200  0.5
3:      C 200  0.5
4:      D 200  0.5
5:      E 200  0.5

Calculate proportion of several binary variables by another variable

You can use tidyr to pivot your data first and then summarize it:

library(tidyr)

tidyr::pivot_longer(my_df, banana:peach,
                    names_to = "fruit") %>% 
  dplyr::group_by(gender, fruit) %>% 
  dplyr::summarize(prop = sum(value) / n())

   gender fruit       prop
   <chr>  <chr>      <dbl>
 1 female apple      0.5  
 2 female banana     0.625
 3 female orange     0.625
 4 female peach      0.5  
 5 female strawberry 0.25 
 6 male   apple      0.75 
 7 male   banana     0.667
 8 male   orange     0.25 
 9 male   peach      0.583
10 male   strawberry 0.333

You can pipe it to arrange if you want to sort by fruit. You can also add the number of observations in the summarize function with n = n().

Frequencies / percentages of multiple binary variables by group with dplyr

After grouping by 'gender', get the 'total' with n(), then loop over the 'var' variables with across, get the mean of logical vector

library(dplyr) # 1.0.0
data %>% 
   group_by(gender) %>% 
       summarise(total = n(), 
            across(starts_with('var'), ~ mean(. == 1)))

calculating counts and percentages of a variable

We may use add_count to create the 'count' column and then get the mean of the binary column by group to return the percent_yes and subtract 100 from the percent_yes to return percent_no

library(dplyr)
df1 %>%
  add_count(UserID, name = 'count') %>%
  group_by(UserID) %>%
  mutate(percent_yes = 100 * mean(substance_use), 
       percent_no = 100 - percent_yes) %>% 
  ungroup

-output

# A tibble: 7 × 5
  UserID substance_use count percent_yes percent_no
   <int>         <int> <int>       <dbl>      <dbl>
1  43124             0     5          40         60
2  43124             1     5          40         60
3  43124             0     5          40         60
4  43124             0     5          40         60
5  43124             1     5          40         60
6    215             1     2         100          0
7    215             1     2         100          0

NOTE: Here, we assumed no missing values in 'substance_use' column

data

df1 <- structure(list(UserID = c(43124L, 43124L, 43124L, 43124L, 43124L, 
215L, 215L), substance_use = c(0L, 1L, 0L, 0L, 1L, 1L, 1L)), 
class = "data.frame", row.names = c(NA, 
-7L))

Calculating percentage of multiple columns of binary variables and plotting bar graph in r

Since you have only 1/0 values we can take mean of columns to get the percentage of 1's. Use barplot to plot it.

barplot(colMeans(df[-1]) * 100, ylim = c(0, 100), ylab='Percentage',
         xlab = 'bins', main = 'Percentage of yes')

Sample Image
data

df <- structure(list(name = c("a", "b", "c"), bin1 = c(1L, 0L, 0L), 
    bin2 = c(0L, 1L, 1L)), class = "data.frame", row.names = c(NA, -3L))

How to calculate and display percentages from a binary dataframe

You can group your tibble by CandidateType and divide the Amount of every row by the maximum Amount:

recruitmentDF %>% 
  group_by(CandidateType) %>% 
  mutate(Pct = scales::percent(Amount / max(Amount)))

This returns:

# A tibble: 6 x 4
# Groups:   CandidateType [2]
  CandidateType Step        Amount Pct   
  <fct>         <fct>        <int> <chr> 
1 External      Hiring         304 3.5%  
2 Internal      Hiring         164 19.8% 
3 External      Interview      950 10.9% 
4 Internal      Interview      512 61.8% 
5 External      Application   8726 100.0%
6 Internal      Application    828 100.0%

How to get percentage of categorical variables and overall percent of a single choice

Updated answer

The tricky part of this problem is the difference between row percentages and column percentages that are represented in the data. Since all rows but the total row are column percentages, we will need to process the data twice, first for the the province * variable level of aggregation, and then variable aggregated over province.

new_data <-data.frame(province=c("a","b"),
                      food=c("yes","no","no","yes","yes","no"),
                      shelter_type=c("unfinished","permanent","transitional"))   
library(dplyr)
library(tidyr)

First we'll generate what ultimately becomes column percentages within a wide format data frame. We use pivot_longer() to create a narrow format tidy data set, create counts, summarise() the counts, and then group_by() variable & value to generate column percentages.

new_data  %>% group_by(province) %>%
     pivot_longer(.,c(food,shelter_type),names_to = "variable",
                  values_to = "value") %>% ungroup() %>%
     group_by(province,variable,value) %>% 
     mutate(count = 1) %>% summarise(.,count = sum(count)) %>% ungroup() %>%
     group_by(variable,value) %>% 
     mutate(pct = count / sum(count)) -> prov_var

Next, we reaggregate the data to create what will become the Total province. We take the original data, convert to narrow format tidy data, and this time group_by() variable & value to calculate the percentages across province.

new_data  %>% group_by(province) %>%
     pivot_longer(.,c(food,shelter_type),names_to = "variable",
                  values_to = "value") %>% ungroup() %>%
     group_by(variable,value) %>%  
     mutate(count = 1) %>% summarise(., count = sum(count)) %>% 
     mutate(province = "Total",
            pct = count / sum(count)) -> tot_var

Finally, we rbind() the data and use tidyr::pivot_wider() to create the wide format data frame as illustrated in the original question.

# now add rows & pivot_wider()
rbind(prov_var,tot_var) %>% 
     mutate(concat_var = paste(variable,value,sep="_")) %>% 
     select(-variable,-value,-count) %>% 
     pivot_wider(id_cols = province,names_from=concat_var,
                 values_from = pct)

...and the output:

# A tibble: 3 x 6
  province food_no food_yes shelter_type_perm… shelter_type_tra… shelter_type_unf…
  <chr>      <dbl>    <dbl>              <dbl>             <dbl>             <dbl>
1 a          0.333    0.667              0.5               0.5               0.5  
2 b          0.667    0.333              0.5               0.5               0.5  
3 Total      0.5      0.5                0.333             0.333             0.333

Partial solutions with `tables::tabular()`

Another way to attempt to answer the question is with the tables package. We can generate the column percentages by province as follows.

library(tables)

# replicate column percentages, where "All" is 100

tabular((Factor(province,"Province") + 1) ~ 
                (Factor(food) + Factor(shelter_type)) * 
                (Percent("col")),data = new_data )

Unfortunately, the row for totals isn't what was requested.

          food            shelter_type                        
          no      yes     permanent    transitional unfinished
 Province Percent Percent Percent      Percent      Percent   
 a         33.33   66.67   50           50           50       
 b         66.67   33.33   50           50           50       
 All      100.00  100.00  100          100          100

We can fix the All row by configuring the table with row percentages, but then the data by province doesn't match what was requested.

# replicate row percentages in All row
tabular((Factor(province,"Province") + 1) ~ 
                (Factor(food) + Factor(shelter_type)) * 
                (Percent("row")),data = new_data )

          food            shelter_type                        
          no      yes     permanent    transitional unfinished
 Province Percent Percent Percent      Percent      Percent   
 a        33.33   66.67   33.33        33.33        33.33     
 b        66.67   33.33   33.33        33.33        33.33     
 All      50.00   50.00   33.33        33.33        33.33

Correct solution with `tabular()`

However, if we control the percentages by specifying them on the row dimension of the table instead of the column dimension, we can achieve the desired output.

tabular((Factor(province,"Province")*( colPct = Percent("col")) + 1*(rowPct = Percent("row")))  ~ 
                (Factor(food) + Factor(shelter_type)),data = new_data )

...and the output:

                 food        shelter_type                        
 Province        no    yes   permanent    transitional unfinished
 a        colPct 33.33 66.67 50.00        50.00        50.00     
 b        colPct 66.67 33.33 50.00        50.00        50.00     
 All      rowPct 50.00 50.00 33.33        33.33        33.33

Original answer

We'll use the dplyr package to summarise the data by province & food, calculate percentages, and then ungroup() to calculate percentage of total responses.

new_data <-data.frame(province=c("a","b"),
                      food=c("yes","no","no","yes","yes","no"),
                      shelter_type=c("unfinished","permanent","transitional"))

library(dplyr)

new_data %>% group_by(province,food) %>%
     summarise(count_food = n()) %>% group_by(province) %>%
     mutate(pct_food = count_food / sum(count_food)) %>%
     ungroup(.) %>%
     mutate(pct_total = count_food / sum(count_food))

...and the output:

# A tibble: 4 x 5
  province food  count_food pct_food pct_total
  <chr>    <chr>      <int>    <dbl>     <dbl>
1 a        no             1    0.333     0.167
2 a        yes            2    0.667     0.333
3 b        no             2    0.667     0.333
4 b        yes            1    0.333     0.167
>

Calculating percentages for multiple numeric variables by a group variable

Here is one way to handle your task. You group the data by Tag. Then, you want to do the calculation you described for the four columns (i.e., Long, Medium, short, and Urgent). Your are dividing each value in each group with the sum of the values for each group in mutate_at().

library(dplyr)

group_by(df, Tag) %>%
mutate_at(.vars = vars(Long:Urgent),
          .funs = funs(. / sum(., na.rm = TRUE)))

#     Tag  YPred         Long      Medium       short     Urgent
#   <dbl> <fctr>        <dbl>       <dbl>       <dbl>      <dbl>
# 1     1     L1 0.4225589226 0.150000000 0.151041667 0.02958580
# 2     1     L2 0.2289562290 0.350000000 0.307291667 0.41420118
# 3     1     L3 0.2293771044 0.293055556 0.341145833 0.38461538
# 4     1     L4 0.1186868687 0.206944444 0.190104167 0.14201183
# 5     1     L5 0.0004208754 0.000000000 0.010416667 0.02958580
# 6     2     L1 0.1853046595 0.023611111 0.000000000 0.13017751
# 7     2     L2 0.2693548387 0.152777778 0.111979167 0.10650888
# 8     2     L3 0.3325268817 0.344444444 0.390625000 0.18343195
# 9     2     L4 0.2098566308 0.473611111 0.492187500 0.56804734
#10     2     L5 0.0029569892 0.005555556 0.005208333 0.01183432

Calculate Percentages of a Binary Variable by Another Variable in R