Calculate percentages of a binary variable BY another variable in R
You could also use data.table:
library(data.table)
setDT(d)[,.(.N,prop=sum(treatment==2)/.N),
by=region]
region N prop
1: A 200 0.5
2: B 200 0.5
3: C 200 0.5
4: D 200 0.5
5: E 200 0.5
Calculate proportion of several binary variables by another variable
You can use tidyr
to pivot your data first and then summarize it:
library(tidyr)
tidyr::pivot_longer(my_df, banana:peach,
names_to = "fruit") %>%
dplyr::group_by(gender, fruit) %>%
dplyr::summarize(prop = sum(value) / n())
gender fruit prop
<chr> <chr> <dbl>
1 female apple 0.5
2 female banana 0.625
3 female orange 0.625
4 female peach 0.5
5 female strawberry 0.25
6 male apple 0.75
7 male banana 0.667
8 male orange 0.25
9 male peach 0.583
10 male strawberry 0.333
You can pipe it to arrange
if you want to sort by fruit
. You can also add the number of observations in the summarize
function with n = n()
.
Frequencies / percentages of multiple binary variables by group with dplyr
After grouping by 'gender', get the 'total' with n()
, then loop over the 'var' variables with across
, get the mean
of logical vector
library(dplyr) # 1.0.0
data %>%
group_by(gender) %>%
summarise(total = n(),
across(starts_with('var'), ~ mean(. == 1)))
calculating counts and percentages of a variable
We may use add_count
to create the 'count' column and then get the mean
of the binary column by group to return the percent_yes
and subtract 100 from the percent_yes to return percent_no
library(dplyr)
df1 %>%
add_count(UserID, name = 'count') %>%
group_by(UserID) %>%
mutate(percent_yes = 100 * mean(substance_use),
percent_no = 100 - percent_yes) %>%
ungroup
-output
# A tibble: 7 × 5
UserID substance_use count percent_yes percent_no
<int> <int> <int> <dbl> <dbl>
1 43124 0 5 40 60
2 43124 1 5 40 60
3 43124 0 5 40 60
4 43124 0 5 40 60
5 43124 1 5 40 60
6 215 1 2 100 0
7 215 1 2 100 0
NOTE: Here, we assumed no missing values in 'substance_use' column
data
df1 <- structure(list(UserID = c(43124L, 43124L, 43124L, 43124L, 43124L,
215L, 215L), substance_use = c(0L, 1L, 0L, 0L, 1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-7L))
Calculating percentage of multiple columns of binary variables and plotting bar graph in r
Since you have only 1/0 values we can take mean of columns to get the percentage of 1's. Use barplot
to plot it.
barplot(colMeans(df[-1]) * 100, ylim = c(0, 100), ylab='Percentage',
xlab = 'bins', main = 'Percentage of yes')
data
df <- structure(list(name = c("a", "b", "c"), bin1 = c(1L, 0L, 0L),
bin2 = c(0L, 1L, 1L)), class = "data.frame", row.names = c(NA, -3L))
How to calculate and display percentages from a binary dataframe
You can group your tibble
by CandidateType
and divide the Amount
of every row by the maximum Amount
:
recruitmentDF %>%
group_by(CandidateType) %>%
mutate(Pct = scales::percent(Amount / max(Amount)))
This returns:
# A tibble: 6 x 4
# Groups: CandidateType [2]
CandidateType Step Amount Pct
<fct> <fct> <int> <chr>
1 External Hiring 304 3.5%
2 Internal Hiring 164 19.8%
3 External Interview 950 10.9%
4 Internal Interview 512 61.8%
5 External Application 8726 100.0%
6 Internal Application 828 100.0%
How to get percentage of categorical variables and overall percent of a single choice
Updated answer
The tricky part of this problem is the difference between row percentages and column percentages that are represented in the data. Since all rows but the total row are column percentages, we will need to process the data twice, first for the the province
* variable
level of aggregation, and then variable
aggregated over province
.
new_data <-data.frame(province=c("a","b"),
food=c("yes","no","no","yes","yes","no"),
shelter_type=c("unfinished","permanent","transitional"))
library(dplyr)
library(tidyr)
First we'll generate what ultimately becomes column percentages within a wide format data frame. We use pivot_longer()
to create a narrow format tidy data set, create counts, summarise()
the counts, and then group_by()
variable & value to generate column percentages.
new_data %>% group_by(province) %>%
pivot_longer(.,c(food,shelter_type),names_to = "variable",
values_to = "value") %>% ungroup() %>%
group_by(province,variable,value) %>%
mutate(count = 1) %>% summarise(.,count = sum(count)) %>% ungroup() %>%
group_by(variable,value) %>%
mutate(pct = count / sum(count)) -> prov_var
Next, we reaggregate the data to create what will become the Total
province. We take the original data, convert to narrow format tidy data, and this time group_by()
variable & value to calculate the percentages across province
.
new_data %>% group_by(province) %>%
pivot_longer(.,c(food,shelter_type),names_to = "variable",
values_to = "value") %>% ungroup() %>%
group_by(variable,value) %>%
mutate(count = 1) %>% summarise(., count = sum(count)) %>%
mutate(province = "Total",
pct = count / sum(count)) -> tot_var
Finally, we rbind()
the data and use tidyr::pivot_wider()
to create the wide format data frame as illustrated in the original question.
# now add rows & pivot_wider()
rbind(prov_var,tot_var) %>%
mutate(concat_var = paste(variable,value,sep="_")) %>%
select(-variable,-value,-count) %>%
pivot_wider(id_cols = province,names_from=concat_var,
values_from = pct)
...and the output:
# A tibble: 3 x 6
province food_no food_yes shelter_type_perm… shelter_type_tra… shelter_type_unf…
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 0.333 0.667 0.5 0.5 0.5
2 b 0.667 0.333 0.5 0.5 0.5
3 Total 0.5 0.5 0.333 0.333 0.333
Partial solutions with tables::tabular()
Another way to attempt to answer the question is with the tables
package. We can generate the column percentages by province
as follows.
library(tables)
# replicate column percentages, where "All" is 100
tabular((Factor(province,"Province") + 1) ~
(Factor(food) + Factor(shelter_type)) *
(Percent("col")),data = new_data )
Unfortunately, the row for totals isn't what was requested.
food shelter_type
no yes permanent transitional unfinished
Province Percent Percent Percent Percent Percent
a 33.33 66.67 50 50 50
b 66.67 33.33 50 50 50
All 100.00 100.00 100 100 100
We can fix the All
row by configuring the table with row percentages, but then the data by province doesn't match what was requested.
# replicate row percentages in All row
tabular((Factor(province,"Province") + 1) ~
(Factor(food) + Factor(shelter_type)) *
(Percent("row")),data = new_data )
food shelter_type
no yes permanent transitional unfinished
Province Percent Percent Percent Percent Percent
a 33.33 66.67 33.33 33.33 33.33
b 66.67 33.33 33.33 33.33 33.33
All 50.00 50.00 33.33 33.33 33.33
Correct solution with tabular()
However, if we control the percentages by specifying them on the row dimension of the table instead of the column dimension, we can achieve the desired output.
tabular((Factor(province,"Province")*( colPct = Percent("col")) + 1*(rowPct = Percent("row"))) ~
(Factor(food) + Factor(shelter_type)),data = new_data )
...and the output:
food shelter_type
Province no yes permanent transitional unfinished
a colPct 33.33 66.67 50.00 50.00 50.00
b colPct 66.67 33.33 50.00 50.00 50.00
All rowPct 50.00 50.00 33.33 33.33 33.33
Original answer
We'll use the dplyr
package to summarise the data by province & food, calculate percentages, and then ungroup()
to calculate percentage of total responses.
new_data <-data.frame(province=c("a","b"),
food=c("yes","no","no","yes","yes","no"),
shelter_type=c("unfinished","permanent","transitional"))
library(dplyr)
new_data %>% group_by(province,food) %>%
summarise(count_food = n()) %>% group_by(province) %>%
mutate(pct_food = count_food / sum(count_food)) %>%
ungroup(.) %>%
mutate(pct_total = count_food / sum(count_food))
...and the output:
# A tibble: 4 x 5
province food count_food pct_food pct_total
<chr> <chr> <int> <dbl> <dbl>
1 a no 1 0.333 0.167
2 a yes 2 0.667 0.333
3 b no 2 0.667 0.333
4 b yes 1 0.333 0.167
>
Calculating percentages for multiple numeric variables by a group variable
Here is one way to handle your task. You group the data by Tag
. Then, you want to do the calculation you described for the four columns (i.e., Long
, Medium
, short
, and Urgent
). Your are dividing each value in each group with the sum of the values for each group in mutate_at()
.
library(dplyr)
group_by(df, Tag) %>%
mutate_at(.vars = vars(Long:Urgent),
.funs = funs(. / sum(., na.rm = TRUE)))
# Tag YPred Long Medium short Urgent
# <dbl> <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 1 L1 0.4225589226 0.150000000 0.151041667 0.02958580
# 2 1 L2 0.2289562290 0.350000000 0.307291667 0.41420118
# 3 1 L3 0.2293771044 0.293055556 0.341145833 0.38461538
# 4 1 L4 0.1186868687 0.206944444 0.190104167 0.14201183
# 5 1 L5 0.0004208754 0.000000000 0.010416667 0.02958580
# 6 2 L1 0.1853046595 0.023611111 0.000000000 0.13017751
# 7 2 L2 0.2693548387 0.152777778 0.111979167 0.10650888
# 8 2 L3 0.3325268817 0.344444444 0.390625000 0.18343195
# 9 2 L4 0.2098566308 0.473611111 0.492187500 0.56804734
#10 2 L5 0.0029569892 0.005555556 0.005208333 0.01183432
Related Topics
How to Place Legends at Different Sides of Plot (Bottom and Right Side) with Ggplot2
Why Are the Colors Wrong on This Ggplot
Integrate() in R Gives Terribly Wrong Answer
How to Calculate a Table of Pairwise Counts from Long-Form Data Frame
X^(1/3)' Behaves Differently for Negative Scalar 'X' and Vector 'X' with Negative Values
Splitting String Between Capital and Lowercase Character in R
Row Not Consolidating Duplicates in R When Using Multiple Months in Date Filter
Convert String of Anyformat into Dd-Mm-Yy Hh:Mm:Ss in R
Convert Factor to Date Class for Multiple Columns
Combine/Merge Columns While Avoiding Na
How to Modify Unexported Object in a Package
How to Detect That a Vector Is Subset of Specific Vector
Why Does Dplyr's Filter Drop Na Values from a Factor Variable
How to Get the Second Sub Element of Every Element in a List
In R, Check If String Appears in Row of Dataframe (In Any Column)