Calculate using dplyr, percentage of NA'S in each column
First, I created a test data for you:
a<- c(1,NA,NA,4)
b<- c(NA,2,3,4)
x<- data.frame(a,b)
x
# a b
# 1 1 NA
# 2 NA 2
# 3 NA 3
# 4 4 4
Then you can use colMeans(is.na(x))
:
colMeans(is.na(x))
# a b
# 0.50 0.25
Is there a way to calculate the percentage of NA's in each column of a dataframe, but with the df split into separate groups?
Grouped by 'programme', get the mean
of NA elements in the other columns, gather
to 'long' format and spread
back to 'wide' format
library(tidyverse)
df %>%
group_by(programme) %>%
summarise_all(funs(mean(is.na(.)))) %>%
gather(variables, val, -programme) %>%
spread(programme, val)
# A tibble: 3 x 4
# variables A B C
# <chr> <int> <int> <int>
#1 v1 0 1 0
#2 v2 1 0 0
#3 v3 0 0 1
Using dplyr function to calculate percentage within groups
library(dplyr)
df %>%
# line below to freeze order of type_n if not ordered factor already
mutate(type_n = forcats::fct_inorder(type_n)) %>%
group_by(type_n) %>%
summarize(n = n(), total = sum(population)) %>%
mutate(new_col = (n / total) %>% scales::percent(decimal.mark = ",", suffix = ""))
# A tibble: 3 x 4
type_n n total new_col
<fct> <int> <int> <chr>
1 small 2 7 28,6
2 medium 2 14 14,3
3 large 3 15 20,0
How to find the percentage of NAs in a data.frame?
x = data.frame(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5))
For the whole dataframe:
sum(is.na(x))/prod(dim(x))
Or
mean(is.na(x))
For columns:
apply(x, 2, function(col)sum(is.na(col))/length(col))
Or
colMeans(is.na(x))
R: Calculate percentage of missing Values (NA) per day for a Column in a data frame using panel data and remove the days with missing data of over 25%
If you summarize()
, you lose lots of information on the individual days. Furthermore, use group_by()
before further dplyr verbs. You can calculate the percentage of NA by dividing the sum of NA by the sum of days. as_tibble()
is only used to better show the number of rows, it would work without it too. I added a column "CountDate" so that you know how many times the same day appears in your dataframe.
Data %>% as_tibble() %>%
group_by(Date) %>%
mutate(CountDate = n(), PercNA = sum(is.na(Size))/n()*100)
# A tibble: 27 x 5
# Groups: Date [9]
Product Date Size CountDate PercNA
<chr> <chr> <int> <int> <dbl>
1 A 01.09.2018 10 3 0
2 A 02.09.2018 9 3 0
3 A 03.09.2018 NA 3 100
4 A 04.09.2018 3 3 0
5 A 05.09.2018 4 3 0
6 A 11.11.2020 5 3 33.3
7 A 12.11.2020 3 3 0
8 A 13.11.2020 NA 3 33.3
9 A 14.11.2020 6 3 0
10 B 01.09.2018 7 3 0
# ... with 17 more rows
To remove the dates having >25% NA, just filter()
:
Data %>% as_tibble() %>%
group_by(Date) %>%
mutate(CountDate = n(), PercNA = sum(is.na(Size))/n()*100) %>%
filter(PercNA <25) %>%
ungroup()
# A tibble: 18 x 5
Product Date Size CountDate PercNA
<chr> <chr> <int> <int> <dbl>
1 A 01.09.2018 10 3 0
2 A 02.09.2018 9 3 0
3 A 04.09.2018 3 3 0
4 A 05.09.2018 4 3 0
5 A 12.11.2020 3 3 0
6 A 14.11.2020 6 3 0
7 B 01.09.2018 7 3 0
8 B 02.09.2018 4 3 0
9 B 04.09.2018 4 3 0
10 B 05.09.2018 6 3 0
11 B 12.11.2020 4 3 0
12 B 14.11.2020 7 3 0
13 C 01.09.2018 3 3 0
14 C 02.09.2018 4 3 0
15 C 04.09.2018 2 3 0
16 C 05.09.2018 4 3 0
17 C 12.11.2020 7 3 0
18 C 14.11.2020 5 3 0
How to get percentage value of each column across all rows in R
As @camille mentioned in the comments you need an na.rm = TRUE
in the rowSums
call. To get the percentage of each model in the manufacturer you need to first count the number of each model grouped by manufacturer and model and then get the percentage grouped only by manufacturer. dplyr
is smart in this way because it removes one layer of grouping after the summarise
so you just need to add a mutate:
library(dplyr)
library(tidyr)
library(ggplot2)
new_mpg <- mpg %>%
group_by(manufacturer, model) %>%
summarise (n = n()) %>%
mutate(n = n/sum(n)) %>%
spread(model, n) %>%
mutate_if(is.integer, as.numeric)
new_mpg[,-1] %>%
mutate(sum = rowSums(., na.rm = TRUE))
R: how to total the number of NA in each col of data.frame
You could try:
colSums(is.na(df))
# V1 V2 V3 V4 V5
# 2 4 2 4 4
data
set.seed(42)
df <- as.data.frame(matrix(sample(c(NA,0:4), 5*20,replace=TRUE), ncol=5))
Calculate column NA's based on a grouping variable with dplyr
Another dplyr
version is to first group_by
subject
and find out the group which has any
NA
value, then group_by
column and calculate total value of NA
s for n
and divide it by total unique values of subject
to get prop
.
library(dplyr)
library(tidyr)
db %>%
group_by(subject) %>%
summarise_all(~any(is.na(.))) %>%
ungroup() %>%
select(-subject) %>%
gather() %>%
group_by(key) %>%
summarise(n = sum(value),
prop = n/n_distinct(db$subject))
# key n prop
# <chr> <int> <dbl>
#1 item_1 2 1
#2 item_2 1 0.5
R: Calculate percentage of observations in a column that are below a certain value for panel data
Instead of count
, which requires a data.frame/tibble, use sum
on a logical vector to get the count - TRUE
values will be counted as 1 and FALSE
as 0
library(dplyr)
df %>%
group_by(Product) %>%
dplyr:: summarise(CountDate = n(),
SmallSize = sum(Size<1000000, na.rm = TRUE), .groups = "drop") %>%
dplyr::mutate(Percent = SmallSize/CountDate)
# A tibble: 3 × 4
Product CountDate SmallSize Percent
<chr> <int> <int> <dbl>
1 A 6 2 0.333
2 B 6 3 0.5
3 C 6 1 0.167
Also, we don't need to create both the columns. It can be directly calculated with mean
df %>%
group_by(Product) %>%
dplyr::summarise(Percent = mean(Size < 1000000, na.rm = TRUE))
# A tibble: 3 × 2
Product Percent
<chr> <dbl>
1 A 0.333
2 B 0.5
3 C 0.167
R dplyr calculating group and column percentages
You can add multiple statements in summarise
so you don't have to create temporary objects a
and b
. To calculate overall percentage you can divide the number by the sum of the column.
library(dplyr)
test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE),
all = n_distinct(id),
resp_rate = round(no_resp/all*100)) %>%
mutate(no_resp_perc = no_resp/sum(no_resp) * 100)
# group no_resp all resp_rate no_resp_perc
# <chr> <int> <int> <dbl> <dbl>
#1 A 2 3 67 25
#2 B 2 3 67 25
#3 C 1 2 50 12.5
#4 D 3 4 75 37.5
Related Topics
How to Use Variables Newly Created in 'J' in the Same 'J' Argument
R: Building a Simple Command Line Plotting Tool/Capturing Window Close Events
Installing Rcppeigen on Amazon Ec2
How to Speed Up or Vectorize a for Loop
Matrix Display Without Row and Column Names
Setting Column Width in R Shiny Datatable Does Not Work in Case of Lots of Column
R: How to Make a Confusion Matrix for a Predictive Model
R Ggplot2 Boxplots - Ggpubr Stat_Compare_Means Not Working Properly
R - How to Use Selectinput in Shiny to Change the X and Fill Variables in a Ggplot Renderplot
Loading Dplyr After Plyr Is Causing Issues
Scales = "Free" Works for Facet_Wrap But Doesn't for Facet_Grid
Error: Maximal Number of Dlls Reached
Rmarkdown Setting the Position of Kable
How to Use a Character Vector of Column Names in the Formula Argument of Dcast (Reshape2)
Is There an R Library That Estimates a Multivariate Natural Cubic Spline (Or Similar) Function