Calculate Using Dplyr, Percentage of Na's in Each Column

Calculate using dplyr, percentage of NA'S in each column

First, I created a test data for you:

a<- c(1,NA,NA,4)
b<- c(NA,2,3,4)
x<- data.frame(a,b)
x
#    a  b
# 1  1 NA
# 2 NA  2
# 3 NA  3
# 4  4  4

Then you can use colMeans(is.na(x)) :

colMeans(is.na(x))
#    a    b 
# 0.50 0.25

Is there a way to calculate the percentage of NA's in each column of a dataframe, but with the df split into separate groups?

Grouped by 'programme', get the mean of NA elements in the other columns, gather to 'long' format and spread back to 'wide' format

library(tidyverse)
df %>% 
  group_by(programme) %>%
  summarise_all(funs(mean(is.na(.)))) %>% 
  gather(variables, val, -programme) %>% 
  spread(programme, val)
# A tibble: 3 x 4
#   variables     A     B     C
#   <chr>     <int> <int> <int>
#1 v1            0     1     0
#2 v2            1     0     0
#3 v3            0     0     1

Using dplyr function to calculate percentage within groups

library(dplyr)

df %>%
  # line below to freeze order of type_n if not ordered factor already
  mutate(type_n = forcats::fct_inorder(type_n)) %>%
  group_by(type_n) %>%
  summarize(n = n(), total = sum(population)) %>%
  mutate(new_col = (n / total) %>% scales::percent(decimal.mark = ",", suffix = ""))

# A tibble: 3 x 4
  type_n     n total new_col
  <fct>  <int> <int> <chr>  
1 small      2     7 28,6   
2 medium     2    14 14,3   
3 large      3    15 20,0

How to find the percentage of NAs in a data.frame?

x = data.frame(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5))

For the whole dataframe:

sum(is.na(x))/prod(dim(x))

mean(is.na(x))

For columns:

apply(x, 2, function(col)sum(is.na(col))/length(col))

colMeans(is.na(x))

R: Calculate percentage of missing Values (NA) per day for a Column in a data frame using panel data and remove the days with missing data of over 25%

If you summarize(), you lose lots of information on the individual days. Furthermore, use group_by() before further dplyr verbs. You can calculate the percentage of NA by dividing the sum of NA by the sum of days. as_tibble() is only used to better show the number of rows, it would work without it too. I added a column "CountDate" so that you know how many times the same day appears in your dataframe.

Data %>% as_tibble() %>%  
  group_by(Date) %>% 
  mutate(CountDate = n(), PercNA = sum(is.na(Size))/n()*100)

# A tibble: 27 x 5
# Groups:   Date [9]
   Product Date        Size CountDate PercNA
   <chr>   <chr>      <int>     <int>  <dbl>
 1 A       01.09.2018    10         3    0  
 2 A       02.09.2018     9         3    0  
 3 A       03.09.2018    NA         3  100  
 4 A       04.09.2018     3         3    0  
 5 A       05.09.2018     4         3    0  
 6 A       11.11.2020     5         3   33.3
 7 A       12.11.2020     3         3    0  
 8 A       13.11.2020    NA         3   33.3
 9 A       14.11.2020     6         3    0  
10 B       01.09.2018     7         3    0  
# ... with 17 more rows

To remove the dates having >25% NA, just filter():

Data %>% as_tibble() %>%  
  group_by(Date) %>% 
  mutate(CountDate = n(), PercNA = sum(is.na(Size))/n()*100) %>%
  filter(PercNA <25) %>% 
  ungroup()

# A tibble: 18 x 5
   Product Date        Size CountDate PercNA
   <chr>   <chr>      <int>     <int>  <dbl>
 1 A       01.09.2018    10         3      0
 2 A       02.09.2018     9         3      0
 3 A       04.09.2018     3         3      0
 4 A       05.09.2018     4         3      0
 5 A       12.11.2020     3         3      0
 6 A       14.11.2020     6         3      0
 7 B       01.09.2018     7         3      0
 8 B       02.09.2018     4         3      0
 9 B       04.09.2018     4         3      0
10 B       05.09.2018     6         3      0
11 B       12.11.2020     4         3      0
12 B       14.11.2020     7         3      0
13 C       01.09.2018     3         3      0
14 C       02.09.2018     4         3      0
15 C       04.09.2018     2         3      0
16 C       05.09.2018     4         3      0
17 C       12.11.2020     7         3      0
18 C       14.11.2020     5         3      0

How to get percentage value of each column across all rows in R

As @camille mentioned in the comments you need an na.rm = TRUE in the rowSums call. To get the percentage of each model in the manufacturer you need to first count the number of each model grouped by manufacturer and model and then get the percentage grouped only by manufacturer. dplyr is smart in this way because it removes one layer of grouping after the summarise so you just need to add a mutate:

library(dplyr)
library(tidyr)
library(ggplot2)
new_mpg <- mpg %>%
  group_by(manufacturer, model) %>%
  summarise (n = n()) %>% 
  mutate(n = n/sum(n)) %>% 
  spread(model, n) %>% 
  mutate_if(is.integer, as.numeric)

new_mpg[,-1] %>% 
  mutate(sum = rowSums(., na.rm = TRUE))

R: how to total the number of NA in each col of data.frame

You could try:

colSums(is.na(df))
#  V1 V2 V3 V4 V5 
#   2  4  2  4  4

data

set.seed(42)
df <- as.data.frame(matrix(sample(c(NA,0:4), 5*20,replace=TRUE), ncol=5))

Calculate column NA's based on a grouping variable with dplyr

Another dplyr version is to first group_by subject and find out the group which has any NA value, then group_by column and calculate total value of NAs for n and divide it by total unique values of subject to get prop.

library(dplyr)
library(tidyr)

db %>%
  group_by(subject) %>%
  summarise_all(~any(is.na(.))) %>%
  ungroup() %>%
  select(-subject) %>%
  gather() %>%
  group_by(key) %>%
  summarise(n = sum(value), 
            prop = n/n_distinct(db$subject))

#   key       n  prop
#   <chr>  <int> <dbl>
#1 item_1     2   1  
#2 item_2     1   0.5

R: Calculate percentage of observations in a column that are below a certain value for panel data

Instead of count, which requires a data.frame/tibble, use sum on a logical vector to get the count - TRUE values will be counted as 1 and FALSE as 0

library(dplyr)
df %>%
  group_by(Product) %>%
  dplyr:: summarise(CountDate = n(),
     SmallSize = sum(Size<1000000, na.rm = TRUE), .groups = "drop") %>%
  dplyr::mutate(Percent = SmallSize/CountDate)
# A tibble: 3 × 4
  Product CountDate SmallSize Percent
  <chr>       <int>     <int>   <dbl>
1 A               6         2   0.333
2 B               6         3   0.5  
3 C               6         1   0.167

Also, we don't need to create both the columns. It can be directly calculated with mean

df %>%
    group_by(Product) %>%
    dplyr::summarise(Percent = mean(Size < 1000000, na.rm = TRUE))
# A tibble: 3 × 2
  Product Percent
  <chr>     <dbl>
1 A         0.333
2 B         0.5  
3 C         0.167

R dplyr calculating group and column percentages

You can add multiple statements in summarise so you don't have to create temporary objects a and b. To calculate overall percentage you can divide the number by the sum of the column.

library(dplyr)

test %>%
  group_by(group) %>%
  summarise(no_resp = sum(resp, na.rm = TRUE), 
            all = n_distinct(id), 
            resp_rate = round(no_resp/all*100)) %>%
  mutate(no_resp_perc = no_resp/sum(no_resp) * 100)

#  group no_resp   all resp_rate no_resp_perc
#  <chr>   <int> <int>     <dbl>        <dbl>
#1 A           2     3        67         25  
#2 B           2     3        67         25  
#3 C           1     2        50         12.5
#4 D           3     4        75         37.5

Calculate Using Dplyr, Percentage of Na's in Each Column