Determine the Number of Na Values in a Column

R: how to total the number of NA in each col of data.frame

You could try:

colSums(is.na(df))
# V1 V2 V3 V4 V5
# 2 4 2 4 4

data

set.seed(42)
df <- as.data.frame(matrix(sample(c(NA,0:4), 5*20,replace=TRUE), ncol=5))

How to count number of rows with NA on each column?

We can use the vectorized colSums on a logical matrix (is.na(df1))

colSums(is.na(df1))

Or another option is sum by looping

sapply(df1, function(x) sum(is.na(x)))

Or with dplyr

library(dplyr)
df1 %>%
summarise(across(everything(), ~ sum(is.na(.))))

How to count the number of rows with NA values in specific columns?

Keeping in the tidyverse world (assumed since you wanted to use n_distinct)

library(tidyverse)

##Your data
data <- tibble(ID = c(1,2,3,2,3,4),
neckpain = c('Yes',NA,'Yes',NA,'Yes',NA),
backpain = c(NA,NA,'Yes',NA,'Yes',NA),
kneepain = c(NA,NA,'Yes',NA,'Yes',NA))

##Pull out ones are missing across ID and count the rows if you want to cherry pick columns
nrow(data %>%
rowwise() %>%
mutate(row_total = sum(is.na(neckpain),
is.na(backpain),
is.na(kneepain))) %>%
filter(row_total == 3))

[1] 3

##Or if you just want to do it across all rows as noted in the comments
nrow(data %>%
mutate(row_total = rowSums(is.na(.[2:4]))) %>%
filter(row_total == 3))
[1] 3

Count number of NA's in a Row in Specified Columns R

df$na_count <- rowSums(is.na(df[c('first', 'last', 'address', 'phone', 'state')])) 

df
first m_initial last address phone state customer na_count
1 Bob L Turner 123 Turner Lane 410-3141 Iowa <NA> 0
2 Will P Williams 456 Williams Rd 491-2359 <NA> Y 1
3 Amanda C Jones 789 Haggerty <NA> <NA> Y 2
4 Lisa <NA> Evans <NA> <NA> <NA> N 3

Find out the percentage of missing values in each column in the given dataset

How about this? I think I actually found something similar on here once before, but I'm not seeing it now...

percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
'percent_missing': percent_missing})

And if you want the missing percentages sorted, follow the above with:

missing_value_df.sort_values('percent_missing', inplace=True)

As mentioned in the comments, you may also be able to get by with just the first line in my code above, i.e.:

percent_missing = df.isnull().sum() * 100 / len(df)

Count NA in multiple columns in R

In the first case, there are multiple functions passed. We may either need to block it with {}

library(dplyr)
dt %>%
select(starts_with("V2QE38")) %>%
{colSums(is.na(.))}
V2QE38A V2QE38B V2QE38C V2QE38D
0 0 0 0

or have another %>%

dt %>%
select(starts_with("V2QE38")) %>%
is.na %>%
colSums

-output

V2QE38A V2QE38B V2QE38C V2QE38D 
0 0 0 0

The issue is that colSums is executed first without evaluating the is.na

> dt %>% 
select(starts_with("V2QE38")) %>%
colSums(.)
V2QE38A V2QE38B V2QE38C V2QE38D
6 1 12 0

which is the same as the OP's output with colSums(is.na(.))

How to simply count number of rows with NAs - R

tl;dr: row wise, you'll want sum(!complete.cases(DF)), or, equivalently, sum(apply(DF, 1, anyNA))

There are a number of different ways to look at the number, proportion or position of NA values in a data frame:

Most of these start with the logical data frame with TRUE for every NA, and FALSE everywhere else. For the base dataset airquality

is.na(airquality)

There are 44 NA values in this data set

sum(is.na(airquality))
# [1] 44

You can look at the total number of NA values per row or column:

head(rowSums(is.na(airquality)))
# [1] 0 0 0 0 2 1
colSums(is.na(airquality))
# Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0

You can use anyNA() in place of is.na() as well:

# by row
head(apply(airquality, 1, anyNA))
# [1] FALSE FALSE FALSE FALSE TRUE TRUE
sum(apply(airquality, 1, anyNA))
# [1] 42

# by column
head(apply(airquality, 2, anyNA))
# Ozone Solar.R Wind Temp Month Day
# TRUE TRUE FALSE FALSE FALSE FALSE
sum(apply(airquality, 2, anyNA))
# [1] 2

complete.cases() can be used, but only row-wise:

sum(!complete.cases(airquality))
# [1] 42


Related Topics



Leave a reply



Submit