R: how to total the number of NA in each col of data.frame
You could try:
colSums(is.na(df))
# V1 V2 V3 V4 V5
# 2 4 2 4 4
data
set.seed(42)
df <- as.data.frame(matrix(sample(c(NA,0:4), 5*20,replace=TRUE), ncol=5))
How to count number of rows with NA on each column?
We can use the vectorized colSums
on a logical matrix (is.na(df1)
)
colSums(is.na(df1))
Or another option is sum
by looping
sapply(df1, function(x) sum(is.na(x)))
Or with dplyr
library(dplyr)
df1 %>%
summarise(across(everything(), ~ sum(is.na(.))))
How to count the number of rows with NA values in specific columns?
Keeping in the tidyverse world (assumed since you wanted to use n_distinct)
library(tidyverse)
##Your data
data <- tibble(ID = c(1,2,3,2,3,4),
neckpain = c('Yes',NA,'Yes',NA,'Yes',NA),
backpain = c(NA,NA,'Yes',NA,'Yes',NA),
kneepain = c(NA,NA,'Yes',NA,'Yes',NA))
##Pull out ones are missing across ID and count the rows if you want to cherry pick columns
nrow(data %>%
rowwise() %>%
mutate(row_total = sum(is.na(neckpain),
is.na(backpain),
is.na(kneepain))) %>%
filter(row_total == 3))
[1] 3
##Or if you just want to do it across all rows as noted in the comments
nrow(data %>%
mutate(row_total = rowSums(is.na(.[2:4]))) %>%
filter(row_total == 3))
[1] 3
Count number of NA's in a Row in Specified Columns R
df$na_count <- rowSums(is.na(df[c('first', 'last', 'address', 'phone', 'state')]))
df
first m_initial last address phone state customer na_count
1 Bob L Turner 123 Turner Lane 410-3141 Iowa <NA> 0
2 Will P Williams 456 Williams Rd 491-2359 <NA> Y 1
3 Amanda C Jones 789 Haggerty <NA> <NA> Y 2
4 Lisa <NA> Evans <NA> <NA> <NA> N 3
Find out the percentage of missing values in each column in the given dataset
How about this? I think I actually found something similar on here once before, but I'm not seeing it now...
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
'percent_missing': percent_missing})
And if you want the missing percentages sorted, follow the above with:
missing_value_df.sort_values('percent_missing', inplace=True)
As mentioned in the comments, you may also be able to get by with just the first line in my code above, i.e.:
percent_missing = df.isnull().sum() * 100 / len(df)
Count NA in multiple columns in R
In the first case, there are multiple functions passed. We may either need to block it with {}
library(dplyr)
dt %>%
select(starts_with("V2QE38")) %>%
{colSums(is.na(.))}
V2QE38A V2QE38B V2QE38C V2QE38D
0 0 0 0
or have another %>%
dt %>%
select(starts_with("V2QE38")) %>%
is.na %>%
colSums
-output
V2QE38A V2QE38B V2QE38C V2QE38D
0 0 0 0
The issue is that colSums
is executed first without evaluating the is.na
> dt %>%
select(starts_with("V2QE38")) %>%
colSums(.)
V2QE38A V2QE38B V2QE38C V2QE38D
6 1 12 0
which is the same as the OP's output with colSums(is.na(.))
How to simply count number of rows with NAs - R
tl;dr: row wise, you'll want sum(!complete.cases(DF))
, or, equivalently, sum(apply(DF, 1, anyNA))
There are a number of different ways to look at the number, proportion or position of NA
values in a data frame:
Most of these start with the logical data frame with TRUE
for every NA
, and FALSE
everywhere else. For the base dataset airquality
is.na(airquality)
There are 44 NA
values in this data set
sum(is.na(airquality))
# [1] 44
You can look at the total number of NA
values per row or column:
head(rowSums(is.na(airquality)))
# [1] 0 0 0 0 2 1
colSums(is.na(airquality))
# Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
You can use anyNA()
in place of is.na()
as well:
# by row
head(apply(airquality, 1, anyNA))
# [1] FALSE FALSE FALSE FALSE TRUE TRUE
sum(apply(airquality, 1, anyNA))
# [1] 42
# by column
head(apply(airquality, 2, anyNA))
# Ozone Solar.R Wind Temp Month Day
# TRUE TRUE FALSE FALSE FALSE FALSE
sum(apply(airquality, 2, anyNA))
# [1] 2
complete.cases()
can be used, but only row-wise:
sum(!complete.cases(airquality))
# [1] 42
Related Topics
Why I Get This Error Writing Data to a File
Image Not Showing in Shiny App R
How to Generalize Outer to N Dimensions
Plotting During a Loop in Rstudio
Modifying Ggplot Objects After Creation
Calculating Time Difference Between Two Columns
Ggplot2: Drop Unused Factors in a Faceted Bar Plot But Not Have Differing Bar Widths Between Facets
How to Combine Ggplot and Dplyr into a Function
Reshape Multiple Categorical Variables to Binary Response Variables
Elegant Way to Select the Color for a Particular Segment of a Line Plot
Pie Charts in Ggplot2 with Variable Pie Sizes
How to Get Value When a Variable Name Is Passed as a String
How to Best Simulate an Arbitrary Univariate Random Variate Using Its Probability Function