Handling missing/incomplete data in R--is there function to mask but not remove NAs?
Exactly what to do with missing data -- which may be flagged as NA
if we know it is missing -- may well differ from domain to domain.
To take an example related to time series, where you may want to skip, or fill, or interpolate, or interpolate differently, ... is that just the (very useful and popular) zoo has all these functions related to NA
handling:
zoo::na.approx zoo::na.locf
zoo::na.spline zoo::na.trim
allowing to approximate (using different algorithms), carry-forward or backward, use spline interpolation or trim.
Another example would be the numerous missing imputation packages on CRAN -- often providing domain-specific solutions. [ So if you call R a DSL, what is this? "Sub-domain specific solutions for domain specific languages" or SDSSFDSL? Quite a mouthful :) ]
But for your specific question: no, I am not aware of a bit-level flag in base R that allows you to mark observations as 'to be excluded'. I presume most R users would resort to functions like na.omit()
et al or use the na.rm=TRUE
option you mentioned.
How to remove NAs from categorical data?
In base R we could use complete.cases
to get cases without NA
survey_complete <- complete.cases(survey)
survey[survey_complete,]
Output:
> survey[survey_complete,]
Death recover
1 1 0
2 1 0
3 1 0
5 0 1
8 0 1
9 1 0
Removing all NAs while retaining the most data possible
Here is one way using the first algorithm that I could think of. The approach is just to remove a row or column in an iteration if it has at least one NA
and the fewest non-NA
values in the matrix (so you lose the fewest cells when removing the row/column). To do this, I make a dataframe of the rows and columns with their counts of NA
and non-NA
along with dimension and index. At the moment, if there is a tie it resolves by deleting rows before columns and earlier indexes before later.
I am not sure that this will give the global maximum (e.g. only takes one branch at ties) but it should do better than just deleting rows/columns. In this example we get 210 for deleting rows, 74 for deleting columns but 272 with the new approach. The code could also probably be optimised if you need to use this for much larger matrices or for many more NA
.
set.seed(1)
mat <- matrix(sample(x = c(1:10, NA), size = 37 * 21, replace = TRUE), ncol = 21)
# filter rows
prod(dim(mat[apply(mat, 1, function(x) all(!is.na(x))), ]))
#> [1] 210
# filter cols
prod(dim(mat[, apply(mat, 2, function(x) all(!is.na(x)))]))
#> [1] 74
delete_row_col <- function(m) {
to_delete <- rbind(
data.frame(
dim = "row",
index = seq_len(nrow(m)),
nas = rowSums(is.na(m)),
non_nas = rowSums(!is.na(m)),
stringsAsFactors = FALSE
),
data.frame(
dim = "col",
index = seq_len(ncol(m)),
nas = colSums(is.na(m)),
non_nas = colSums(!is.na(m)),
stringsAsFactors = FALSE
)
)
to_delete <- to_delete[to_delete$nas > 0, ]
to_delete <- to_delete[to_delete$non_nas == min(to_delete$non_nas), ]
if (nrow(to_delete) == 0) {
return(m)
}
else if (to_delete$dim[1] == "row") {
m <- m[-to_delete$index[1], ]
} else {
m <- m[, -to_delete$index[1]]
}
return(m)
}
remove_matrix_na <- function(m) {
while (any(is.na(m))) {
m <- delete_row_col(m)
}
return(m)
}
prod(dim(remove_matrix_na(mat)))
#> [1] 272
Created on 2019-07-06 by the reprex package (v0.3.0)
Is there a way to exclude NA while calculating outliers in a data frame but still include rows with NA in the final output?
We can have an |
condition with is.na
to not remove NA
rows
library(dplyr)
df %>%
filter((height < (mean(height, na.rm = TRUE) +
3* sd(height, na.rm=TRUE)))|is.na(height))
Cann't remove NAs
Obviously you have 'Na'
strings that are fake NA
s. replace
them with real ones, then your code should work.
dat <- replace(dat, dat == 'Na', NA)
dat[complete.cases(dat[, c("Point", "Basal", "Short.Saps", "Tall.Saps")]), ]
# Point Basal Short.Saps Tall.Saps
# 4 DEL004 10 21 22
Data:
dat <- structure(list(Point = c("DEL001", "DEL002", "DEL003", "DEL004"
), Basal = c("Na", "Na", "Na", "10"), Short.Saps = c(2L, 1L,
0L, 21L), Tall.Saps = c(0L, 6L, 5L, 22L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Remove trailing (last) rows with NAs in all columns
This seems to work with all test cases.
The idea is to use a reverse cumsum
to filter out the NA
rows at the end.
library(data.table)
remove_empty_row_last_new <- function(d) {
d[d[,is.na(rev(cumsum(rev(ifelse(rowSums(!is.na(.SD))==0,1,NA)))))]]
}
d <- data.table(a=c(1,NA,3,NA,5,NA,NA),b=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_new(d)
#> a b
#> 1: 1 1
#> 2: NA NA
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
d2 <- data.table(A=c(1,NA,3,NA,5,1 ,NA),B=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_new(d2)
#> A B
#> 1: 1 1
#> 2: NA NA
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
#> 6: 1 NA
d3 <- data.table(A=c(1,NA,3,NA,5,NA,NA),B=c(1,NA,3,4,5,1,NA))
remove_empty_row_last_new(d3)
#> A B
#> 1: 1 1
#> 2: NA NA
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
#> 6: NA 1
d4 <- data.table(A=c(1,2,3,NA,5,NA,NA),B=c(1,2,3,4,5,1,7))
remove_empty_row_last_new(d4)
#> A B
#> 1: 1 1
#> 2: 2 2
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
#> 6: NA 1
#> 7: NA 7
You'll have to check performance on your real dataset, but it seems a bit faster :
> microbenchmark::microbenchmark(remove_empty_row_last(d),remove_empty_row_last_new(d))
Unit: microseconds
expr min lq mean median uq max neval cld
remove_empty_row_last(d) 384.701 411.800 468.5251 434.251 483.7515 1004.401 100 b
remove_empty_row_last_new(d) 345.201 359.301 416.1650 382.501 450.5010 1104.401 100 a
How to handle with NA's when doing glm in R (and not removing entire rows)?
Missing information (NA) in variables is quite tricky to handle. First of all, NA values will be omitted. It is not possible to fit a regression model using NA values, so you have to handle them before fitting the glm model.
There are a lot of approaches, the easiest ones are omitting the rows with NAs (what glm does by default), imputing the NA values with the most frequent, the one from the previous row or the median/mean (those are called single imputation methods) or other more complex approaches that use two or more variables at once to get the right values (multiple imputation methods).
The solution will depend always on the context of the data. And you have to know that the existence of missing data will mean an error or bias in your results. So it is important to try to reduce this bias in the glm model.
For a bit more information, for example you can have a look to this post: https://www.kdnuggets.com/2017/09/missing-data-imputation-using-r.html#:~:text=In%20R%2C%20there%20are%20a%20lot%20of%20packages,and%20probably%20a%20gold%20standard%20for%20imputing%20values.
arithmetic by row with missing observations
You can just use sum
in an apply
call and weight each vector accordingly.
weights <- c(1,rep(-1,3),0,rep(-1,2),0,-1,1)
apply(df.1,1,function(x) sum(x*weights,na.rm=T))
[1] 79 129 228 268 279
Although it is perhaps quicker to use colSums
on the transposed matrix multiplied by these weights:
colSums(t(df.1)*weights,na.rm=T)
[1] 79 129 228 268 279
RandomForest in R reports missing values in object, but vector has zero NAs in it
randomForest
can fail due to a few different types of issues with the data. Missing values (NA
), values of NaN
, Inf
or -Inf
, and character types that have not been cast into factors will all fail, with a variety of error messages.
We can see below some examples of the error messages generated by each of these issues:
my.df <- data.frame(a = 1:26, b=letters, c=(1:26)+rnorm(26))
rf <- randomForest(a ~ ., data=my.df)
# this works without issues, because b=letters is cast into a factor variable by default
my.df$d <- LETTERS # Now we add a character column
rf <- randomForest(a ~ ., data=my.df)
# Error in randomForest.default(m, y, ...) :
# NA/NaN/Inf in foreign function call (arg 1)
# In addition: Warning message:
# In data.matrix(x) : NAs introduced by coercion
rf <- randomForest(d ~ ., data=my.df)
# Error in y - ymean : non-numeric argument to binary operator
# In addition: Warning message:
# In mean.default(y) : argument is not numeric or logical: returning NA
my.df$d <- c(NA, rnorm(25))
rf <- randomForest(a ~ ., data=my.df)
rf <- randomForest(d ~ ., data=my.df)
# Error in na.fail.default(list(a = 1:26, b = 1:26, c = c(3.14586293058335, :
# missing values in object
my.df$d <- c(Inf, rnorm(25))
rf <- randomForest(a ~ ., data=my.df)
rf <- randomForest(d ~ ., data=my.df)
# Error in randomForest.default(m, y, ...) :
# NA/NaN/Inf in foreign function call (arg 1)
Interestingly, the error message you received, which was caused by having a character
type in the data frame (see comments), is the error that I see when there is a numeric column with NA
. This suggests that there may either be (1) differences in the errors from different versions of randomForest
or (2) that the error message depends in more complex ways on the structure of the data. Either way, the advice for anyone receiving errors such as these is to look for all of the possible issues with the data listed above, in order to track down the cause.
Related Topics
Plot a Legend and Well-Spaced Universal Y-Axis and Main Titles in Grid.Arrange
Methods for Doing Heatmaps, Level/Contour Plots, and Hexagonal Binning
Compare If Two Dataframe Objects in R Are Equal
Ggplot2: Overlay Density Plots R
Remove a Layer from a Ggplot2 Chart
Reading in Chunks at a Time Using Fread in Package Data.Table
Rmarkdown Directing Output File into a Directory
R Package Xtable, How to Create a Latextable with Multiple Rows and Columns from R
How to Plot a Subset of a Data Frame in R
How to Check If a Sequence of Numbers Is Monotonically Increasing (Or Decreasing)
How to Tell What Packages You Have Used in R
Formatting Ggplot2 Axis Labels with Commas (And K? Mm) If I Already Have a Y-Scale
Using R and Plot.Ly - How to Script Saving My Output as a Webpage
Building a List in a Loop in R - Getting Item Names Correct