Handling Missing/Incomplete Data in R--Is There Function to Mask But Not Remove Nas

Handling missing/incomplete data in R--is there function to mask but not remove NAs?

Exactly what to do with missing data -- which may be flagged as NA if we know it is missing -- may well differ from domain to domain.

To take an example related to time series, where you may want to skip, or fill, or interpolate, or interpolate differently, ... is that just the (very useful and popular) zoo has all these functions related to NA handling:

zoo::na.approx  zoo::na.locf    
zoo::na.spline zoo::na.trim

allowing to approximate (using different algorithms), carry-forward or backward, use spline interpolation or trim.

Another example would be the numerous missing imputation packages on CRAN -- often providing domain-specific solutions. [ So if you call R a DSL, what is this? "Sub-domain specific solutions for domain specific languages" or SDSSFDSL? Quite a mouthful :) ]

But for your specific question: no, I am not aware of a bit-level flag in base R that allows you to mark observations as 'to be excluded'. I presume most R users would resort to functions like na.omit() et al or use the na.rm=TRUE option you mentioned.

How to remove NAs from categorical data?

In base R we could use complete.cases to get cases without NA

survey_complete <- complete.cases(survey)
survey[survey_complete,]

Output:

> survey[survey_complete,]
Death recover
1 1 0
2 1 0
3 1 0
5 0 1
8 0 1
9 1 0

Removing all NAs while retaining the most data possible

Here is one way using the first algorithm that I could think of. The approach is just to remove a row or column in an iteration if it has at least one NA and the fewest non-NA values in the matrix (so you lose the fewest cells when removing the row/column). To do this, I make a dataframe of the rows and columns with their counts of NA and non-NA along with dimension and index. At the moment, if there is a tie it resolves by deleting rows before columns and earlier indexes before later.

I am not sure that this will give the global maximum (e.g. only takes one branch at ties) but it should do better than just deleting rows/columns. In this example we get 210 for deleting rows, 74 for deleting columns but 272 with the new approach. The code could also probably be optimised if you need to use this for much larger matrices or for many more NA.

set.seed(1)
mat <- matrix(sample(x = c(1:10, NA), size = 37 * 21, replace = TRUE), ncol = 21)
# filter rows
prod(dim(mat[apply(mat, 1, function(x) all(!is.na(x))), ]))
#> [1] 210
# filter cols
prod(dim(mat[, apply(mat, 2, function(x) all(!is.na(x)))]))
#> [1] 74

delete_row_col <- function(m) {
to_delete <- rbind(
data.frame(
dim = "row",
index = seq_len(nrow(m)),
nas = rowSums(is.na(m)),
non_nas = rowSums(!is.na(m)),
stringsAsFactors = FALSE
),
data.frame(
dim = "col",
index = seq_len(ncol(m)),
nas = colSums(is.na(m)),
non_nas = colSums(!is.na(m)),
stringsAsFactors = FALSE
)
)
to_delete <- to_delete[to_delete$nas > 0, ]
to_delete <- to_delete[to_delete$non_nas == min(to_delete$non_nas), ]

if (nrow(to_delete) == 0) {
return(m)
}
else if (to_delete$dim[1] == "row") {
m <- m[-to_delete$index[1], ]
} else {
m <- m[, -to_delete$index[1]]
}
return(m)
}

remove_matrix_na <- function(m) {
while (any(is.na(m))) {
m <- delete_row_col(m)
}
return(m)
}

prod(dim(remove_matrix_na(mat)))
#> [1] 272

Created on 2019-07-06 by the reprex package (v0.3.0)

Is there a way to exclude NA while calculating outliers in a data frame but still include rows with NA in the final output?

We can have an | condition with is.na to not remove NA rows

library(dplyr)
df %>%
filter((height < (mean(height, na.rm = TRUE) +
3* sd(height, na.rm=TRUE)))|is.na(height))

Cann't remove NAs

Obviously you have 'Na' strings that are fake NAs. replace them with real ones, then your code should work.

dat <- replace(dat, dat == 'Na', NA)
dat[complete.cases(dat[, c("Point", "Basal", "Short.Saps", "Tall.Saps")]), ]
# Point Basal Short.Saps Tall.Saps
# 4 DEL004 10 21 22

Data:

dat <- structure(list(Point = c("DEL001", "DEL002", "DEL003", "DEL004"
), Basal = c("Na", "Na", "Na", "10"), Short.Saps = c(2L, 1L,
0L, 21L), Tall.Saps = c(0L, 6L, 5L, 22L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))

Remove trailing (last) rows with NAs in all columns

This seems to work with all test cases.

The idea is to use a reverse cumsum to filter out the NA rows at the end.

library(data.table)

remove_empty_row_last_new <- function(d) {
d[d[,is.na(rev(cumsum(rev(ifelse(rowSums(!is.na(.SD))==0,1,NA)))))]]
}

d <- data.table(a=c(1,NA,3,NA,5,NA,NA),b=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_new(d)
#> a b
#> 1: 1 1
#> 2: NA NA
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5

d2 <- data.table(A=c(1,NA,3,NA,5,1 ,NA),B=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_new(d2)
#> A B
#> 1: 1 1
#> 2: NA NA
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
#> 6: 1 NA

d3 <- data.table(A=c(1,NA,3,NA,5,NA,NA),B=c(1,NA,3,4,5,1,NA))
remove_empty_row_last_new(d3)
#> A B
#> 1: 1 1
#> 2: NA NA
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
#> 6: NA 1

d4 <- data.table(A=c(1,2,3,NA,5,NA,NA),B=c(1,2,3,4,5,1,7))
remove_empty_row_last_new(d4)
#> A B
#> 1: 1 1
#> 2: 2 2
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
#> 6: NA 1
#> 7: NA 7

You'll have to check performance on your real dataset, but it seems a bit faster :

> microbenchmark::microbenchmark(remove_empty_row_last(d),remove_empty_row_last_new(d))
Unit: microseconds
expr min lq mean median uq max neval cld
remove_empty_row_last(d) 384.701 411.800 468.5251 434.251 483.7515 1004.401 100 b
remove_empty_row_last_new(d) 345.201 359.301 416.1650 382.501 450.5010 1104.401 100 a

How to handle with NA's when doing glm in R (and not removing entire rows)?

Missing information (NA) in variables is quite tricky to handle. First of all, NA values will be omitted. It is not possible to fit a regression model using NA values, so you have to handle them before fitting the glm model.

There are a lot of approaches, the easiest ones are omitting the rows with NAs (what glm does by default), imputing the NA values with the most frequent, the one from the previous row or the median/mean (those are called single imputation methods) or other more complex approaches that use two or more variables at once to get the right values (multiple imputation methods).

The solution will depend always on the context of the data. And you have to know that the existence of missing data will mean an error or bias in your results. So it is important to try to reduce this bias in the glm model.

For a bit more information, for example you can have a look to this post: https://www.kdnuggets.com/2017/09/missing-data-imputation-using-r.html#:~:text=In%20R%2C%20there%20are%20a%20lot%20of%20packages,and%20probably%20a%20gold%20standard%20for%20imputing%20values.

arithmetic by row with missing observations

You can just use sum in an apply call and weight each vector accordingly.

weights <- c(1,rep(-1,3),0,rep(-1,2),0,-1,1)

apply(df.1,1,function(x) sum(x*weights,na.rm=T))
[1] 79 129 228 268 279

Although it is perhaps quicker to use colSums on the transposed matrix multiplied by these weights:

colSums(t(df.1)*weights,na.rm=T)
[1] 79 129 228 268 279

RandomForest in R reports missing values in object, but vector has zero NAs in it

randomForest can fail due to a few different types of issues with the data. Missing values (NA), values of NaN, Inf or -Inf, and character types that have not been cast into factors will all fail, with a variety of error messages.

We can see below some examples of the error messages generated by each of these issues:

my.df <- data.frame(a = 1:26, b=letters, c=(1:26)+rnorm(26))
rf <- randomForest(a ~ ., data=my.df)
# this works without issues, because b=letters is cast into a factor variable by default

my.df$d <- LETTERS # Now we add a character column
rf <- randomForest(a ~ ., data=my.df)
# Error in randomForest.default(m, y, ...) :
# NA/NaN/Inf in foreign function call (arg 1)
# In addition: Warning message:
# In data.matrix(x) : NAs introduced by coercion

rf <- randomForest(d ~ ., data=my.df)
# Error in y - ymean : non-numeric argument to binary operator
# In addition: Warning message:
# In mean.default(y) : argument is not numeric or logical: returning NA

my.df$d <- c(NA, rnorm(25))
rf <- randomForest(a ~ ., data=my.df)
rf <- randomForest(d ~ ., data=my.df)
# Error in na.fail.default(list(a = 1:26, b = 1:26, c = c(3.14586293058335, :
# missing values in object

my.df$d <- c(Inf, rnorm(25))
rf <- randomForest(a ~ ., data=my.df)
rf <- randomForest(d ~ ., data=my.df)
# Error in randomForest.default(m, y, ...) :
# NA/NaN/Inf in foreign function call (arg 1)

Interestingly, the error message you received, which was caused by having a character type in the data frame (see comments), is the error that I see when there is a numeric column with NA. This suggests that there may either be (1) differences in the errors from different versions of randomForest or (2) that the error message depends in more complex ways on the structure of the data. Either way, the advice for anyone receiving errors such as these is to look for all of the possible issues with the data listed above, in order to track down the cause.



Related Topics



Leave a reply



Submit