Error in Na.Fail.Default: Missing Values in Object - But No Missing Values

Error in na.fail.default: missing values in object - but no missing values

tl;dr you have to use na.exclude() (or whatever) on the whole data frame at once, so that the remaining observations stay matched up across variables ...

set.seed(101)
tot_nochc=runif(10,1,15)
cor_partner=factor(c(1,1,0,1,0,0,0,0,1,0))
age=runif(10,18,75)
agecu=age^3
day=factor(c(1,2,2,3,3,NA,NA,4,4,4))
## use data.frame() -- *DON'T* cbind() first
dt=data.frame(tot_nochc,cor_partner,agecu,day)
## DON'T attach(dt) ...

Now try:

library(nlme)
corpart.lme.1=lme(tot_nochc~cor_partner+agecu+cor_partner *agecu,
random = ~cor_partner+agecu+cor_partner *agecu |day,
data=dt,
na.action=na.exclude)

We get convergence errors and warnings, but I think that's now because we're using a tiny made-up data set without enough information in it and not because of any inherent problem with the code.

Random Forest Error in na.fail.default: missing values in object

Assuming train() is from caret, you can specify a function to handle na's with the na.action parameter. The default is na.fail. A very common one is na.omit. The randomForest library has na.roughfix that will "Impute Missing Values by median/mode."

mod_rf <-
train(left_school ~ job_title
+ gender +
+ marital_status + age_at_enrollment + monthly_wage + educational_qualification + cityD + educational_qualification + cityC.
+ cityB +cityA + duration_in_program, # Equation (outcome and everything else)
data=train_data, # Training data
method = "ranger", # random forest (ranger is much faster than rf)
metric = "ROC", # area under the curve
trControl = control_conditions,
tuneGrid = tune_mtry,
na.action = na.omit
)
mod_rf

Why do I get Error in na.fail.default(list(doc.class = c(3L, 1L...missing values in object

The problem lies in this part of the code:

tdm <- as.data.frame(inspect(tdm))
weightedtdm <- as.data.frame(inspect(weightedtdm))

dim(weightedtdm) #returns rows and columns
10 10

You never use this to create a data.frame out of a tdm. You only get the first 10 rows and 10 columns. Not all the data from the tdm.

You need to use:

tdm <- as.data.frame(as.matrix(tdm))
weightedtdm <- as.data.frame(as.matrix(weightedtdm))

dim(weightedtdm)
[1] 993 9243

Here you can see the enormous difference between the 2 ways.

Using the first weightedtdm will result in 700 NA values for all columns except doc.class when you run weightedTDMtrain$doc.class <- merged$Class[which(merged$train_test == "train")]
This is the reason why train returns the error message.

Using the second way will work and your train will start to run. (slowly because of the repeated cross validation.)

Error in na.fail.default(list(Fe = c(568L, 437L, 599L, 1016L, 670L, 1951L, : missing values in object

So here is the first pass at an answer, but without peaking at a subsample of your data this is the gist I can give.

First, we can set up some fake data and add some missing cases to solve the first issue.

# Some fake data
dat <- mtcars

# Now lets add some missing data

dat[sample(x = 1:nrow(dat), size = 5),
sample(x = 1:ncol(dat), size = 5)] <- NA

Now we we looks at our data, we see we have some missing values:
(the function counts the mising cells per column then turns the output into a dataframe for easier presentation)

as.data.frame(lapply(dat, function(x) sum(is.na(x))))

#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 0 5 0 0 5 5 0 5 5 0 0

If we know that we can't have missing values, we can then use the complete.cases function to only keep those rows that do not have any NAs.

dat_complete <- dat[complete.cases(dat),]

This removes some records (in my case 5)

nrow(dat) -nrow(dat_complete)
#> [1] 5

Aside: If dropping missing data is problematic (e.g. discarding information from these data can bias your estimates. Perhaps the data are not missing completely at random, or there was an instrument malfunction that is known), there are many methods for imputation and joint estimation of the missing values.

The second problem deals with splits not containing all of the factors.

For example, if I were to add a factor to my data set and generate the partitions, it is important to see if I capture the factor in my training data.

# Add factor (note D is a rare letter)
library(caret)
dat_complete$my_factor <- factor(sample(x = letters[1:4],
size = nrow(dat_complete),
prob = c(.7,.2,.15,.05), replace = T))

pls_Fe <- createDataPartition(dat_complete$mpg, p = 0.7, list = FALSE)
training <- dat_complete[pls_Fe,]
testing <- dat_complete[-pls_Fe,]

When I look at my testing and training data I see that while my testing data has a "D", my training data do not.

table(training$my_factor)
#> a b c
#> 14 6 0
table(testing$my_factor)
#> a b c
#> 5 1 1

A model cannot reliably predict on a new factor level (generally speaking, of course. there are more methods).

How to fix this? You can always convert the factors to numbers if that makes sense in the context (adds biase, but allows your model to work). You might need to drop your split (rather than 70% training, do 60% and see if you capture some of the low incidence samples). If the predictors aren't relevant, then remove them from consideration. Additionally, given this is PLS, you could also try one-hot encoding. This splits a factor column into n-1 columns with a 1 or 0 representing factor representation. Do this on the full data set (only if you know that you will always observe the factors e.g. the value can only take on a category).

dumz <- dummyVars(~my_factor, data = dat_complete)

dat_dummies <- cbind(dat_complete, predict(dumz, dat_complete))

dat_dummies <- dat_dummies[,names(dat_dummies) !="my_factor"]

pls_Fe <- createDataPartition(dat_dummies$mpg, p = 0.7, list = FALSE)

training <- dat_dummies[pls_Fe,]
testing <- dat_dummies[-pls_Fe,]



Related Topics



Leave a reply



Submit