How to Deal with Nas in Residuals in a Regression in R

How do I deal with NAs in residuals in a regression in R?

I just found this googling around a bit deeper. The resid function on a lm with na.action=na.exclude is the way to go.

Bind residuals to input dataset with missing values

"[<-"(df, !is.na(df$X), "res", residuals(lm(Y ~ X,data=df,na.action=na.omit)))

will do the trick.

how does R handle NA values vs deleted values with regressions

The regression would omit any NA values prior to doing the analysis (i.e. deleting any row that contains a missing NA in any of the predictor variables or the outcome variable). You can check this by comparing the degrees of freedom and other statistics for both models.

Here's a toy example:

head(mtcars)

# check the data set size (all non-missings)
dim(mtcars) # has 32 rows

# Introduce some missings
set.seed(5)
mtcars[sample(1:nrow(mtcars), 5), sample(1:ncol(mtcars), 5)] <- NA

head(mtcars)

# Create an alternative where all missings are omitted
mtcars_NA_omit <- na.omit(mtcars)

# Check the data set size again
dim(mtcars_NA_omit) # Now only has 27 rows

# Now compare some simple linear regressions
summary(lm(mpg ~ cyl + hp + am + gear, data = mtcars))
summary(lm(mpg ~ cyl + hp + am + gear, data = mtcars_NA_omit))

Comparing the two summaries you can see that they are identical, with the one exception that for the first model, there's a warning message that 5 csaes have been dropped due to missingness, which is exactly what we did manually in our mtcars_NA_omit example.

# First, original model

Call:
lm(formula = mpg ~ cyl + hp + am + gear, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0835 -1.7594 -0.2023  1.4313  5.6948 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.64284    7.02359   4.220 0.000352 ***
cyl         -1.04494    0.83565  -1.250 0.224275    
hp          -0.03913    0.01918  -2.040 0.053525 .  
am           4.02895    1.90342   2.117 0.045832 *  
gear         0.31413    1.48881   0.211 0.834833    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.947 on 22 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.7998,    Adjusted R-squared:  0.7635 
F-statistic: 21.98 on 4 and 22 DF,  p-value: 2.023e-07

# Second model where we dropped missings manually    

Call:
lm(formula = mpg ~ cyl + hp + am + gear, data = mtcars_NA_omit)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0835 -1.7594 -0.2023  1.4313  5.6948 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.64284    7.02359   4.220 0.000352 ***
cyl         -1.04494    0.83565  -1.250 0.224275    
hp          -0.03913    0.01918  -2.040 0.053525 .  
am           4.02895    1.90342   2.117 0.045832 *  
gear         0.31413    1.48881   0.211 0.834833    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.947 on 22 degrees of freedom
Multiple R-squared:  0.7998,    Adjusted R-squared:  0.7635 
F-statistic: 21.98 on 4 and 22 DF,  p-value: 2.023e-07

R: Replacement has [x] rows, data has [y] - residuals from a linear model in new variable

When NA are present, the function broom::augment generates a column .rownames, which maps to row numbers in the original data frame.

So we can join by adding row numbers to the original data. Note that .rownames is of type character, so we have to alter either .rownames or the row numbers so as the types match.

Putting all that together, something like this should work. I assume you want standardized residuals .std.resid. If not, use .resid.

library(dplyr)
library(broom)

data.frame.use %>% 
  lm(FCNA ~ Age + Gender, data = .) %>% 
  augment() %>% 
  select(.rownames, .std.resid) %>% 
  right_join(mutate(data.frame.use, row = as.character(row_number())), 
             by = c(".rownames" = "row"))

Filling NA using linear regression in R

Here is an example using lm to predict values in R.

library(dplyr)

# Construct linear model based on non-NA pairs
df2 <- df %>% filter(!is.na(var1))

fit <- lm(var1 ~ var2, data = df2)

# See the result
summary(fit)

# Call:
#   lm(formula = var1 ~ var2, data = df2)
# 
# Residuals:
#   1          2          3          4          5          6 
# 8.627e-15 -2.388e-15  1.546e-16 -9.658e-15 -2.322e-15  5.587e-15 
# 
# Coefficients:
#   Estimate Std. Error   t value Pr(>|t|)    
# (Intercept) 2.321e-14  5.619e-15 4.130e+00   0.0145 *  
#   var2        6.667e-01  4.411e-17 1.511e+16   <2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 7.246e-15 on 4 degrees of freedom
# Multiple R-squared:      1,   Adjusted R-squared:      1 
# F-statistic: 2.284e+32 on 1 and 4 DF,  p-value: < 2.2e-16
# 
# Warning message:
#   In summary.lm(fit) : essentially perfect fit: summary may be unreliable

# Use fit to predict the value
df3 <- df %>% 
  mutate(pred = predict(fit, .)) %>%
  # Replace NA with pred in var1
  mutate(var1 = ifelse(is.na(var1), pred, var1))

# See the result
df3 %>% as.data.frame()

#    time  var1   var2  pred
# 1    15  20.4  30.60  20.4
# 2    16  31.5  47.25  31.5
# 3    17  42.6  63.90  42.6
# 4    18  53.7  80.55  53.7
# 5    19  64.8  97.20  64.8
# 6    20  75.9 113.85  75.9
# 7    21  87.0 130.50  87.0
# 8    22  98.1 147.15  98.1
# 9    23 109.2 163.80 109.2
# 10   24 120.3 180.45 120.3
# 11   25 131.4 197.10 131.4
# 12   26 142.5 213.75 142.5

Calculating correlation between residuals of linear regression with NAs and independent variable in R

You can get the actual data used to fit the model from model$model, and from there the p column:

cor(residuals(model), model$model$p)

Alternatively, is.na(mydf$p1) will tell you which rows in mydf have an NA in column p1:

cor(residuals(model), mydf$p[!is.na(mydf$p1)])

In general, is.na(x) tells us whether elements in x are NA or not:

> is.na(c(1,2,NA,4,NA,6))
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE