How do I deal with NAs in residuals in a regression in R?
I just found this googling around a bit deeper. The resid
function on a lm
with na.action=na.exclude
is the way to go.
Bind residuals to input dataset with missing values
"[<-"(df, !is.na(df$X), "res", residuals(lm(Y ~ X,data=df,na.action=na.omit)))
will do the trick.
how does R handle NA values vs deleted values with regressions
The regression would omit any NA values prior to doing the analysis (i.e. deleting any row that contains a missing NA
in any of the predictor variables or the outcome variable). You can check this by comparing the degrees of freedom and other statistics for both models.
Here's a toy example:
head(mtcars)
# check the data set size (all non-missings)
dim(mtcars) # has 32 rows
# Introduce some missings
set.seed(5)
mtcars[sample(1:nrow(mtcars), 5), sample(1:ncol(mtcars), 5)] <- NA
head(mtcars)
# Create an alternative where all missings are omitted
mtcars_NA_omit <- na.omit(mtcars)
# Check the data set size again
dim(mtcars_NA_omit) # Now only has 27 rows
# Now compare some simple linear regressions
summary(lm(mpg ~ cyl + hp + am + gear, data = mtcars))
summary(lm(mpg ~ cyl + hp + am + gear, data = mtcars_NA_omit))
Comparing the two summaries you can see that they are identical, with the one exception that for the first model, there's a warning message that 5 csaes have been dropped due to missingness, which is exactly what we did manually in our mtcars_NA_omit
example.
# First, original model
Call:
lm(formula = mpg ~ cyl + hp + am + gear, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.0835 -1.7594 -0.2023 1.4313 5.6948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.64284 7.02359 4.220 0.000352 ***
cyl -1.04494 0.83565 -1.250 0.224275
hp -0.03913 0.01918 -2.040 0.053525 .
am 4.02895 1.90342 2.117 0.045832 *
gear 0.31413 1.48881 0.211 0.834833
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.947 on 22 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.7998, Adjusted R-squared: 0.7635
F-statistic: 21.98 on 4 and 22 DF, p-value: 2.023e-07
# Second model where we dropped missings manually
Call:
lm(formula = mpg ~ cyl + hp + am + gear, data = mtcars_NA_omit)
Residuals:
Min 1Q Median 3Q Max
-5.0835 -1.7594 -0.2023 1.4313 5.6948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.64284 7.02359 4.220 0.000352 ***
cyl -1.04494 0.83565 -1.250 0.224275
hp -0.03913 0.01918 -2.040 0.053525 .
am 4.02895 1.90342 2.117 0.045832 *
gear 0.31413 1.48881 0.211 0.834833
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.947 on 22 degrees of freedom
Multiple R-squared: 0.7998, Adjusted R-squared: 0.7635
F-statistic: 21.98 on 4 and 22 DF, p-value: 2.023e-07
R: Replacement has [x] rows, data has [y] - residuals from a linear model in new variable
When NA are present, the function broom::augment
generates a column .rownames
, which maps to row numbers in the original data frame.
So we can join by adding row numbers to the original data. Note that .rownames
is of type character, so we have to alter either .rownames
or the row numbers so as the types match.
Putting all that together, something like this should work. I assume you want standardized residuals .std.resid
. If not, use .resid
.
library(dplyr)
library(broom)
data.frame.use %>%
lm(FCNA ~ Age + Gender, data = .) %>%
augment() %>%
select(.rownames, .std.resid) %>%
right_join(mutate(data.frame.use, row = as.character(row_number())),
by = c(".rownames" = "row"))
Filling NA using linear regression in R
Here is an example using lm
to predict values in R.
library(dplyr)
# Construct linear model based on non-NA pairs
df2 <- df %>% filter(!is.na(var1))
fit <- lm(var1 ~ var2, data = df2)
# See the result
summary(fit)
# Call:
# lm(formula = var1 ~ var2, data = df2)
#
# Residuals:
# 1 2 3 4 5 6
# 8.627e-15 -2.388e-15 1.546e-16 -9.658e-15 -2.322e-15 5.587e-15
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2.321e-14 5.619e-15 4.130e+00 0.0145 *
# var2 6.667e-01 4.411e-17 1.511e+16 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 7.246e-15 on 4 degrees of freedom
# Multiple R-squared: 1, Adjusted R-squared: 1
# F-statistic: 2.284e+32 on 1 and 4 DF, p-value: < 2.2e-16
#
# Warning message:
# In summary.lm(fit) : essentially perfect fit: summary may be unreliable
# Use fit to predict the value
df3 <- df %>%
mutate(pred = predict(fit, .)) %>%
# Replace NA with pred in var1
mutate(var1 = ifelse(is.na(var1), pred, var1))
# See the result
df3 %>% as.data.frame()
# time var1 var2 pred
# 1 15 20.4 30.60 20.4
# 2 16 31.5 47.25 31.5
# 3 17 42.6 63.90 42.6
# 4 18 53.7 80.55 53.7
# 5 19 64.8 97.20 64.8
# 6 20 75.9 113.85 75.9
# 7 21 87.0 130.50 87.0
# 8 22 98.1 147.15 98.1
# 9 23 109.2 163.80 109.2
# 10 24 120.3 180.45 120.3
# 11 25 131.4 197.10 131.4
# 12 26 142.5 213.75 142.5
Calculating correlation between residuals of linear regression with NAs and independent variable in R
You can get the actual data used to fit the model from model$model
, and from there the p
column:
cor(residuals(model), model$model$p)
Alternatively, is.na(mydf$p1)
will tell you which rows in mydf
have an NA
in column p1
:
cor(residuals(model), mydf$p[!is.na(mydf$p1)])
In general, is.na(x)
tells us whether elements in x
are NA
or not:
> is.na(c(1,2,NA,4,NA,6))
[1] FALSE FALSE TRUE FALSE TRUE FALSE
Related Topics
Row-Wise Sort Then Concatenate Across Specific Columns of Data Frame
How to Add a Table to a Ggplot
Why am I Losing Categorical Data in My Regression Summary
R: What Are Operators Like %In% Called and How to Learn About Them
Can You Specify Different Geoms for Different Facets in a Ggplot
Ggplot: Colour Points by Groups Based on User Defined Colours
Paste All Combinations of a Vector in R
How to Align Multiple Ggplot2 Plots and Add Shadows Over All of Them
Return Index of the Smallest Value in a Vector
Shiny Renderui Selectinput Returned Null
Messy Plot When Plotting Predictions of a Polynomial Regression Using Lm() in R
What Does "Not Run" Mean in R Help Pages
Using Data.Table I and J Arguments in Functions
Adding Ellipses to a Principal Component Analysis (Pca) Plot
Buffer (Geo)Spatial Points in R with Gbuffer
Two-Way Density Plot Combined with One Way Density Plot with Selected Regions in R