Predict.Lm in R Fails to Recognize Newdata

Predict Warning on newdata

apl$grp is a vector, but predict requires the newdata argument to be a data frame.* This data frame must contain columns with the same names as the predictor variables used to fit the model (though it can contain other columns as well). So, the following code should work:

predict(mdl, newdata = apl)

You can use predict rather than predict.lm. mdl is an object of class lm, which causes predict to "dispatch" the predict.lm method automatically.


* Strictly speaking, since this is an lm model, the predict "method" that gets dispatched is predict.lm and that method requires that newdata be a data frame. predict.glm also requires a data frame. However, there are some predict methods that can take other types of arguments. For example:

  • The randomForest package has a predict method for randomForest models that can take a data frame or matrix as the newdata argument.
  • The glmnet package has a predict method for glmnet models that requires a matrix, although the argument is called newx rather than newdata in that case.

R: predict.lm() not recognizing an object

Predict expects newdata to have the same column names (to match the formula in reg.len). You're changing it to "x" in your newdata specification, which isn't part of the formula.

dat <- data.frame(y=rnorm(50),lg.std.len=sample(10:15,50,replace=TRUE))
reg.len <- lm(y ~ lg.std.len,data=dat)

newx <- seq(0.6, 1.4, 0.01)
prd.len <- predict(reg.len, newdata=data.frame(lg.std.len=newx),
interval="confidence", level=0.90, type="response")

The key part is newdata=data.frame(lg.std.len=newx)

Why I'm geeting an error in predict.lm variable lengths differ?

In lm, do not use the $ in formula when using data= argument.

fit1 <- lm(y ~ train$X1 + X2, data=train)  ## predict will fail
predict(fit1, newdata=test)
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = obje
# ct$xlevels) :
# variable lengths differ (found for 'X2')

fit2 <- lm(y ~ X1 + X2, data=train) ## predict will work
predict(fit2, newdata=test)

Reason: If you use e.g. train$X1 in the formula, the variable will be fixed, and even if you provide newdata= in predict, the old values will be used. If the vector is not accidentally of same length, you will get this error.


Data:

n <- 60
set.seed(42)
dat <- data.frame(X1=rnorm(n), X2=rnorm(n))
dat <- transform(dat, y=1 + X1 + rnorm(n))
train <- dat[1:20, ]
test <- dat[21:n, ]

Predict.lm for Multiple Regression; trouble with new.data

The error probably stems from subsetting the data inside the formula in the lm() command. It's the predict() command that actually throws the error. Let's have an example:

# Data
trees<-structure(list(Index = 1:31, DBH = c(8.3, 8.6, 8.8, 10.5, 10.7,
10.8, 11, 11, 11.1, 11.2, 11.3, 11.4, 11.4, 11.7, 12, 12.9, 12.9,
13.3, 13.7, 13.8, 14, 14.2, 14.5, 16, 16.3, 17.3, 17.5, 17.9,
18, 18, 20.6), Height = c(70L, 65L, 63L, 72L, 81L, 83L, 66L,
75L, 80L, 75L, 79L, 76L, 76L, 69L, 75L, 74L, 85L, 86L, 71L, 64L,
78L, 80L, 74L, 72L, 77L, 81L, 82L, 80L, 80L, 80L, 87L), Merch.Vol. = c(10.3,
10.3, 10.2, 16.4, 18.8, 19.7, 15.6, 18.2, 22.6, 19.9, 24.2, 21,
21.4, 21.3, 19.1, 22.2, 33.8, 27.4, 25.7, 24.9, 34.5, 31.7, 36.3,
38.3, 42.6, 55.4, 55.7, 58.3, 51.5, 51, 77)), .Names = c("Index",
"DBH", "Height", "Merch.Vol"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24",
"25", "26", "27", "28", "29", "30", "31"))

# This gives an error
g = c(3, 19, 5)
mbreg = lm(Merch.Vol[-g]~DBH[-g]+Height[-g], data=trees)
p2 = predict(mbreg,trees[g,2:3])

# This will work
# Notice that the object trees2 will contain the new, sampled dataset
# The model is then fitted on the dataset trees2
g = c(3, 19, 5)
trees2<-trees[-g,]
mbreg = lm(Merch.Vol~DBH+Height, data=trees2)
p2 = predict(mbreg,trees[g,2:3])

Subsetting (or sampling) the data into a new object before fitting the model using it will remove the error. You might want to change your code example to:

g = sample(2:31,3);g
trees2<-trees[-g,]
mbreg = lm(trees$Merch.Vol~DBH+Height, data=trees2)
p2 = predict(mbreg,trees[g,2:3])
MAPE[2] = MAPE[2] + sum(abs((trees$Merch.Vol[g]-p2)/trees$Merch.Vol[g]))/3

In addition, I'd suggest not to use the attach command here at all. An alternative to it is to use the data argument in the call to lm(). This arguments tells the lm() command to look for the variables mentioned in the formula from the named object (see the example above, and also in R ?lm).

You mention that after attaching the data you still can't call Merch.Vol directly. If you look at the column names closely, you'll probably notice that the correct column name is actually Merch.Vol. with an extra dot in the end. The dollar ($) operator uses column matching, and even if you don't have a column called D in your data, trees$D will return the values from DBH column. That's why trees$Merch.Vol will also work, even if the column name is not exactly correct typed.

Predict() - Maybe I'm not understanding it

First, you want to use

model <- lm(Total ~ Coupon, data=df)

not model <-lm(df$Total ~ df$Coupon, data=df).

Second, by saying lm(Total ~ Coupon), you are fitting a model that uses Total as the response variable, with Coupon as the predictor. That is, your model is of the form Total = a + b*Coupon, with a and b the coefficients to be estimated. Note that the response goes on the left side of the ~, and the predictor(s) on the right.

Because of this, when you ask R to give you predicted values for the model, you have to provide a set of new predictor values, ie new values of Coupon, not Total.

Third, judging by your specification of newdata, it looks like you're actually after a model to fit Coupon as a function of Total, not the other way around. To do this:

model <- lm(Coupon ~ Total, data=df)
new.df <- data.frame(Total=c(79037022, 83100656, 104299800))
predict(model, new.df)

Getting Warning: 'newdata' had 1 row but variables found have 32 rows on predict.lm

This is a problem of using different names between your data and your newdata and not a problem between using vectors or dataframes.

When you fit a model with the lm function and then use predict to make predictions, predict tries to find the same names on your newdata. In your first case name x conflicts with mtcars$wt and hence you get the warning.

See here an illustration of what I say:

This is what you did and didn't get an error:

a <- mtcars$mpg
x <- mtcars$wt

#here you use x as a name
fitCar <- lm(a ~ x)
#here you use x again as a name in newdata.
predict(fitCar, data.frame(x = mean(x)), interval = "confidence")

fit lwr upr
1 20.09062 18.99098 21.19027

See that in this case you fit your model using the name x and also predict using the name x in your newdata. This way you get no warnings and it is what you expect.

Let's see what happens when I change the name to something else when I fit the model:

a <- mtcars$mpg
#name it b this time
b <- mtcars$wt

fitCar <- lm(a ~ b)
#here I am using name x as previously
predict(fitCar, data.frame(x = mean(x)), interval = "confidence")

fit lwr upr
1 23.282611 21.988668 24.57655
2 21.919770 20.752751 23.08679
3 24.885952 23.383008 26.38890
4 20.102650 19.003004 21.20230
5 18.900144 17.771469 20.02882
Warning message:
'newdata' had 1 row but variables found have 32 rows

The only thing I did now was to change the name x when fitting the model to b and then predict using the name x in the newdata. As you can see I got the same error as in your question.

Hope this is clear now!

predict() in lmer not recognizing id variable

You haven't specified a value for the random effects in your prediction data frame. To get population-level predictions (ignoring the random effects), use

predict(m4, newdata=newdata, re.form=NA)

You're getting a weird error message because you have a package loaded (probably dplyr) which has an id() function defined: if you didn't, you would get the more interpretable error message

Error in eval(predvars, data, env) : object 'id' not found



Related Topics



Leave a reply



Submit