multinomial logistic regression in R: multinom in nnet package result different from mlogit in mlogit package?
The difference here is the difference between the Wald $z$ test (what you calculated in pv
) and the Likelihood Ratio test (what is returned by summary(mlogit.model)
. The Wald test is computationally simpler, but in general has less desirable properties (e.g., its CIs are not scale-invariant). You can read more about the two procedures here.
To perform LR tests on your nnet
model coefficents, you can load the car
and lmtest
packages and call Anova(test)
(though you'll have to do a little more work for the single df tests).
Multiple imputation and mlogit for a multinomial regression
Answer
You are using with.mids
incorrectly, and thus both lines of code are wrong; the multinom
line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply
:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2,
data = data.idx,
reflevel = "1",
nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
How with.mids
works
When you apply with
to a mids
object (AKA the output of mice::mice
), then you are actually calling with.mids
.
If you use getAnywhere(with.mids)
(or just type mice:::with.mids
), you'll find that it does a couple of things:
- It loops over all imputed datasets.
- It uses
complete
to get one dataset. - It runs the expression with the dataset as the environment.
The third step is the problem. For functions that use formulas (like lm
, glm
and multinom
), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data
variable.
The problems
This is where both your problems derive from:
In your
multinom
call, you set thedata
variable to bedf
. Hence, you are actually running yourmultinom
on the originaldf
, NOT the imputed dataset!In your
dfidx
call, you are again filling indata
directly. This is also wrong. However, leaving it empty also gives an error. This is becausewith.mids
doesn't fill in thedata
argument, but only the environment. That isn't sufficient for you.
Fixing multinom
The solution for your multinom
line is simple: just don't specify data
:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
summary(pool(multinomtest))
As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.
Fixing dfidx (and mlogit)
We cannot do this with with.mids
, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx
) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply
:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
From there, all we have to do is make something that looks like a mira
object, so that we can still use pool
:
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
R: how to format my data for multinomial logit?
Your data doesn't lend itself well to be estimated using an MNL model unless we make more assumptions. In general, since all your variables are individual specific and does not vary across alternatives (types), the model cannot be identified. All of your individual specific characteristics will drop out unless we treat them as alternative specific. By the sounds of it, each professional program carries meaning in an of itself. In that case, we could estimate the MNL model using constants only, where the constant captures everything about the program that makes an individual choose it.
library(mlogit)
df <- data.frame(ID = seq(1, 10),
type = c(2, 3, 4, 2, 1, 1, 4, 1, 3, 2),
age = c(28, 31, 12, 1, 49, 80, 36, 53, 22, 10),
dum1 = c(1, 0, 0, 0, 0, 1, 0, 1, 1, 0),
dum2 = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))
Now, just to be on the safe side, I create dummy variables for each of the programs. type_1
refers to program 1, type_2
to program 2 etc.
qqch <- data.frame(ID = rep(df$ID, each = 4),
choice = rep(1:4, 10))
# merge both dataframes
df2 <- dplyr::left_join(qqch, df, by = "ID")
# change the values in stype by 1 or 0
for (i in 1:length(df2$ID)){
df2[i, "type"] <- ifelse(df2[i, "type"] == df2[i, "choice"], 1, 0)
}
# Add alternative specific variables (here only constants)
df2$type_1 <- ifelse(df2$choice == 1, 1, 0)
df2$type_2 <- ifelse(df2$choice == 2, 1, 0)
df2$type_3 <- ifelse(df2$choice == 3, 1, 0)
df2$type_4 <- ifelse(df2$choice == 4, 1, 0)
# format for mlogit
df3 <- mlogit.data(df2, choice = "type", shape = "long", alt.var = "choice")
head(df3)
Now we can run the model. I include the dummies for each of the alternatives keeping alternative 4 as my reference level. Only J-1 constants are identified, where J is the number of alternatives. In the second half of the formula (after the pipe operator), I make sure that I remove all alternative specific constants that the model would have created and I add your individual specific variables, treating them as alternative specific. Note that this only makes sense if your alternatives (programs) carry meaning and are not generic.
model <- mlogit(type ~ type_1 + type_2 + type_3 | -1 + age + dum1 + dum2,
reflevel = 4, data = df3)
summary(model)
multinomial mixed logit model mlogit r-package
The rpar
argument accepts only alternative-specific variables. There is no need to specify the person-specific id in the model formula -- this is handled by including id.var = something
in the mlogit.data
command. For example, if you had an alternative specific covariate acov
, you could allow random slopes for acov
across a panel:
N = 200
dat <- data.frame(personID = as.factor(sample(1:4, N, replace=TRUE)),
decision = as.factor(sample(c("Q","U", "other"), N, replace=TRUE)),
syllable = as.factor(sample(1:4, N, replace=TRUE)),
acov.Q = rnorm(N), acov.U = rnorm(N), acov.other = rnorm(N))
dataMod <- mlogit.data(dat, shape="wide", choice="decision", id.var="personID", varying = 4:6)
mlogit(formula = decision ~ acov|syllable, rpar = c(acov = "n"), panel = T, data = dataMod)
It seems you are trying to fit a model with a random, person-specific intercept for each alternative (not random slopes). Unfortunately, I don't think you can do so in mlogit
(but see this post).
One option that would work to fit random intercepts in the absence of alternative-specific covariates is MCMCglmm
.
library(MCMCglmm)
priors = list(R = list(fix = 1, V = 0.5 * diag(2), n = 2),
G = list(G1 = list(V = diag(2), n = 2)))
m <- MCMCglmm(decision ~ -1 + trait + syllable,
random = ~ idh(trait):personID,
rcov = ~ us(trait):units,
prior = priors,
nitt = 30000, thin = 20, burnin = 10000,
family = "categorical",
data = dat)
Relevant issues are prior selection, convergence of Markov chains, etc. Florian Jaeger's lab's blog has a short tutorial on multinomial models via MCMCglmm
that you might find helpful, in addition to the MCMCglmm
documentation.
Huge difference in result of vglm() and multinomial() for mlogit
The gap is due to two factors: (1) The multinomial()
family in VGAM
chooses the reference to be the last level of the response factor by default while multinom()
in nnet
always uses the first level as the reference. (2) The species categories in the iris data can be separated linearly, thus leading to very large coefficients and huge standard errors. Where exactly the numeric optimization stops when the log-likelihood virtually doesn't change further, can be somewhat different across implementations but is practically irrelevant.
As an example without separation consider a school choice regression model based on data from the German Socio-Economic Panel (1994-2002) in the AER
package:
data("GSOEP9402", package = "AER")
library("nnet")
m1 <- multinom(school ~ meducation + memployment + log(income) + log(size),
data = GSOEP9402)
m2 <- vglm(school ~ meducation + memployment + log(income) + log(size),
data = GSOEP9402, family = multinomial(refLevel = 1))
Then, both models lead to esssentially the same coefficients:
coef(m1)
## (Intercept) meducation memploymentparttime memploymentnone
## Realschule -6.366449 0.3232377 0.4422277 0.7322972
## Gymnasium -22.476933 0.6664295 0.8964440 1.0581122
## log(income) log(size)
## Realschule 0.3877988 -1.297537
## Gymnasium 1.5347946 -1.757441
coef(m2, matrix = TRUE)
## log(mu[,2]/mu[,1]) log(mu[,3]/mu[,1])
## (Intercept) -6.3666257 -22.4778081
## meducation 0.3232500 0.6664550
## memploymentparttime 0.4422720 0.8964986
## memploymentnone 0.7323156 1.0581625
## log(income) 0.3877985 1.5348495
## log(size) -1.2975203 -1.7574912
Related Topics
How to Interpret Lm() Coefficient Estimates When Using Bs() Function for Splines
Save a Data Frame with List-Columns as CSV File
Pie Plot Getting Its Text on Top of Each Other
How to Install Dependencies When Using "R Cmd Install" to Install R Packages
Generate a Sequence of Characters from 'A'-'Z'
Stat_Contour with Data Labels on Lines
Plot the Equivalent of Correlation Matrix for Factors (Categorical Data)? and Mixed Types
Scaling Shiny Plots to Window Height
Ggplot2: How to Remove Slash from Geom_Density Legend
How to Properly Use Functions from Other Packages in a R Package
Remove Geom(S) from an Existing Ggplot Chart
R: How Does a Foreach Loop Find a Function That Should Be Invoked
How to Write Dplyr Groups to Separate Files
Find and Break on Repeated Runs
R: Arranging Multiple Plots Together Using Gridextra
How to Plot a Histogram of a Long-Tailed Data Using R