The formula-data interface is a
critical advantage of the R
software. It provides a
practical way to describe the model to be estimated and to store data.
However, the usual interface is not flexible enough to deal correctly
with random utility models. Therefore, mlogit
provides
tools to construct richer data.frame
s and
formula
s.
mlogit
is loaded using:
It comes with several data sets that we’ll use to illustrate the features of the library. Data sets used for multinomial logit estimation concern some individuals, that make one or a sequential choice of one alternative among a set of mutually exclusive alternatives. The determinants of these choices are covariates that can depend on the alternative and the choice situation, only on the alternative or only on the choice situation.
To illustrate this typology of the covariates, consider the case of repeated choice of destinations for vacations by families:
The unit of observation is therefore the choice situation, and it is also the individual if there is only one choice situation per individual observed, which is often the case.
Such data have therefore a specific structure that can be
characterized by three indexes: the alternative, the choice situation
and the individual. These three indexes will be denoted
alt
, chid
and id
. Note that the
distinction between chid
and id
is only
relevant if we have repeated observations for the same individual.
Data sets can have two different shapes: a wide shape (one row for each choice situation) or a long shape (one row for each alternative and, therefore, as many rows as there are alternatives for each choice situation).
mlogit
deals with both format. It provides a
mlogit.data
function that take as first argument a
data.frame
and returns a data.frame
in “long”
format with some supplementary information about the structure of the
data.
Train
1 is an example of a wide data
set:
## id choiceid choice price_A time_A change_A comfort_A price_B
## 1 1 1 A 2400 150 0 1 4000
## 2 1 2 A 2400 150 0 1 3200
## 3 1 3 A 2400 115 0 1 4000
## time_B change_B comfort_B
## 1 150 0 1
## 2 130 0 1
## 3 115 0 0
This data set contains data about a stated preference survey in
Netherlands. Each individual has responded to several (up to 16)
scenarios. For every scenario, two train trips are proposed to the user,
with different combinations of four attributes: price
(the
price in cents of guilders), time
(travel time in minutes),
change
(the number of changes) and comfort
(the class of comfort, 0, 1 or 2, 0 being the most comfortable
class).
This “wide” format is suitable to store choice situation (or individual specific) variables because, in this case, they are stored only once in the data. Otherwise, it is cumbersome for alternative specific variables because there are as many columns for such variables that there are alternatives.
For such a wide data set, the shape
argument of
mlogit.data
is mandatory, as its default value is
"long"
. The alternative specific variables are indicated
with the varying
argument which is a numeric vector that
indicates their position in the data frame. This argument is then passed
to stats::reshape
that coerced the original
data.frame
in “long” format. Further arguments may be
passed to reshape
. For example, as the names of the
variables are of the form price_A
, one must add
sep = "_"
(the default value being "."
). The
choice
argument is also mandatory because the response has
to be transformed in a logical value in the long format. To take the
panel dimension into account, one has to add an argument
id.var
which is the name of the individual index.
Tr <- mlogit.data(Train, shape = "wide", choice = "choice",
varying = 4:11, sep = "_", id.var = "id",
opposite = c("price", "comfort", "time", "change"))
Note the use of the opposite
argument for the 4
covariates: we expect negative coefficients for all of them, taking the
opposite of the covariates will lead to expected positive coefficients.
We next convert price
and time
in more
meaningful unities, hours and euros (1 guilder was 2.20371 euros):
## id choiceid choice alt price time change comfort chid
## 1.A 1 1 TRUE A -52.88904 -2.5 0 -1 1
## 1.B 1 1 FALSE B -88.14840 -2.5 0 -1 1
## 2.A 1 2 TRUE A -52.88904 -2.5 0 -1 2
An index
attribute is added to the data, which contains
the three relevant indexes: chid
is the choice situation
index, alt
the alternative index and id
is the
individual index. This attribute is a data.frame
that can
be extracted using the index
function.
## chid alt id
## 1.A 1 A 1
## 1.B 1 B 1
## 2.A 2 A 1
ModeCanada
,2 is an example of a data set in long format.
It presents the choice of individuals for a transport mode for the
Ontario-Quebec corridor:
## case alt choice dist cost ivt ovt freq income urban noalt
## 1 1 train 0 83 28.25 50 66 4 45 0 2
## 2 1 car 1 83 15.77 61 0 0 45 0 2
## 3 2 train 0 83 28.25 50 66 4 25 0 2
## 4 2 car 1 83 15.77 61 0 0 25 0 2
## 5 3 train 0 83 28.25 50 66 4 70 0 2
## 6 3 car 1 83 15.77 61 0 0 70 0 2
There are four transport modes (air
, train
,
bus
and car
) and most of the variable are
alternative specific (cost
for monetary cost,
ivt
for in vehicle time, ovt
for out of
vehicle time, freq
for frequency). The only choice
situation specific variables are dist
(the distance of the
trip), income
(household income), urban
(a
dummy for trips which have a large city at the origin or the
destination) and noalt
the number of available
alternatives. The advantage of this shape is that there are much fewer
columns than in the wide format, the caveat being that values of
dist
, income
and urban
are
repeated four times.
For data in “long” format, the shape
and the
choice
arguments are no more mandatory.
To replicate published results later in the text, we’ll use only a
subset of the choice situations, namely those for which the 4
alternatives are available. This can be done using the
subset
function with the subset
argument set
to noalt == 4
while estimating the model. This can also be
done within mlogit.data
, using the subset
argument.
The information about the structure of the data can be explicitly
indicated using choice situations and alternative indexes (respectively
case
and alt
in this data set) or, in part,
guessed by the mlogit.data
function. Here, after
subsetting, we have 2779 choice situations with 4 alternatives, and the
rows are ordered first by choice situation and then by alternative
(train
, air
, bus
and
car
in this order).
The first way to read correctly this data frame is to ignore
completely the two index variables. In this case, the only supplementary
argument to provide is the alt.levels
argument which is a
character vector that contains the name of the alternatives in their
order of appearance:
Note that this can only be used if the data set is “balanced”, which
means than the same set of alternatives is available for all choice
situations. It is also possible to provide an argument
alt.var
which indicates the name of the variable that
contains the alternatives
The name of the variable that contains the information about the
choice situations can be indicated using the chid.var
argument:
MC <- mlogit.data(ModeCanada, subset = noalt == 4, chid.var = "case",
alt.levels = c("train", "air", "bus", "car"))
Both alternative and choice situation variable can also be provided:
and dropped from the data frame using the drop.index
argument:
MC <- mlogit.data(ModeCanada, subset = noalt == 4, chid.var = "case",
alt.var = "alt", drop.index = TRUE)
head(MC)
## choice dist cost ivt ovt freq income urban noalt
## 109.train 0 377 58.25 215 74 4 45 0 4
## 109.air 1 377 142.80 56 85 9 45 0 4
## 109.bus 0 377 27.52 301 63 8 45 0 4
## 109.car 0 377 71.63 262 0 0 45 0 4
## 110.train 0 377 58.25 215 74 4 70 0 4
## 110.air 1 377 142.80 56 85 9 70 0 4
Standard formula
s are not very practical to describe
random utility models, as these models may use different sets of
covariates. Actually, working with random utility models, one has to
consider at most four sets of covariates:
The first three sets of covariates enter the observable part of the utility which can be written, alternative j:
Vij = αj + βxij + νtj + γjzi + δjwij.
As the absolute value of utility is irrelevant, only utility differences are useful to modelise the choice for one alternative. For two alternatives j and k, we obtain:
Vij − Vik = (αj − αk) + β(xij − xik) + (γj − γk)zi + (δjwij − δkwik) + ν(tj − tk).
It is clear from the previous expression that coefficients of choice situation specific variables (the intercept being one of those) should be alternative specific, otherwise they would disappear in the differentiation. Moreover, only differences of these coefficients are relevant and can be identified. For example, with three alternatives 1, 2 and 3, the three coefficients γ1, γ2, γ3 associated to a choice situation specific variable cannot be identified, but only two linear combinations of them. Therefore, one has to make a choice of normalization and the simplest one is just to set γ1 = 0.
Coefficients for alternative and choice situation specific variables may (or may not) be alternative specific. For example, transport time is alternative specific, but 10 mn in public transport may not have the same impact on utility than 10 mn in a car. In this case, alternative specific coefficients are relevant. Monetary cost is also alternative specific, but in this case, one can consider than 1$ is 1$ whatever it is spent for the use of a car or in public transports. In this case, a generic coefficient is relevant.
The treatment of alternative specific variables don’t differ much from the alternative and choice situation specific variables with a generic coefficient. However, if some of these variables are introduced, the ν parameter can only be estimated in a model without intercepts to avoid perfect multicolinearity.
Individual-related heteroscedasticity (see Swait and Louviere 1993) can be addressed by writing the utility of choosing j for individual i: Uij = Vij + σiϵij, where ϵ has a variance that doesn’t depend on i and j and σi2 = f(vi) is a parametric function of some individual-specific covariates. Note that this specification induce choice situation heteroscedasticity, also denoted scale heterogeneity.3. As the overall scale of utility is irrelevant, the utility can also be writen as: Uij* = Uij/σi = Vij/σi + ϵij, i.e., with homoscedastic errors. if Vij is a linear combination of covariates, the associated coefficients are then divided by σi.
A logit model with only choice situation specific variables is sometimes called a multinomial logit model, one with only alternative specific variables a conditional logit model and one with both kind of variables a mixed logit model. This is seriously misleading: conditional logit model is also a logit model for longitudinal data in the statistical literature and mixed logit is one of the names of a logit model with random parameters. Therefore, in what follows, we’ll use the name multinomial logit model for the model we’ve just described whatever the nature of the explanatory variables used.
mlogit
package provides objects of class
mFormula
which are built upon Formula
objects
provided by the Formula
package.4 The
Formula
package provides richer formula
s,
which accept multiple responses (a feature not used here) and multiple
set of covariates. It has in particular specific
model.frame
and model.matrix
methods which can
be used with one or several sets of covariates.
To illustrate the use of mFormula
objects, we use again
the ModeCanada
data set and consider three sets of
covariates that will be indicated in a three-part formula, which refers
to the first three items of the four points list at start of this
section.
cost
(monetary cost) is an alternative specific
covariate with a generic coefficient (part 1),income
and urban
are choice situation
specific covariates (part 2),ivt
(in vehicle travel time) is alternative specific
and alternative specific coefficients are expected (part 3).Some parts of the formula may be omitted when there is no ambiguity.
For example, the following sets of formula
s are
identical:
f2 <- mFormula(choice ~ cost + ivt | income + urban)
f2 <- mFormula(choice ~ cost + ivt | income + urban | 0)
f4 <- mFormula(choice ~ cost + ivt)
f4 <- mFormula(choice ~ cost + ivt | 1)
f4 <- mFormula(choice ~ cost + ivt | 1 | 0)
By default, an intercept is added to the model, it can be removed by
using + 0
or - 1
in the second part.
model.frame
and model.matrix
methods are
provided for mFormula
objects. The latter is of particular
interest, as illustrated in the following example:
## air:(intercept) bus:(intercept) car:(intercept) cost
## 109.train 0 0 0 58.25
## 109.air 1 0 0 142.80
## 109.bus 0 1 0 27.52
## 109.car 0 0 1 71.63
## air:income bus:income car:income train:ivt air:ivt
## 109.train 0 0 0 215 0
## 109.air 45 0 0 0 56
## 109.bus 0 45 0 0 0
## 109.car 0 0 45 0 0
## bus:ivt car:ivt
## 109.train 0 0
## 109.air 0 0
## 109.bus 301 0
## 109.car 0 262
The model matrix contains J − 1 columns for every choice
situation specific variables (income
and the intercept),
which means that the coefficient associated to the first alternative
(air
) is set to 0. It contains only one column for
cost
because we want a generic coefficient for this
variable. It contains J
columns for ivt
, because it is an alternative specific
variable for which we want alternative specific coefficients.
As for all models estimated by maximum likelihood, three testing
procedures may be applied to test hypothesis about models fitted using
mlogit
. The set of hypothesis tested defines two models:
the unconstrained model that doesn’t take these hypothesis into account
and the constrained model that impose these hypothesis.
This in turns define three principles of tests: the Wald test, based only on the unconstrained model, the Lagrange multiplier test (or score test), based only on the constrained model and the likelihood ratio test, based on the comparison of both models.
Two of these three tests are implemented in the lmtest
package (Zeileis and Hothorn 2002):
waldtest
and lrtest
. The Wald test is also
implemented as linearHypothesis
in package car
(Fox and Weisberg 2010), with a fairly
different syntax. We provide special methods of waldtest
and lrtest
for mlogit
objects and we also
provide a function for the Lagrange multiplier (or score) test called
scoretest
.
We’ll see later that the score test is especially useful for
mlogit
objects when one is interested in extending the
basic multinomial logit model because, in this case, the unconstrained
model may be difficult to estimate. For the presentation of further
tests, we provide a convenient statpval
function which
extract the statistic and the p-value from the objects returned by the
testing function, which can be either of class anova
or
htest
.
Used by Ben-Akiva, Bolduc, and Bradley (1993) and Meijer and Rouwendal (2006).↩︎
Used in particular by (Forinash and Koppleman 1993), Bhat (1995), Franck S. Koppelman and Wen (1998) and Frank S. Koppelman and Wen (2000).↩︎
This kind of heteroscedasticity shouldn’t be confused with alternative heteroscedasticity (σj2 ≠ σk2) which is introduced in the heteroskedastic logit model described in vignette relaxing the iid hypothesis↩︎
See (Zeileis and Croissant
2010) for a description of the Formula
package.↩︎