The utility for alternative l is written as: Ul = Vl + ϵl where Vl is a function of some observable covariates and unknown parameters to be estimated, and ϵl is a random deviate which contains all the unobserved determinants of the utility. Alternative l is therefore chosen if ϵj < (Vl − Vj) + ϵl ∀ j ≠ l and the probability of choosing this alternative is then:
P(ϵ1 < Vl − V1 + ϵl, ϵ2 < Vl − V2 + ϵl, ..., ϵJ < Vl − VJ + ϵl).
Denoting F−l the cumulative density function of all the ϵs except ϵl, this probability is:
Note that this probability is conditional on the value of ϵl. The unconditional probability (which depends only on β and on the value of the observed explanatory variables) is obtained by integrating out the conditional probability using the marginal density of ϵl, denoted fl:
The conditional probability is an integral of dimension J − 1 and the computation of the unconditional probability adds on more dimension of integration.
The multinomial logit model (McFadden 1974) is a special case of the model developed in the previous section. It is based on three hypothesis.
The first hypothesis is the independence of the errors. In this case, the univariate distribution of the errors can be used, which leads to the following conditional and unconditional probabilities:
which means that the conditional probability is the product of J − 1 univariate cumulative density functions and the evaluation of only a one-dimensional integral is required to compute the unconditional probability.
The second hypothesis is that each ϵ follows a Gumbel distribution, whose density and probability functions are respectively:
where μ is the location parameter and θ the scale parameter. The first two moments of the Gumbel distribution are E(z) = μ + θγ, where γ is the Euler-Mascheroni constant ( ≈ 0.577) and $\mbox{V}(z)=\frac{\pi^2}{6}\theta^2$. The mean of ϵj is not identified if Vj contains an intercept. We can then, without loss of generality suppose that μj = 0, ∀j. Moreover, the overall scale of utility is not identified. Therefore, only J − 1 scale parameters may be identified, and a natural choice of normalization is to impose that one of the θj is equal to 1.
The last hypothesis is that the errors are identically distributed. As the location parameter is not identified for any error term, this hypothesis is essentially an homoscedasticity hypothesis, which means that the scale parameter of the Gumbel distribution is the same for all the alternatives. As one of them has been previously set to 1, we can therefore suppose that, without loss of generality, θj = 1, ∀j ∈ 1...J. The conditional and unconditional probabilities then further simplify to:
The probabilities have then very simple, closed forms, which correspond to the logit transformation of the deterministic part of the utility.
If we consider the probabilities of choice for two alternatives l and m, we have $P_l=\frac{e^{V_l}}{\sum_j e^{V_j}}$ and $P_m=\frac{e^{V_m}}{\sum_j e^{V_j}}$. The ratio of these two probabilities is:
$$ \frac{P_l}{P_m}=\frac{e^{V_l}}{e^{V_m}}=e^{V_l-V_m}. $$
This probability ratio for the two alternatives depends only on the characteristics of these two alternatives and not on those of other alternatives. This is called the IIA property (for independence of irrelevant alternatives). IIA relies on the hypothesis that the errors are identical and independent. It is not a problem by itself and may even be considered as a useful feature for a well specified model. However, this hypothesis may be in practice violated, especially if some important variables are omitted.
In a linear model, the coefficients are the marginal effects of the explanatory variables on the explained variable. This is not the case for the multinomial logit model. However, meaningful results can be obtained using relevant transformations of the coefficients.
The marginal effects are the derivatives of the probabilities with respect to the covariates, which can be be choice situation-specific (zi) or alternative specific (xij):
$$ \begin{array}{rcl} \displaystyle \frac{\partial P_{il}}{\partial z_{i}}&=&P_{il}\left(\beta_l-\sum_j P_{ij}\beta_j\right) \\ \displaystyle \frac{\partial P_{il}}{\partial x_{il}}&=&\gamma P_{il}(1-P_{il})\\ \displaystyle \frac{\partial P_{il}}{\partial x_{ik}}&=&-\gamma P_{il}P_{ik}. \end{array} $$
For a choice situation specific variable, the sign of the marginal effect is not necessarily the sign of the coefficient. Actually, the sign of the marginal effect is given by (βl − ∑jPijβj), which is positive if the coefficient for alternative l is greater than a weighted average of the coefficients for all the alternatives, the weights being the probabilities of choosing the alternatives. In this case, the sign of the marginal effect can be established with no ambiguity only for the alternatives with the lowest and the greatest coefficients.
For an alternative-specific variable, the sign of the coefficient can be directly interpreted. The marginal effect is obtained by multiplying the coefficient by the product of two probabilities which is at most 0.25. The rule of thumb is therefore to divide the coefficient by 4 in order to have an upper bound of the marginal effect.
Note that the last equation can be rewritten: $\frac{\mbox{d} P_{il} / P_{il}}{\mbox{d}x_{ik}} = -\gamma P_{ik}$. Therefore, when a characteristic of alternative k changes, the relative change of the probabilities for every alternatives except k are the same, which is a consequence of the IIA property.
Coefficients are marginal utilities, which cannot be interpreted. However, ratios of coefficients are marginal rates of substitution. For example, if the observable part of utility is: V = βo + β1x1 + βx2 + βx3, join variations of x1 and x2 which ensure the same level of utility are such that: dV = β1dx1 + β2dx2 = 0 so that:
$$ - \frac{dx_2}{dx_1}\mid_{dV = 0} = \frac{\beta_1}{\beta_2}. $$
For example, if x2 is transport cost (in $), x1 transport time (in hours), β1 = 1.5 and β2 = 0.2, $\frac{\beta_1}{\beta_2}=30$ is the marginal rate of substitution of time in terms of $ and the value of 30 means that to reduce the travel time of one hour, the individual is willing to pay at most 30$ more. Stated more simply, time value is 30$ per hour.
Consumer’s surplus has a very simple expression for multinomial logit models, which was first derived by Small and Rosen (1981). The level of utility attained by an individual is Uj = Vj + ϵj, j being the chosen alternative. The expected utility, from the searcher’s point of view is then: E(maxjUj), where the expectation is taken over the values of all the error terms. Its expression is simply, up to an additive unknown constant, the log of the denominator of the logit probabilities, often called the “log-sum”:
$$ \mbox{E}(U)=\ln \sum_{j=1}^Je^{V_j}+C. $$
If the marginal utility of income (α) is known and constant, the expected surplus is simply $\frac{\mbox{E}(U)}{\alpha}$.
Random utility models are fitted using the mlogit
function. Basically, only two arguments are mandatory,
formula
and data
, if a
mlogit.data
object (and not an ordinary
data.frame
) is provided.
We first use the ModeCanada
data set, which was already
coerced to a mlogit.data
object (called MC
) in
the previous section. The same model can then be estimated using as
data
argument this mlogit.data
object:
library("mlogit")
data("ModeCanada", package = "mlogit")
MC <- mlogit.data(ModeCanada, subset = noalt == 4, chid.var = "case",
alt.var = "alt", drop.index = TRUE)
ml.MC1 <- mlogit(choice ~ cost + freq + ovt | income | ivt, MC)
or a data.frame
. In this latter case, further arguments
that will be passed to mlogit.data
should be indicated:
ml.MC1b <- mlogit(choice ~ cost + freq + ovt | income | ivt, ModeCanada,
subset = noalt == 4, alt.var = "alt", chid.var = "case")
mlogit
provides two further useful arguments:
reflevel
indicates which alternative is the “reference”
alternative, i.e., the one for which the coefficients of choice
situation specific covariates are set to 0,alt.subset
indicates a subset of alternatives on which
the estimation has to be performed; in this case, only the lines that
correspond to the selected alternatives are used and all the choice
situations where not selected alternatives has been chosen are
removed.We estimate the model on the subset of three alternatives (we exclude
bus
whose market share is negligible in our sample) and we
set car
as the reference alternative. Moreover, we use a
total transport time variable computed as the sum of the in vehicle and
the out of vehicle time variables.
MC$time <- with(MC, ivt + ovt)
ml.MC1 <- mlogit(choice ~ cost + freq | income | time, MC,
alt.subset = c("car", "train", "air"), reflevel = "car")
The main results of the model are computed and displayed using the
summary
method:
##
## Call:
## mlogit(formula = choice ~ cost + freq | income | time, data = MC,
## alt.subset = c("car", "train", "air"), reflevel = "car",
## method = "nr")
##
## Frequencies of alternatives:
## car train air
## 0.45757 0.16721 0.37523
##
## nr method
## 6 iterations, 0h:0m:0s
## g'(-H)^-1g = 6.94E-06
## successive function values within tolerance limits
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## train:(intercept) -0.97034440 0.26513065 -3.6599 0.0002523 ***
## air:(intercept) -1.89856552 0.68414300 -2.7751 0.0055185 **
## cost -0.02849715 0.00655909 -4.3447 1.395e-05 ***
## freq 0.07402902 0.00473270 15.6420 < 2.2e-16 ***
## train:income -0.00646892 0.00310366 -2.0843 0.0371342 *
## air:income 0.02824632 0.00365435 7.7295 1.088e-14 ***
## car:time -0.01402405 0.00138047 -10.1589 < 2.2e-16 ***
## train:time -0.01096877 0.00081834 -13.4036 < 2.2e-16 ***
## air:time -0.01755120 0.00399181 -4.3968 1.099e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -1951.3
## McFadden R^2: 0.31221
## Likelihood ratio test : chisq = 1771.6 (p.value = < 2.22e-16)
The frequencies of the different alternatives in the sample are first indicated. Next, some information about the optimization are displayed: the Newton-Ralphson method (with analytic gradient and hessian) is used, as it is the most efficient method for this simple model for which the log-likelihood function is concave. Note that very few iterations and computing time are required to estimate this model. Then the usual table of coefficients is displayed followed by some goodness of fit measures: the value of the log-likelihood function, which is compared to the value when only intercepts are introduced, which leads to the computation of the McFadden R2 and to the likelihood ratio test.
The fitted
method can be used either to obtain the
probability of actual choices (type = "outcome"
) or the
probabilities for all the alternatives
(type = "probabilities"
).
## 109 110 111 112 113 114
## 0.1909475 0.3399941 0.1470527 0.3399941 0.3399941 0.2440011
## car train air
## 109 0.4206404 0.3884120 0.1909475
## 110 0.3696476 0.2903582 0.3399941
## 111 0.4296769 0.4232704 0.1470527
## 112 0.3696476 0.2903582 0.3399941
Note that the log-likelihood is the sum of the log of the fitted outcome probabilities and that, as the model contains intercepts, the average fitted probabilities for every alternative equals the market shares of the alternatives in the sample.
## [1] -1951.344
## 'log Lik.' -1951.344 (df=9)
## car train air
## 0.4575659 0.1672084 0.3752257
Predictions can be made using the predict
method. If no
data is provided, predictions are made for the sample mean values of the
covariates.
## car train air
## 0.5066362 0.2116876 0.2816761
Assume, for example, that we wish to predict the effect of a
reduction of train transport time of 20%. We first create a new
data.frame
simply by multiplying train transport time by
0.8 and then using the predict
method with this new
data.frame
.
NMC <- MC
NMC[index(NMC)$alt == "train", "time"] <- 0.8 *
NMC[index(NMC)$alt == "train", "time"]
Oprob <- fitted(ml.MC1, type = "probabilities")
Nprob <- predict(ml.MC1, newdata = NMC)
rbind(old = apply(Oprob, 2, mean), new = apply(Nprob, 2, mean))
## car train air
## old 0.4575659 0.1672084 0.3752257
## new 0.4044736 0.2635801 0.3319462
If, for the first individuals in the sample, we compute the ratio of the probabilities of the air and the car mode, we obtain:
## 109 110 111 112 113 114
## 0.4539448 0.9197791 0.3422401 0.9197791 0.9197791 0.6021092
## 109 110 111 112 113 114
## 0.4539448 0.9197791 0.3422401 0.9197791 0.9197791 0.6021092
which is an illustration of the IIA property. If train time changes, it changes the probabilities of choosing air and car, but not their ratio.
We next compute the surplus for individuals of the sample induced by train time reduction. This requires the computation of the log-sum term (also called inclusive value or inclusive utility) for every choice situation, which is:
$$ \mbox{iv}_i = \ln \sum_{j = 1} ^ J e^{\beta^\top x_{ij}}. $$
For this purpose, we use the logsum
function, which
works on a vector of coefficients
and a
model.matrix
. The basic use of logsum
consists
on providing as unique argument (called coef
) a
mlogit
object. In this case, the model.matrix
and the coef
are extracted from the same model.
To compute the log-sum after train time reduction, we must provide a
model.matrix
which is not the one corresponding to the
fitted model. This can be done using the X
argument which
is a matrix or an object from which a model.matrix
can be
extracted. This can also be done by filling the data
argument (a data.frame
or an object from which a
data.frame
can be extracted using a
model.frame
method), and eventually the
formula
argument (a formula
or an object for
which the formula
method can be applied). If no formula is
provided but if data
is a mlogit.data
object,
the formula is extracted from it.
Surplus variation is then computed as the difference of the log-sums divided by the opposite of the cost coefficient which can be interpreted as the marginal utility of income:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5852 2.8439 3.8998 4.6971 5.8437 31.3912
Consumer’s surplus variation range from 0.6 to 31 Canadian $, with a median value of about 4$.
Marginal effects are computed using the effects
method.
By default, they are computed at the sample mean, but a
data
argument can be provided. The variation of the
probability and of the covariate can be either absolute or relative.
This is indicated with the type
argument which is a
combination of two a
(as absolute) and r
(as
relative) characters. For example, type = "ar"
means that
what is measured is an absolute variation of the probability for a
relative variation of the covariate.
## car train air
## -0.1822177 -0.1509079 0.3331256
The results indicate that, for a 100% increase of income, the
probability of choosing air
increases by 33 points of
percentage, as the probabilities of choosing car
and
train
decrease by 18 and 15 points of percentage.
For an alternative specific covariate, a matrix of marginal effects is displayed.
## car train air
## car -0.9131273 0.9376923 0.9376923
## train 0.3358005 -1.2505014 0.3358005
## air 1.2316679 1.2316679 -3.1409703
The cell in the lth row and the cth column indicates the
change of the probability of choosing alternative c when the cost of alternative l changes. As
type = "rr"
, elasticities are computed. For example, a 10%
change of train cost increases the probabilities of choosing car and air
by 3.36%. Note that the relative changes of the probabilities of
choosing one of these two modes are equal, which is a consequence of the
IIA property.
Finally, in order to compute travel time valuation, we divide the coefficients of travel times (in minutes) by the coefficient of monetary cost (in $).
## car:time train:time air:time
## 29.52728 23.09447 36.95360
The value of travel time ranges from 23 for train to 37 Canadian $ per hour for plane.
The second example is a data set used by Fowlie (2010), called NOx
. She
analyzed the effect of an emissions trading program (the NOx budget
program which seeks to reduce the emission of nitrogen oxides) on the
behavior of producers. More precisely, coal electricity plant managers
may adopt one out of fifteen different technologies in order to comply
to the emissions defined by the program. Some of them require high
investment (the capital cost is kcost
) and are very
efficient to reduce emissions, some other require much less investment
but are less efficient and the operating cost (denoted
vcost
) is then higher, especially because pollution permits
must be purchased to offset emissions exceeding their allocation.
The focus of the paper is on the effects of the regulatory environment on manager’s behavior. Some firms are deregulated, whereas other are either regulated or public. Rate of returns is applied for regulated firms, which means that they perceive a “fair” rate of return on their investment. Public firms also enjoy significant cost of capital advantages. Therefore, the main hypothesis of the paper is that public and regulated firms will adopt much more capitalistic intensive technologies than deregulated and public ones, which means that the coefficient of capital cost should take a higher negative value for deregulated firms. Capital cost is interacted with the age of the plant (measured as a deviation from the sample mean age), as firms should weight capital costs more heavily for older plants, as they have less time to recover these costs.
Multinomial logit models are estimated for the three subsamples
defined by the regulatory environment. The 15 technologies are not
available for every plants, the sample is therefore restricted to
available technologies, using the available
covariate.
Three technology dummies are introduced: post
for
post-combustion polution control technology, cm
for
combustion modification technology and lnb
for low NOx
burners technology.
A last model is estimated for the whole sample, but the parameters
are allowed to be proportional to each other. The scedasticity function
is described in the fourth part of the formula, it contains here only
one covariate, env
. Note also that for the last model, the
author introduced a specific capital cost coefficient for deregulated
firms.1
data("NOx", package = "mlogit")
NOx$kdereg <- with(NOx, kcost * (env == "deregulated"))
NOxml <- mlogit.data(NOx, chid.var = "chid", alt.var = "alt",
id.var = "id")
ml.pub <- mlogit(choice ~ post + cm + lnb + vcost + kcost + kcost:age |
- 1, subset = available & env == "public", data = NOxml)
ml.reg <- update(ml.pub, subset = available & env == "regulated")
ml.dereg <- update(ml.pub, subset = available & env == "deregulated")
ml.pool <- mlogit(choice ~ post + cm + lnb + vcost + kcost + kcost:age +
kdereg | - 1 | 0 | env, subset = available == 1, data = NOxml,
method = "bhhh")
library("texreg")
htmlreg(list(Public = ml.pub, Deregulated = ml.dereg, Regulated = ml.reg,
Pooled = ml.pool), caption = "Environmental compliance choices.",
omit.coef = "(post)|(cm)|(lnb)", float.pos = "hbt", label = "tab:nox")
Public | Deregulated | Regulated | Pooled | |
---|---|---|---|---|
vcost | -1.56*** | -0.19*** | -0.28*** | -0.31*** |
(0.36) | (0.06) | (0.06) | (0.04) | |
kcost | 0.04 | -0.06** | 0.01 | 0.01 |
(0.11) | (0.02) | (0.03) | (0.02) | |
kcost:age | -0.08 | -0.04** | -0.02* | -0.02*** |
(0.04) | (0.01) | (0.01) | (0.01) | |
kdereg | -0.07*** | |||
(0.01) | ||||
sig.envderegulated | 0.32** | |||
(0.12) | ||||
sig.envpublic | -0.33*** | |||
(0.08) | ||||
AIC | 168.92 | 690.15 | 731.48 | 1634.22 |
Log Likelihood | -78.46 | -339.07 | -359.74 | -808.11 |
Num. obs. | 113 | 227 | 292 | 632 |
K | 15 | 15 | 15 | 15 |
***p < 0.001; **p < 0.01; *p < 0.05 |
Results are presented in the preceeding table, using the
texreg
package (Leifeld
2013). Coefficients are very different on the sub-samples defined
by the regulatory environment. Note in particular that the capital cost
coefficient is positive and insignificant for public and regulated
firms, as it is significantly negative for deregulated firms. Errors
seems to have significant larger variance for deregulated firms and
lower ones for public firms compared to regulated firms. The hypothesis
that the coefficients (except the kcost
one) are identical
up to a multiplicative scalar can be performed using a likelihood ratio
test:
## 'log Lik.' 61.67179 (df=6)
## 'log Lik.' 6.377284e-10 (df=6)
The hypothesis is strongly rejected.
Note the use of the method
argument, set to
bhhh
. mlogit
use its own optimisation
functions, but borrows its syntax from package maxLik
(Toomet and Henningsen 2010). The default method
is bfgs
, except for the basic model, for which it is
nr
. As the default algorithm failed to converged, we use
here bhhh
.↩︎