par(family="serif", las=1, bty="l",
cex.axis=1, cex.lab=1, cex.main=1,
xaxs="i", yaxs="i", mar = c(5, 5, 3, 1))
library(rstan)
rstan_options(auto_write = TRUE) # Cache compiled Stan programs
options(mc.cores = parallel::detectCores()) # Parallelize chains
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")Discrete Choice Modeling
Analyzing observed choices from a finite set is a common task in many fields. Survey data, for example, often takes the form of individual respondents answering questions by choosing from a set of possible answers. Similarly, sales data is often derived from the purchasing choices of individual consumers in a market. Moreover, choice observations are not limited to humans: some ecological analysis study animal behavior by observing choices in consumption, movement, and more.
Modeling observed choices has long been a topic of interest in psychology, marketing Schwarz, Chapman, and Feit (2020), and economics (Train 2009), each of which introduces their own emphasis, terminology, and notation. In this chapter I present choice modeling from a more general probabilistic modeling perspective. This perspective helps to contextualize conventional modeling techniques while also clarifying their full generality.
We will first consider the general analysis of choice as decision making under uncertainty, and how decision making behavior can change with the possible alternatives. Then we will review some of the more common discrete choice models with a particular focus on the underlying assumptions and their consequences. Along the way I will demonstrate the more theoretical insights with explicit examples.
1 Utility Maximization Models
If we want to treat choices as the result of a decision making process then we need to specify who is making decisions, what decisions they can make, and how the arrive at any particular decision.
1.1 Agents, Alternatives, and Choice Sets
In this chapter I will refer to any entity responsible for making a decision as an agent. Depending on the application, agents can be individuals, such as people or organisms, or even collectives, like companies and communities. Generally we will be interested in not one single agent but rather an ensemble of N agents.
A choice set is a collection of J alternatives from which each agent can select their preferred option. We will assume that the choice set is both exclusive and exhaustive so that each agent always chooses one, but only one, preferred alternative at a time.
The choice set is not always fixed; in many settings the possible alternatives vary from one decision to the next. In this case I will use choice set to refer to the complete set of all possible alternatives and choice subset to refer to any particular subset of available options in a given decision.
1.2 Utilities
If decisions between the alternatives in a choice set are made purposefully then can model those decisions by quantifying each agent’s preference for the alternatives. The quantification of individual preferences into a one-dimensional real number, taking into account all relevant benefits and costs, is known as a utility. Here I will denote the utility agent n gains by choosing alternative J by u_{nj} \in \mathbb{R}.
Utilities are not generally universal but rather can be influenced by the context of a particular decision. For example, utilities can in theory depend on number of alternatives in each choice subset, the number of choice subsets considered, the order in which those choice subsets are presented to a given agent, and so on. Because of this we have to be especially careful when extrapolating utilities from one context to another.
In particular the utilities agents allocate, and the decisions that those utilities motivate, can be different if an agent is asked to make a hypothetical choice and if they are making an actual choice. These inconsistent utilities complicate analyses that attempt to infer consistent behavior from multiple sources. For instance, the outcome of surveys and focus groups, which propose hypothetical consumer decisions, may be inconsistent with sales data, which is derived from actual consumer decisions. In marketing and economics these varying behaviors are known as stated preferences and revealed preferences, respectively.
1.3 Choice Probabilities
When the utilities that an agent assigns to each alternative are known exactly, the resulting decisions become completely deterministic. Agent will choose alternative j if u_{nj} > u_{nj'} for all j' \ne j. Because utilities are real-valued, exact ties are exceedingly improbable and can be safely ignored.
Utilities, however, are rarely known exactly in practical applications. The understanding of utilities might, for example, be limited by incomplete or untrustworthy elicitation of the agents, varying or otherwise inconsistent preferences, or any number of other unknown or unquantifiable factors.
Regardless of why utilities are uncertain in a given analysis, we can always consistently model that uncertainty with a probability distribution over all of the utilities. Probabilistic utility models are sometimes referred to as behavioral models.
Because the utilities are all real-valued, we can always specify a probabilistic utility model with a joint probability density function, p( u_{11}, \ldots, u_{1J}, \ldots, u_{N1}, \ldots, u_{NJ} ). In applications where agents make decisions independently from each other, we can model the utilities for each agent with an independent product distribution, p( u_{11}, \ldots, u_{1J}, \ldots, u_{N1}, \ldots, u_{NJ} ) = \prod_{n = 1}^{N} p( u_{n1}, \ldots, u_{nJ} ).
With uncertainty utilities decisions are no longer deterministic but rather probabilistic. Agent n will choose alternative j for any configurations of the utilities that fall into the rectangular subset \mathsf{u}_{nj} = \bigcap_{j' \ne j} \{ u_{nj} > u_{nj'} \}. The probability that agent n chooses alternative j is given by the probability allocated to this subset, \begin{align*} q_{nj} &= \pi( \mathsf{u}_{nj} ) \\ &= \int \prod_{j' = 1}^{J} \mathrm{d} u_{nj'} \, p( u_{n1}, \ldots, u_{nJ} ) \, I_{ \mathsf{u}_{nj} } ( u_{n1}, \ldots, u_{nJ} ). \end{align*}
These alternative probabilities then define a discrete choice model for an observed choice y made by agent n, p( y ) = \mathrm{categorical} ( y \mid q_{n1}, \ldots, q_{nJ} ) = q_{n y}.
Many presentations of discrete choice models implicitly assume that each agent chooses from a given choice set only once. In this case each observed choice can be uniquely labeled by the agent who made it, y_{n}. When agents can make multiple choices, however, this labeling convention becomes ungainly.
To avoid any potential confusion in this chapter, I will label individual choice observations with the separate index m \in {1, \ldots M}. We can then denote the agent making that choice as n_{m}, so that the discrete choice model becomes p( y_{m} ) = \mathrm{categorical} ( y_{m} \mid q_{n_{m}1}, \ldots, q_{n_{m}J} ) = q_{n(m) y_{m}}. This notation also allows us to explicitly denote varying choice subsets, for example p( y_{m} ) = \mathrm{categorical} ( y_{m} \mid q_{n_{m}1}, \ldots, q_{n_{m}J_{m}} ) = q_{n_{m} y_{m}}.
1.4 Deriving Choice Probabilities From Cumulative Distribution Functions
Because the joint subset \mathsf{u}_{nj} is defined by the intersection of component subsets, we can write its indicator function as a product of component indicator functions, I_{ \{ u_{nj} > u_{nj'} \} } ( u_{nj} ) = \prod_{j' \ne j} I_{ \{ u_{nj} > u_{nj'} \} } ( u_{nj} ). This then allows us to substantially simplify the choice probabilities, \begin{align*} q_{nj} &= \int \left[ \prod_{j' = 1}^{J} \mathrm{d} u_{nj'} \right] \, p( u_{n1}, \ldots, u_{nJ} ) \, I_{ \mathsf{u}_{nj} } ( u_{n1}, \ldots, u_{nJ} ) \\ &= \int \left[ \prod_{j' = 1}^{J} \mathrm{d} u_{nj'} \right] \, p( u_{n1}, \ldots, u_{nJ} ) \, \prod_{j' \ne j} I_{ \{ u_{nj} > u_{nj'} \} } ( u_{nj} ) \\ &= \int \mathrm{d} u_{nj} \, \left[ \prod_{j' \ne j} \int \mathrm{d} u_{nj'} \, I_{ \{ u_{nj} > u_{nj'} \} } ( u_{nj} ) \right] \, p( u_{n1}, \ldots, u_{nJ} ) \\ &= \int_{-\infty}^{+\infty} \mathrm{d} u_{nj} \, \left[ \prod_{j' \ne j} \int_{-\infty}^{ u_{nj} } \mathrm{d} u_{nj'} \, \right] \, p( u_{n1}, \ldots, u_{nJ} ). \end{align*} Conveniently, this integral can always be written in terms of the cumulative distribution function of the given probabilistic utility model.
The joint probability density function p( x_{1}, \ldots, x_{J} ) defines the joint cumulative distribution function \Pi( x_{1}, \ldots, x_{J} ) = \left[ \prod_{j' = 1}^{J} \int_{-\infty}^{ x_{j} } \mathrm{d} x_{j}' \, \right] \, p( x_{1}', \ldots, x_{J}' ). Taking a partial derivative with respect to the jth input gives \frac{ \partial \Pi }{ \partial x_{j} } ( x_{1}, \ldots, x_{J} ) = \left[ \prod_{j' \ne j} \int_{-\infty}^{ x_{j} } \mathrm{d} x_{j}' \, \right] \, p( x_{1}', \ldots, x_{J}' ).
If we set all of the inputs to u_{nj} then this is exactly the integrand defining the choice probabilities! Consequently we can write the choice probabilities as \begin{align*} q_{nj} &= \int_{-\infty}^{+\infty} \mathrm{d} u_{nj} \, \left[ \prod_{j' \ne j} \int_{-\infty}^{ u_{nj} } \mathrm{d} u_{nj'} \, \right] \, p( u_{n1}, \ldots, u_{nJ} ) \\ &= \int_{-\infty}^{+\infty} \mathrm{d} u_{nj} \, \frac{ \partial \Pi }{ \partial x_{j} } ( u_{nj}, \ldots, u_{nj} ), \end{align*} or, a bit more compactly, q_{nj} = \int_{-\infty}^{+\infty} \mathrm{d} u \, \frac{ \partial \Pi }{ \partial x_{j} } ( u, \ldots, u ).
In addition to potentially simplifying calculations, these formulas also allow us to define probabilistic utility model not in terms of a joint probability density function but rather with a joint cumulative distribution function.
2 Substitution Patterns
Once we can derive choice probabilities, we can consider how choice probabilities change when the behavior of one or more alternatives changes.
For example, if the jth alternative is replaced with a worse option then we would expect that q_{nj} will decrease. How, however, should the excess probability be redistributed amongst the other alternatives? Do some alternatives benefit more than others, or do they all increase in a similar manner?
If q_{nj} is taken all the way to zero, then the jth item is effectively removed form the choice set. In many applications a critical question is how choice probabilities change when alternatives are removed or added to the choice set or, more generally, how choice probabilities vary across different choice subsets.
All of these behaviors are examples of substitution patterns. Without the right substitution patterns, a discrete choice model will not be able to make inferences and predictions that generalize beyond a particular choice subset. Consequently, understanding the substitution patterns implied by a given probabilistic utility model is critical.
Quantifying substitution patterns through the behavior of choice probabilities directly is complicated by the summation constraint \sum_{j = 1}^{J} q_{nj} = 1. Any change in the choice probability for one alternative always results in a corresponding change to at least one other choice probability.
Instead substitution patterns are often better quantified by proportional changes to choice probabilities. In particular, ratios of choice probabilities are often much more interpretable.
2.1 Independence from Irrelevant Alternatives
One prototypical substitution pattern arises when the relative preference between two alternatives is unaffected by changes in preferences of other alternatives.
For example, if the j_{3}th choice probability q_{nj_{3}} decreases then we are left with an excess of probability that has to be redistributed amongst the other alternatives. If q_{nj_{1}} increases more than q_{nj_{2}} then the relative preference for alternative j_{1} over alternative j_{2} will increase.
The relative preference for alternatives j_{1} and j_{2} will be unaffected by a change in preference for alternative j_{3} if, and only if, any increase/decrease in q_{nj_{1}} is complemented by the same proportional increase/decrease in q_{nj_{2}}.
In other words the ratio of choice probabilities, \frac{ q_{nj_{1}}(z) }{ q_{nj_{2}}(z) }, must be independent of q_{nj_{3}}. In this case the behavior alternative j_{3} is irrelevant to the relative behavior of alternatives j_{1} and j_{2}.
A discrete choice model exhibits independence from irrelevant alternatives when the ratio of choice probabilities for every pair of alternatives is invariant to changes in any other choice probabilities. Independence from irrelevant alternatives results in very particular substitution patterns that can be both conveniently and overly restrictive.
2.1.1 Removing An Alternative From The Choice Set
Independence from irrelevant alternatives, for example, implies that the ratio of choice probabilities between each pair of alternatives will be the same even if the choice probability for a remaining alternative goes to zero and we eliminate it from the choice set entirely. This behavior allows us to derive the choice probabilities for any choice subset.
To demonstrate, let’s consider four alternatives with the choice probability ratios \begin{align*} \frac{ q_{n1} }{ q_{n2} } &= r_{12} \quad\quad\quad \frac{ q_{n1} }{ q_{n3} } = r_{13} \quad\quad\quad \frac{ q_{n1} }{ q_{n4} } = r_{14} \\ \frac{ q_{n2} }{ q_{n3} } &= r_{23} \quad\quad\quad \frac{ q_{n2} }{ q_{n4} } = r_{24} \quad\quad\quad \frac{ q_{n3} }{ q_{n4} } = r_{34}. \end{align*} Under independence from irrelevant alternatives, the ratios of choice probabilities will be the same if we remove the second alternative from the choice set, \frac{ q'_{n1} }{ q'_{n3} } = r_{13} \quad\quad\quad \frac{ q'_{n1} }{ q'_{n4} } = r_{14} \quad\quad\quad \frac{ q'_{n3} }{ q'_{n4} } = r_{34}.
Along with the summation constraint q'_{n1} + q'_{n3} + q'_{n4} = 1 these ratios are enough to completely determine the choice probabilities for this choice subset.
Using the first two ratios we can isolate the first choice probability, \begin{align*} q'_{n1} + q'_{n3} + q'_{n4} &= 1 \\ q'_{n1} + \frac{1}{ r_{13} } q'_{n1} + \frac{1}{ r_{14} } q'_{n1} &= 1 \\ q'_{n1} \left( 1 + \frac{1}{ r_{13} } + \frac{1}{ r_{14} } \right) &= 1 \\ q'_{n1} &= \frac{ 1 }{ 1 + \frac{1}{ r_{13} } + \frac{1}{ r_{14} } }. \end{align*}
Then we can isolate the next choice probability, \begin{align*} q'_{n1} + q'_{n3} + q'_{n4} &= 1 \\ \frac{ 1 }{ 1 + \frac{1}{ r_{13} } + \frac{1}{ r_{14} } } + q'_{n3} + \frac{1}{ r_{34} } \, q'_{n3} \\ &= 1 \\ q'_{n3} \left(1 + \frac{1}{ r_{34} } \right) &= 1 - \frac{ 1 }{ 1 + \frac{1}{ r_{13} } + \frac{1}{ r_{14} } } \\ q'_{n3} &= \frac{1}{1 + \frac{1}{ r_{34} }} \left( 1 - \frac{ 1 }{ 1 + \frac{1}{ r_{13} } + \frac{1}{ r_{14} } } \right). \end{align*}
The final choice probability follows from the summation constraint, \begin{align*} q'_{n1} + q'_{n3} + q'_{n4} &= 1 \\ q'_{n4} &= 1 - q'_{n1} - q'_{n3} \\ &= 1 - \frac{ 1 }{ 1 + \frac{1}{ r_{13} } + \frac{1}{ r_{14} } } - \frac{1}{1 + \frac{1}{ r_{34} }} \left( 1 - \frac{ 1 }{ 1 + \frac{1}{ r_{13} } + \frac{1}{ r_{14} } } \right). \end{align*}
All of this is to say that, under independence from irrelevant alternatives, we can reconstruct choice probabilities for any choice subset from the choice probabilities of the full choice set without having to rederive from them a probabilistic utility model. In practice this can also facilitate the aggregation of inferences from observations that span many different choice subsets.
2.1.2 Adding An Alternative To The Choice Set
Independence from irrelevant alternatives also constrains the choice probabilities when we add new alternatives to the choice set. To completely determine the updated choice probabilities, we just need to specify the ratios between the choice probability of the new alternative and the existing J choice probabilities. While this might sound burdensome, it is a much less demanding task then having to specify all { J + 1 \choose 2} = \frac{ (J + 1) \, J }{ 2 } possible ratios across the expanded choice set.
To illustrate this behavior let’s consider an initial choice subset with two alternatives and the choice probability ratio \frac{ q_{n1} }{ q_{n2} } = r_{12}. Assuming independence from irrelevant alternatives, the choice probabilities after adding a new, third alternative would satisfy the ratios \frac{ q'_{n1} }{ q'_{n2} } = r_{12} \quad\quad\quad \frac{ q'_{n1} }{ q'_{n3} } = r_{13} \quad\quad\quad \frac{ q'_{n2} }{ q'_{n3} } = r_{23}. Once we provide r_{13} and r_{23} we can derive all three choice probabilities using calculations similar to those in the previous section.
Starting with the first alternative we have \begin{align*} q'_{n1} + q'_{n2} + q'_{n3} &= 1 \\ q'_{n1} + \frac{1}{ r_{12} } q'_{n1} + \frac{1}{ r_{13} } q'_{n1} &= 1 \\ q'_{n1} \left( 1 + \frac{1}{ r_{12} } + \frac{1}{ r_{13} } \right) &= 1 \\ q'_{n1} &= \frac{ 1 }{ 1 + \frac{1}{ r_{23} } + \frac{1}{ r_{13} } }. \end{align*}
At this point we can isolate the next choice probability, \begin{align*} q'_{n1} + q'_{n2} + q'_{n3} &= 1 \\ \frac{ 1 }{ 1 + \frac{1}{ r_{23} } + \frac{1}{ r_{13} } } + q'_{n2} + \frac{1}{ r_{23} } \, q'_{n2} &= 1 \\ q'_{n2} \left(1 + \frac{1}{ r_{23} } \right) &= 1 - \frac{ 1 }{ 1 + \frac{1}{ r_{23} } + \frac{1}{ r_{13} } } \\ q'_{n2} &= \frac{1}{1 + \frac{1}{ r_{23} }} \left( 1 - \frac{ 1 }{ 1 + \frac{1}{ r_{23} } + \frac{1}{ r_{13} } } \right). \end{align*}
Then we can use the summation constraint to derive the final choice probability, \begin{align*} q'_{n1} + q'_{n2} + q'_{n3} &= 1 \\ q'_{n3} &= 1 - q'_{n1} - q'_{n2} \\ &= 1 - \frac{ 1 }{ 1 + \frac{1}{ r_{23} } + \frac{1}{ r_{13} } } \frac{1}{1 + \frac{1}{ r_{23} }} \left( 1 - \frac{ 1 }{ 1 + \frac{1}{ r_{23} } + \frac{1}{ r_{13} } } \right). \end{align*}
2.1.3 The Blue Bus/Red Bus Problem
While independence from irrelevant alternatives facilitates calculations involving the reduction or expansion of choice subsets, it also implies behaviors that can be less desirable in some applications.
Consider, for example, analyzing discrete choice data for two commuter alternatives, driving a car verses taking a blue bus, with the choice probabilities q_{n1} = \frac{r}{1 + r} \quad\quad\quad q_{n2} = \frac{1}{1 + r} and the corresponding ratio \frac{ q_{n1} }{ q_{n2} } = r.
At this point we introduce a third alternative: the same bus as before only with a shiny new coat of red paint. If the aesthetics have negligible impact on commuter behavior then the two bus options in the expanded choice set should have the same probability, \frac{ q'_{n2} }{ q'_{n3} } = 1, and the ratio of choice probabilities between driving a car and taking a new bus should be the same as the ratio between driving a car and taking a blue bus, \frac{ q'_{n1} }{ q'_{n2} } = \frac{ q'_{n1} }{ q'_{n3} }. Assuming independence from irrelevant alternatives, the latter two ratios are completely determined by the initial behavior, \frac{ q'_{n1} }{ q'_{n2} } = \frac{ q'_{n1} }{ q'_{n3} } = r.
The only choice probabilities that satisfy all three of these ratio constraints while also still summing to one are q_{n1} = \frac{r}{r + 2} \quad\quad q_{n2} = \frac{1}{r + 2} \quad\quad q_{n3} = \frac{1}{r + 2}. Under independence from irrelevant alternatives, the introduction of an equivalent bus cannibalizes demand not just from the existing bus option but also from the existing car option!
In some applications this might be the desired behavior. For instance this would be reasonable if an additional bus reduced wait times or opened up more seats.
On the other hand, if the two buses offered the exact same commuter experience then the new bus would not offer car commuters anything new to entice them towards public transportation. In this case we would expect the introduction of an equivalent bus option to cannibalize from only the existing bus option and not the existing car option, a behavior that independence from irrelevant alternatives cannot accommodate.
Independence from irrelevant alternatives can also cause problems when we remove one alternative from an initial choice set. Consider, for example, an initial choice set consisting of the driving, blue bus, and red bus commuting alternatives with the choice probabilities q_{n1} = \frac{r}{r + 1} \quad\quad q_{n2} = \frac{1/2}{r + 1} \quad\quad q_{n3} = \frac{1/2}{r + 2} and the ratios \frac{ q_{n1} }{ q_{n2} } = 2 \, r \quad\quad\quad \frac{ q_{n1} }{ q_{n3} } = 2 \, r \quad\quad\quad \frac{ q_{n2} }{ q_{n3} } = 1.
Under independence from irrelevant alternatives, removing the red bus option results in the choice probabilities q'_{n1} = \frac{2 \, r}{2 \, r + 1}, q'_{n2} = \frac{1}{2 \, r + 1}. In order to maintain the ratio \frac{ q'_{n1} }{ q'_{n2} } = \frac{ q_{n1} }{ q_{n2} } = 2 \, r the probability for the red bus has to be reallocated to both of the remaining commuting alternatives, with some red bus riders choosing to start driving instead of taking the blue bus.
This might make sense if the removal of the red bus degrades the experience of taking the blue bus. If, on the other hand, the experience of taking the blue bus is unchanged then we would expect that red bus riders would all move to the equivalent blue bus alternative. In this case we would expect to see q'_{n2} = q_{n2} + q_{n3} = \frac{1}{r + 1} with q'_{n1} = 1 - q'_{n2} = \frac{r}{r + 1}. This implies the new choice probability ratio \frac{ q'_{n1} }{ q'_{n2} } = r \ne \frac{ q_{n1} }{ q_{n2} }, which violates independence from irrelevant alternatives!
Ultimately, models that exhibit independence from irrelevant alternatives perform poorly when certain pairs of alternatives are better substitutes for each other than other alternatives. One way to avoid these awkward behaviors in practice is to exclude any alternatives that are not sufficiently distinct from the other alternatives. Another is to consider models that don’t necessarily exhibit independence from irrelevant alternatives.
2.2 Sensitivities and Elasticities
Another way to quantify substitution patterns is to examine infinitesimal changes to the choice probabilities that arise from infinitesimal changes to the probabilistic utility model.
To formalize this we’ll need to consider a discrete choice model defined by a parametric utility model configured with a real-valued variable z. This variable might, for example, represent some unknown model configuration or some external covariate that is known exactly.
In this case the probabilistic utility model is defined by an entire family of joint probability density functions, p( u_{n1}, \ldots, u_{nJ} \mid z ), or, equivalently, a family of joint cumulative distribution functions, \Pi( u_{n1}, \ldots, u_{nJ} \mid z ). This families then give parametric choice probabilities, q_{n1}(z), \ldots q_{nJ}(z).
If this dependence is smooth then we can quantify how much each choice probability varies with infinitesimal changes to z with the derivatives \frac{ \partial q_{nj} }{ \partial z }(z). Because they quantify how much each choice probability is influenced by changes in z, these derivatives are also known as sensitivities.
One immediate complication with interpreting sensitivities, however, is that they are always coupled together. Indeed if we differentiate the summation constraint then we see that the sensitivities across all alternatives in a choice set must always sum to zero, \begin{align*} \sum_{j = 1}^{J} q_{nj} &= 1 \\ \frac{ \partial }{ \partial z } \Bigg( \sum_{j = 1}^{J} q_{nj} \Bigg) &= \frac{ \partial }{ \partial z } \Bigg( 1 \Bigg) \\ \sum_{j = 1}^{J} \frac{ \partial q_{nj} }{ \partial z }(z) &= 0. \end{align*}
As we saw in the previous section, ratios of two choice probabilities are often easier to interpret. The influence of z on the ratio of any two choice probabilities is given by \begin{align*} \frac{ \partial }{ \partial z } \left( \frac{ q_{nj} }{ q_{nj'} } \right) &= \frac{ 1 }{ q_{nj'} } \, \frac{ \partial q_{nj} }{ \partial z }(z) - \frac{ q_{nj} }{ q_{nj'}^{2} } \, \frac{ \partial q_{nj'} }{ \partial z }(z) \\ &= \frac{z}{z} \, \frac{ q_{nj} }{ q_{nj} } \frac{ 1 }{ q_{nj'} } \, \frac{ \partial q_{nj} }{ \partial z }(z) - \frac{z}{z} \, \frac{ q_{nj} }{ q_{nj'}^{2} } \, \frac{ \partial q_{nj'} }{ \partial z }(z) \\ &= \frac{1}{z} \frac{ q_{nj} }{ q_{nj'} } \left[ \frac{z}{ q_{nj} } \, \frac{ \partial q_{nj} }{ \partial z }(z) - \frac{ z }{ q_{nj'} } \, \frac{ \partial q_{nj'} }{ \partial z }(z) \right]. \end{align*}
The normalized sensitivities that arise in this calculation are known as elasticities, \epsilon_{nj}(z) = \frac{ z }{ q_{nj} } \frac{ \partial q_{nj} }{ \partial z }(z). Besides being independent of the choice of units for z, elasticities immediately identify invariances of both choice probabilities and ratios of choice probabilities.
For example, if z \ne 0 then a vanishing elasticity implies a vanishing sensitivity, \begin{align*} 0 &= \epsilon_{nj}(z) \\ 0 &= \frac{ z }{ q_{nj} } \frac{ \partial q_{nj} }{ \partial z }(z) \\ &= \frac{ \partial q_{nj} }{ \partial z }(z). \end{align*} This, in turn, implies that the choice probability q_{nj} is unaffected by changes in z.
On the other hand, if z \ne 0 and two elasticities are the same then the corresponding ratio of choice probabilities is independent of z, \begin{align*} \epsilon_{nj}(z) &= \epsilon_{nj'}(z) \\ 0 &= \epsilon_{nj}(z) - \epsilon_{nj'}(z) \\ 0 &= \frac{1}{z} \frac{ q_{nj} }{ q_{nj'} } \Bigg[ \epsilon_{nj}(z) - \epsilon_{nj'}(z) \Bigg] \\ 0 &= \frac{1}{z} \frac{ q_{nj} }{ q_{nj'} } \left[ \frac{z}{ q_{nj} } \, \frac{ \partial q_{nj} }{ \partial z }(z) - \frac{ z }{ q_{nj'} } \, \frac{ \partial q_{nj'} }{ \partial z }(z) \right] \\ 0 &= \frac{ \partial }{ \partial z } \left( \frac{ q_{nj} }{ q_{nj'} } \right). \end{align*}
In particular we can use elasticities to identify discrete choice models that exhibit independence from irrelevant alternatives. A discrete choice model will exhibit this property if and only if the elasticities for all of the alternatives in the choice set are the same, \epsilon_{nj}(z) = \epsilon_{nj'}(z) for all possible values of z.
2.3 Perfect Replacements
In some applications the most relevant feature of a discrete choice model is how the behavior of alternatives can change without affecting specific choice probabilities. In other words, we might be interested in what hypothetical alternatives define perfect replacements for a given alternative.
Consider, for example, a probabilistic utility model with not one but two configuration variables p( u_{n1}, \ldots, u_{nJ} \mid z_{1}, z_{2} ), or, equivalently, \Pi( u_{n1}, \ldots, u_{nJ} \mid z_{1}, z_{2} ).
The infinitesimal change in z_{1} needed to keep a choice probability q_{nj} constant under an infinitesimal change in z_{2} is determined by the total derivative \frac{ \mathrm{d} z_{1} }{ \mathrm{d} z_{2} } along the surface implicitly defined by the constraint q_{nj}( z_{1}, z_{2} ) = \text{constant}. We can compute this constrained derivative by differentiating the constraint with respect to a parameter t that coordinates motion along the constraint surface, \begin{align*} \frac{ \mathrm{d} }{ \mathrm{d} t } \big( q_{nj}( z_{1}(t), z_{2}(t)) \big) &= \frac{ \mathrm{d} }{ \mathrm{d} t } \left( \mathrm{const} \right) \\ \frac{ \partial q_{nj} }{ \partial z_{1} } \, \frac{ \mathrm{d} z_{1} }{ \mathrm{d} t } + \frac{ \partial q_{nj} }{ \partial z_{2} } \, \frac{ \mathrm{d} z_{2} }{ \mathrm{d} t } &= 0 \\ \frac{ \partial q_{nj} }{ \partial z_{1} } \, \frac{ \mathrm{d} z_{1} }{ \mathrm{d} t } &= - \frac{ \partial q_{nj} }{ \partial z_{2} } \, \frac{ \mathrm{d} z_{2} }{ \mathrm{d} t } \\ \frac{ \mathrm{d} z_{1} }{ \mathrm{d} t } \left( \frac{ \mathrm{d} z_{2} }{ \mathrm{d} t } \right)^{-1} &= - \frac{ \partial q_{nj} }{ \partial z_{2} } \, \left( \frac{ \partial q_{nj} }{ \partial z_{1} } \right)^{-1} \\ \frac{ \mathrm{d} z_{1} }{ \mathrm{d} z_{2} } &= - \frac{ \partial q_{nj} }{ \partial z_{2} } \, \left( \frac{ \partial q_{nj} }{ \partial z_{1} } \right)^{-1}. \end{align*}
For example, if both partial derivatives are constant, \begin{align*} \frac{ \partial q_{nj} }{ \partial z_{1} } &= \gamma_{1} \\ \frac{ \partial q_{nj} }{ \partial z_{2} } &= \gamma_{2}, \end{align*} then so too is the total derivative, \frac{ \mathrm{d} z_{1} }{ \mathrm{d} z_{2} } = - \frac{ \gamma_{2} }{ \gamma_{1} }. In this case the utility model configurations ( z_{1}, z_{2} ) and \left( z_{1} + \delta , z_{2} - \frac{ \gamma_{2} }{ \gamma_{1} } \, \delta \right) result in exactly the same choice probability for the jth alternative, q_{nj}.
Intuitively we can interpret these two model configurations as defining two distinct jth alternatives that are equivalent from the perspective of the agents.
2.4 Expected Substitution Patterns
When a probabilistic utility model is configured by unknown model configuration variables, p ( u_{n1}, \ldots, u_{nJ} \mid \theta ) or equivalently \Pi ( u_{n1}, \ldots, u_{nJ} \mid \theta ), then the resulting choice probabilities, and substitution patterns, will also be unknown. In practice we need to infer the consistent behaviors of \theta from observed data and then propagate the inferential uncertainty to the derived choice behaviors.
Bayesian inference uses the observations \tilde{y}_{1}, \ldots, \tilde{y}_{M} to construct a posterior distribution for \theta, \begin{align*} p ( \theta \mid \tilde{y}_{1}, \ldots, \tilde{y}_{M} ) &\propto p( \tilde{y}_{1}, \ldots, \tilde{y}_{M} \mid \theta ) \, p( \theta ) \\ &\propto \left[ \prod_{m = 1}^{M} \mathrm{categorical} ( y_{m} \mid q_{n(m)1}( \theta ), \ldots, q_{n(m)J}( \theta ) ) \right] p ( \theta ). \end{align*} This posterior distribution for \theta can then be pushed forward to an induced posterior distribution for the choice probabilities, p ( q_{n1}, \ldots, q_{nJ} \mid \tilde{y}_{1}, \ldots, \tilde{y}_{M} ), quantifying our uncertainty about agent behaviors.
We can use this induced posterior distribution to study the range of substitution patterns consistent with the observed data. For example in practice we might use posterior samples of the model configuration variables, \tilde{\theta}_{s} \sim p ( \theta \mid \tilde{y}_{1}, \ldots, \tilde{y}_{M} ), to derive posterior choice samples, \tilde{q}_{n1, s}, \ldots, \tilde{q}_{nJ, s} = q_{n1}( \tilde{\theta}_{s} ), \ldots, q_{nJ} ( \tilde{\theta}_{s} ), and then investigate the varying substitution patterns across the samples.
That said, we can also average over the consistent model configurations to construct point estimates of the choice probabilities. More formally we might consider the posterior expectation value of each choice probability, \mathbb{E} [ q_{nj} ] = \int \mathrm{d} q_{nj} \, p ( q_{n1}, \ldots, q_{nJ} \mid \tilde{y}_{1}, \ldots, \tilde{y}_{M} ) \, q_{nj}. Interestingly the expected choice probabilities define a valid simplex, \begin{align*} \sum_{j = 1}^{J} \mathbb{E} \Bigg[ q_{nj} \Bigg] &= \mathbb{E} \Bigg[ \sum_{j = 1}^{J} q_{nj} \Bigg] \\ &= \mathbb{E} [ 1 ] \\ &= 1, \end{align*} which allows them to define expected substitution patterns.
Expected choice probabilities, however, will in general exhibit different substitution patterns than the substitution patterns from any individual choice probability configuration. For example, even if every value of \theta results in choice probabilities that exhibit independence from irrelevant alternatives, the posterior expected choice probabilities typically will not. That said, the deviations between these behaviors will not always be substantial.
Ultimately the most productive use of posterior inferences is to first derive whatever behavior might be of interest, such as ratios of choice probabilities, and only then propagate posterior uncertainties. Using posterior expectations of individual model configuration variables to derive behaviors of interest not only loses information but also is not guaranteed to preserve all of the nuances in the resulting behaviors.
2.5 Inferring Substitution Patterns
In general different configurations of a parametric utility models will exhibit different substitution patterns. This flexibility can be useful when we don’t know what substitution patterns are appropriate in a given application. To really leverage that flexibility, however, we need to be able to infer the appropriate substitution patterns from observed data.
Unfortunately inferring substitution patterns is not always straightforward. In particular, we cannot infer substitution patterns from observations drawn from a single choice set. The only way to discriminate between different substitution patterns is to have observations drawn from multiple choice subsets.
The more choice subsets we observe, the more precisely we will be able to infer the substitution patterns. Exactly how many choice subsets we need to observe for inferences to be well-behaved depends on the flexibility of the specific discrete choice model we are using.
Bayesian analysis of discrete choice data is particularly useful. Once we quantify these uncertainties, no matter how large or complex they might be, we can propagate them into inferences and predictions for decision behavior in new choice subsets.
3 Independent Utility Models
The construction of a discrete choice model is greatly simplified if we assume that the utilities for each alternative are independent of each other, p( u_{n1}, \ldots, u_{nJ} ) = \prod_{j = 1}^{J} p_{j}( u_{nj} ). Beyond just independence, we might also assume that the individual utility models are also the same, p( u_{n1}, \ldots, u_{nJ} ) = \prod_{j = 1}^{J} p( u_{nj} ). In this case the joint model is completely determined by a single, univariate utility model with potentially varying configurations.
Under these assumptions the joint cumulative distribution function also simplifies into a product of univariate cumulative distribution functions, \Pi ( u_{n1}, \ldots, u_{nJ} ) = \prod_{j = 1}^{J} \Pi( u_{nj} ). The better behaved \Pi( x ) is under products, the easier \Pi( u_{n1}, \ldots, u_{nJ} ) will be to differentiate, and then integrate, as we derive choice probabilities.
One way to motivate a useful component utility model is to consider extremes. Let’s say that the utilities that ultimately drive agent choice are largely determined by the best case behavior over a wide range of possibilities. In this case the model for each p( u_{nj} ) would be derived from the maximum utility considered.
The behavior of the maximum of an ensemble of probabilistic behaviors is the subject of extreme value theory in statistics (de Carvalho et al. 2026). One of the key insights from extreme value theory is the Fisher-Tippett-Gnedenko extreme value theorem, which states that over a sufficiently large ensemble only a few coherent extreme behaviors can arise. All of these asymptotic behaviors are quantified with a corresponding family of probability density functions.
Unfortunately, most of the models motivated by extreme value theory are sufficiently complex that we cannot derive choice probabilities in closed form; for some details see Appendix A.2. Fortunately, there is one exceptional subset of extreme value theory models that do allow for convenient analytic results: the Gumbel model.
3.1 The Independent Gumbel Model
One of the emergent behaviors given by the extreme value theorem is defined by the Gumbel family of probability density functions over a real line, \mathrm{Gumbel}( x \mid \mu, \sigma ) = \frac{1}{\sigma} \exp \left( - \frac{ x - \mu }{ \sigma } - e^{ - \frac{ x - \mu }{ \sigma } } \right). Each probability density function in the Gumbel family is unimodal, with the location parameter \mu roughly characterizing the location of the peak and the scale parameter \sigma roughly characterizing the width of the peak (Figure 1).
One nice feature of the Gumbel family is that the cumulative distribution functions are particularly well-behaved, \Pi_{\mathrm{Gumbel}}( x \mid \mu, \sigma ) = \exp \left( - e^{ - \frac{ x' - \mu }{ \sigma } } \right). For a derivation of these cumulative distribution functions see Appendix A.1.
If we model the utilities that an agent assigns to each alternative with independent Gumbel models configured with separate location parameters and a common scale parameter, \theta = ( \mu_{n1}, \ldots, \mu_{nJ}, \sigma_{n}), then the joint utility model becomes p( u_{n1}, \ldots, u_{nJ} \mid \theta ) = \prod_{j = 1}^{J} \mathrm{Gumbel}( u_{nj} \mid \mu_{nj}, \sigma_{n} ). I will refer to each \mu_{nj} as the baseline utility for each alternative.
In this case the joint cumulative distribution function simplifies into an exponential of the sum of contributions from each alternative, \begin{align*} \Pi( u_{n1}, \ldots, u_{nJ} \mid \theta) &= \prod_{j = 1}^{J} \Pi_{\mathrm{Gumbel}}( u_{nj} \mid \mu_{nj}, \sigma_{n} ), \\ &= \prod_{j = 1}^{J} \exp \left( - e^{ - \frac{ u_{nj} - \mu_{nj} }{ \sigma_{n} } } \right) \\ &= \exp \left( - \sum_{j = 1}^{J} e^{ - \frac{ u_{nj} - \mu_{nj} }{ \sigma_{n} } } \right). \end{align*}
Fortunately, all of the derivatives and integrals needed to derive the choice probabilities from this model can be performed analytically. I will sequester the actual calculations to Appendix A.1 referring here to only the results: \begin{align*} q_{nj} ( \theta ) &= \int_{-\infty}^{+\infty} \mathrm{d} u \, \frac{ \partial \Pi }{ \partial x_{j} } ( u, \ldots, u \mid \theta ) \\ &= \frac{ \exp \left( \frac{ \mu_{nj} }{ \sigma_{n} } \right) }{ \sum_{j' = 1}^{J} \exp \left( \frac{ \mu_{nj'} }{ \sigma_{n} } \right) }. \end{align*}
The larger \sigma_{n} is, the more each \mu_{nj} / \sigma_{n} will shrink towards zero. This, in turn, pushes the choice probabilities towards more uniform configurations. On the other hand, smaller values of \sigma_{n} amplify the baseline utilities, pushing the choice probabilities towards more heterogeneous configurations.
Conveniently these choice probabilities can be readily calculated with the softmax function, \mathrm{softmax}(x_{1}, \ldots, x_{J}) = \left( \frac{ e^{x_{1}} }{ \sum_{j' = 1}^{J} e^{x_{j'}} }, \ldots, \frac{ e^{x_{J}} }{ \sum_{j' = 1}^{J} e^{x_{j'}} } \right) All we have to do is apply the softmax function to the scaled baseline utilities, (q_{n1}, \ldots, q_{nJ} ) = \mathrm{softmax} \left( \frac{ \mu_{n1} }{ \sigma_{n} }, \ldots, \frac{ \mu_{nJ} }{ \sigma_{n} } \right)! If we introduce an explicit variable for the direct inputs to the softmax function, \nu_{nj} = \frac{ \mu_{nj} }{ \sigma_{n} }, then we can write the choice probabilities in the even more compact form, (q_{n1}, \ldots, q_{nJ} ) = \mathrm{softmax} \left( \nu_{n1}, \ldots, \nu_{nJ} \right).
The softmax function is also known as the inverse multilogit function, with the multilogit of the choice probabilities being given by the scaled baseline utilities. Moreover, the multilogit function is often shorted to just the logit function, not to be confused with the less general, one-dimensional logit function \mathrm{logit}(p) = \log \frac{p}{1 - p}.
Because of this sequence of terminological jumps, the independent Gumbel discrete choice model is often referred to as just the logit model. Personally I find this confusing, not only because of the sloppiness with the multilogit and logit functions but also because it focuses on a derived behavior and not the initial assumptions that we actually make. In this chapter I will stick with “independent Gumbel model”, but beware when comparing to the more conventional literature in some fields.
3.2 Redundancies of the Independent Gumbel Model
With the softmax function, the independent Gumbel discrete choice model is particularly straightforward to implement in practice. At the same time the structure of the softmax function also provides insight into the inherit redundancies of this model.
3.2.1 Translation Redundancy
As I discuss in my general Taylor modeling chapter and die fairness case study, the softmax function is not injective: many distinct inputs map to the same output probabilities. Specifically, any translation of the inputs leaves the output unchanged.
From a discrete choice perspective, any translation of the scaled baseline utilities does not affect the derived choice probabilities, \begin{align*} (q_{n1}, \ldots, q_{nJ} ) &= \mathrm{softmax} \left( \nu_{n1}, \ldots, \nu_{nJ} \right) \\ &= \mathrm{softmax} \left( \nu_{n1} + \alpha, \ldots, \nu_{nJ} + \alpha \right). \end{align*} An immediate consequence of this redundancy is that, no matter how precisely we infer the choice probabilities, we will only ever be able to learn the baseline utilities up to translations. This results in problematic likelihood functions that concentrate not around a point in the model configuration space but rather entire planes.
In order to avoid these inferential degeneracies we need to eliminate this redundancy from the model. This requires some way of obstructing translations of the baseline utilities.
One particularly convenient way to accomplish this is to parameterize the model in terms of relative baseline utilities, \delta_{nj} = \mu_{nj} - \mu_{nj'} for any distinguished alternative j'. If \delta_{nj} > 0 then the baseline utility of the jth alternative is larger than the baseline utility of the distinguished j'th alternative, and vice versa. Critically, the relative baseline utility of the distinguished alternative is fixed to zero, \delta_{nj'} = \mu_{nj'} - \mu_{nj'} = 0, eliminating a degree of freedom from the model.
Because of the translation invariance of the softmax function, we can always write the choice probabilities directly in terms of these relative baseline utilities: \begin{align*} (q_{n1}, \ldots, q_{nJ} ) &= \mathrm{softmax} \left( \frac{ \mu_{n1} }{ \sigma_{n} }, \ldots, \frac{ \mu_{nJ} }{ \sigma_{n} } \right) \\ &= \mathrm{softmax} \left( \frac{ \mu_{n1} }{ \sigma_{n} } - \frac{ \mu_{nj'} }{ \sigma_{n} }, \ldots, \frac{ \mu_{nJ} }{ \sigma_{n} } - \frac{ \mu_{nj'} }{ \sigma_{n} } \right) \\ &= \mathrm{softmax} \left( \frac{ \mu_{n1} - \mu_{nj'} }{ \sigma_{n} }, \ldots, \frac{ \mu_{nJ} - \mu_{nj'} }{ \sigma_{n} } \right) \\ &= \mathrm{softmax} \left( \frac{ \delta_{n1} }{ \sigma_{n} }, \ldots, \frac{ \delta_{nJ} }{ \sigma_{n} } \right). \end{align*} We cannot translate the relative inputs, however, without moving
More formally, fixing one of the inputs to zero makes the softmax function bijective, with each configuration of the scaled, relative baseline utilities corresponding to a unique configuration of the output choice probabilities.
3.2.2 Scale Redundancy
Although we have resolved the inherent redundancy of the softmax function, we are not quite out of the woods yet. The inputs to the softmax function are themselves redundant.
The inputs to the softmax function are all ratios of baseline utilities \mu_{nj} and the agent scale \sigma_{n}, \nu_{nj} = \frac{ \mu_{nj} }{ \sigma_{n} }. Scaling the baseline utilities and the agent scale in the same way doesn’t change these ratios, \begin{align*} \nu'_{nj} &= \frac{ \mu'_{nj} }{ \sigma'_{n} } \\ &= \frac{ \alpha \, \mu_{nj} }{ \alpha \, \sigma_{n} } \\ &= \frac{ \mu_{nj} }{ \sigma_{n} } \\ &= \nu_{nj}. \end{align*}
With the inputs to the softmax function unchanged,the derived choice probabilities are the same, \begin{align*} (q'_{n1}, \ldots, q'_{nJ} ) &= \mathrm{softmax} \left( \frac{ \mu'_{n1} }{ \sigma'_{n} }, \ldots, \frac{ \mu'_{nJ} }{ \sigma'_{n} } \right) \\ &= \mathrm{softmax} \left( \frac{ \alpha \, \mu_{n1} }{ \alpha \, \sigma_{n} }, \ldots, \frac{ \alpha \, \mu_{nJ} }{ \alpha \, \sigma_{n} } \right) \\ &= \mathrm{softmax} \left( \frac{ \mu_{n1} }{ \sigma_{n} }, \ldots, \frac{ \mu_{nJ} }{ \sigma_{n} } \right) \\ &= (q_{n1}, \ldots, q_{nJ} ). \end{align*} Consequently inferences of the choice probabilities cannot discriminate between \mu_{nj} and \sigma_{n}.
If we’re considering only a single agent, then we can eliminate this scale redundancy by directly parameterizing the model in terms of scaled baseline utilities \eta_{nj} = \frac{ \mu_{nj} }{ \sigma_{n} }. When we want to analyze choice observations across multiple agents, however, we have to account for the fact that \sigma_{n} can vary from one agent to another.
In this case we have to distinguish a particular agent scale, \sigma_{n'}, and use it to define scaled baseline utilities \zeta_{nj} = \frac{ \mu_{nj} }{ \sigma_{n'} }, and relative scales, \tau_{n} = \frac{ \sigma_{n} }{ \sigma_{n'} }. By construction the relative scale of the distinguished agent is fixed to one, \tau_{n'} = 1, eliminating a degree of freedom from the model.
We can now write the choice probabilities for any particular agent as \begin{align*} (q_{n1}, \ldots, q_{nJ} ) &= \mathrm{softmax} \left( \frac{ \mu_{n1} }{ \sigma_{n} }, \ldots, \frac{ \mu_{nJ} }{ \sigma_{n} } \right) \\ &= \mathrm{softmax} \left( \frac{ \mu_{n1} / \sigma_{n'} }{ \sigma_{n} / \sigma_{n'} }, \ldots, \frac{ \mu_{nJ} / \sigma_{n'} }{ \sigma_{n} / \sigma_{n'} } \right) \\ &= \mathrm{softmax} \left( \frac{ \zeta_{n1} }{ \tau_{n} }, \ldots, \frac{ \zeta_{nJ} }{ \tau_{n} } \right) \end{align*}
3.2.3 Resolving Both Redundancies Without Abusing Notation
To completely eliminate redundancies from the independent Gumbel discrete choice model we need to integrate both of these strategies at the same time.
Given a distinguished alternative j' and agent n' we can parameterize the model in terms of scaled, relative baseline utilities \omega_{nj} = \frac{ \delta_{nj} }{ \sigma_{n'} } and relative scales \tau_{n} = \frac{ \sigma_{n} }{ \sigma_{n'} }. The choice probabilities are then given by (q'_{n1}, \ldots, q'_{nJ} ) = \mathrm{softmax} \left( \frac{ \omega_{n1} }{ \tau_{n} }, \ldots, \frac{ \omega_{nJ} }{ \tau_{n} } \right)
Which alternative and agent we distinguish does not affect the behavior or implementation of the choice probabilities. It does, however, change the interpretation and inferential behavior of the model configuration variables we use.
The change in interpretation is especially important in Bayesian analyses, as it changes the form of the prior model. For example, an independent prior model over the baseline utilities does not correspond to an independent prior model over the scaled, relative baseline utilities and vice versa. In practice eliciting a prior model for the relative model configuration variables is more straightforward because we can contextualize our domain expertise through explicit comparisons to the reference alternative and agent.
Many presentations of discrete choice models, especially those that limit inferences to point estimates, overload variable names and use \mu_{nj} and \sigma_{n} to refer to both the absolute and relative model configuration variables at at the same time. Redundancies are managed by fixing one \mu_{nj} to 0 and one \sigma_{n} to 1, while largely ignoring any change in interpretation.
To avoid any ambiguity, and clarify the robust implementation of discrete choice models, I will be pedantic as possible with my notation in this chapter. In particular, I will use as many greek letters as are needed to ensure that different transformations of the absolute model configurations are always denoted with unique symbols. For the independent Gumbel model I will largely use the absolute variables \mu_{nj} and \sigma_{n} when discussing theoretical properties and the relative variables \omega_{nj} and \tau_{n} when discussing implementations.
3.3 Substitution Patterns of the Independent Gumbel Model
The analytic form of the choice probabilities in the independent Gumbel model makes the substitution patterns particularly straightforward to investigate.
Because the normalization that couples all of the choice probabilities together cancels, ratios of choice probabilities simplify substantially, \begin{align*} \frac{ q_{nj_{1}} }{ q_{nj_{2}} } &= \frac{ \exp \left( \frac{ \mu_{nj_{1}} }{ \sigma_{n} } \right) }{ \sum_{j' = 1}^{J} \exp \left( \frac{ \mu_{nj'} }{ \sigma_{n} } \right) } \frac{ \sum_{j' = 1}^{J} \exp \left( \frac{ \mu_{nj'} }{ \sigma_{n} } \right) }{ \exp \left( \frac{ \mu_{nj_{2}} }{ \sigma_{n} } \right) } \\ &= \frac{ \exp \left( \frac{ \mu_{nj_{1}} }{ \sigma_{n} } \right) }{ \exp \left( \frac{ \mu_{nj_{2}} }{ \sigma_{n} } \right) } \\ &= \exp \left( \frac{ \mu_{nj_{1}} }{ \sigma_{n} } - \frac{ \mu_{nj_{2}} }{ \sigma_{n} } \right). \end{align*}
Regardless of the two alternatives we consider, this ratio will only ever depend on the two corresponding baseline utilities. In other words the ratio is independent of changes to any of the other baseline utilities. Decreasing \mu_{nj_{3}} may increase both q_{nj_{1}} and q_{nj_{2}}, but in exactly the same proportion so that the ratio between them remains the same.
In other words the independent Gumbel model exhibits independence from irrelevant alternatives, and all of the consequences of that property that we discussed in Section 2.1. This makes the independent Gumbel model relatively exceptional, as most discrete choice models do not exhibit independence from irrelevant alternatives.
We can also demonstrate independence from irrelevant alternatives using elasticities. The sensitivity of the j_{1}th choice probability with respect to the j_{3}th baseline utility is given by \begin{align*} \frac{ \partial q_{nj_{1}} }{ \partial \mu_{nj_{3}} } &= \frac{ \partial }{ \partial \mu_{nj_{3}} } \left[ \frac{ \exp \left( \frac{ \mu_{nj_{1}} }{ \sigma_{n} } \right) } { \sum_{j' = 1}^{J} \exp \left( \frac{ \mu_{nj'} }{ \sigma_{n} } \right) } \right] \\ &= \exp \left( \frac{ \mu_{nj_{1}}}{ \sigma_{n} } \right) \frac{ - \frac{1}{\sigma_{n} } \, \exp \left( \frac{ \mu_{nj_{3}} }{ \sigma_{n} } \right) }{ \left( \sum_{j' = 1}^{J} \exp \left( \frac{ \mu_{nj'} }{ \sigma_{n} } \right) \right)^{2} } \\ &= -\frac{1}{\sigma_{n} } \, \frac{ \exp \left( \frac{ \mu_{nj_{1}} }{ \sigma_{n} } \right) }{ \sum_{j' = 1}^{J} \exp \left( \frac{ \mu_{nj'} }{ \sigma_{n} } \right) } \frac{ \exp \left( \frac{ \mu_{nj_{3}} }{ \sigma_{n} } \right) }{ \sum_{j' = 1}^{J} \exp \left( \frac{ \mu_{nj'} }{ \sigma_{n} } \right) } \\ &= -\frac{1}{\sigma_{n} } \, q_{nj_{1}} \, q_{nj_{3}}. \end{align*} The corresponding elasticity is then \begin{align*} \frac{ \mu_{nj_{3}} }{ q_{nj_{1}} } \, \frac{ \partial q_{nj_{1}} }{ \partial \mu_{nj_{3}} } &= \frac{ \mu_{nj_{3}} }{ q_{nj_{1}} } \, \left( -\frac{1}{\sigma_{n} } \, q_{nj_{1}} \, q_{nj_{3}} \right) \\ &= -\frac{ \mu_{nj_{3}} }{ \sigma_{n} } \, q_{nj_{3}}. \end{align*} Notice that this elasticity is the same for all alternatives j_{1}!
Because the elasticities are all the same, differences in elasticities cancel, and ratios of choice probabilities are invariant, \begin{align*} \frac{ \partial }{ \partial \mu_{nj_{3}} } \left( \frac{ q_{nj_{1}} }{ q_{nj_{2}} } \right) &= \frac{1}{\mu_{nj_{3}}} \frac{ q_{nj_{1}} }{ q_{nj_{2}} } \bigg[ -\frac{ \mu_{nj_{3}} }{ \sigma_{n} } \, q_{nj_{3}} +\frac{ \mu_{nj_{3}} }{ \sigma_{n} } \, q_{nj_{3}} \bigg] \\ &= \frac{1}{\mu_{nj_{3}}} \frac{ q_{nj_{1}} }{ q_{nj_{3}} } \bigg[ \hspace{19.4mm} 0 \hspace{19.4mm} \bigg] \\ &= 0 \end{align*} for any distinct j_{1}, j_{2}, and j_{3}. This is consistent with the invariance of ratios of choice probabilities that we identified above from direct inspection.
3.4 Demonstration
Having worked through all of the theory, let’s now see how we can implement the independent Gumbel model into a practical Bayesian analysis.
3.4.1 Setup
As always we begin by setting up our local R environment.
util <- new.env()
source('mcmc_analysis_tools_rstan.R', local=util)
source('mcmc_visualization_tools.R', local=util)3.4.2 Basics
Let’s start with the basic implementation of the independent Gumbel model.
3.4.2.1 Simulate Data
First and foremost, we’ll need some data to fit. Here let’s consider many observations, each of which pair one of six agents with a fixed choice set of nine alternatives.
N <- 6
J <- 9
M <- 1000simu\_ig1.stan
data {
int<lower=1> M; // Number of observations
int<lower=1> N; // Number of agents
int<lower=1> J; // Number of alternatives in choice set
}
generated quantities {
vector[J] mu; // Baseline utilities
array[N] real<lower=0> sigma; // Scales
array[M] int<lower=1, upper=N> agent; // Observed agent
array[M] int<lower=1, upper=J> choice; // Observed choice
// Sample true data generating process behavior
for (j in 1:J) mu[j] = normal_rng(0, 1);
for (n in 1:N) sigma[n] = gamma_rng(31.0, 15.5);
// Simulate observations
for (m in 1:M) {
int n = categorical_rng(uniform_simplex(N));
agent[m] = n;
choice[m] = categorical_logit_rng(log_softmax(mu / sigma[n]));
}
}simu <- stan(file="stan_programs/simu_ig1.stan",
algorithm="Fixed_param", seed=8438338,
data=list('N' = N, 'J' = J, 'M' = M),
warmup=0, iter=1, chains=1, refresh=0)
simu_fit <- extract(simu)data <- list('N' = N, 'J' = J, 'M' = M,
'agent' = simu_fit$agent[1,],
'choice' = simu_fit$choice[1,])
mu_true <- simu_fit$mu[1,]
sigma_true <- simu_fit$sigma[1,]3.4.2.2 Explore Data
There are a few natural ways that we can summarize observations over a fixed choice set. For example, we can aggregate all of the observed choices into a single histogram.
par(mfrow=c(1, 1), mar = c(5, 5, 2, 1))
util$plot_line_hist(data$choice, 0.5, data$J + 0.5, 1,
xlab='Chosen Alternative')We can also stratify by agents, histogramming just the observed choices made by one agent at a time.
par(mfrow=c(2, 3), mar = c(5, 5, 2, 1))
for (n in 1:data$N)
util$plot_line_hist(data$choice[data$agent == n],
0.5, data$J + 0.5, 1,
xlab='Chosen Alternative',
main=paste('Agent', n))3.4.2.3 Redundant Implementation
With these simulated data we can now draw inferences from an independent Gumbel discrete choice model. As we discussed in Section 3.2, implementing this model with the absolute baseline utilities and agent scales should lead to problems.
The prior model here is completely arbitrary.
ig1.stan
data {
int<lower=1> M; // Number of observations
int<lower=1> N; // Number of agents
int<lower=1> J; // Number of alternatives in choice set
// Individual observations
array[M] int<lower=1, upper=N> agent;
array[M] int<lower=1, upper=J> choice;
}
parameters {
vector[J] mu; // Baseline utilities
array[N] real<lower=0> sigma; // Scales
}
transformed parameters {
// Alternative log probabilities
array[N] vector[J] log_probs;
for (n in 1:N)
log_probs[n] = log_softmax(mu / sigma[n]);
}
model {
// Prior model
target += normal_lpdf(mu | 0, 100);
target += normal_lpdf(sigma | 0, 100);
// Observational model
for (m in 1:M) {
int n = agent[m];
target += log_probs[n][choice[m]];
}
}
generated quantities {
array[M] int<lower=1, upper=J> choice_pred;
for (m in 1:M) {
int n = agent[m];
choice_pred[m] = categorical_logit_rng(log_probs[n]);
}
}fit <- stan(file='stan_programs/ig1.stan',
data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)Interestingly Hamiltonian Monte Carlo doesn’t seem to have any trouble exploring the posterior distribution over the absolute model configurations.
diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics) All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples,
c('mu', 'sigma'),
TRUE)
util$check_all_expectand_diagnostics(base_samples)All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
That said, we don’t actually learn much about each absolute baseline utilities. The arbitrary prior model is doing a lot of work here to keep our inferences contained to a finite neighborhood.
par(mfrow=c(1, 1), mar = c(5, 5, 2, 1))
plot(samples[['mu[1]']], samples[['mu[2]']],
cex=1, pch=16, col=util$c_dark,
xlim=c(-250, 250), xlab='mu[1]',
ylim=c(-250, 250), ylab='mu[2]')
thetas <- seq(0, 2 * pi, 0.01)
lines(232 * cos(thetas), 232 * sin(thetas),
lwd=2, col=util$c_light)Similarly, our inferences for the agent scales are largely informed by the prior model.
par(mfrow=c(1, 1), mar = c(5, 5, 2, 1))
plot(samples[['sigma[2]']], samples[['sigma[3]']],
cex=1, pch=16, col=util$c_dark,
xlim=c(0, 300), xlab='sigma[2]',
ylim=c(0, 300), ylab='sigma[3]')
thetas <- seq(0, 0.5 * pi, 0.01)
lines(257 * cos(thetas), 257 * sin(thetas),
lwd=2, col=util$c_light)3.4.2.4 Non-Redundant Implementation
We should be able to do better by implementing the independent Gumbel model with scaled, relative baseline utilities and relative agent scales. This, however, requires distinguishing an anchor alternative and anchor agent.
In theory the choice of anchors should not matter, but in practice the more we observe the anchors the more well-behaved our inferences for the relative model configurations will be. This, in turn, allows our posterior computation to be as fast as possible.
Because the observations are relatively uniform across agents, I will anchor the first agent for simplicity.
table(data$agent)
1 2 3 4 5 6
169 178 155 150 183 165
The observed choices, however, are much more heterogeneous. Here I will anchor the alternative that has been chosen the most often.
table(data$choice)
1 2 3 4 5 6 7 8 9
40 108 64 105 208 109 71 154 141
data$anchor <- 5To avoid redundancy, we directly model only the J - 1 unanchored alternatives. The add_anchor function inserts the missing zero back into the relative baseline utilities.
ig2.stan
functions {
// Insert a zero into x_free at index anchor
vector add_anchor(vector x_free, int anchor) {
int J = size(x_free) + 1;
if (anchor == 1)
return append_row(0, x_free);
else if (anchor == J)
return append_row(x_free, 0);
else
return append_row(x_free[1:(anchor - 1)],
append_row(0, x_free[anchor:(J - 1)]));
}
}
data {
int<lower=1> M; // Number of observations
int<lower=1> N; // Number of agents
int<lower=1> J; // Number of alternatives in choice set
// Individual observations
array[M] int<lower=1, upper=N> agent;
array[M] int<lower=1, upper=J> choice;
// Anchoring configuration
int<lower=1, upper=J> anchor;
}
parameters {
// Relative baseline utilities for non-anchor alternatives
vector[J - 1] omega_free;
// Relative scales for non-anchor agents
array[N - 1] real<lower=0> tau_free;
}
transformed parameters {
// Relative baseline utilities for all alternatives
vector[J] omega = add_anchor(omega_free, anchor);
// Relative scales for all agents
array[N] real<lower=0> tau = append_array({1}, tau_free);
// Alternative log probabilities
array[N] vector[J] log_probs;
for (n in 1:N)
log_probs[n] = log_softmax(omega / tau[n]);
}
model {
// Prior model
target += normal_lpdf(omega_free | 0, 10);
target += normal_lpdf(tau_free | 0, 10);
// Observational model
for (m in 1:M) {
int n = agent[m];
target += log_probs[n][choice[m]];
}
}
generated quantities {
array[M] int<lower=1, upper=J> choice_pred;
for (m in 1:M) {
int n = agent[m];
choice_pred[m] = categorical_logit_rng(log_probs[n]);
}
}Note that the prior model for the relative model configurations is once again arbitrary, but it is not consistent with the prior model for the absolute model configurations that we used previously. In practical applications it is often more straightforward to elicit domain expertise about the relative model configurations directly.
fit <- stan(file='stan_programs/ig2.stan',
data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)None of the diagnostics suggest inaccurate posterior quantification.
diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics) All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples,
c('omega_free',
'tau_free'),
TRUE)
util$check_all_expectand_diagnostics(base_samples)All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
Free of redundancies, the likelihood function for the relative model configurations strongly informs our posterior inferences.
par(mfrow=c(2, 1), mar = c(5, 5, 2, 1))
plot(samples[['omega[1]']], samples[['omega[2]']],
cex=1, pch=16, col=util$c_dark,
xlim=c(-25, 25), xlab='omega[1]',
ylim=c(-25, 25), ylab='omega[2]')
thetas <- seq(0, 2 * pi, 0.01)
lines(23.2 * cos(thetas), 23.2 * sin(thetas),
lwd=2, col=util$c_light)
plot(samples[['tau[2]']], samples[['tau[3]']],
cex=1, pch=16, col=util$c_dark,
xlim=c(0, 30), xlab='tau[2]',
ylim=c(0, 30), ylab='tau[3]')
thetas <- seq(0, 0.5 * pi, 0.01)
lines(25.7 * cos(thetas), 25.7 * sin(thetas),
lwd=2, col=util$c_light)Because we’re working with simulated data, we know that the independent Gumbel model adequately captures the relevant features of the true data generating process. In practice we would have to rely on posterior retrodictive checks to identify any inadequacies in our modeling assumptions.
Conveniently, the histograms that we used to explore the data also make for productive visual retrodictive checks. Here we don’t see any signs of model in adequacy.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
util$plot_hist_quantiles(samples, 'choice_pred',
0.5, data$J + 0.5, 1,
baseline_values=data$choice,
xlab='Chosen Alternative')par(mfrow=c(2, 3), mar=c(5, 5, 1, 1))
for (n in 1:data$N) {
names <- sapply(which(data$agent == n),
function(n) paste0('choice_pred[', n, ']'))
filtered_samples <- util$filter_expectands(samples, names)
util$plot_hist_quantiles(
filtered_samples, 'choice_pred',
0.5, data$J + 0.5, 1,
baseline_values=data$choice[data$agent == n],
xlab='Chosen Alternative',
main=paste('Agent', n)
)
}With an adequate model we can be confident that our posterior inferences are faithfully quantifying the behavior of the underlying data generating process, including the relative baseline utilities for each alternative and the relative scales for each agent.
par(mfrow=c(1, 1), mar = c(5, 4, 2, 1))
omega_true <- (mu_true - mu_true[data$anchor]) / sigma_true[1]
names <- sapply(1:data$J, function(j) paste0('omega[', j, ']'))
util$plot_disc_pushforward_quantiles(samples, names,
baseline_values=omega_true,
baseline_col=util$c_mid_teal,
xlab="Alternative",
ylab="Relative Baseline Utility")par(mfrow=c(1, 1), mar = c(5, 4, 2, 1))
tau_true <- sigma_true / sigma_true[1]
names <- sapply(1:data$N, function(n) paste0('tau[', n, ']'))
util$plot_disc_pushforward_quantiles(samples, names,
baseline_values=tau_true,
baseline_col=util$c_mid_teal,
xlab="Agent",
ylab="Relative Scale")3.4.2.5 Aggregated Implementation
When agents choose from the same choice set multiple times, we end up querying each set of choice probabilities repeatedly. Our previous model implementation ends up recalculating the choice probabilities for each observation, unnecessarily wasting computation.
By aggregating observations for each agent-choice set pairing, we can collapse the repeated categorical models into multinomial models. This allows us to compute the choice probabilities only once, making the implementation as efficient as possible.
choice_counts <- matrix(NA, data$N, data$J)
for (n in 1:data$N)
choice_counts[n,] <- hist(data$choice[data$agent == n],
seq(0.5, data$J + 0.5, 1),
plot=FALSE)$counts
data$choice_counts <- choice_countsig3.stan
functions {
// Insert a zero into x_free at index anchor
vector add_anchor(vector x_free, int anchor) {
int J = size(x_free) + 1;
if (anchor == 1)
return append_row(0, x_free);
else if (anchor == J)
return append_row(x_free, 0);
else
return append_row(x_free[1:(anchor - 1)],
append_row(0, x_free[anchor:(J - 1)]));
}
}
data {
int<lower=1> N; // Number of agents
int<lower=1> J; // Number of alternatives in choice set
// Aggregated observations
array[N, J] int<lower=0> choice_counts;
// Anchoring configuration
int<lower=1, upper=J> anchor;
}
parameters {
// Relative baseline utilities for non-anchor alternatives
vector[J - 1] omega_free;
// Relative scales for non-anchor agents
array[N - 1] real<lower=0> tau_free;
}
transformed parameters {
// Relative baseline utilities for all alternatives
vector[J] omega = add_anchor(omega_free, anchor);
// Relative scales for all agents
array[N] real<lower=0> tau = append_array({1}, tau_free);
// Alternative log probabilities
array[N] vector[J] log_probs;
for (n in 1:N)
log_probs[n] = log_softmax(omega / tau[n]);
}
model {
// Prior model
target += normal_lpdf(omega_free | 0, 10);
target += normal_lpdf(tau_free | 0, 10);
// Observational model
for (n in 1:N)
target += multinomial_logit_lpmf(choice_counts[n] | log_probs[n]);
}
generated quantities {
array[N, J] int<lower=0> choice_counts_pred;
array[J] int<lower=0> agg_choice_counts_pred = rep_array(0, J);
for (n in 1:N) {
choice_counts_pred[n]
= multinomial_logit_rng(log_probs[n],
sum(choice_counts[n]));
for (j in 1:J) {
agg_choice_counts_pred[j] += choice_counts_pred[n, j];
}
}
}fit <- stan(file='stan_programs/ig3.stan',
data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)The only diagnostic warnings that we see here are a few \hat{xi} warnings. These indicate heavy-tailed posteriors, and potentially unreliable expectation values, but don’t cast any doubt on the overall posterior quantification.
diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics) All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples,
c('omega_free',
'tau_free'),
TRUE)
util$check_all_expectand_diagnostics(base_samples)tau_free[5]:
Chain 4: Right tail hat{xi} (0.391) exceeds 0.25.
Large tail hat{xi}s suggest that the expectand might not be
sufficiently integrable.
The posterior retrodictive checks behave identically to those we derived from ig2.stan. Note that, because we are working with aggregate counts and not individual observations, we have to visualize the histograms slightly differently here.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
obs_counts <- colSums(data$choice_counts)
names <- sapply(1:data$J,
function(j)
paste0('agg_choice_counts_pred[', j, ']'))
util$plot_disc_pushforward_quantiles(samples, names,
baseline_values=obs_counts,
xlab="Alternative",
ylab="Choice Counts")par(mfrow=c(2, 3), mar=c(5, 5, 1, 1))
for (n in 1:data$N) {
obs_counts <- data$choice_counts[n,]
names <- sapply(1:data$J,
function(j)
paste0('choice_counts_pred[', n, ',', j, ']'))
util$plot_disc_pushforward_quantiles(samples, names,
baseline_values=obs_counts,
xlab="Alternative",
ylab="Choice Counts",
main=paste('Agent', n))
}Similarly, posterior inferences for the relative model configurations are equivalent to the unaggregated model.
par(mfrow=c(2, 1), mar = c(5, 4, 2, 1))
omega_true <- (mu_true - mu_true[data$anchor]) / sigma_true[1]
names <- sapply(1:data$J,
function(j) paste0('omega[', j, ']'))
util$plot_disc_pushforward_quantiles(
samples, names,
baseline_values=omega_true,
baseline_col=util$c_mid_teal,
xlab="Alternative",
ylab="Relative Baseline Utility"
)
tau_true <- sigma_true / sigma_true[1]
names <- sapply(1:data$N,
function(n) paste0('tau[', n, ']'))
util$plot_disc_pushforward_quantiles(
samples, names,
baseline_values=tau_true,
baseline_col=util$c_mid_teal,
xlab="Agent",
ylab="Relative Scale"
)3.4.3 Independent Gumbel Substitution Patterns
Now that we’re comfortable with implementing an independent Gumbel model over a single choice set, we can explore the substitution patterns if manifests across multiple choice subsets.
3.4.3.1 Simulate Data
To infer substitution patterns we need to consider observations across different choice subsets. In practice this requires defining a global choice set of all of the possible alternatives, and then choice subsets from which we’ll simulate observations.
Here let’s take a choice set consisting of 10 alternatives, each of which are contiguously indexed.
J <- 10We can then define choice subsets by the index of the included alternatives (Figure 2) Our first choice subset includes all but the eighth alternatives. The second choice subset removes the fifth alternative, while the third choice subset adds back in both missing alternatives. In particular we can think of the second and third choice subsets as the result of removing and adding, respectively, an alternative from the first choice subset.
subset_alts <-
list(c(1, 2, 3, 4, 5, 6, 7, 9, 10), # Nominal
c(1, 2, 3, 4, 6, 7, 9, 10), # Remove alternative 5
c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)) # Add alternative 8
S <- length(subset_alts)Perhaps the most difficult task when implementing discrete choice models over multiple choice subsets is keeping track of the indices. Within a choice subset we will need to index the included alternatives contiguously. This, however, means that the local indexing will not be the same as the global indexing of the full choice set!
Fortunately, the subset_alts arrays already map from local to global indexing. Even more, it’s not too hard to convert this to an inverse map from global to local indexing.
to_subset_idx <-
lapply(subset_alts,
function(alts) sapply(1:J,
function(j) match(j, alts)))The other awkward implementation issue concerns the encoding of subset_alts. Because each choice subset is a different size, subset_alts has the structure of a ragged array which we cannot implement directly in the Stan modeling language.
Here I’ll concatenate all of the choice subset information into a single monolithic array (Figure 3). In order to retrieve the information for a particular subset, we use the start and end indices to segment the monolithic array.
Stan modeling language. (b) That said, we can always implement a ragged array by concatenated the individual component arrays together into one long array and keeping track of the indices where each component starts and ends. In particular we can retrieve any component array by appropriately subsetting this concatenated array.
SJ <- 0
alts <- c()
subset_start <- c()
subset_end <- c()
for (s in 1:S) {
subset_start <- c(subset_start, SJ + 1)
SJ <- SJ + length(subset_alts[[s]])
subset_end <- c(subset_end, SJ)
alts <- c(alts, subset_alts[[s]])
}With the organization of the choice subsets handled we can specify the final details. Here we’ll consider the same six agents that we did previously.
N <- 6To tease out any subtle substitution patterns, however, we’ll simulate many more choice observations.
M <- 15000Now we are finally ready to simulate some data. Note that the observed choices are defined in terms of the local indexing within choice subset.
simu\_ig2.stan
data {
int<lower=1> M; // Number of observations
int<lower=1> N; // Number of agents
int<lower=1> J; // Number of alternatives in choice set
// Choice subset configurations
int<lower=1> S;
int<lower=1> SJ;
array[S] int<lower=1, upper=SJ> subset_start;
array[S] int<lower=1, upper=SJ> subset_end;
array[SJ] int<lower=1, upper=J> alts;
}
generated quantities {
vector[J] mu; // Baseline utilities
array[N] real<lower=0> sigma; // Scales
array[M] int<lower=1, upper=S> subset; // Observed choice subset
array[M] int<lower=1, upper=N> agent; // Observed agent
array[M] int<lower=1, upper=J> choice; // Observed choice
// Sample true data generating process behavior
for (j in 1:J) mu[j] = normal_rng(0, 1);
for (n in 1:N) sigma[n] = gamma_rng(31.0, 15.5);
// Simulate observations
for (m in 1:M) {
int n = categorical_rng(uniform_simplex(N));
int s = categorical_rng(uniform_simplex(S));
int JJ = subset_end[s] - subset_start[s] + 1;
array[JJ] int subset_alts
= alts[subset_start[s]:subset_end[s]];
vector[JJ] subset_log_probs
= log_softmax(mu[subset_alts] / sigma[n]);
agent[m] = n;
subset[m] = s;
choice[m] = categorical_logit_rng(subset_log_probs);
}
}simu <- stan(file="stan_programs/simu_ig2.stan",
algorithm="Fixed_param", seed=8438338,
data=list('N' = N, 'J' = J, 'M' = M,
'S' = S, 'SJ' = SJ,
'subset_start' = subset_start,
'subset_end' = subset_end,
'alts' = alts),
warmup=0, iter=1, chains=1, refresh=0)
simu_fit <- extract(simu)mu_true <- simu_fit$mu[1,]
sigma_true <- simu_fit$sigma[1,]For maximum efficiency we’ll use the aggregated model implementation, which requires calculating the observed counts for the alternatives in each choice subset.
data <- list()
for (s in 1:S) {
JJ <- subset_end[s] - subset_start[s] + 1
choice_counts <- matrix(NA, N, JJ)
for (n in 1:N)
choice_counts[n,] <- hist(simu_fit$choice[ simu_fit$subset == s
& simu_fit$agent == n],
seq(0.5, JJ + 0.5, 1),
plot=FALSE)$counts
data[[s]] <- list('N' = N, 'J' = JJ,
'choice_counts' = choice_counts)
}3.4.3.2 Explore Data
In theory we could aggregate all of the observed choices into a histogram with respect to the global choice set alternatives. That said, this would convolve all of the choice subsets together making interpretation tricky at best. Because we are considering only a few choice subsets here, we can just as easily examine separate histograms for each choice subset.
par(mfrow=c(3, 1), mar = c(5, 5, 2, 1))
for (s in 1:S) {
util$plot_disc_vals(colSums(data[[s]]$choice_counts),
xlab='Chosen Alternative (Local Indexing)',
main=paste('Choice Subset', s))
}We could also stratify each of these choice subset histograms by the corresponding agent if useful.
3.4.3.3 Individual Posterior Quantification
Let’s start by fitting the observations of each choice subset separately.
In theory we can set different alternative anchors for each choice subset, but different anchors makes the resulting relative baseline utilities awkward to compare. To make comparisons more straightforward, we’ll anchor the same alternative in all three choice subsets so that the relative baseline utilities are compatible. Any alternative that appears in all three choice subsets is a potential candidate.
common_alts <-
Reduce(intersect,
lapply(1:S,
function(s) alts[subset_start[s]:subset_end[s]]))
for (alt in common_alts) {
print(paste0('Alternative ', alt, ':'))
for (s in 1:S) {
MM <- sum(data[[s]]$choice_counts[,to_subset_idx[[s]][alt]])
print(paste0(' ', MM, ' observations in choice subset ', s))
}
}[1] "Alternative 1:"
[1] " 212 observations in choice subset 1"
[1] " 261 observations in choice subset 2"
[1] " 180 observations in choice subset 3"
[1] "Alternative 2:"
[1] " 528 observations in choice subset 1"
[1] " 741 observations in choice subset 2"
[1] " 444 observations in choice subset 3"
[1] "Alternative 3:"
[1] " 393 observations in choice subset 1"
[1] " 516 observations in choice subset 2"
[1] " 306 observations in choice subset 3"
[1] "Alternative 4:"
[1] " 480 observations in choice subset 1"
[1] " 722 observations in choice subset 2"
[1] " 426 observations in choice subset 3"
[1] "Alternative 6:"
[1] " 612 observations in choice subset 1"
[1] " 854 observations in choice subset 2"
[1] " 510 observations in choice subset 3"
[1] "Alternative 7:"
[1] " 451 observations in choice subset 1"
[1] " 525 observations in choice subset 2"
[1] " 351 observations in choice subset 3"
[1] "Alternative 9:"
[1] " 783 observations in choice subset 1"
[1] " 986 observations in choice subset 2"
[1] " 612 observations in choice subset 3"
[1] "Alternative 10:"
[1] " 414 observations in choice subset 1"
[1] " 467 observations in choice subset 2"
[1] " 318 observations in choice subset 3"
We’ll anchor the ninth alternative, which has the most observations amongst those that appear in all of the choice subsets.
anchor <- 9That said, we have to be careful to translate this global alternative index to the local indices within each choice subset.
for (s in 1:S)
data[[s]]$anchor <- to_subset_idx[[s]][anchor]Now we can quantify separate posterior distributions for each of the data sets.
samples <- list()
for (s in 1:S) {
fit <- stan(file='stan_programs/ig3.stan',
data=data[[s]], seed=8438338,
warmup=1000, iter=2024, refresh=0)
cat(paste('Analyzing choice subset', s, ':\n'))
diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
cat('\n')
samples[[s]] <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples[[s]],
c('omega_free',
'tau_free'),
TRUE)
util$check_all_expectand_diagnostics(base_samples)
cat('\n\n')
}Analyzing choice subset 1 :
All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
Analyzing choice subset 2 :
All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
Analyzing choice subset 3 :
All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
3.4.3.4 Inferential Comparisons
For demonstration purposes let’s focus on the substitution patterns between the third and tenth alternatives by studying how the corresponding choice probabilities vary across the three choice subsets. Because the choice probabilities will vary across agents, we’ll also need to pick a specific agent.
n <- 5
j1 <- 3
j2 <- 10While log choice probabilities are more useful computationally, the untransformed choice probabilities are a bit easier to interpret.
j1_probs <- list()
j2_probs <- list()
for (s in 1:S) {
var_repl <- list('x'=paste0('log_probs[', n, ',',
to_subset_idx[[s]][j1], ']'))
j1_probs[[s]] <-
util$eval_expectand_pushforward(samples[[s]],
function(x) exp(x),
var_repl)
var_repl <- list('x'=paste0('log_probs[', n, ',',
to_subset_idx[[s]][j2], ']'))
j2_probs[[s]] <-
util$eval_expectand_pushforward(samples[[s]],
function(x) exp(x),
var_repl)
}The baseline behavior is defined by the choice probabilities q_{5,3} and q_{5,10} in the nominal choice subset.
par(mfrow=c(2, 1), mar = c(5, 4, 2, 1))
B <- 50
flim <- c(0.025, 0.125)
display_name <- paste('Probability of Alternative', j1)
util$plot_expectand_pushforward(j1_probs[[1]], B,
flim=flim, ylim=c(0, 100),
display_name=display_name,
col=util$c_light)
text(0.075, 75, 'Choice\nSubset 1', col=util$c_light)
display_name <- paste('Probability of Alternative', j2)
util$plot_expectand_pushforward(j2_probs[[1]], B,
flim=flim,
ylim=c(0, 100),
display_name=display_name,
col=util$c_light)
text(0.08, 85, 'Choice\nSubset 1', col=util$c_light)Moving from the first to the second choice subset requires removing an alternative. The probability initially allocated to that alternative has to then be redistributed amongst the remaining alternatives, increasing their choice probabilities.
par(mfrow=c(2, 1), mar = c(5, 4, 2, 1))
B <- 50
flim <- c(0.025, 0.125)
display_name <- paste('Probability of Alternative', j1)
util$plot_expectand_pushforward(j1_probs[[1]], B,
flim=flim, ylim=c(0, 100),
display_name=display_name,
col=util$c_light)
text(0.075, 75, 'Choice\nSubset 1', col=util$c_light)
util$plot_expectand_pushforward(j1_probs[[2]], B,
flim=flim,
col=util$c_mid,
border="#BBBBBB88",
add=TRUE)
text(0.11, 60, 'Choice\nSubset 2', col=util$c_mid)
display_name <- paste('Probability of Alternative', j2)
util$plot_expectand_pushforward(j2_probs[[1]], B,
flim=flim,
ylim=c(0, 100),
display_name=display_name,
col=util$c_light)
text(0.08, 85, 'Choice\nSubset 1', col=util$c_light)
util$plot_expectand_pushforward(j2_probs[[2]], B,
flim=flim,
col=util$c_mid,
border="#BBBBBB88",
add=TRUE)
text(0.10, 60, 'Choice\nSubset 2', col=util$c_mid)At the same time, moving from the first to the third choice subset requires adding a new alternative. The probability allocated to this new alternative has to siphoned from the probability allocated to the initial alternatives. Consequently both q_{5,3} and q_{5,10} decrease.
par(mfrow=c(2, 1), mar = c(5, 4, 2, 1))
B <- 50
flim <- c(0.025, 0.125)
display_name <- paste('Probability of Alternative', j1)
util$plot_expectand_pushforward(j1_probs[[1]], B,
flim=flim, ylim=c(0, 100),
display_name=display_name,
col=util$c_light)
text(0.075, 75, 'Choice\nSubset 1', col=util$c_light)
util$plot_expectand_pushforward(j1_probs[[2]], B,
flim=flim,
col=util$c_mid,
border="#BBBBBB88",
add=TRUE)
text(0.11, 60, 'Choice\nSubset 2', col=util$c_mid)
util$plot_expectand_pushforward(j1_probs[[3]], B,
flim=flim,
col=util$c_dark,
border="#BBBBBB88",
add=TRUE)
text(0.0425, 75, 'Choice\nSubset 3', col=util$c_dark)
display_name <- paste('Probability of Alternative', j2)
util$plot_expectand_pushforward(j2_probs[[1]], B,
flim=flim,
ylim=c(0, 100),
display_name=display_name,
col=util$c_light)
text(0.08, 90, 'Choice\nSubset 1', col=util$c_light)
util$plot_expectand_pushforward(j2_probs[[2]], B,
flim=flim,
col=util$c_mid,
border="#BBBBBB88",
add=TRUE)
text(0.10, 60, 'Choice\nSubset 2', col=util$c_mid)
util$plot_expectand_pushforward(j2_probs[[3]], B,
flim=flim,
col=util$c_dark,
border="#BBBBBB88",
add=TRUE)
text(0.045, 75, 'Choice\nSubset 3', col=util$c_dark)Because the independent Gumbel model exhibits independence from irrelevant alternatives, however, the ratios of choice probabilities should be consistent across any choice subset.
ratios <- list()
for (s in 1:S) {
name1 <- paste0('log_probs[', n, ',',
to_subset_idx[[s]][j1], ']')
name2 <- paste0('log_probs[', n, ',',
to_subset_idx[[s]][j2], ']')
ratios[[s]] <-
util$eval_expectand_pushforward(samples[[s]],
function(x1, x2) exp(x1 - x2),
list('x1'=name1, 'x2'=name2))
}Indeed the posterior distributions for q_{5,3} / q_{5,10} are consistent across all three of our choice subsets here. Note that they are not exactly the same because each posterior distribution is informed by a different set of observations.
par(mfrow=c(1, 1), mar = c(5, 4, 2, 1))
B <- 50
flim <- c(0.5, 1.75)
display_name <- paste(' Probability of Alternative', j1, '\n/',
'Probability of Alternative', j2)
util$plot_expectand_pushforward(ratios[[1]],
B, flim=flim,
ylim=c(0, 6),
display_name=display_name,
col=util$c_light)
util$plot_expectand_pushforward(ratios[[2]],
B, flim=flim,
col=util$c_mid,
border="#BBBBBB88",
add=TRUE)
util$plot_expectand_pushforward(ratios[[3]],
B, flim=flim,
col=util$c_dark,
border="#BBBBBB88",
add=TRUE)3.4.3.5 Joint Posterior Quantification
By modeling the appropriate substitution patterns we don’t have to limit ourselves to separate fits for each choice subset. We can equally well analyze all of the data at the same time using a single, joint discrete choice model.
ig\_joint.stan
functions {
// Insert a zero into x_free at index anchor
vector add_anchor(vector x_free, int anchor) {
int J = size(x_free) + 1;
if (anchor == 1)
return append_row(0, x_free);
else if (anchor == J)
return append_row(x_free, 0);
else
return append_row(x_free[1:(anchor - 1)],
append_row(0, x_free[anchor:(J - 1)]));
}
}
data {
int<lower=1> N; // Number of agents
int<lower=1> J; // Number of alternatives in choice set
// Choice subset configurations
int<lower=1> S;
int<lower=1> SJ;
array[S] int<lower=1, upper=SJ> subset_start;
array[S] int<lower=1, upper=SJ> subset_end;
array[SJ] int<lower=1, upper=J> alts;
// Aggregated observations
array[N, SJ] int<lower=0> choice_counts;
// Anchoring configuration
int<lower=1, upper=J> anchor;
// Prediction configuration
int<lower=1, upper=N> n_pred;
int<lower=1> J_pred;
array[J_pred] int<lower=1, upper=J> pred_alts;
}
parameters {
// Relative baseline utilities for non-anchor alternatives
vector[J - 1] omega_free;
// Relative scales for non-anchor agents
array[N - 1] real<lower=0> tau_free;
}
transformed parameters {
// Relative baseline utilities for all alternatives
vector[J] omega = add_anchor(omega_free, anchor);
// Relative scales for all agents
array[N] real<lower=0> tau = append_array({1}, tau_free);
// Alternative log probabilities
array[N] vector[SJ] log_probs;
for (s in 1:S) {
// Extract choice subset configuration
int JJ = subset_end[s] - subset_start[s] + 1;
array[JJ] int subset_alts
= alts[subset_start[s]:subset_end[s]];
vector[JJ] subset_omega = omega[subset_alts];
for (n in 1:N)
log_probs[n][subset_start[s]:subset_end[s]]
= log_softmax( subset_omega / tau[n] );
}
}
model {
// Prior model
target += normal_lpdf(omega_free | 0, 10);
target += normal_lpdf(tau_free | 0, 10);
// Observational model
for (s in 1:S) {
int JJ = subset_end[s] - subset_start[s] + 1;
for (n in 1:N) {
array[JJ] int subset_counts
= choice_counts[n, subset_start[s]:subset_end[s]];
vector[JJ] subset_log_probs
= log_probs[n][subset_start[s]:subset_end[s]];
target += multinomial_logit_lpmf(subset_counts |
subset_log_probs);
}
}
}
generated quantities {
array[N, SJ] int<lower=0> choice_counts_pred;
array[SJ] int<lower=0> agg_choice_counts_pred
= rep_array(0, SJ);
vector[J_pred] log_prob_pred;
for (s in 1:S) {
int JJ = subset_end[s] - subset_start[s] + 1;
for (n in 1:N) {
int MM
= sum(choice_counts[n, subset_start[s]:subset_end[s]]);
vector[JJ] subset_log_probs
= log_probs[n][subset_start[s]:subset_end[s]];
choice_counts_pred[n, subset_start[s]:subset_end[s]]
= multinomial_logit_rng(subset_log_probs, MM);
for (j in subset_start[s]:subset_end[s])
agg_choice_counts_pred[j] += choice_counts_pred[n, j];
}
}
{
vector[J_pred] pred_omega = omega[pred_alts];
log_prob_pred = softmax( pred_omega / tau[n_pred] );
}
}To keep the organization consistent I will store these choice counts using the same ragged array pattern that I used for the choice subset alternatives.
choice_counts <- matrix(NA, N, SJ)
for (s in 1:S) {
for (n in 1:N) {
choice_counts[n, subset_start[s]:subset_end[s]] <-
data[[s]]$choice_counts[n,]
}
}joint_data <- list('N'=N, 'J'=J,
'S'=S, 'SJ'=SJ,
'subset_start'=subset_start,
'subset_end'=subset_end,
'alts'=alts,
'choice_counts'=choice_counts,
'anchor'=anchor)For demonstration purposes we’ll specifically look at predictions for the fifth agent choosing from the full choice set. That said, we could equally just as easily make predictions for any agent choosing from any choice subset.
joint_data$n_pred <- 5
joint_data$J_pred <- J
joint_data$pred_alts <- 1:JNow we let Hamiltonian Monte Carlo go to work.
fit <- stan(file='stan_programs/ig_joint.stan',
data=joint_data, seed=8438338,
warmup=1000, iter=2024, refresh=0)The diagnostics are clean, giving us confidence in the accuracy of our posterior quantification.
diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics) All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
joint_samples <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(joint_samples,
c('omega_free',
'tau_free'),
TRUE)
util$check_all_expectand_diagnostics(base_samples)All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
There is no appreciable retrodictive tension in any of the choice subsets.
par(mfrow=c(3, 1), mar=c(5, 5, 1, 1))
for (s in 1:S) {
agg_choice_counts_obs <-
colSums(joint_data$choice_counts[,subset_start[s]:subset_end[s]])
names <- sapply(subset_start[s]:subset_end[s],
function(j)
paste0('agg_choice_counts_pred[', j, ']'))
util$plot_disc_pushforward_quantiles(
joint_samples, names,
baseline_values=agg_choice_counts_obs,
xlab="Alternative (Local Indexing)",
ylab="Choice Counts",
main=paste("Choice Subset", s)
)
}Posterior inferences of the relative baseline utilities drawn from the joint model are consistent with posterior inferences drawn from each of the individual models. By leveraging all of the data at the same time, however, the joint inferences enjoys much smaller uncertainties.
par(mfrow=c(2, 1), mar = c(5, 4, 2, 1))
B <- 50
flim <- c(-1.25, -0.25)
display_name <- paste0('omega[', j1, ']')
name <- paste0('omega[', to_subset_idx[[1]][j1], ']')
util$plot_expectand_pushforward(samples[[1]][[name]],
B, flim=flim,
ylim=c(0, 10),
display_name=display_name,
col=util$c_light)
name <- paste0('omega[', to_subset_idx[[2]][j1], ']')
util$plot_expectand_pushforward(samples[[s]][[name]],
B, flim=flim,
col=util$c_light,
border="#BBBBBB88",
add=TRUE)
name <- paste0('omega[', to_subset_idx[[3]][j1], ']')
util$plot_expectand_pushforward(samples[[s]][[name]],
B, flim=flim,
col=util$c_light,
border="#BBBBBB88",
add=TRUE)
name <- paste0('omega[', j1, ']')
util$plot_expectand_pushforward(joint_samples[[name]],
B, flim=flim,
col=util$c_dark,
border="#BBBBBB88",
add=TRUE)
flim <- c(-1.25, -0.25)
display_name <- paste0('omega[', j2, ']')
name <- paste0('omega[', to_subset_idx[[1]][j2], ']')
util$plot_expectand_pushforward(samples[[1]][[name]],
B, flim=flim,
ylim=c(0, 10),
display_name=display_name,
col=util$c_light)
name <- paste0('omega[', to_subset_idx[[2]][j2], ']')
util$plot_expectand_pushforward(samples[[s]][[name]],
B, flim=flim,
col=util$c_light,
border="#BBBBBB88",
add=TRUE)
name <- paste0('omega[', to_subset_idx[[3]][j2], ']')
util$plot_expectand_pushforward(samples[[s]][[name]],
B, flim=flim,
col=util$c_light,
border="#BBBBBB88",
add=TRUE)
name <- paste0('omega[', j2, ']')
util$plot_expectand_pushforward(joint_samples[[name]],
B, flim=flim,
col=util$c_dark,
border="#BBBBBB88",
add=TRUE)We can then use these precise inferences to make predictions that generalize to any choice subset.
par(mfrow=c(1, 1), mar = c(5, 4, 2, 1))
names <- sapply(1:joint_data$J_pred,
function(j) paste0('log_prob_pred[', j, ']'))
util$plot_disc_pushforward_quantiles(joint_samples, names,
xlab="Alternative",
xticklabs=joint_data$pred_alts,
ylab="Log Choice Probability")4 Coupled Utility Models
One of the most productive ways to enable more sophisticated substitution patterns in discrete models, in particular to transcend independence from irrelevant alternatives, is to couple the components of the probabilistic utility model together. The more coupled the utilities of two alternatives are, the better substitutions those two alternatives are for each other than other alternatives. In this section we’ll review some of the more common discrete choice models that are derived from coupled utility models.
4.1 Nested Gumbel Model
The nested Gumbel model partitions a choice set into subsets known as nests, with alternatives in the same nest more strongly coupled together than those across different nests.
More formally, consider partitioning a given choice set into K subsets. We denote each subset as B_{k} so that if j \in B_{k} then the jth alternative is in the kth group. When implementing these models in practice, it will also useful to be able to quickly refer to the nest containing the jth alternative. Here I will use the notation k(j).
Given a partition of the choice set, the nested Gumbel model is most easily defined not with a joint probability density function but rather with the a joint cumulative distribution function, \Pi( u_{n1}, \ldots, u_{nJ} ) = \exp \left( - \sum_{k = 1}^{K} \left( \sum_{j \in B_{k}} e^{ - \frac{1}{\lambda_{k}} \frac{ u_{nj} - \mu_{nj} }{ \sigma_{n} } } \right)^{ \lambda_{k} } \right). The \lambda_{k} are positive variables that quantify how strongly the utilities within each nest are coupled together, with larger values corresponding to weaker couplings.
When \lambda_{k} = 1 then the behavior of the nested Gumbel model in the kth nest reduces to the behavior of the independent Gumbel, at least up to a scale parameter. Values of \lambda_{k} larger than one can lead to self-consistent model configurations, but not always. To simplify the use of this model here, I will consider only 0 \le \lambda_{k} \le 1.
In theory we could differentiate this joint cumulative distribution function with respect to each argument to derive a corresponding joint probability density function. The results, however, are not particularly insightful and not needed to derive choice probabilities. Consequently we will not consider the nested Gumbel probability density function any further.
4.1.1 Nested Gumbel As A Generalization of Independent Gumbel
We can also derive the nested Gumbel model as a generalization of the independent Gumbel model. Recall that the cumulative distribution function is given by \Pi( u_{n1}, \ldots, u_{nJ}) = \exp \left( - \sum_{j = 1}^{J} e^{ - \frac{ u_{nj} - \mu_{nj} }{ \sigma_{n} } } \right). Given a nesting structure, we can always split the sum in the exponential up into a sum over nests and then a sum over the alternatives in each nest, \Pi( u_{n1}, \ldots, u_{nJ} ) = \exp \left( - \sum_{k = 1}^{K} \left( \sum_{j \in B_{k}} e^{ - \frac{ u_{nj} - \mu_{nj} }{ \sigma_{n} } } \right) \right).
At this point we embrace a vector space perspective (Horn and Johnson 1985) and recognize that the inner sums can be interpreted as 1-norms over each nest, \Vert x_{1}, \ldots, x_{I} \Vert_{1} = \sum_{i = 1}^{I} x_{i}. The immediate generalization of 1-norms are p-norms, \Vert x_{1}, \ldots, x_{I} \Vert_{p} = \left( \sum_{i = 1}^{I} x_{i}^{p} \right)^{\frac{1}{p}}. When p = 1, all of elements contribute to the norm equally. As p increases, however, the norm becomes dominated more and more by the one element with the largest value.
Allowing p to vary across nests then gives the generalized utility model \Pi( u_{n1}, \ldots, u_{nJ} ) = \exp \left( - \sum_{k = 1}^{K} \left( \sum_{j \in B_{k}} e^{ - p_{k} \, \frac{ u_{nj} - \mu_{nj} }{ \sigma_{n} } } \right)^{ \frac{1}{p_{k}} } \right). Defining \lambda_{k} = \frac{1}{p_{k}} immediately gives the nested Gumbel model.
Even if we don’t derive the nested Gumbel model in this way, the connection to p-norms provides intuition for the behavior in each nest. When \lambda_{k} = 1, for example, all of the alternatives in the nest contribute equally. As \lambda_{k} decreases, however, the the one alternative with the largest utility residual u_{nj} - \mu_{nj} begins to dominate (Figure 4).
4.1.2 Choice Probabilities
Following our general construction, the choice probabilities are given by differentiating the joint cumulative distribution function, evaluating all arguments on the target utility, and then integrating. After a few pages of calculations, which I have reserved for Appendix A.3, this gives \begin{align*} q_{nj} &= \int_{-\infty}^{+\infty} \mathrm{d} u \, \frac{ \partial \Pi }{ \partial x_{j} } ( u, \ldots, u ) \\ &= \frac{ e^{ \frac{1}{ \lambda_{k(j)} } \frac{ \mu_{nj} }{ \sigma_{n} } } \, \left( \sum_{j' \in B_{k(j)}} e^{ \frac{1}{ \lambda_{k(j)} } \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{ \lambda_{k(j)} - 1 } }{ \sum_{k' = 1}^{K} \left( \sum_{j' \in B_{k'}} e^{ \frac{1}{ \lambda_{k'} } \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{ \lambda_{k'} } }. \end{align*}
4.1.2.1 Conditional and Marginal Choice Probabilities
The choice probabilities become substantially more interpretable with a little bit of mathematical manipulation. In particular, we can manipulate them into a product of two softmax functions, \begin{align*} q_{nj} &= \frac{ e^{ \frac{1}{ \lambda_{k(j)} } \frac{ \mu_{nj} }{ \sigma_{n} } } \, \left( \sum_{j' \in B_{k(j)}} e^{ \frac{1}{ \lambda_{k(j)} } \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{ \lambda_{k(j)} - 1 } }{ \sum_{k' = 1}^{K} \left( \sum_{j' \in B_{k'}} e^{ \frac{1}{ \lambda_{k'} } \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{ \lambda_{k'} } } \\ &= \frac{ e^{ \frac{1}{ \lambda_{k(j)} } \frac{ \mu_{nj} }{ \sigma_{n} } } \, }{ \sum_{j' \in B_{k(j)}} e^{ \frac{1}{ \lambda_{k(j)} } \frac{ \mu_{nj'} }{ \sigma_{n} } } } \cdot \frac{ \left( \sum_{j' \in B_{k(j)}} e^{ \frac{1}{ \lambda_{k(j)} } \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{ \lambda_{k(j)} } }{ \sum_{k' = 1}^{K} \left( \sum_{j' \in B_{k'}} e^{ \frac{1}{ \lambda_{k'} } \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{ \lambda_{k'} } }. \end{align*}
The first term, q_{ nj \mid k(j) } = \frac{ e^{ \frac{1}{ \lambda_{k(j)} } \frac{ \mu_{nj} }{ \sigma_{n} } } \, }{ \sum_{j' \in B_{k(j)}} e^{ \frac{1}{ \lambda_{k(j)} } \frac{ \mu_{nj'} }{ \sigma_{n} } } } quantifies the conditional probability of choosing the jth alternative from the k(j)th nest, while the second term q_{ nk(j) } = \frac{ \left( \sum_{j' \in B_{k(j)}} e^{ \frac{1}{ \lambda_{k(j)} } \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{ \lambda_{k(j)} } }{ \sum_{k' = 1}^{K} \left( \sum_{j' \in B_{k'}} e^{ \frac{1}{ \lambda_{k'} } \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{ \lambda_{k'} } } quantifies the marginal probability of choosing the k(j)th nest from the partition. In other words, the nesting structure of the nested Gumbel model induces a natural conditional decomposition of the choice probabilities, q_{nj} = q_{ nj \mid k(j) } \, q_{ nk(j) }.
4.1.2.1.1 Efficient Calculation
Using the softmax function we can calculate all of the conditional probabilities for the alternatives within a particular nest, ( j_{1}, \ldots, j_{J_{k}} ) \in B_{k}, at the same time, (q_{nj_{1} \mid k}, \ldots, q_{nj_{J_{k}} \mid k} ) = \mathrm{softmax} \left( \frac{1}{ \lambda_{k} } \frac{ \mu_{nj_{1}} }{ \sigma_{n} } , \ldots, \frac{1}{ \lambda_{k} } \frac{ \mu_{nj_{J_{k}}} }{ \sigma_{n} } \right).
With a little bit of work we can also calculate all of the marginal probabilities for the nests with a single softmax function evaluation, \begin{align*} (q_{n1}, &\ldots, q_{nK} ) \\ &= \mathrm{softmax} \left( \lambda_{1} \, \log \sum_{j' \in B_{1}} e^{ \frac{1}{ \lambda_{1} } \frac{ \mu_{nj'} }{ \sigma_{n} } }, \right. \\ & \hspace{23mm} \ldots, \\ & \hspace{21.5mm} \left. \lambda_{K} \, \log \sum_{j' \in B_{K}} e^{ \frac{1}{ \lambda_{K} } \frac{ \mu_{nj'} }{ \sigma_{n} } } \quad \right). \end{align*}
The marginal probabilities simplify even further with the use of a log-sum-exp function, \text{log-sum-exp}(x_{1}, \ldots, x_{L}) = \log \sum_{l = 1}^{L} e^{x_{l}}. This allows us to write (q_{n1}, \ldots, q_{nK} ) = \mathrm{softmax} ( \lambda_{1} \, C_{1}, \ldots, \lambda_{K} \, C_{K} ) with C_{k} = \text{log-sum-exp} \left( \frac{1}{ \lambda_{k} } \frac{ \mu_{nj_{1}} }{ \sigma_{n} }, \ldots, \frac{1}{ \lambda_{k} } \frac{ \mu_{nj_{J_{k}}} }{ \sigma_{n} } \right)
4.1.2.2 Redundancies
The form of the conditional choice probabilities within each nest is identical to the choice probabilities that we would derived from the independent Gumbel model over that nest. This immediately indicates that the baseline utilities within each nest are subject to their own translation redundancy.
Independently translating the baseline utilities within each nest, however, does not result in the same marginal probabilities. Instead there is only a global translation redundancy across all of the baseline utilities. Writing \nu_{nj} = \frac{ \mu_{nj} }{ \sigma_{n} }, we have \begin{align*} q_{ nk(j) } ( \nu_{n1}, \ldots, \nu_{nJ} ) &= \frac{ \left( \sum_{j' \in B_{k(j)}} e^{ \frac{ \nu_{nj'} }{ \lambda_{k(j)} } } \right)^{ \lambda_{k(j)} } }{ \sum_{k' = 1}^{K} \left( \sum_{j' \in B_{k'}} e^{ \frac{ \nu_{nj'} }{ \lambda_{k'} } } \right)^{ \lambda_{k'} } } \\ &= \frac{ e^{\alpha} }{ e^{\alpha} } \frac{ \left( \sum_{j' \in B_{k(j)}} e^{ \frac{ \nu_{nj'} }{ \lambda_{k(j)} } } \right)^{ \lambda_{k(j)} } }{ \sum_{k' = 1}^{K} \left( \sum_{j' \in B_{k'}} e^{ \frac{ \nu_{nj'} }{ \lambda_{k'} } } \right)^{ \lambda_{k'} } } \\ &= \frac{ \left( \sum_{j' \in B_{k(j)}} e^{ \frac{ \alpha }{ \lambda_{k(j)} } } e^{ \frac{ \nu_{nj'} }{ \lambda_{k(j)} } } \right)^{ \lambda_{k(j)} } }{ \sum_{k' = 1}^{K} e^{ \alpha } \left( \sum_{j' \in B_{k'}} e^{ \frac{ \alpha }{ \lambda_{k'} } } e^{ \frac{ \nu_{nj'} }{ \lambda_{k'} } } \right)^{ \lambda_{k'} } } \\ &= \frac{ \left( \sum_{j' \in B_{k(j)}} e^{ \frac{ \nu_{nj'} + \alpha }{ \lambda_{k(j)} } } \right)^{ \lambda_{k(j)} } }{ \sum_{k' = 1}^{K} e^{ \alpha } \left( \sum_{j' \in B_{k'}} e^{ \frac{ \nu_{nj'} + \alpha }{ \lambda_{k'} } } \right)^{ \lambda_{k'} } } \\ &= q_{nk(j)} ( \nu_{n1} + \alpha, \ldots, \nu_{nJ} + \alpha) \end{align*}
All of this is to say that the same relative baseline utility and relative scale approach that we considered in Section 3.2 will eliminate the redundancies of the nested Gumbel model as well!
4.1.3 Substitution Patterns
The nested Gumbel model couples the alternative utilities within each nest together, making alternatives within a nest better substitutes for each other than alternatives in other nests. These substitution patterns can be quantified in various ways.
4.1.3.1 Ratios of Choice Probabilities
In a nested Gumbel model the ratios of choice probabilities for any two alternatives is given by \begin{align*} \frac{ q_{nj_{1}} }{ q_{nj_{2}} } &= \frac{ q_{ nj_{1} \mid k(j_{1}) } \, q_{ nk(j_{1}) } }{ q_{ nj_{2} \mid k(j_{2}) } \, q_{ nk(j_{2}) } } \\ &= \frac{ q_{ nj_{1} \mid k(j_{1}) } }{ q_{ nj_{2} \mid k(j_{2}) } } \, \frac{ q_{ nk(j_{1}) } }{ q_{ nk(j_{2}) } }. \end{align*}
If two alternatives are in the same nest, k(j_{1}) = k(j_{2}) = k, then the marginal probabilities cancel, \begin{align*} \frac{ q_{nj_{1}} }{ q_{nj_{2}} } &= \frac{ q_{ nj_{1} \mid k(j_{1}) } \, q_{ nk(j_{1}) } } { q_{ nj_{2} \mid k(j_{2}) } \, q_{ nk(j_{2}) } } \\ &= \frac{ q_{ nj_{1} \mid k } \, q_{ nk } } { q_{ nj_{2} \mid k } \, q_{ nk } } \\ &= \frac{ q_{ nj_{1} \mid k } } { q_{ nj_{2} \mid k } }, \end{align*} leaving only the ratio of conditonal probabilities.
This ratio, however, depends on only the two corresponding baseline utilities, \begin{align*} \frac{ q_{nj_{1}} }{ q_{nj_{2}} } &= \frac{ q_{ nj_{1} \mid k } }{ q_{ nj_{2} \mid k } } \\ &= \frac{ e^{ \frac{ 1 }{ \lambda_{k} } \frac{ \mu_{nj_{1}} }{ \sigma_{n} } } \, }{ \sum_{j' \in B_{k}} e^{ \frac{ 1 }{ \lambda_{k} } \frac{ \mu_{nj'} }{ \sigma_{n} } } } \frac{ \sum_{j' \in B_{k}} e^{ \frac{ \mu_{nj'} }{ \lambda_{k} } } }{ e^{ \frac{ 1 }{ \lambda_{k} } \frac{ \mu_{nj_{2}} }{ \sigma_{n} } } \, } \\ &= \exp \left( \frac{ 1 }{ \lambda_{k} } \frac{ \mu_{nj_{1}} - \mu_{nj_{2}} }{ \sigma_{n} } \right). \end{align*} In other words the nested Gumbel model exhibits independent of irrelevant alternatives within each nest.
When comparing alternatives across nests, however, the marginal probabilities no longer cancel. Instead we have \begin{align*} \frac{ q_{nj_{1}} }{ q_{nj_{2}} } &= \frac{ q_{ nj_{1} \mid k(j_{1}) } }{ q_{ nj_{2} \mid k(j_{2}) } } \, \frac{ q_{ nk(j_{1}) } }{ q_{ nk(j_{2}) } } \\ &= \frac{ e^{ \frac{ 1 }{ \lambda_{k} } \frac{ \mu_{nj_{1}} }{ \sigma_{n} } } \, }{ e^{ \frac{ 1 }{ \lambda_{k} } \frac{ \mu_{nj_{2}} }{ \sigma_{n} } } \, } \frac{ \left( \sum_{j' \in B_{k(j_{1})}} e^{ \frac{ 1 }{ \lambda_{k(j_{1})} } \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{ \lambda_{k(j_{1})} } }{ \left( \sum_{j' \in B_{k(j_{2})}} e^{ \frac{ 1 }{ \lambda_{k(j_{2})} } \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{ \lambda_{k(j_{2})} } }. \end{align*} The ratio of the choice probabilities now depend on all of the baseline utilities across the two nests. Whenever the behavior of any alternatives in either of the nests change, so two will the substitution pattern between j_{1} and j_{2}.
In general the coupling of the nested Gumbel model allows alternatives in one nest to be better substitutes for each other than any alternative in another nest. If both \lambda_{k(j_{1})} and \lambda_{k(j_{2})} are strictly less than one and if \sum_{j' \in B_{k(j_{1})}} e^{ \frac{ 1 }{ \lambda_{k(j_{1})} } \frac{ \mu_{nj'} }{ \sigma_{n} } } decreases relative \sum_{j' \in B_{k(j_{2})}} e^{ \frac{ 1 }{ \lambda_{k(j_{2})} } \frac{ \mu_{nj'} }{ \sigma_{n} } } then the choice probability q_{nj_{1}} will increase relative to q_{nj_{2}}, and vice versa.
In other words any choice probability removed from one alternative within a nest are preferentially transferred to the choice probabilities of the other alternatives in that same nest. Taking this to the limit, removing an item from B_{k(j_{1})} but keeping everything else fixed decreases \sum_{j' \in B_{k(j_{1})}} e^{ \frac{ 1 }{ \lambda_{k(j_{1})} } \frac{ \mu_{nj'} }{ \sigma_{n} } } relative to the corresponding sum for the k(j_{2}) nest. This in turn increases q_{nj_{1}} relative to q_{nj_{2}} (Figure 5).
4.1.3.2 Elasticities
The substitution patterns are also mirrored in the behavior of the elasticities. That said, the analytic form of the elasticities become substantially more ungainly. After a long calculation Appendix A.1 we have \begin{align*} \epsilon_{njj'} &= \frac{ \partial q_{nj} }{ \partial \mu_{nj'} } \\ &= \frac{ \mu_{nj'} }{ q_{nj} } \frac{ \partial }{ \partial \mu_{nj'} } \left[ \frac{ e^{ \frac{ 1 }{ \lambda_{k(j)} } \frac{ \mu_{nj} } { \sigma_{n} } } \, \left( \sum_{j'' \in B_{k(j)} } e^{ \frac{ 1 }{ \lambda_{k(j)} } \frac{ \mu_{nj''} }{ \sigma_{n} } } \right)^{ \lambda_{k(j)} - 1 } }{ \sum_{k' = 1}^{K} \left( \sum_{j'' \in B_{k'}} e^{ \frac{ 1 }{ \lambda_{k'} } \frac{ \mu_{nj''} }{ \sigma_{n} } } \right)^{ \lambda_{k'} } } \right] \\ &= \Bigg( \quad \delta_{j'j} \, \frac{1}{ \lambda_{k(j)} } \\ &\hspace{7mm} + \delta_{k(j') k(j) } \, \frac{\lambda_{k(j)} - 1}{ \lambda_{k(j)} } q_{ nj' \mid k(j) } \\ &\hspace{7mm} - q_{nj'} \, \sum_{j'' \in B_{k(j')}} e^{ \frac{ 1 }{ \lambda_{k(j')} } \frac{ \mu_{nj''} }{ \sigma_{n} } } \hspace{28mm} \Bigg) \frac{ \mu_{nj'} }{ \sigma_{n} }, \end{align*} where \delta denotes the Kronecker delta function, \delta_{ii'} = \left\{ \begin{array}{ll} 0, & i \ne i' \\ 1, & i = i' \end{array} \right. .
In particular we can identify when independence of irrelevant alternatives arises by comparing the elasticities defined by three distinct alternatives, j_{1}, j_{2}, and j', in different nesting scenarios.
For example if j_{1} and j' are in different nests, k(j_{1}) \ne k(j'), then only the third term of the elasticity remains, \epsilon_{nj_{1}j'} = \frac{ \mu_{nj'} \, q_{nj'} }{ \sigma_{n} } \sum_{j'' \in B_{k(j')}} e^{ \frac{ 1 }{ \lambda_{k(j')} } \frac{ \mu_{nj''} }{ \sigma_{n} } }, which is independent of j_{1} and its nest. Consequently if j_{2} and is also in a different nest than j', but not necessarily in a different nest than j_{1}, k(j_{1}) \ne k(j') \text{ and } k(j_{2}) \ne k(j'), then the elasticities are equal, \epsilon_{nj_{1}j'} = \epsilon_{nj_{2}j'}. Consequently the ratio of q_{nj_{1}} / q_{nj_{2}} will be invariant to changes in the behavior of the j'th alternative.
At the same time if all three alternatives are in the same nest, k( j_{1} ) = k( j_{2} ) = k( j' ) \equiv k then the elasticities become \begin{align*} \epsilon_{nj_{1}j'} &= \frac{ \mu_{nj'} }{ \sigma_{n} } \left( q_{ nj' \mid k } \, \frac{\lambda_{k} - 1}{ \lambda_{k} } + q_{nj'} \, \sum_{j'' \in B_{k}} e^{ \frac{ 1 }{ \lambda_{k} } \frac{ \mu_{nj''} }{ \sigma_{n} } } \right) \\ \epsilon_{nj_{2}j'} &= \frac{ \mu_{nj'} \, q_{nj'} }{ \sigma_{n} } \left( q_{ nj' \mid k } \, \frac{\lambda_{k} - 1}{ \lambda_{k} } + q_{nj'} \, \sum_{j'' \in B_{k(j')}} e^{ \frac{ 1 }{ \lambda_{k} } \frac{ \mu_{nj''} }{ \sigma_{n} } } \right). \end{align*} The two elasticities are exactly equal once again, and the ratio of q_{nj_{1}} / q_{nj_{2}} will be invariant to changes in the behavior of the j'th alternative.
This behavior changes, however, when j' is in the same nest as j_{1} or j_{2}. For example if k( j' ) = k( j_{1} ) \ne k( j_{2} ) then \epsilon_{nj_{1}j'} = \frac{ \mu_{nj'} }{ \sigma_{n} } \left( q_{ nj' \mid k(j_{1}) } \, \frac{\lambda_{k(j_{1})} - 1}{ \lambda_{k(j_{1})} } + q_{nj'} \, \sum_{j'' \in B_{k(j')}} e^{ \frac{ 1 }{ \lambda_{k(j')} } \frac{ \mu_{nj''} }{ \sigma_{n} } } \right) but \epsilon_{nj_{2}j'} = \frac{ \mu_{nj'} }{ \sigma_{n} } \left( q_{nj'} \, \sum_{j'' \in B_{k(j')}} e^{ \frac{ 1 }{ \lambda_{k(j')} } \frac{ \mu_{nj''} }{ \sigma_{n} } } \right) In this case the elasticities are not equal and changes in the behavior of the j' alternative will influence q_{nj_{1}} / q_{nj_{2}}.
Altogether these results imply that the nested Gumbel model exhibits independence of irrelevant alternatives within each individual nest. Any changes outside of the nests containing any alternatives being considered do not change the corresponding substitution patterns.
4.2 Generalized Nested Gumbel Model
Conveniently the nested Gumbel model can be generalized without compromising analytic tractability by removing the constraint that the nests must be disjoint.
To allow nests to overlap we introduce for each alternative a set of parameters that quantify the partial association to each nest, ( \alpha_{j1}, \ldots, \alpha_{jK} ), that form a simplex, 0 \le \alpha_{jk} \le 1, \quad \sum_{k = 1}^{K} \alpha_{jk} = 1. The possible configurations of these partial simplices is not arbitrary but rather is restricted by the structure of the nests. In particular \alpha_{jk} can be non-zero if and only if j \in B_{k}.
With partial associations we can generalize the probabilistic utility model to the joint cumulative distribution function, \Pi( u_{n1}, \ldots, u_{nJ} ) = \exp \left( - \sum_{k = 1}^{K} \left( \sum_{j \in B_{k}} \left( \alpha_{jk} \, e^{ -\frac{ u_{nj} - \mu_{nj} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} \right)^{ \lambda_{k} } \right). If for each alternative j we have \alpha_{jk} = 1 for one, and only one, k then the nests become disjoint and this model reduces to the nested Gumbel model.
Deriving the choice probabilities becomes more straightforward if we first define the subset of nests that contain a given alternative, A_{j}. In other words k \in A_{j} if and only if j \in B_{k}.
After a long calculation Appendix A.3, this organization allows us to write the choice probabilities as \begin{align*} q_{nj} &= \int_{-\infty}^{+\infty} \mathrm{d} u_{nj} \, \frac{ \partial \Pi }{ \partial x_{j} } ( u_{nj}, \ldots, u_{nj} ) \\ &= \frac{ \sum_{k \in A_{j} } \left( \sum_{j' \in B_{k}} \left( \alpha_{j'k} \, e^{ \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} \right)^{ \lambda_{k} - 1 } \, \left( \alpha_{jk} \, e^{ \frac{ \mu_{nj} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} \, }{ \sum_{k = 1}^{K} \left( \sum_{j' \in B_{k}} \left( \alpha_{j'k} \, e^{ \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} \right)^{ \lambda_{k} } }. \end{align*}
Similar the choice probabilities in the nested Gumbel model, the choice probabilities in the generalized nested Gumbel model admit a convenient conditional decomposition, \begin{align*} q_{nj} &= \frac{ \sum_{k \in A_{j} } \left( \sum_{j' \in B_{k}} \left( \alpha_{j'k} \, e^{ \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} \right)^{ \lambda_{k} - 1 } \, \left( \alpha_{jk} \, e^{ \frac{ \mu_{nj} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} \, }{ \sum_{k = 1}^{K} \left( \sum_{j' \in B_{k}} \left( \alpha_{j'k} \, e^{ \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} \right)^{ \lambda_{k} } } \\ &= \sum_{k \in A_{j} } \frac{ \left( \sum_{j' \in B_{k}} \left( \alpha_{j'k} \, e^{ \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} \right)^{ \lambda_{k} - 1 } \, \left( \alpha_{jk} \, e^{ \frac{ \mu_{nj} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} \, }{ \sum_{k' = 1}^{K} \left( \sum_{j' \in B_{k}} \left( \alpha_{j'k'} \, e^{ \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k'}}} \right)^{ \lambda_{k'} } } \\ &= \sum_{k \in A_{j} } \frac{ \left( \alpha_{jk} \, e^{ \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} }{ \sum_{j' \in B_{k}} \left( \alpha_{j'k} \, e^{ \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} } \frac{ \left( \sum_{j' \in B_{k}} \left( \alpha_{j'k} \, e^{ \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} \right)^{ \lambda_{k} } }{ \sum_{k' = 1}^{K} \left( \sum_{j' \in B_{k}} \left( \alpha_{j'k'} \, e^{ \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k'}}} \right)^{ \lambda_{k'} } } \\ &\equiv \sum_{k \in A_{j} } q_{nj \mid k } \, q_{nk}. \end{align*}
Here q_{nj \mid B_{k} } = \frac{ \left( \alpha_{jk} \, e^{ \frac{ \mu_{nj} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} }{ \sum_{j' \in B_{k}} \left( \alpha_{j'k} \, e^{ \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} } quantifies the conditional probabilities of the alternative within each nest, while q_{nk} = \frac{ \left( \sum_{j' \in B_{k}} \left( \alpha_{j'k} \, e^{ \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} \right)^{ \lambda_{k} } }{ \sum_{k' = 1}^{K} \left( \sum_{j' \in B_{k}} \left( \alpha_{j'k'} \, e^{ \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k'}}} \right)^{ \lambda_{k'} } } quantifies the marginal probability of each nest.
With some careful manipulation we can also write these probabilities as the output of softmax functions, which is particularly useful for practical implementations. For example \begin{align*} q_{nj \mid B_{k} } &= \frac{ \left( \alpha_{jk} \, e^{ \frac{ \mu_{nj} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} }{ \sum_{j' \in B_{k}} \left( \alpha_{j'k} \, e^{ \frac{ \mu_{nj'} }{ \sigma_{n} } } \right)^{\frac{1}{\lambda_{k}}} } \\ &= \frac{ \exp \left( \frac{1}{\lambda_{k}} \, \left( \log \alpha_{jk} + \frac{ \mu_{nj} }{ \sigma_{n} } \right) \right) }{ \sum_{j' \in B_{k}} \exp \left( \frac{1}{\lambda_{k}} \, \left( \log \alpha_{j'k} + \frac{ \mu_{nj} }{ \sigma_{n} } \right) \right) } \end{align*} with (q_{n1 \mid k}, \ldots, q_{nJ \mid k} ) = \mathrm{softmax} \left( \frac{1}{\lambda_{k}} \, \left( \log \alpha_{1k} + \frac{ \mu_{n1} }{ \sigma_{n} } \right), \ldots, \frac{1}{\lambda_{k}} \, \left( \log \alpha_{Jk} + \frac{ \mu_{nJ} }{ \sigma_{n} } \right) \right).
The substitution patterns exhibited by these generalized choice probabilities are, unsurprisingly, more complicated than those of the nested Gumbel model. Specifically the summation over all nests that contain a given alternative immediately obstructs the cancellation we would need for the ratios of choice probabilities to be independent of the behavior of other alternatives. I will leave the exploration of these substitution patterns as an exercise for the enthusiastic reader…
4.3 The Probit Model
Another popular discrete choice modeling technique with coupled utilities is the probit model which assumes a multivariate normal utility model for each agent, p( u_{n1}, \ldots, u_{nJ} ) = \text{multinormal} \left( u_{n1}, \ldots, u_{nJ} \mid \boldsymbol{\mu}_{n}, \boldsymbol{\Sigma}_{n} \right).
In order to calculate the choice probabilities in this model, q_{nj} = \pi \left( \, \bigcap_{j' \ne j} \{ u_{nj} > u_{nj'} \} \, \right), we have to be able to integrate the multivariate normal probability density function over these non-rectangular subsets.
Unfortunately these integrals are not available in closed form. At the same time, numerical integration is either too inaccurate or too expensive to be productive when we are working with more than a few alternatives.
Because of these challenges the probit model is typically implemented not by calculating the choice probabilities directly but rather by explicitly modeling the latent utilities. Specifically, the utilities for each agent u_{n1}, \ldots, u_{nJ} are elevated to parameters with the observational model p ( \tilde{y}_{m} \mid u_{n(m)1}, \ldots, u_{n(m)J} ) = \left\{ \begin{array}{ll} 1, & \bigcap_{j' \ne \tilde{y}_{m}} \{ u_{n(m)\tilde{y}_{m}} > u_{n(m)j'} \} \\ 0, & \text{else}. \end{array} \right.
This observational model, however, presents some immediate practical challenges. In particular the likelihood functions, and then posterior distributions, derived from this observational model will always be discontinuous, with some configurations of the utilities equally consistent with any observed data and some completely disallowed.
Because of this discontinuous behavior, the gradients of the derived likelihood functions are not informative. This, in turn, obstructs not only optimization-based maximum likelihood estimators but also scalable Markov chain Monte Carlo tools like Hamiltonian Monte Carlo.
Historically, probit models have largely been limited to Bayesian analyses using Markov chain Monte Carlo with relatively inefficient Gibbs samplers. Because the methodology required for efficient implementation and diagnostics of these tools is pretty distinct from Hamiltonian Monte Carlo, I will not consider probit models further in this chapter.
4.4 Demonstration
Let’s put all of this theory into context by analyzing data simulated from a model with more sophisticated substitution patterns than the model from which we simulated in Section 3.5.
4.4.1 Basics
As we did previously, let’s start with analyzing data arising from a single, fixed choice set.
4.4.1.1 Simulate Data
In this case we’ll use a nested Gumbel model for our simulation truth. This means configuring the nests and the intra-nest couplings within the given choice set.
K <- 3
nests <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3)
lambda_true <- c(0.65, 0.4, 0.8)The baseline utilities and agent scales will behave similar to before.
To implement the nested Gumbel model we need to be able to sum over all of the alternatives in each nest. This, in, turn, requires knowing which alternatives are in each nest. There are a few different ways to do this, but here I’ll use the num_which_int and which_int functions to dynamically lookup the appropriate alternatives.
simu\_ng1.stan
functions {
// Return the number of elements in vals that equal i
int num_which_int(array[] int vals, int i) {
int N = num_elements(vals);
int NN = 0;
for (n in 1:N) {
if (vals[n] == i) {
NN += 1;
}
}
return NN;
}
// Return the elements in vals that equal i
array[] int which_int(array[] int vals, int i) {
int N = num_elements(vals);
array[N] int selected_idxs;
int NN = 0;
for (n in 1:N) {
if (vals[n] == i) {
NN += 1;
selected_idxs[NN] = n;
}
}
return selected_idxs[1:NN];
}
}
data {
int<lower=1> M; // Number of observations
int<lower=1> N; // Number of agents
int<lower=1> J; // Number of alternatives in choice set
// Nesting configuration
int<lower=1, upper=J> K;
array[J] int<lower=1, upper=K> nests;
// Intra-nest coupling parameters
vector<lower=0, upper=1>[K] lambda;
}
generated quantities {
vector[J] mu; // Baseline utilities
array[N] real<lower=0> sigma; // Scales
array[M] int<lower=1, upper=N> agent; // Observed agent
array[M] int<lower=1, upper=J> choice; // Observed choice
array[N] vector[J] log_probs;
// Sample true data generating process behavior
for (j in 1:J) mu[j] = normal_rng(0, 1);
for (n in 1:N) sigma[n] = gamma_rng(31.0, 15.5);
// Compute log choice probabilities
for (n in 1:N) {
vector[K] nest_lses;
vector[K] log_marg_probs;
// Compute marginal probability for each nest
for (k in 1:K) {
int JK = num_which_int(nests, k);
array[JK] int nest_alts = which_int(nests, k);
nest_lses[k] = log_sum_exp( mu[nest_alts]
/ (sigma[n] * lambda[k]) );
}
log_marg_probs = log_softmax( lambda .* nest_lses );
// Compute joint probability for each alternative in each nest
for (k in 1:K) {
int JK = num_which_int(nests, k);
array[JK] int nest_alts = which_int(nests, k);
log_probs[n][nest_alts] // Joint Probability
= log_marg_probs[k] // Marginal Probability
+ mu[nest_alts] / (sigma[n] * lambda[k]) // Conditional
- nest_lses[k]; // Probability
}
}
// Simulate observations
for (m in 1:M) {
int n = categorical_rng(uniform_simplex(N));
agent[m] = n;
choice[m] = categorical_logit_rng(log_probs[n]);
}
}simu <- stan(file="stan_programs/simu_ng1.stan",
algorithm="Fixed_param", seed=8438338,
data=list('N' = N, 'J' = J, 'M' = M,
'K' = K, 'nests' = nests,
'lambda' = lambda_true),
warmup=0, iter=1, chains=1, refresh=0)
simu_fit <- extract(simu)Once the individual choices have been simulated we can aggregate them across agent-alternative pairs again.
mu_true <- simu_fit$mu[1,]
sigma_true <- simu_fit$sigma[1,]
log_prob_true <- simu_fit$log_prob[1,,]
data <- list('N' = N, 'J' = J,
'K' = K, 'nests' = nests,
'agent' = simu_fit$agent[1,],
'choice' = simu_fit$choice[1,])
choice_counts <- matrix(NA, data$N, data$J)
for (n in 1:data$N)
choice_counts[n,] <- hist(data$choice[data$agent == n],
seq(0.5, data$J + 0.5, 1),
plot=FALSE)$counts
data$choice_counts <- choice_counts4.4.1.2 Explore Data
The aggregate histogram of choices across all agents still proves informative.
par(mfrow=c(1, 1), mar = c(5, 5, 2, 1))
agg_choice_counts <- colSums(data$choice_counts)
util$plot_disc_vals(agg_choice_counts,
xlab='Chosen Alternative',
ylab='Counts')Now, however, we can also stratify these histograms by nest to summarize the behavior within each nest.
par(mfrow=c(3, 1), mar = c(5, 5, 2, 1))
for (k in 1:data$K) {
alts <- which(data$nests == k)
nest_alt_choice_counts <- agg_choice_counts[alts]
util$plot_disc_vals(nest_alt_choice_counts,
xlab='Chosen Alternative',
xticklabs=alts,
ylab='Counts',
main=paste0('Nest ', k))
}At the same time, we can summarize the relative abundance of choices across nests by aggregating counts over all of the alternatives in each nest. Here we see that the third nest contains the most attractive alternatives.
par(mfrow=c(1, 1), mar = c(5, 5, 2, 1))
nest_choice_counts <-
sapply(1:K,
function(k)
sum(agg_choice_counts[which(data$nests == k)]))
util$plot_disc_vals(nest_choice_counts,
xlab='Chosen Nest',
ylab='Counts')4.4.1.3 Independent Gumbel Model
Although we know that the true data generating process couples the baseline utilities, let’s see what happens when we assume the simpler independent Gumbel model.
Once again we’ll choose the anchor alternative based on the total number of observed choices.
colSums(data$choice_counts) [1] 360 1812 1050 498 3845 770 278 3009 2360 1018
data$anchor <- 5# Posterior quantification
fit <- stan(file='stan_programs/ig3.stan',
data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)All of the diagnostics are clean, indicting that our posterior computation is reasonably accurate.
diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics) All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples,
c('omega_free',
'tau_free'),
TRUE)
util$check_all_expectand_diagnostics(base_samples)All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
Moreover, despite the oversimplified assumptions the aggregate posterior retrodictive behavior is great.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
names <- sapply(1:data$J,
function(j)
paste0('agg_choice_counts_pred[', j, ']'))
util$plot_disc_pushforward_quantiles(
samples, names,
baseline_values=agg_choice_counts,
xlab="Alternative",
ylab="Choice Counts"
)The inferred log choice probabilities are also well-behaved, matching the true behaviors from the simulation.
par(mfrow=c(1, 1), mar = c(5, 4, 2, 1))
names <- sapply(1:data$J,
function(j)
paste0('log_probs[1,', j, ']'))
util$plot_disc_pushforward_quantiles(
samples, names,
baseline=log_prob_true[1,],
baseline_col=util$c_mid_teal,
xlab="Alternative",
ylab="Log Probability"
)Inferences for the baseline utilities, however, are less satisfying, as they concentrate pretty far away from the true values.
par(mfrow=c(1, 1), mar = c(5, 4, 2, 1))
omega_true <- (mu_true - mu_true[data$anchor]) / sigma_true[1]
names <- sapply(1:data$J, function(j) paste0('omega[', j, ']'))
util$plot_disc_pushforward_quantiles(samples, names,
baseline_values=omega_true,
baseline_col=util$c_mid_teal,
xlab="Alternative",
ylab="Relative Baseline Utility")Regardless of the behavior of the true data generating process, the independent Gumbel model is perfectly adequate for capturing the choice probabilities within a fixed choice set. The limitations of the independent Gumbel model will manifest only when we consider observations across multiple choice subsets at the same time. We’ll investigate that in Section 4.4.2.
4.4.1.4 Nested Gumbel Model
Before trying to fit data from multiple choice subsets, let’s spend some time implementing the nested Gumbel model for this single choice set. Here we’ll also use the num_which_int and which_int functions to make the choice probability calculations a bit cleaner.
ng1.stan
functions {
// Insert a zero into x_free at index anchor
vector add_anchor(vector x_free, int anchor) {
int J = size(x_free) + 1;
if (anchor == 1)
return append_row(0, x_free);
else if (anchor == J)
return append_row(x_free, 0);
else
return append_row(x_free[1:(anchor - 1)],
append_row(0, x_free[anchor:(J - 1)]));
}
// Return the number of elements in vals that equal i
int num_which_int(array[] int vals, int i) {
int N = num_elements(vals);
int NN = 0;
for (n in 1:N) {
if (vals[n] == i) {
NN += 1;
}
}
return NN;
}
// Return the elements in vals that equal i
array[] int which_int(array[] int vals, int i) {
int N = num_elements(vals);
array[N] int selected_idxs;
int NN = 0;
for (n in 1:N) {
if (vals[n] == i) {
NN += 1;
selected_idxs[NN] = n;
}
}
return selected_idxs[1:NN];
}
}
data {
int<lower=1> N; // Number of agents
int<lower=1> J; // Number of alternatives in choice set
// Aggregated observations
array[N, J] int<lower=0> choice_counts;
// Nesting configuration
int<lower=1, upper=J> K;
array[J] int<lower=1, upper=K> nests;
// Anchor configuration
int<lower=1, upper=J> anchor;
}
parameters {
// Relative baseline utilities for non-anchor alternatives
vector[J - 1] omega_free;
// Relative scales for non-anchor agents
array[N - 1] real<lower=0> tau_free;
// Intra-nest coupling parameters
vector<lower=0, upper=1>[K] lambda;
}
transformed parameters {
// Relative baseline utilities for all alternatives
vector[J] omega = add_anchor(omega_free, anchor);
// Relative scales for all agents
array[N] real<lower=0> tau = append_array({1}, tau_free);
// Alternative log probabilities
array[N] vector[J] log_probs;
for (n in 1:N) {
vector[K] nest_lses;
vector[K] log_marg_probs;
// Compute marginal probability for each nest
for (k in 1:K) {
int JK = num_which_int(nests, k);
array[JK] int nest_alts = which_int(nests, k);
nest_lses[k] = log_sum_exp( omega[nest_alts]
/ (tau[n] * lambda[k]) );
}
log_marg_probs = log_softmax( lambda .* nest_lses );
// Compute joint probability for each alternative in each nest
for (k in 1:K) {
int JK = num_which_int(nests, k);
array[JK] int nest_alts = which_int(nests, k);
log_probs[n][nest_alts] // Joint Probability
= log_marg_probs[k] // Marginal Probability
+ omega[nest_alts] / (tau[n] * lambda[k]) // Conditional
- nest_lses[k]; // Probability
}
}
}
model {
// Prior model
target += normal_lpdf(omega_free | 0, 10);
target += normal_lpdf(tau_free | 0, 10);
target += beta_lpdf(lambda | 1, 1);
// Observational model
for (n in 1:N)
target += multinomial_logit_lpmf(choice_counts[n] | log_probs[n]);
}
generated quantities {
array[N, J] int<lower=0> choice_counts_pred;
array[J] int<lower=0> agg_choice_counts_pred = rep_array(0, J);
array[K] int<lower=0> nest_choice_counts_pred = rep_array(0, K);
for (n in 1:N) {
int MM = sum(choice_counts[n]);
choice_counts_pred[n]
= multinomial_logit_rng(log_probs[n], MM);
for (j in 1:J) {
agg_choice_counts_pred[j] += choice_counts_pred[n, j];
}
}
for (k in 1:K) {
int JK = num_which_int(nests, k);
array[JK] int nest_alts = which_int(nests, k);
nest_choice_counts_pred[k]
= sum(agg_choice_counts_pred[nest_alts]);
}
}
fit <- stan(file='stan_programs/ng1.stan',
data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)Despite the fact that we’re now using the correct model, the posterior computation has faltered. Specifically we now see divergences that indicate stunted exploration.
diagnostics1 <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics1) Chain 2: 27 of 1024 transitions (2.6%) diverged.
Chain 3: 1 of 1024 transitions (0.1%) diverged.
Divergent Hamiltonian transitions result from unstable numerical
trajectories. These instabilities are often due to degenerate target
geometry, especially "pinches". If there are only a small number of
divergences then running with adept_delta larger than 0.801 may reduce
the instabilities at the cost of more expensive Hamiltonian
transitions.
samples1 <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples1,
c('omega_free',
'tau_free',
'lambda'),
TRUE)
util$check_all_expectand_diagnostics(base_samples)All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
To root out the underlying problem we’ll need to find where the divergences at concentrating, which means examining scatter plots of pairs of parameters. We can try to brute force our way through all of the possible scatter plots, but let’s take a second to consider the structure of the model and motivate particularly suspect pairs.
Going through the mathematical structure of the nested Gumbel model we see that the relative baseline utilities within each nest always appear with the corresponding intra-nest coupling parameter, \exp \left( \frac{ 1 }{ \lambda_{k(j')} } \frac{ \omega_{nj''} }{ \tau_{n} } \right). If the data mostly inform the ratio \frac{ \omega_{nj''} }{ \lambda_{k(j')} } and \omega_{nj''} and \lambda_{k(j')} individually, then any inferences for these parameters will exhibit a problematic degeneracy. Consequently let’s look at these pairs first.
lambda_names <- sapply(1:data$K,
function(k) paste0('lambda[', k, ']'))
transforms <- setNames(as.list(rep(2, data$K)), lambda_names)
for (k in 1:K) {
alt_names <- sapply(which(data$nest == k),
function(j) paste0('omega[', j, ']'))
util$plot_div_pairs(alt_names, lambda_names[k],
samples1, diagnostics1,
transforms=transforms)
}