Mixture Modeling

Author

Michael Betancourt

Published

October 2024

Sometimes a single probabilistic model just isn’t enough. When a behavior of interest might be consistent with multiple probabilistic models we need to incorporate each of those possibilities into the final model. If we don’t know which of those possible probabilistic is responsible for a particular instance of that behavior then the only way to built a consistent joint model is with mixture modeling.

Mixture modeling allows us to, for example, account for undesired contamination in observations, such as an irreducible background that overlaps with a desired signal. By mixing together probabilistic models for the signal and background events we can learn not only their individual behaviors but also their relative prevalence across the observed data.

In this chapter we’ll review the mathematical foundations and general implementation of mixture models, including the potential for frustrating inferential behaviors. I will illustrate these concepts with a series of demonstrative analyses.

1 Implementing Mixture Models

From a mathematical perspective mixing multiple probabilistic models together is relatively straightforward. We have to do a bit of work, however, to derive an implementation that performs well in practice.

1.1 Categorical Implementations

Consider an mathematical space X and a collection of K component probability distributions specified with the probability density functions, p_{k}(x), each providing a different model for a behavior of interest. To simplify the initial presentation we will assume that each component probability distribution is fixed so that there no model configuration variables to consider.

The most straightforward way to model a behavior that can arise from any of these component models is to introduce a categorical variable that labels the responsible component, z \in \{ 1, \ldots, k, \ldots, K \}, and a corresponding probabilistic model, p(z = k \mid \lambda_{1}, \ldots, \lambda_{K}) = \lambda_{k}, where the categorical probabilities \lambda_{k} from a simplex configuration, \begin{align*} 0 \le \lambda_{k} &\le 1 \\ \sum_{k = 1}^{K} \lambda_{k} &= 1. \end{align*}

Each instance of this behavior is then modeled with a value x \in X, an assignment z, and the joint model \begin{align*} p(x, z \mid \lambda_{1}, \ldots, \lambda_{K}) &= p(x \mid z) \, p(z \mid \lambda_{1}, \ldots, \lambda_{K}) \\ &= p_{z}(x) \, \lambda_{z}. \end{align*}

If we know the responsible component model then the component assignment variable z will be known. If we do not know z, however, then all we can do is infer it from any particular value x. Assuming that the component probabilities and component models are known the consistent component assignment are given by an application of Bayes’ Theorem, \begin{align*} p(z \mid \tilde{x}, \lambda_{1}, \ldots, \lambda_{K}) &= \frac{ p(\tilde{x}, z \mid \lambda_{1}, \ldots, \lambda_{K}) }{ \sum_{k = 1}^{K} p(\tilde{x}, k \mid \lambda_{1}, \ldots, \lambda_{K}) } \\ &= \frac{ p_{z}(\tilde{x}) \, \lambda_{z}} { \sum_{k = 1}^{K} p_{k}(\tilde{x}) \, \lambda_{k} }. \end{align*}

When modeling multiple, independent instances of a behavior we have to consider possible heterogeneity in the categorical variables.

If all of the categorical variables are modeled with the same component probabilities then the joint model becomes \begin{align*} p(x_{1}, z_{1}, \ldots, x_{N}, z_{N} \mid \lambda_{1}, \ldots, \lambda_{K}) &= \prod_{n = 1}^{N} p(x_{n} \mid z_{n}) \, p(z_{n} \mid \lambda_{1}, \ldots, \lambda_{K}) \\ &= \prod_{n = 1}^{N} p_{z_{n}}(x_{n}) \, \lambda_{z_{n}}. \end{align*} On the other hand if the component probabilities can vary across observations then we need a separate simplex configuration for each observation, \boldsymbol{\lambda}_{n} = ( \lambda_{n, 1}, \ldots, \lambda_{n, K}), and the joint model becomes \begin{align*} p(x_{1}, z_{1}, \ldots, x_{N}, z_{N} \mid \boldsymbol{\lambda}_{1}, \ldots, \boldsymbol{\lambda}_{N}) &= \prod_{n = 1}^{N} p(x_{n} \mid z_{n}) \, p(z_{n} \mid \boldsymbol{\lambda}_{n}) \\ &= \prod_{n = 1}^{N} p_{z_{n}}(x_{n}) \, \lambda_{n, z_{n}}. \end{align*}

In either case inferences of the individual z_{n} depend on only the value of the corresponding x_{n}, p(z_{n} \mid \tilde{x}_{n}, \lambda_{1}, \ldots, \lambda_{K}) = \frac{ p_{z_{n}}(\tilde{x}_{n}) \, \lambda_{z_{n}}} { \sum_{k = 1}^{K} p_{k}(\tilde{x}_{n}) \, \lambda_{k} } or p(z_{n} \mid \tilde{x}_{n}, \lambda_{n, 1}, \ldots, \lambda_{n, K}) = \frac{ p_{z_{n}}(\tilde{x}_{n}) \, \lambda_{n, z_{n}}} { \sum_{k = 1}^{K} p_{k}(\tilde{x}_{n}) \, \lambda_{n, k} }, again assuming fixed component probabilities and component probability distributions.

1.2 Marginal Implementations

In practice unknown categorical variables can quickly become a nuisance. Not only do they introduce a new variable that we have to infer for each observation but also those variables are discrete which limits the available computational tools. Specifically we cannot use gradient-based methods like Hamiltonian Monte Carlo to explore posterior distributions over the component assignments and component probabilities, let alone any configuration parameters for the component models, at the same time.

Fortunately marginalizing unknown categorical assignments out of the joint model is straightforward.

1.2.1 Single Observation

For a single instance marginalizing an unknown component assignment requires summing over the K possible values, \begin{align*} p(x \mid \lambda_{1}, \ldots, \lambda_{K}) &= \sum_{k = 1}^{K} p(x, z = k \mid \lambda_{1}, \ldots, \lambda_{K}) \\ &= \sum_{k = 1}^{K} p(x \mid z = k) \, p(z = k \mid \lambda_{1}, \ldots, \lambda_{K}) \\ &= \sum_{k = 1}^{K} p_{k}(x) \, \lambda_{k}, \end{align*} which is often written as \begin{equation*} p(x \mid \lambda_{1}, \ldots, \lambda_{K}) = \sum_{k = 1}^{K} \lambda_{k} \, p_{k}(x). \end{equation*} In other words all we have to do to eliminate a categorical variable is add together all of the the component probability density functions weighted by the component probabilities.

1.2.2 Multiple Homogeneous Observations

Given multiple, independent instances with the same component probabilities we can marginalize each component individually, \begin{align*} p(x_{1}, \ldots, x_{N} \mid \lambda_{1}, \ldots, \lambda_{K}) &= \prod_{n = 1}^{N} p(x_{n} \mid \lambda_{1}, \ldots, \lambda_{K}) \\ &= \prod_{n = 1}^{N} \sum_{k = 1}^{K} p_{k}(x_{n}) \, \lambda_{k} \\ &= \prod_{n = 1}^{N} \sum_{k = 1}^{K} \lambda_{k} \, p_{k}(x_{n}). \end{align*}

1.2.3 Multiple Heterogeneous Observations

Accounting for heterogeneity in the component probabilities is a bit more cumbersome, \begin{align*} p(x_{1}, \ldots, x_{N} \mid \boldsymbol{\lambda}_{1}, \ldots, \boldsymbol{\lambda}_{N}) &= \prod_{n = 1}^{N} p(x_{n} \mid \boldsymbol{\lambda}_{n}) \\ &= \prod_{n = 1}^{N} \sum_{k = 1}^{K} p_{k}(x_{n}) \, \lambda_{n, k}. \end{align*}

Interestingly we can sometimes simplify the heterogeneous model even further by marginalizing out not only the individual categorical variables but also the individual categorical probabilities!

Specifically if the individual categorical probabilities are independent and identically distributed then any consistent probabilistic model for their behavior will be of the form p(\boldsymbol{\lambda}_{n}, \ldots, \boldsymbol{\lambda}_{n}) = \prod_{n = 1}^{N} p(\boldsymbol{\lambda}_{n}). In this case the expectation value of any component probability is the same for all observations, \begin{align*} \int \mathrm{d} \boldsymbol{\lambda}_{1} \cdots \mathrm{d} \boldsymbol{\lambda}_{N} p(\boldsymbol{\lambda}_{1}, \ldots, \boldsymbol{\lambda}_{N}) \, \lambda_{n, k} &= \int \mathrm{d} \boldsymbol{\lambda}_{1} \cdots \mathrm{d} \boldsymbol{\lambda}_{N} \bigg[ \prod_{n = 1}^{N} p(\boldsymbol{\lambda}_{n}) \bigg] \, \lambda_{n, k} \\ &= \prod_{n' \ne 1}^{N} \int \mathrm{d} \boldsymbol{\lambda}_{n'} p(\boldsymbol{\lambda}_{n}) \cdot \int \mathrm{d} \boldsymbol{\lambda}_{n} \, p(\boldsymbol{\lambda}_{n}) \, \lambda_{n, k} \\ &= \prod_{n' \ne 1}^{N} 1 \cdot \int \mathrm{d} \boldsymbol{\lambda}_{n} \, p(\boldsymbol{\lambda}_{n}) \, \lambda_{n, k} \\ &= \int \mathrm{d} \boldsymbol{\lambda}_{n} \, p(\boldsymbol{\lambda}_{n}) \, \lambda_{n, k} \\ &\equiv \omega_{k}. \end{align*}

Moreover these individual expectation values form their own simplex configuration: the bounds on each \lambda_{n, k} imply that 0 \le \omega_{k} \le 1 while \begin{align*} \sum_{k = 1}^{K} \omega_{k} &= \sum_{k = 1}^{K} \int \mathrm{d} \boldsymbol{\lambda}_{n} \, p(\boldsymbol{\lambda}_{n}) \, \lambda_{n, k} \\ &= \int \mathrm{d} \boldsymbol{\lambda}_{n} \, p(\boldsymbol{\lambda}_{n}) \, \sum_{k = 1}^{K} \lambda_{n, k} \\ &= \int \mathrm{d} \boldsymbol{\lambda}_{n} \, p(\boldsymbol{\lambda}_{n}) \, 1 \\ &= 1. \end{align*}

Under these assumptions the marginal mixture model is given by \begin{align*} p(x_{1}, \ldots, x_{N}) &= \int \mathrm{d} \boldsymbol{\lambda}_{1} \cdots \mathrm{d} \boldsymbol{\lambda}_{N} p(x_{1}, \ldots, x_{N}, \boldsymbol{\lambda}_{1}, \ldots, \boldsymbol{\lambda}_{N}) \\ &= \int \mathrm{d} \boldsymbol{\lambda}_{1} \cdots \mathrm{d} \boldsymbol{\lambda}_{N} \bigg[ \prod_{n = 1}^{N} \sum_{k = 1}^{K} p_{k}(x_{n}) \, \lambda_{n, k} \bigg] \prod_{n = 1}^{N} p(\boldsymbol{\lambda}_{n}) \\ &= \prod_{n = 1}^{N} \int \mathrm{d} \boldsymbol{\lambda}_{n} \bigg[ \sum_{k = 1}^{K} p_{k}(x_{n}) \, \lambda_{n, k} \bigg] p(\boldsymbol{\lambda}_{n}) \\ &= \prod_{n = 1}^{N} \sum_{k = 1}^{K} \bigg[ p_{k}(x_{n}) \int \mathrm{d} \boldsymbol{\lambda}_{n} \, p(\boldsymbol{\lambda}_{n}) \, \lambda_{n, k} \bigg] \\ &= \prod_{n = 1}^{N} \sum_{k = 1}^{K} \bigg[ p_{k}(x_{n}) \, \omega_{k} \bigg]. \end{align*}

In other words the heterogeneous model reduces to the same form as the homogeneous mixture model, only with the categorical probabilities \boldsymbol{\omega} = (\omega_{1}, \ldots, \omega_{K})! Critically the new categorical probabilities \omega_{k} can no longer be interpreted as the probability of any individual instance of the target behavior arising from a particular component model. Instead they capture the proportion of the ensemble of instances that arises from each component model.

1.3 Numerically Stable Marginal Implementations

Most probabilistic programming languages, for example the Stan modeling language, work not with probability density functions p(x) but rather log probability density functions \log \circ \, p (x). Consequently in order to implement a mixture model in these languages we need to evaluate \begin{align*} \log \circ \, p (x_{1}, \ldots, x_{N} \mid \lambda_{1}, \ldots, \lambda_{K}) &= \log \bigg( \prod_{n = 1}^{N} \sum_{k = 1}^{K} \lambda_{k} \, p_{k}(x_{n}) \bigg) \\ &= \sum_{n = 1}^{N} \log \bigg( \sum_{k = 1}^{K} \lambda_{k} \, p_{k}(x_{n}) \bigg) \\ &= \sum_{n = 1}^{N} \log \bigg( \sum_{k = 1}^{K} \lambda_{k} \, \exp( \log \circ \, p_{k} (x_{n}) ) \bigg) \\ &= \sum_{n = 1}^{N} \log \bigg( \sum_{k = 1}^{K} \exp( \log(\lambda_{k}) + \log \circ \, p_{k} (y_{n}) ). \bigg) \end{align*} In other words for each observation we need to exponentiate a term for each component, evaluate the sum of the exponentials, and then apply the natural logarithm function.

This composite operation is often referred to as the “log sum exp” function, \text{log-sum-exp}(v_{1}, \ldots, v_{K}) = \log \bigg( \sum_{k = 1}^{K} \exp (v_{k}) \bigg). Implementing this operation on computers is complicated by the limitations of floating point arithmetic. In particular the intermediate terms \exp(v_{k}) are prone to overflowing to floating point infinity and corrupting the calculation before the final logarithm can calm things down.

Fortunately the properties of logarithms and exponentials allow us to implement the log sum exp operation in a variety of ways. Specifically we can always factor out any one component without affecting the final value, \begin{align*} \text{log-sum-exp}(v_{1}, \ldots, v_{K}) &= \log \bigg( \sum_{k = 1}^{K} \exp (v_{k}) \bigg) \\ &= \log \bigg( \exp (v_{k'}) + \sum_{k \ne k'} \exp (v_{k}) \bigg) \\ &= \log \bigg( \exp (v_{k'}) \bigg) \bigg(1 + \sum_{k \ne k'} \frac{ \exp (v_{k}) }{ \exp (v_{k'}) } \bigg) \\ &= \log \bigg( \exp (v_{k'}) \bigg) \bigg(1 + \sum_{k \ne k'} \exp (v_{k} - v_{k'}) \bigg) \\ &= \log \exp (v_{k'}) + \log \bigg(1 + \sum_{k \ne k'} \exp (v_{k} - v_{k'}) \bigg) \\ &= v_{k'} + \log \bigg(1 + \sum_{k \ne k'} \exp (v_{k} - v_{k'}) \bigg). \end{align*}

If we factor out the largest component value then v_{k'} \ge v_{k \ne k'}, the input to each exponential will always be less than or equal to one, and the calculation will never encounter floating point overflow. Conveniently most computational libraries provide log sum exp function implementations that automatically factor out the largest value and ensure stable numerical computation. Because of this we need to worry about numerical stability only when implementing operations like these ourselves.

For example in Stan we can avoid any numerical issues by using the built-in log_sum_exp function.

// Loop over observations
for (n in 1:N) {
  target += log_sum_exp(log(lambda[1]) + foo1_lpdf(x[n]),
                        log(lambda[2]) + foo2_lpdf(x[n]),
                        log(lambda[3]) + foo3_lpdf(x[n]),
                        log(lambda[4]) + foo4_lpdf(x[n]));
}

Conveniently this function is also vectorized so we can call it with a single vector argument.

// Loop over observations
for (n in 1:N) {
  vector[4] lpds = [ foo1_lpdf(x[n]), foo2_lpdf(x[n]),
                     foo3_lpdf(x[n]), foo4_lpdf(x[n]) ]';
  target += log_sum_exp(log(lambda) + lpds);
}

When working with only two components there is also a log_mix function that incorporates the mixture probability directly.

log_mix(theta, foo1_lpdf(x[n]), foo2_lpdf(x[n]))
=
log_sum_exp(log(theta)     + foo1_lpdf(x[n]),
            log(1 - theta) + foo2_lpdf(x[n]))

All of this said there is one additional numerical concern relevant to the implementation of mixture models. If any categorical probability vanishes, \lambda_{k} = 0, then the input to the log sum exp function will overflow to negative infinity, \begin{align*} v_{k'} &= \log(\lambda_{k'}) + \log \circ \, p_{k'} (x) \\ &= \log(0) + \log \circ \, p_{k'} (x) \\ &= -\infty + \log \circ \, p_{k'} (x) \\ &= -\infty. \end{align*} At the same time negative infinities also arise if any of the component probability density functions vanish, \begin{align*} v_{k'} &= \log(\lambda_{k'}) + \log \circ \, p_{k'} (x) \\ &= \log(\lambda_{k'}) + \log(0) \\ &= \log(\lambda_{k'}) -\infty \\ &= -\infty. \end{align*}

Formally any negative infinite input results in a vanishing exponential output \exp(v_{k}) = \exp(-\infty) = 0 and no contribution to the intermediate sum. \sum_{k = 1}^{K} \exp(v_{k}) = \sum_{k \ne k'} \exp (v_{k}), Consequently we have, at least in theory, the identity \begin{align*} \text{log-sum-exp}&(v_{1}, \ldots, v_{k - 1}, v_{k} = -\infty, v_{k + 1}, \ldots, v_{K}) \\ &= \text{log-sum-exp}(v_{1}, \ldots, v_{k - 1}, v_{k + 1}, \ldots, v_{K}). \end{align*} In practice, however, the negative infinity can wreak havoc on the floating point calculations.

To avoid potential numerical complications entirely we need to drop the offending component before trying to evaluate the inputs to the log sum exp function. Specifically we compute v_{k} = \log(\lambda_{k}) + \log \circ \, p_{k} (x) for only the components with \lambda_{k} > 0 and p_{k}(x) > 0, and then evaluate the log sum exp function on only these well-behaved inputs.

1.4 Sampling From Mixture Models

We’ve done a good bit of work to remove the component assignments from mixture models, but they are not without their uses. Indeed they make sampling from a mixture model particularly straightforward.

In particular the unmarginalized joint model p(x, z \mid \lambda_{1}, \ldots, \lambda_{K}) = p(x \mid z) \, p(z \mid \lambda_{1}, \ldots, \lambda_{K}) immediately motivates an ancestral sampling strategy for generating samples of x and z. We first select a component by sampling a categorical variable (Figure 1 (a)) \tilde{z} \sim p(z \mid \lambda_{1}, \ldots, \lambda_{K}) and then we sample from the corresponding component probability distribution (Figure 1 (b)) \tilde{x} \sim p(x \mid \tilde{z}) = p_{\tilde{z}}(y).

 

(a)
(b)
Figure 1: Sampling from a mixture model is a straightforward, two-step process. (a) We first sample a component \tilde{z} given the component probabilities p(z = k) = \lambda_{k} and then (b) sample a value \tilde{x} from the corresponding component probability distribution p_{\tilde{z}}(x).

Together the pair \{ \tilde{x}, \tilde{z} \} defines a sample from the unmarginalized model, \{ \tilde{x}, \tilde{z} \} \sim p(x, z \mid \lambda_{1}, \ldots, \lambda_{K}), while the single value \tilde{x} defines a sample from the marginalized model, \tilde{x} \sim p(y \mid \lambda_{1}, \ldots, \lambda_{K}).

Implementing this two-step sampling procedure in the Stan Modeling Language is straightforward.

// Loop over observations
for (n in 1:N) {
  // Sample a component
  int<lower=1, upper=K> z = categorical_rng(lambda);
  // Sample an observation
  if (z == 1) {
    x[n] = foo1_rng();
  } else if (z == 2) {
    x[n] = foo2_rng();
  } else if (z == 3) {
    x[n] = foo3_rng();
  } ...
}

Given values for the component probabilities and component probability distributions we can also conditionally sample component assignments for any value using the posterior distribution that we derived in Section 1.1, \tilde{z} \sim \frac{ p_{z}(\tilde{x}) \, \lambda_{z} } { \sum_{k = 1}^{K} p_{k}(\tilde{x}) \, \lambda_{k} }.

For example we could sample marginal posterior assignments using the generated quantities block in a Stan program.

// Loop over observations
for (n in 1:N) {
  vector[4] log_ps =   log(lambda[1]) + foo1_lpdf(x[n])
                     + log(lambda[2]) + foo2_lpdf(x[n])
                     + log(lambda[3]) + foo3_lpdf(x[n])
                     + log(lambda[4]) + foo4_lpdf(x[n]);
  simplex[4] ps = softmax(log_ps);
  z[n] = categorical_rng(ps);
}

Equivalently we can use the built-int categorical_logit_rng function which automatically applies the softmax function to the input arguments.

// Loop over observations
for (n in 1:N) {
  vector[4] log_ps =   log(lambda[1]) + foo1_lpdf(x[n])
                     + log(lambda[2]) + foo2_lpdf(x[n])
                     + log(lambda[3]) + foo3_lpdf(x[n])
                     + log(lambda[4]) + foo4_lpdf(x[n]);
  z[n] = categorical_log_rng(log_ps);
}

2 Notable Mixture Models

Mixture modeling is a very general modeling technique but there are a few special cases worth particular discussion.

2.1 Inflation Models

Inflation models are a special case of mixture modeling where one or more of the component probability distributions concentrate on single values. These singular probability distributions are defined by the allocations \delta_{x_{\mathrm{inf}}}( \mathsf{x} ) = \left\{ \begin{array}{rl} 1, & x_{\mathrm{inf}} \in \mathsf{x}, \\ 0, & x_{\mathrm{inf}} \notin \mathsf{x} \end{array} \right. . They are also known as Dirac probability distributions, or simply Dirac distributions for short.

While this definition is straightforward, the implementation of inflation models requires some care, especially when the Dirac probability distribution is not compatible with the natural reference measure on the ambient space and we cannot construct well-defined probability density functions.

In theory inflation models can contain any number of component probability distributions but in this section I will consider only two, one baseline component and one inflated component, to simplify the presentation.

2.1.1 Discrete Inflation Models

On discrete spaces any Dirac probability distribution can be specified with a probability density function relative to the counting measure, in other words a probability mass function, \delta_{x_{\mathrm{inf}}}(x) = \left\{ \begin{array}{rl} 1, & x = x_{\mathrm{inf}}, \\ 0, & x \ne x_{\mathrm{inf}} \end{array} \right. . To allows us to implement a corresponding inflation model using probability mass functions.

For example we can take a baseline probability distribution specified by the probability mass function p_{\mathrm{baseline}}(x) and inflate the value x = x_{\mathrm{inf}} by mixing it with a Dirac probability mass function, p(x) = \lambda \, p_{\mathrm{baseline}}(x) + (1 - \lambda) \, \delta_{x_{\mathrm{inf}}}(x), or equivalently \begin{align*} \log \circ \, p (x_{\mathrm{inf}}) &= \text{log-sum-exp}\big( \log(\lambda) + \log \circ \, p_{\mathrm{baseline}} (x_{\mathrm{inf}}), \\ & \hphantom{=\text{log-sum-exp}\big(\;} \log(1 - \lambda) + \log \circ \, \delta_{x_{\mathrm{inf}}} (x_{\mathrm{inf}}) \; \big). \end{align*}

Because the Dirac component allocates zero probability to so many elements, however, we have to take care when evaluating the logarithm of this mixed probability mass function. If x = x_{\mathrm{inf}} then \delta_{x_{\mathrm{inf}}}(x = x_{\mathrm{inf}}) = 1 and \log \circ \, \delta_{x_{\mathrm{inf}}} (x = x_{\mathrm{inf}}) = 0. At this point evaluating the log probability mass function is straightforward, \begin{align*} \log \circ \, p (x_{\mathrm{inf}}) &= \text{log-sum-exp}\big( \log(\lambda) + \log \circ \, p_{\mathrm{baseline}} (x_{\mathrm{inf}}), \\ & \hphantom{=\text{log-sum-exp}\big(\;} \log(1 - \lambda) + \log \circ \, \delta_{x_{\mathrm{inf}}} (x_{\mathrm{inf}}) \; \big) \\ &= \text{log-sum-exp}\big( \log(\lambda) + \log \circ \, p_{\mathrm{baseline}} (x_{\mathrm{inf}}) ), \\ & \hphantom{=\text{log-sum-exp}\big(\;} \log(1 - \lambda) + \log(1) \; \big) \\ &= \text{log-sum-exp}\big( \log(\lambda) + \log \circ \, p_{\mathrm{baseline}} (x_{\mathrm{inf}}) ), \\ & \hphantom{=\text{log-sum-exp}\big(\;} \log(1 - \lambda) \; \big). \end{align*}

On the other hand if x \ne x_{\mathrm{inf}} then \delta_{x_{\mathrm{inf}}}(x \ne x_{\mathrm{inf}}) = 0 and \log \circ \, \delta_{x_{\mathrm{inf}}} (x \ne x_{\mathrm{inf}}) = -\infty. To avoid numerical issues arising from the negative infinity we need to follow the procedure introduced in Section 1.3, first analytically accounting for the zero, \begin{align*} p(x) &= \lambda \, p_{\mathrm{baseline}}(x) + (1 - \lambda) \, \delta_{x_{\mathrm{inf}}}(x) \\ &= \lambda \, p_{\mathrm{baseline}}(x) + (1 - \lambda) \, 0 \\ &= \lambda \, p_{\mathrm{baseline}}(x), \end{align*} and only then taking the natural logarithm, \begin{align*} \log \circ \, p (y) &= \log \big( \lambda \, p_{\mathrm{baseline}}(x) \big) \\ &= \log (\lambda) + \log \circ \, p_{\mathrm{baseline}} (x). \end{align*}

In other words to ensure stable numerical calculations in practice we have to invoke conditional statements when implementing the discrete inflation model. For example in Stan we might write

// Loop over observations
for (n in 1:N) {
  // Check equality with inflated value
  if (x[n] == x_inf) {
    target += log_sum_exp(log(lambda) + log_p_baseline(x[n]),
                          log(1 - lambda));
  } else {
    target += log(lambda) + log_p_baseline(x[n]);
  }
}

2.1.2 Continuous Inflation Models

Unfortunately Dirac probability distributions are less well-behaved on continuous spaces. The problem is that Dirac probability distributions are not absolutely continuous with respect to the natural reference measures, such as the Lebesgue measure on a given real space. Consequently we cannot define an appropriate probability density function.

A common heuristic for denoting a continuous inflation model uses the Dirac delta function \delta(x). Recall that the Dirac delta function is defined by the integral action \int \mathrm{d} x \, \delta(x) f(x) = f(0) and doesn’t actually have a well-defined point-wise output.

Using a Dirac delta function we can heuristically write a continuous inflation model with a single inflated value x_{\mathrm{inf}} as p(x) = \lambda \, p_{\mathrm{baseline}}(x) + (1 - \lambda) \, \delta(x - x_{\mathrm{inf}}) but, because we cannot evaluate \delta(x - x_{\mathrm{inf}}), we actually cannot evaluate p(x).

In order to formally implement a continuous inflation model we need to break the ambient space X into two pieces, once containing the inflated value x_{\mathrm{inf}} and one containing everything else X \setminus x_{\mathrm{inf}}. We can then treat \{ x_{\mathrm{inf}}\} as a discrete space equipped with a counting reference measure and and X \setminus x_{\mathrm{inf}} as a continuous space equipped with, for example, a Lebesgue reference measure.

If \pi_{\mathrm{baseline}} is absolutely continuous with respect to a Lebesgue reference measure on X then the probability allocated to any singular inflated value will be zero \pi_{\mathrm{baseline}}( \{ x_{\mathrm{inf}} \} ) = 0. Consequently the baseline component will effectively define the same probability distribution on X and X \setminus x_{\mathrm{inf}}, and we can represent both with the same probability density function. In other words for x \ne x_{\mathrm{inf}} the inflation model can be specified by the probability density function p(x) = \lambda \, p_{\mathrm{baseline}}(x).

On the other hand for x = x_{\mathrm{inf}} we have to treat the mixture probability density function as a probability mass function, \begin{align*} p( x_{\mathrm{inf}} ) &= \pi( \{ x_{\mathrm{inf}} \} ) \\ &= \lambda \, \pi_{\mathrm{baseline}}( \{ x_{\mathrm{inf}} \} ) + (1 - \lambda) \, \delta_{x_{\mathrm{inf}}}( \{ x_{\mathrm{inf}} \} ) \\ &= \lambda \, 0 + (1 - \lambda) \, 1 \\ &= (1 - \lambda). \end{align*}

When working with N independent instances it’s convenient to first separate the values into the N_{\mathrm{inf}} instances that exactly equal the inflated value, v_{1}, \ldots, v_{N_{\mathrm{inf}}}, and the remaining N - N_{\mathrm{inf}} observations that don’t, w_{1}, \ldots, w_{N - N_{\mathrm{inf}}}. This allows us to simplify the joint probability density function into the form \begin{align*} p(x_{1}, \ldots, x_{N}) &= \prod_{n = 1}^{N} p(x_{n}) \\ &= \prod_{n = 1}^{N_{\mathrm{inf}}} p(v_{n}) \, \prod_{n = 1}^{N - N_{\mathrm{inf}}} p(w_{n}) \\ &= \prod_{n = 1}^{N_{\mathrm{inf}}} (1 - \lambda) \, \prod_{n = 1}^{N - N_{\mathrm{inf}}} \lambda \, p_{\mathrm{baseline}}(w_{n}) \\ &= (1 - \lambda)^{N_{\mathrm{inf}}} \, \lambda^{N - N_{\mathrm{inf}}} \, \prod_{n = 1}^{N - N_{\mathrm{inf}}} p_{\mathrm{baseline}}(w_{n}) \\ &\propto \mathrm{Binomial}(N_{\mathrm{inf}} \mid N, 1 - \lambda) \, \prod_{n = 1}^{N - N_{\mathrm{inf}}} p_{\mathrm{baseline}}(w_{n}). \end{align*}

In other words the continuous inflation model completely decouples into a Binomial model for the number of inflated values and a baseline model for the non-inflated values! Moreover, because these models are independent of each other they can be implemented together or separately without impacting any inferences.

The intuition here is that when evaluating the mixture model on the inflated value the contribution of the inflating component is always infinitely larger than the contribution from any other components. If we encounter x_{\mathrm{inf}} then we know that it has to be associated with the inflated component, and if we observe any other value then we know that it has to be associated with another component. Because of this lack of ambiguity we can always separate the inflated and non-inflated values and model them separately without compromising the consistency of our inferences.

2.2 Categorical and Multinomial Mixture Models

Categorical probability distributions are relatively simple objects, but mixtures of categorical probability distributions exhibit a unique property that can be of use in some circumstances.

Recall that every categorical probability distribution over a set of K unstructured elements is defined by K atomic probabilities \mathrm{categorical}( \{ k \} \mid \mathbf{q}) = q_{k}. Conveniently we can also write this as a mixture model where every component probability distribution is a Dirac probability distribution, \mathrm{categorical}( \mathsf{z} \mid \mathbf{q} ) = \sum_{k = 1}^{K} q_{k} \, \delta_{ \{ k \} }( \mathsf{z} ), or equivalently a mixture probability mass function, \mathrm{categorical}( z \mid \mathbf{q}) = \sum_{k = 1}^{K} q_{k} \, \delta_{ k }( z ), where the Dirac probability distribution and probability mass function behave as in Section 2.1.1.

Interestingly the mixture of two different categorical probability distributions is always another categorical probability distribution, \begin{align*} \lambda \, \mathrm{categorical}( z \mid \mathbf{q}_{1}) & \\ + (1 - \lambda) \, \mathrm{categorical}( z \mid \mathbf{q}_{2}) &= \lambda \, \sum_{k = 1}^{K} q_{1k} \, \delta_{ k }( z ) + (1 - \lambda) \, \sum_{k = 1}^{K} q_{2k} \, \delta_{ k }( z ) \\ &= \sum_{k = 1}^{K} \left[ \lambda \, q_{1k} + (1 - \lambda) \, q_{2k} \right] \, \delta_{ k }( z ) \\ &= \mathrm{categorical}( z \mid \lambda \, \mathbf{q}_{1} + (1 - \lambda) \, \mathbf{q}_{2}). \end{align*} Indeed the mixture of any number of categorical probability distributions defines another categorical probability distribution, \begin{align*} \sum_{j = 1}^{J} \lambda_{j} \mathrm{categorical}( z \mid \mathbf{q}_{j} ) &= \sum_{j = 1}^{J} \lambda_{j} \, \sum_{k = 1}^{K} q_{jk} \, \delta_{ k }( z ) \\ &= \sum_{k = 1}^{K} \left[ \sum_{j = 1}^{J} \lambda_{j} \, q_{jk} \right] \, \delta_{ k }( z ) \\ &= \mathrm{categorical}( z \mid \textstyle \sum_{j = 1}^{J} \lambda_{j} \, \mathbf{q}_{j} ). \end{align*} In other words mixing categorical probability distributions is equivalent to mixing the corresponding simplex configurations together.

That said the closure of categorical probability distributions under mixtures, and the corresponding duality between probability distribution mixtures and simplex configuration mixtures, is exceptional. Consider, for example, the multinomial probability distribution over the total counts over an ensemble of identical and independent categorical instances, \mathrm{multinomial}( n_{1}, \ldots, n_{K} \mid \mathbf{q}) = N! \prod_{k = 1}^{K} \frac{ q_{k}^{n_{k}} }{ n_{k}! }. While mixing simplex configurations together is equivalent to mixing the corresponding categorical probability distributions together, \begin{align*} \mathrm{categorical} & ( z \mid \textstyle \sum_{j = 1}^{J} \lambda_{j} \, \mathbf{q}_{j} ) \\ &= \sum_{j = 1}^{J} \lambda_{j} \mathrm{categorical}( z \mid \mathbf{q}_{j} ), \end{align*} it is not equivalent to mixing the corresponding multinomial probability distributions together (Figure 2) \begin{align*} \mathrm{multinomial} & ( n_{1}, \ldots, n_{K} \mid \textstyle \sum_{j = 1}^{J} \lambda_{j} \, \mathbf{q}_{j} ) \\ &\ne \sum_{j = 1}^{J} \lambda_{j} \mathrm{multinomial}( n_{1}, \ldots, n_{K} \mid \mathbf{q}_{j})! \end{align*}

Figure 2: Mixing categorical probability distributions is equivalent to mixing the defining simplex configurations, but this property does not generalize to other probability distributions derived from categorical probability distributions, such as the multinomial distribution. For example a mixture of the two binomial probability distributions \mathrm{binominal}(n \mid N, p_{1}) and \mathrm{binominal}(n \mid N, p_{2}) is not the same as the binomial distribution defined by the mixture of p_{1} and p_{2} even though \lambda \, \mathrm{Bernoulli}(p_{1}) + (1 - \lambda) \, \mathrm{Bernoulli}(p_{2}) is the same as \mathrm{Bernoulli}(\lambda \, p_{1} + (1 - \lambda) \, p_{2}).

The left-hand side \mathrm{multinomial}( n_{1}, \ldots, n_{K} \mid \textstyle \sum_{j = 1}^{J} \lambda_{j} \, \mathbf{q}_{j} ) corresponds to heterogeneous component categorical probability distributions. Specifically it implies that a proportion \lambda_{j} of the N component categorical probability distributions are configured by \mathbf{q}_{j}. These heterogeneous contributions to the counts wash out the detail of each component model, always resulting in a uni-modal probability distribution.

On the other side \sum_{j = 1}^{J} \lambda_{j} \mathrm{multinomial}( n_{1}, \ldots, n_{K} \mid \mathbf{q}_{j}) corresponds to homogeneous component categorical probability distributions. For each simplex configuration \mathbf{q}_{j} the corresponding multinomial distribution concentrates around the count values q_{jk} \, N. The mixture model overlays these distinct peaks on top of each other, resulting in a multi-modal probability distribution.

In general we have to take care to avoid misinterpreting models defined by mixtures of parameters as probabilistic mixture models.

2.3 Continuous Mixture Models

So far we have considered mixture models with only a finite number of components. These models are defined by the joint probability density function p(x, z) = p(x \mid z) \, p(z) with z \in Z = \{ 1, \ldots, k, \ldots, K \}, or equivalently the marginal probability density function \begin{align*} p(x) &= \sum_{k = 1}^{K} p(x \mid k) \, p(k) \\ &= \sum_{k = 1}^{K} p_{k}(x) \, \lambda_{k} \\ &= \sum_{k = 1}^{K} \lambda_{k} \, p_{k}(x). \end{align*}

We can immediately extend this construction to mixture models with a countably infinite number of components, p(x, z) = p(x \mid z) \, p(z) with z \in Z = \{ 1, \ldots, k, \ldots, \}, or equivalently \begin{align*} p(x) &= \sum_{k = 1}^{\infty} p(x \mid k) \, p(k) \\ &= \sum_{k = 1}^{\infty} p_{k}(x) \, \lambda_{k} \\ &= \sum_{k = 1}^{\infty} \lambda_{k} \, p_{k}(x). \end{align*}

These models are referred to as discrete mixture models because the categorical assignment variables take values in discrete spaces. To be more precise they can also be referred to as finite discrete mixture models and infinite discrete mixture models, respectively.

Mathematics, however, allows us to take this construction even further: by allowing the assignment variable to take values in an arbitrary space we can build mixture models from an arbitrary number of components. For example if Z = \mathbb{R} then the joint model p(x, z) = p(x \mid z) \, p(z). can be interpreted as a mixture model with an uncountably infinite number of component models, each of which is specified by the conditional probability density function p(x \mid z) and weighted by not a probability but rather a probability density p(z).

In this case the marginal mixture model is defined not by summation but rather integration, \begin{align*} p(x) &= \int \mathrm{d} z \, p(z) \, p(x \mid z) \\ &= \int \mathrm{d} z \, p(x \mid z) \, p(z). \end{align*}

Because the assignment variables now take values in a continuous space mixture models of this form are, unsurprisingly, referred to as continuous mixture models.

From a practical perspective continuous mixture models are a bit of a double-edged sword. On one hand the integration needed to evaluate the marginal model p(x) can be implemented analytically only in exceptional circumstances. On the other hand if X is also a continuous space then the joint model p(x, z) will be defined over a completely continuous product space and we can use gradient-based tools like Hamiltonian Monte Carlo to characterize it directly.

3 Bayesian Mixture Models

When working with probabilistic models defined over products spaces we have a variety of ways that we can employ mixture modeling. For example we can use mixtures to model the behavior of the entire product space. At the same time we can use mixtures to model the behavior of individual components or even products of specific components.

This flexibility becomes especially useful in Bayesian modeling where our models are defined over a product of an observational space and a model configuration space. In particular we can apply mixture modeling techniques to both the prior model p( \theta ) and observational model p(y \mid \theta).

3.1 Mixture Prior Models

Applied prior modeling is often constrained by our ability to express our intricate domain expertise with the mathematically-convenient, but relatively simple, probability density functions to which we might be limited in practice. That said when we don’t have access to sufficiently flexible families of probability density functions we can often engineer some of the desired flexibility by combining the available probability density functions into mixture prior models, p(\theta) = \sum_{k = 1}^{K} \lambda_{k} \, p_{k}(\theta).

For example let’s say that we want to build a principled prior model for a behavior modeled by the parameter \theta, but we have access to only normal probability density functions. In this case a unimodal normal prior model is sufficient only if we want to suppress parameter configurations outside of a single interval (Figure 3).

Figure 3: A single normal prior model is useful for quantifying domain expertise that is consistent with a connected interval of parameter configurations.

If our domain expertise is consistent with parameter configurations that concentrate within multiple disjoint intervals, however, then no single normal prior model will be sufficient. We can build a normal prior model that captures any one interval, but only at the expense of the others (Figure 4 (a)). At the same time a normal prior model that expands to contain all of the consistent intervals will end up including the inconsistent behaviors in between (Figure 4 (b)). On the other hand a mixture of normal prior models can accommodate the disjoint intervals without issue (Figure 4 (c)).

(a)
(b)
(c)
Figure 4: A single normal prior model is less useful for quantifying domain expertise that is consistent with multiple, disconnected intervals of parameter configurations. The best we can do is engineer a normal prior model (a) that concentrates on one of the intervals but neglects the others (b) or spans all of the intervals but includes all of the inconsistent configurations between them. (c) Multiple normal prior models, however, can be combined into a mixture prior model that readily accommodates the disjoint domain expertise.

Mixture models can also be useful for engineering more sophisticated unimodal behaviors. For example we can use a mixture of normal prior models to approximate more leptokurtic or “heavy-tailed” behaviors (Figure 5 (a)), platykurtic or “light-tailed” behaviors (Figure 5 (b)), and even asymmetric behaviors (Figure 5 (c)).

(a)
(b)
(c)
Figure 5: Mixture prior models can be used to engineer more sophisticated behavior from simpler component models. For example a mixture of normal prior models can be more (a) heavy-tailed, (b) light-tailed, or (c) asymmetric than any single normal prior model.

All of this said, multimodal domain expertise is not particularly common in practice. Unimodal domain expertise that cannot be captured by a normal prior model is more common but, with the benefit of expressive probabilistic programming languages and contemporary probabilistic computation tools, we can usually implement sufficiently flexible probability density functions directly. Prior mixture modeling was much more useful in times past when Bayesian analyses were limited to calculations that could be made analytically or with limited computational resources.

Continuous mixture prior models, however, still play a critical role in the modeling of domain expertise subject to certain ignorance constraints, as we’ll learn about in the chapter on hierarchical modeling.

3.2 Mixture Observational Models

One of the most powerful uses of mixtures models in Bayesian analyses is engineering observational models that combine multiple data generating processes together, p(y \mid \theta) = \sum_{k = 1}^{K} \lambda_{k} \, p_{k}(y \mid \theta), when we don’t know which might be responsible for any given observation. These mixture observational models allow us to, for example, model measurements that have been contaminated by undesired but unavoidable behaviors, a situation that arises all too often in practice.

For example if p_{s}(y \mid \theta_{s}) models a desired signal and p_{b}(y \mid \theta_{b}) models an irreducible background then the mixture observational model p(y \mid \theta_{s}, \theta_{b}) = \lambda_{s} \, p_{s}(y \mid \theta_{s}) + (1 - \lambda_{s}) \, p_{b}(y \mid \theta_{b}) models their overlap, allowing us to probabilistically separate the signal from the background.

Similarly consider a normal observational model \text{normal}(y \mid, \mu, \sigma) where the measurement variability \sigma can vary unpredictably from measurement to measurement. If we can quantify the distribution of measurement variabilities then we can model this data generating behavior with a continuous mixture model, for example \begin{align*} p(y \mid \mu, \tau) &= \int_{0}^{\infty} \mathrm{d} z \, \text{normal} \bigg(y \mid \mu, \sqrt{z} \bigg) \, \text{exponential} \bigg( z \mid \frac{1}{2 \tau^{2}} \bigg) \\ &= \text{Laplace}(y \mid \mu, \tau). \end{align*}

While mathematically convenient, taking exponentially distributed squared measurement variabilities is a strong assumption that we usually don’t have the domain expertise to confidently motivate. Instead in practice this Laplace observational model is often used more heuristically to accommodate extreme “outliers” that contaminate the real-valued observations of interest.

4 Mixture Observational Model Inferences

In practice mixture observational models are typically used in applications where the neither the component observational models nor their probabilities are known precisely. Consequently the inferential challenge is to infer all of these behaviors from the observed data at the same time.

Theoretically we can always infer the component observational model behaviors jointly with the unknown component assignments, \begin{align*} p(\mathbf{z}, \boldsymbol{\lambda}, \boldsymbol{\theta} \mid \tilde{\mathbf{y}} ) &\propto p(\tilde{\mathbf{y}}, \mathbf{z}, \boldsymbol{\lambda}, \boldsymbol{\theta} ) \\ &\propto \bigg[ p(\tilde{\mathbf{y}} \mid \mathbf{z}, \boldsymbol{\theta} ) \, p( \mathbf{z} \mid \boldsymbol{\lambda} ) \bigg] \, p( \boldsymbol{\lambda} ) \, p( \boldsymbol{\theta} ) \\ &\propto \prod_{n = 1}^{N} \bigg[ p_{z_{n}}(\tilde{y}_{n} \mid \theta_{z_{n}} ) \, \lambda_{z_{n}} \bigg] \, p( \boldsymbol{\lambda} ) \, p( \boldsymbol{\theta} ). \end{align*}

That said in practice it is almost always easier to infer the component probabilities and component model configurations directly from the marginalized model, \begin{align*} p(\boldsymbol{\lambda}, \boldsymbol{\theta} \mid \tilde{\mathbf{y}} ) &\propto p(\tilde{\mathbf{y}}, \boldsymbol{\lambda}, \boldsymbol{\theta} ) \\ &\propto \bigg[ p(\tilde{\mathbf{y}} \mid \boldsymbol{\lambda}, \boldsymbol{\theta} ) \, \bigg] \, p( \boldsymbol{\lambda} ) \, p( \boldsymbol{\theta} ) \\ &\propto \prod_{n = 1}^{N} \bigg[ \sum_{k = 1}^{K} p_{k}(\tilde{y}_{n} \mid \theta_{k} ) \, \lambda_{k} \bigg] \, p( \boldsymbol{\lambda} ) \, p( \boldsymbol{\theta} ). \end{align*} When needed any component assignment can always be recovered with an application of Bayes’ Theorem, p(z_{n} = k \mid \tilde{y}_{n}, \boldsymbol{\lambda}, \boldsymbol{\theta}) = \frac{ p_{k}(\tilde{y}_{n} \mid \theta_{k}) \, \lambda_{k} }{ \sum_{k' = 1}^{K} p_{k'}(\tilde{y}_{n} \mid \theta_{k}) \, \lambda_{k} }.

From a mathematical perspective the behavior of mixture observational model inferences is relatively straightforward. For example if the configurations of the component observational models are fixed then any observations that are more consistent with a particular component model, \frac { \mathrm{d} \pi_{k , \theta_{k}} } { \mathrm{d} \pi_{k', \theta_{k'}} } (\tilde{y}_{n}) = \frac { p_{k}(\tilde{y}_{n} \mid \theta_{k}) } { p_{k'}(\tilde{y}_{n} \mid \theta_{k'}) } > 1 for all k' \ne k, will push the posterior distribution to concentrate on larger values of \lambda_{k} and smaller values of the remaining \lambda_{k'}.

More generally the posterior distribution will have to account for the interactions between the component probabilities \boldsymbol{\lambda} and the component model configuration parameters \boldsymbol{\theta}. When the component observational models are not too redundant, with each component model responsible for a mostly unique set of behaviors, then then these interactions will manifest as non-degenerate posterior distribution that are straightforward to quantify with tools like Hamiltonian Monte Carlo.

If the component observational models exhibit substantial redundancy, however, then the resulting posterior inferences will be much less pleasant. The problem is that the mixture observational model will be able to contort itself in a variety of different ways, all of which are consistent with the observed data. This yields complex, and often multi-modal, degeneracies that frustrate accurate posterior computation. Recall that multi-modal degeneracies are particularly obstructive as Markov chains with poorly chosen initializations can be trapped by even small modes (Figure 6).

Figure 6: Multi-modal posterior distributions are especially frustrating for Markov chain Monte Carlo. No matter how little target probability is allocated to the neighborhood of a mode it can still attract Markov chains for arbitrarily long times if the Markov chains are initialized too close to its basin of attraction.

Redundancy in a mixture observational model is at its highest when two or more of the component observational models are the same. In this case we can permute the corresponding component indices without affecting any of the model evaluations. If K_{r} component models are the same then the likelihood function will always exhibit K_{r}! distinct modes, one for each permutation of the redundant indices (Figure 7).

 

(a)

 

 

(b)
(c)

 

Figure 7: Mixture observational models with K equal component observational models are formally non-identified – every model configuration is accompanied by K! - 1 other model configurations that define exactly the same observational model. These redundant model configurations manifest as distinct modes in the likelihood function; for example a mixture observational model with two identical normal component models always results in (a) a bimodal likelihood function where (b) every model configuration in one mode is paired with (c) a model configuration in another mode where the component models are swapped.

That said non-redundant mixture observational models are not guaranteed to be well-behaved either. Degeneracies can also arise when component observational models are mathematically distinct but contain similar qualitative behaviors.

Consider, for example, a mixture observational model over positive integers that inflates a baseline Poisson model, \mathrm{Poisson}(y \mid \lambda), with a Dirac probability distribution concentrating at zero, \delta_{0}(y).

If \lambda is far from zero then these two component model capture distinct behaviors, with the former concentrating on values above zero and the latter on values exactly at zero (Figure 8 (a)). When \lambda is close to zero, however, the two component models will both concentrate at zero (Figure 8 (b)) and observations of y = 0 will be able to distinguish between them.

 

(a)
(b)
Figure 8: Discrete inflation observational models are prone to degeneracies when the baseline model can contort itself to mimic the inflation model. (a) When \lambda is much larger than zero a Poisson model is not easily confused with a Dirac model concentrating at zero. (b) If \lambda is close to zero, however, then the two component observational models are nearly indistinguishable.

In my opinion the most productive approach to mixture observational modeling is to treat each component as a model for a distinct data generating process. Any domain expertise that distinguishes the possible data generating behaviors from each other can, and should, be directly incorporated into either the structure of the component observational models or the prior model to avoid as much inferential degeneracy as possible.

5 Demonstrations

Enough of the generality. In this section we’ll demonstrate the basic mixture modeling techniques presented in this chapter with a sequence of demonstrative Bayesian applications using mixture observational models.

5.1 Setup

Always we start by setting up our local environment.

par(family="serif", las=1, bty="l",
    cex.axis=1, cex.lab=1, cex.main=1,
    xaxs="i", yaxs="i", mar = c(5, 5, 3, 1))

library(rstan)
rstan_options(auto_write = TRUE)            # Cache compiled Stan programs
options(mc.cores = parallel::detectCores()) # Parallelize chains
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
util <- new.env()
source('mcmc_analysis_tools_rstan.R', local=util)
source('mcmc_visualization_tools.R', local=util)

5.2 Separating Signal and Background

For our first exercise let’s consider a simplified version of a particle physics experiment where we observe individual energy depositions in a detector. The only problem is that each depositions can come from either an irreducible background or a signal of interest, and we are always ignorant to which source is responsible for any given deposition. Fortunately we can model this overlap of signal and background sources with a mixture observational model.

We’ll model the background with a steeply falling exponential probability density function and the signal with a Cauchy probability density function that can be derived as an approximation to certain physical processes. The precise configuration of the signal and background models in this example is somewhat arbitrary, especially without units. That said the high background probability and peaked signal is emulative of actual particle physics experiments.

mu_signal <- 45
sigma_signal <- 5
beta_back <- 20
lambda <- 0.95

xs <- seq(0, 100, 1)
ys <- lambda * dexp(xs, 1 / beta_back) +
      (1 - lambda) * dcauchy(xs, mu_signal, sigma_signal)

par(mfrow=c(1, 1), mar=c(5, 5, 3, 1))

plot(xs, ys, lwd=2, type="l", col=util$c_dark,
     main="Observational Model",
     xlab="y", ylab="Probability Density")

Simulating data from this model is straightforward using the techniques introduced in Section 1.4.

simu\_signal\_background.stan
data {
  int<lower=1> N;      // Number of observations
}

transformed data {
  real mu_signal =  45;                 // Signal location
  real<lower=0> sigma_signal = 5;       // Signal scale
  real beta_back = 20;                  // Background rate
  real<lower=0, upper=1> lambda = 0.95; // Background probability
}

generated quantities {
  array[N] real<lower=0> y = rep_array(-1, N);

  for (n in 1:N) {
    if (bernoulli_rng(lambda)) {
      y[n] = exponential_rng(1 / beta_back);
    } else {
      // Truncate signal to positive values
      while (y[n] < 0) {
        y[n] = cauchy_rng(mu_signal, sigma_signal);
      }
    }
  }
}
N <- 500

simu <- stan(file="stan_programs/simu_signal_background.stan",
             algorithm="Fixed_param",
             data=list("N" = N), seed=8438338,
             warmup=0, iter=1, chains=1, refresh=0)

data <- list("N" = N,
             "y" = extract(simu)$y[1,])

Because of the overwhelming background contribution it’s hard to make out the meager signal by eye.

par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))

util$plot_line_hist(data$y, 0, 125, 5, xlab="y", prob=TRUE)
Warning in check_bin_containment(bin_min, bin_max, values): 1 value (0.2%) fell
above the binning.

Perhaps we can use Bayesian inference to tease out the signal?

Note that the prior model here is a bit arbitrary due to the nature of the exercise. In a practical analysis we would want to take care to elicit at least some basic domain expertise about the behaviors of each component observational model. This is especially true if the component models have any potential for redundant behaviors.

signal\_background1.stan
data {
  // Signal and background observations
  int<lower=1> N;
  array[N] real<lower=0> y;
}

parameters {
  real mu_signal;                // Signal location
  real<lower=0> sigma_signal;    // Signal scale
  real<lower=0> beta_back;       // Background scale
  real<lower=0, upper=1> lambda; // Background probability
}

model {
  // Prior model
  mu_signal ~ normal(50, 50 / 2.32);   // 0 <~ mu_signal    <~ 100
  sigma_signal ~ normal(0, 25 / 2.57); // 0 <~ sigma_signal <~  25
  beta_back ~ normal(0, 50 / 2.57);    // 0 <~ beta_back    <~  50
  // Implicit uniform prior density function for lambda

  // Observational model
  for (n in 1:N) {
    target += log_mix(lambda,
                      exponential_lpdf(y[n] | 1 / beta_back),
                      cauchy_lpdf(y[n] | mu_signal, sigma_signal));
  }
}

generated quantities {
  array[N] real<lower=0> y_pred = rep_array(-1, N);

  for (n in 1:N) {
    if (bernoulli_rng(lambda)) {
      y_pred[n] = exponential_rng(1 / beta_back);
    } else {
      while (y_pred[n] < 0) {
        y_pred[n] = cauchy_rng(mu_signal, sigma_signal);
      }
    }
  }
}
fit <- stan(file="stan_programs/signal_background1.stan",
            data=data, seed=8438338,
            warmup=1000, iter=2024, refresh=0)

The diagnostics exhibit some mild \hat{\xi} warnings, but nothing too concerning so long as we’re not interested in calculating the expectation value of mu_signal.

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples,
                                       c('mu_signal', 'sigma_signal',
                                         'beta_back', 'lambda'))
util$check_all_expectand_diagnostics(base_samples)
mu_signal:
  Chain 1: Right tail hat{xi} (0.253) exceeds 0.25.
  Chain 3: Right tail hat{xi} (0.423) exceeds 0.25.
  Chain 4: Right tail hat{xi} (0.378) exceeds 0.25.


Large tail hat{xi}s suggest that the expectand might not be
sufficiently integrable.

Even better there are no indications of retrodictive tension that would suggest inadequacies in our model.

par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))

util$plot_hist_quantiles(samples, 'y_pred', 0, 125, 5,
                         baseline_values=data$y, xlab="y")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 6981 predictive values (0.3%) fell above the binning.
Warning in check_bin_containment(bin_min, bin_max, baseline_values, "observed
value"): 1 observed value (0.2%) fell above the binning.

Our posterior inferences are not too bad, especially considering the relatively small number of observations and overwhelming background contribution. Note also the heavy tails of the marginal posterior distribution for mu_signal, consistent with the \hat{\xi} warnings.

par(mfrow=c(2, 2), mar=c(5, 5, 1, 1))

util$plot_expectand_pushforward(samples[['mu_signal']], 25,
                                display_name="mu_signal",
                                baseline=mu_signal,
                                baseline_col=util$c_mid_teal)

util$plot_expectand_pushforward(samples[['sigma_signal']], 25,
                                display_name="sigma_signal",
                                baseline=sigma_signal,
                                baseline_col=util$c_mid_teal)

util$plot_expectand_pushforward(samples[['beta_back']], 25,
                                display_name="beta_back",
                                baseline=beta_back,
                                baseline_col=util$c_mid_teal)

util$plot_expectand_pushforward(samples[['lambda']], 25,
                                display_name="lambda",
                                baseline=lambda,
                                baseline_col=util$c_mid_teal)

We can also directly visualize the inferred behaviors of the signal model, the background model, and their combinations.

library(colormap)
nom_colors <- c("#DCBCBC", "#C79999", "#B97C7C",
                "#A25050", "#8F2727", "#7C0000")
line_colors <- colormap(colormap=nom_colors, nshades=20)

cs <- c(1, 2, 3, 4)
ss <- c(1, 250, 500, 750, 1000)

plot_signal_realizations <- function() {
  n <- 1
  for (c in cs) {
    for (s in ss) {
      ms <- samples[['mu_signal']][c, s]
      ss <- samples[['sigma_signal']][c, s]
      l <- samples[['lambda']][c, s]

      ys <- (1 - l) * dcauchy(xs, ms, ss)
      lines(xs, ys, lwd=2, col=line_colors[n])
      n <- n + 1
    }
  }
  ys <- (1 - lambda) * dcauchy(xs, mu_signal, sigma_signal)
  lines(xs, ys, lwd=4, col="white")
  lines(xs, ys, lwd=2, col=util$c_dark_teal)
}

plot_back_realizations <- function() {
  n <- 1
  for (c in cs) {
    for (s in ss) {
      bb <- samples[['beta_back']][c, s]
      l <- samples[['lambda']][c, s]

      ys <- l * dexp(xs, 1 / bb)
      lines(xs, ys, lwd=2, col=line_colors[n])
      n <- n + 1
    }
  }
  ys <- lambda * dexp(xs, 1 / beta_back)
  lines(xs, ys, lwd=4, col="white")
  lines(xs, ys, lwd=2, col=util$c_dark_teal)
}

plot_sum_realizations <- function() {
  n <- 1
  for (c in cs) {
    for (s in ss) {
      ms <- samples[['mu_signal']][c, s]
      ss <- samples[['sigma_signal']][c, s]
      bb <- samples[['beta_back']][c, s]
      l <- samples[['lambda']][c, s]

      ys <- l * dexp(xs, 1 / bb) + (1 - l) * dcauchy(xs, ms, ss)
      lines(xs, ys, lwd=2, col=line_colors[n])
      n <- n + 1
    }
  }
  ys <- (1 - lambda) * dcauchy(xs, mu_signal, sigma_signal) +
        lambda * dexp(xs, 1 / beta_back)
  lines(xs, ys, lwd=4, col="white")
  lines(xs, ys, lwd=2, col=util$c_dark_teal)
}

While the posterior uncertainties are large we do seem to be able to resolve the underlying signal through the overwhelming background!

par(mfrow=c(1, 3), mar=c(5, 5, 2, 1))

xs <- seq(0, 100, 0.5)

plot(NULL, main="Weighted Signal",
     xlab="y", ylab="Probability Density",
     xlim=range(xs), ylim=c(0, 0.05))
plot_signal_realizations()

plot(NULL, main="Weighted Background",
     xlab="y", ylab="Probability Density",
     xlim=range(xs), ylim=c(0, 0.05))
plot_back_realizations()

plot(NULL, main="Mixture",
     xlab="y", ylab="Probability Density",
     xlim=range(xs), ylim=c(0, 0.05))
plot_sum_realizations()

In many scientific applications the relative proportion of signal and background, and hence the component probabilities, are more relevant than the probability that any particular observation arose from either source. Fortunately when the latter is of interest computing posterior assignment probabilities and simulating assignments is straightforward.

signal\_background2.stan
data {
  // Signal and background observations
  int<lower=1> N;
  array[N] real<lower=0> y;
}

parameters {
  real mu_signal;                // Signal location
  real<lower=0> sigma_signal;    // Signal scale
  real<lower=0> beta_back;       // Background scale
  real<lower=0, upper=1> lambda; // Background probability
}

model {
  // Prior model
  mu_signal ~ normal(50, 50 / 2.32);   // 0 <~ mu_signal    <~ 100
  sigma_signal ~ normal(0, 25 / 2.57); // 0 <~ sigma_signal <~  25
  beta_back ~ normal(0, 50 / 2.57);    // 0 <~ beta_back    <~  50
  // Implicit uniform prior density function for lambda

  // Observational model
  for (n in 1:N) {
    target += log_mix(lambda,
                      exponential_lpdf(y[n] | 1 / beta_back),
                      cauchy_lpdf(y[n] | mu_signal, sigma_signal));
  }
}

generated quantities {
  array[N] real<lower=0, upper=1> p;
  array[N] int<lower=0, upper=1> z_pred;

  for (n in 1:N) {
    vector[2] xs = [   log(lambda)
                     + exponential_lpdf(y[n] | 1 / beta_back),
                       log(1 - lambda)
                     + cauchy_lpdf(y[n] | mu_signal, sigma_signal) ]';
    p[n] = softmax(xs)[1];
    z_pred[n] = bernoulli_rng(p[n]);
  }
}
fit <- stan(file="stan_programs/signal_background2.stan",
            data=data, seed=8438338,
            warmup=1000, iter=2024, refresh=0)

Because the modified generated quantities block consumes pseudo-random numbers differently we need to double check the computational diagnostics. We do see a step size adaptation warning which suggests that the adaptation has changed in one of the Markov chains, but on its own this isn’t too problematic.

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  Chain 1: Average proxy acceptance statistic (0.676)
           is smaller than 90% of the target (0.801).

  A small average proxy acceptance statistic indicates that the
adaptation of the numerical integrator step size failed to converge.
This is often due to discontinuous or imprecise gradients.
samples <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples,
                                       c('mu_signal', 'sigma_signal',
                                         'beta_back', 'lambda'))
util$check_all_expectand_diagnostics(base_samples)
mu_signal:
  Chain 1: Right tail hat{xi} (0.516) exceeds 0.25.
  Chain 3: Right tail hat{xi} (0.420) exceeds 0.25.
  Chain 4: Both left and right tail hat{xi}s (0.252, 0.471) exceed 0.25.


Large tail hat{xi}s suggest that the expectand might not be
sufficiently integrable.

Consequently we can move on to analyzing the new behaviors.

plot_assignment <- function(idxs) {
  for (idx in idxs) {
    name <- paste0('p[', idx, ']')
    util$plot_expectand_pushforward(samples[[name]], 25,
                                    flim=c(-0.03, 1.03),
                                    display_name='Background Probability',
                                    main=paste("Observation", idx))
  }

  for (idx in idxs) {
    name <- paste0('z_pred[', idx, ']')
    zs <- c(samples[[name]], recursive=TRUE)
    util$plot_line_hist(zs, -0.03, 1.03, 0.02,
                        col=util$c_dark,
                        xlab="Background Assignment")
    abline(v=mean(zs), lwd=2, col=util$c_mid)
  }
}

Away from the peak of the signal the observations are strongly associated with the background model, with large individual background probabilities and a predominance of z = 1 samples.

par(mfrow=c(2, 4), mar=c(5, 5, 2, 1))

idxs <- which(60 < data$y)
plot_assignment(idxs[1:4])

Right at its peak the signal model has more of an influence, although the background model still dominates given its much higher base rate.

par(mfrow=c(2, 4), mar=c(5, 5, 2, 1))

idxs <-  which(45 < data$y & data$y < 55)
plot_assignment(idxs[1:4])

In general the component probabilities are always more informative than the sampled assignments. Consequently they are usually the best way to quantify the behavior of individual observations.

5.3 Zero-Inflated Poisson Model

To demonstrate discrete inflation models let’s next consider a zero-inflated Poisson observational model, often affectionately referred to as a “ZIP”. Once again we begin by simulating data from a particular configuration of the model.

simu\_zip.stan
data {
  int<lower=1> N;   // Number of observations
  real<lower=0> mu; // Poisson intensity
  real<lower=0, upper=1> lambda; // Main component probability
}

generated quantities {
  // Initialize predictive variables with inflated value
  array[N] int<lower=0> y = rep_array(0, N);

  for (n in 1:N) {
    // If we sample the non-inflating component then replace initial
    // value with a Poisson sample
    if (bernoulli_rng(lambda)) {
      y[n] = poisson_rng(mu);
    }
  }
}

Let’s start by simulating data from a configuration where the two component models are well-separated.

N <- 100
mu_true <- 7.5
lambda_true <- 0.8

simu <- stan(file="stan_programs/simu_zip.stan",
             algorithm="Fixed_param",
             data=list("N" = N,
                       "mu" = mu_true,
                       "lambda" = lambda_true),
             seed=8438338,
             warmup=0, iter=1, chains=1, refresh=0)

data <- list("N" = N,
             "y" = extract(simu)$y[1,])

Unsurprisingly we see two clear peaks in the observed data, one concentrating entirely at zero and one scattered across larger values.

par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))

util$plot_line_hist(data$y,
                    bin_min=-0.5, bin_max=14.5, bin_delta=1,
                    xlab="y")

Because the data so clearly separate into two peaks we might hope that inferences will be straightforward.

zip1.stan
data {
  int<lower=1> N;          // Number of observations
  array[N] int<lower=0> y; // Positive integer observations
}

parameters {
  real<lower=0> mu;              // Poisson intensity
  real<lower=0, upper=1> lambda; // Main component probability
}

model {
  // Prior model
  mu ~ normal(0, 15 / 2.57); // 0 <~ mu <~ 15
  // Implicit uniform prior density function for lambda

  // Observational model
  for (n in 1:N) {
    if (y[n] == 0) {
      target += log_mix(lambda, poisson_lpmf(y[n] | mu), 0);
    } else {
      target += log(lambda) + poisson_lpmf(y[n] | mu);
    }
  }
}

generated quantities {
  // Initialize predictive variables with inflated value
  array[N] int<lower=0> y_pred = rep_array(0, N);

  for (n in 1:N) {
    // If we sample the non-inflating component then replace initial
    // value with a Poisson sample
    if (bernoulli_rng(lambda)) {
      y_pred[n] = poisson_rng(mu);
    }
  }
}
fit <- stan(file="stan_programs/zip1.stan",
            data=data, seed=8438338,
            warmup=1000, iter=2024, refresh=0)

Our first good sign is that there are no diagnostics warnings.

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples, c('mu', 'lambda'))
util$check_all_expectand_diagnostics(base_samples)
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.

A retrodictive check with a histogram summary statistic also looks good, although retrodictive checks will always be pretty well-behaved when we’re fitting data simulated from the same model!

par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))

util$plot_hist_quantiles(samples, 'y_pred',
                         bin_min=-0.5, bin_max=18.5, bin_delta=1,
                         baseline_values=data$y, xlab="y")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 99 predictive values (0.0%) fell above the binning.

More importantly the posterior inferences are able to identify the true model configuration pretty precisely.

par(mfrow=c(1, 2), mar=c(5, 5, 1, 1))

util$plot_expectand_pushforward(samples[['mu']], 25,
                                display_name="mu",
                                baseline=mu_true,
                                baseline_col=util$c_mid_teal)

util$plot_expectand_pushforward(samples[['lambda']], 25,
                                display_name="lambda",
                                baseline=lambda_true,
                                baseline_col=util$c_mid_teal)

To make things harder let’s try again but with much stronger zero inflation.

N <- 100
mu_true <- 7.5
lambda_true <- 0.01

simu <- stan(file="stan_programs/simu_zip.stan",
             algorithm="Fixed_param",
             data=list("N" = N,
                       "mu" = mu_true,
                       "lambda" = lambda_true),
             seed=8438338,
             warmup=0, iter=1, chains=1, refresh=0)

Indeed the inflation is so strong that the simulated data is comprised entirely of zeros.

data <- list("N" = N,
             "y" = extract(simu)$y[1,])

table(data$y)

  0 
100 

What can we learn about the Poisson component model in this case?

fit <- stan(file="stan_programs/zip1.stan",
            data=data, seed=8438338,
            warmup=1000, iter=2024, refresh=0)

Unfortunately the diagnostics indicate some weak computational problems.

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  Chain 1: 8 of 1024 transitions (0.8%) diverged.

  Chain 2: 4 of 1024 transitions (0.4%) diverged.

  Chain 4: 2 of 1024 transitions (0.2%) diverged.

  Divergent Hamiltonian transitions result from unstable numerical
trajectories.  These instabilities are often due to degenerate target
geometry, especially "pinches".  If there are only a small number of
divergences then running with adept_delta larger than 0.801 may reduce
the instabilities at the cost of more expensive Hamiltonian
transitions.
samples <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples, c('mu', 'lambda'))
util$check_all_expectand_diagnostics(base_samples)
lambda:
  Chain 1: Right tail hat{xi} (0.899) exceeds 0.25.
  Chain 2: Right tail hat{xi} (0.892) exceeds 0.25.
  Chain 3: Right tail hat{xi} (1.036) exceeds 0.25.
  Chain 4: Right tail hat{xi} (1.222) exceeds 0.25.


Large tail hat{xi}s suggest that the expectand might not be
sufficiently integrable.

Following best practices we’ll follow up on the divergences by examining some relevant pair plots. Fortunately here there are only two parameters and hence one possible pair plot to consider.

util$plot_div_pairs(x_names="mu", y_names="lambda",
                    samples, diagnostics)

One immediate take away from this plot is the extreme posterior uncertainties. The zero-inflated Poisson observational model can accommodate zeros in two distinct ways. Firstly it can push \lambda to zero, turning off the baseline Poisson component but also leaving the intensity parameter \mu uninformed beyond the prior model. Secondly it can push \mu to zero so that the two component models become redundant, in which case \lambda becomes uninformed beyond the prior model. Ultimately the zero-inflated Poisson model with our initial, diffuse prior model is just too flexible to yield well-behaved inferences.

One way to avoid these strong uncertainties, and the resulting strain on the posterior computation, is to constrain the mixture observational model with additional domain expertise. For example any information on the strength of the inflation can inform a more concentrated prior model for \lambda. Similarly any information on the precise value of \mu may be able to suppress configurations where the two component models overlap.

Here let’s assume that our domain expertise is inconsistent with values of \mu below one. The only problem is that we now need a prior model for \mu that suppresses values both above 15 and below 1. Multiple families of probability density functions are applicable here, including the log normal, gamma, and inverse gamma families.

All of these families feature slightly different tail behaviors that have different advantages and disadvantages. For this analysis we’ll go with the inverse gamma family as it more heavily suppresses the smaller values of \mu where we know the component models become redundant.

To inform a particular inverse gamma configuration we can use Stan’s algebraic solver to find the configuration matching our desired tail behaviors.

prior\_tune.stan
functions {
  // Differences between inverse gamma tail
  // probabilities and target probabilities
  vector tail_delta(vector y, vector theta,
                    array[] real x_r, array[] int x_i) {
    vector[2] deltas;
    deltas[1] = inv_gamma_cdf(theta[1] | exp(y[1]), exp(y[2])) - 0.01;
    deltas[2] = 1 - inv_gamma_cdf(theta[2] | exp(y[1]), exp(y[2])) - 0.01;
    return deltas;
  }
}

data {
  real<lower=0>     y_low;
  real<lower=y_low> y_high;
}

transformed data {
  // Initial guess at inverse gamma parameters
  vector[2] y_guess = [log(2), log(5)]';
  // Target quantile
  vector[2] theta = [y_low, y_high]';
  vector[2] y;
  array[0] real x_r;
  array[0] int x_i;

  // Find inverse Gamma density parameters that ensure 
  // 1% probability below y_low and 1% probability above y_high
  y = algebra_solver(tail_delta, y_guess, theta, x_r, x_i);

  print("alpha = ", exp(y[1]));
  print("beta = ", exp(y[2]));
}

generated quantities {
  real alpha = exp(y[1]);
  real beta = exp(y[2]);
}
stan(file='stan_programs/prior_tune.stan',
     data=list("y_low" = 1, "y_high" = 15),
     iter=1, warmup=0, chains=1,
     seed=4838282, algorithm="Fixed_param")
alpha = 3.48681
beta = 9.21604

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                0 seconds (Sampling)
Chain 1:                0 seconds (Total)
Chain 1: 
Inference for Stan model: anon_model.
1 chains, each with iter=1; warmup=0; thin=1; 
post-warmup draws per chain=1, total post-warmup draws=1.

      mean se_mean sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
alpha 3.49      NA NA 3.49 3.49 3.49 3.49  3.49     0  NaN
beta  9.22      NA NA 9.22 9.22 9.22 9.22  9.22     0  NaN
lp__  0.00      NA NA 0.00 0.00 0.00 0.00  0.00     0  NaN

Samples were drawn using (diag_e) at Wed Oct  9 22:45:03 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

With a more informative prior model in hand let’s try again.

zip2.stan
data {
  int<lower=1> N;          // Number of observations
  array[N] int<lower=0> y; // Positive integer observations
}

parameters {
  real<lower=0> mu;              // Poisson intensity
  real<lower=0, upper=1> lambda; // Main component probability
}

model {
  // Prior model
  mu ~ inv_gamma(3.5, 9.0); // 1   <~   mu   <~ 15
  // Implicit uniform prior density function for lambda

  // Observational model
  for (n in 1:N) {
    if (y[n] == 0) {
      target += log_mix(lambda, poisson_lpmf(y[n] | mu), 0);
    } else {
      target += log(lambda) + poisson_lpmf(y[n] | mu);
    }
  }
}

generated quantities {
  // Initialize predictive variables with inflated value
  array[N] int<lower=0> y_pred = rep_array(0, N);

  for (n in 1:N) {
    // If we sample the non-inflating component then replace initial
    // value with a Poisson sample
    if (bernoulli_rng(lambda)) {
      y_pred[n] = poisson_rng(mu);
    }
  }
}
fit <- stan(file="stan_programs/zip2.stan",
            data=data, seed=8438338,
            warmup=1000, iter=2024, refresh=0)

There is a lone \hat{\xi} warning, but more importantly the divergences are gone.

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples, c('mu', 'lambda'))
util$check_all_expectand_diagnostics(base_samples)
mu:
  Chain 1: Right tail hat{xi} (0.310) exceeds 0.25.


Large tail hat{xi}s suggest that the expectand might not be
sufficiently integrable.

Taking a quick look at the one relevant pair plot we see that the stronger prior model entirely suppresses the ridge where \mu is small and \lambda is poorly informed.

util$plot_div_pairs(x_names="mu", y_names="lambda",
                    samples, diagnostics)

That said while the posterior geometry is more well-behaved the inferences still leave much to desired. In particular with this more informative prior model the observations no longer inform \mu.

par(mfrow=c(1, 2), mar=c(5, 5, 1, 1))

util$plot_expectand_pushforward(samples[['mu']], 25,
                                display_name="mu",
                                baseline=mu_true,
                                baseline_col=util$c_mid_teal)

util$plot_expectand_pushforward(samples[['lambda']], 25,
                                display_name="lambda",
                                baseline=lambda_true,
                                baseline_col=util$c_mid_teal)

5.4 Zero/One-Inflated Beta Model

Now let’s explore what happens when we try to inflate a continuous baseline observational model. Continuous inflation models are often useful for modeling contamination in a data generating process, such as data-entry errors or unexpected/corrupted outcomes that have been coded with default values. For example data entry software that fills in all entries with zeros before allowing a user to overwrite that default value will give excess zeros if the user fails to enter all observed values. Similar missing observations that are coded with zero will also result in an excess of zeros.

Here let’s inflate both zero and one values in baseline beta observational model, giving what is known as a zero/one-inflated beta model. While less conventional than “ZIP” for a zero-inflated Poisson model, the short-hand “ZOIB” for this model is advocated by a small but passionate group.

Simulating data from a continuous inflation observational model proceeds exactly the same as for a discrete mixture model, and indeed any mixture model.

simu\_zoib.stan
data {
  int<lower=1> N; // Number of observations
}

transformed data {
  real alpha = 3;
  real beta = 2;
  simplex[3] lambda = [0.75, 0.15, 0.10]';
}

generated quantities {
  // Initialize predictive variables with inflated value
  array[N] real<lower=0, upper=1> y;

  for (n in 1:N) {
    int z = categorical_rng(lambda);

    if (z == 1) {
      y[n] = beta_rng(alpha, beta);
    } else if (z == 2) {
      y[n] = 0;
    } else {
      y[n] = 1;
    }
  }
}
N <- 100

simu <- stan(file="stan_programs/simu_zoib.stan",
             algorithm="Fixed_param",
             data=list("N" = N), seed=8438338,
             warmup=0, iter=1, chains=1, refresh=0)

data <- list("N" = N,
             "y" = extract(simu)$y[1,])

Because of the finite binning histogram visualizations of the data can make it difficult to distinguish between inflation near zero and one and inflation exactly at zero and one.

par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))

util$plot_line_hist(data$y,
                    bin_min=0, bin_max=1.05, bin_delta=0.05,
                    xlab="y")

Empirical cumulative distribution functions, are better suited to identifying continuous inflation, which manifests as distinct jumps in the cumulative probabilities.

plot_ecdf <- function(vals, delta=0.01, xlab="") {
  N <- length(vals)
  ordered_vals <- sort(vals)

  xs <- c(ordered_vals[1] - delta,
          rep(ordered_vals, each=2),
          ordered_vals[N] + delta)

  ecdf_counts <- rep(0:N, each=2)

  plot(xs, ecdf_counts, type="l", lwd="2", col="black",
       xlab=xlab,
       ylim=c(0, N), ylab="ECDF (Counts)")
}
plot_ecdf(data$y, xlab="y")

As discussed in Section 2.1.2 the implementation of a continuous inflation model requires treating the inflation values as a discrete space separate from the other values. There are two ways to handle this.

Firstly we can model the inflated and non-inflated values jointly.

zoib1.stan
data {
  int<lower=1> N;                    // Number of observations
  array[N] real<lower=0, upper=1> y; // Unit-interval valued observations
}

transformed data {
  int<lower=0> N_zero = 0;
  int<lower=0> N_one = 0;
  int<lower=0> N_else = N;

  for (n in 1:N) {
    if (y[n] == 0) N_zero += 1;
    if (y[n] == 1) N_one  += 1;
  }

  N_else -= N_one + N_zero;
}

parameters {
  real<lower=0> alpha; // Beta shape
  real<lower=0>  beta; // Beta scale
  simplex[3]   lambda; // Component probabilities
}

model {
  // Prior model
  alpha ~ normal(0, 10 / 2.57); // 0 <~ alpha <~ 10
  beta ~ normal(0, 10 / 2.57);  // 0 <~ beta  <~ 10
  // Implicit uniform prior density function for lambda

  // Observational model
  target += multinomial_lpmf({N_else, N_zero, N_one} | lambda);

  for (n in 1:N) {
    if (0 < y[n] && y[n] < 1) {
      target += beta_lpdf(y[n] | alpha, beta);
    }
  }
}

generated quantities {
  array[N] real<lower=0, upper=1> y_pred;

  for (n in 1:N) {
    int z = categorical_rng(lambda);

    if (z == 1) {
      y_pred[n] = beta_rng(alpha, beta);
    } else if (z == 2) {
      y_pred[n] = 0;
    } else {
      y_pred[n] = 1;
    }
  }
}
fit <- stan(file="stan_programs/zoib1.stan",
            data=data, seed=8438338,
            warmup=1000, iter=2024, refresh=0)

The diagnostics show no problems.

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples1 <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples1,
                                       c('alpha', 'beta', 'lambda'),
                                       check_arrays=TRUE)
util$check_all_expectand_diagnostics(base_samples)
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.

Nor does the posterior retrodictive check.

par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))

util$plot_hist_quantiles(samples1, 'y_pred',
                         bin_min=0, bin_max=1.05, bin_delta=0.05,
                         baseline_values=data$y, xlab="y")

Moreover our posterior inferences are both precise and accurate to the simulation data generating process. For instance the inferences are consistent with the true configuration of the baseline beta model used to simulate the data.

par(mfrow=c(1, 2), mar=c(5, 5, 1, 1))

util$plot_expectand_pushforward(samples1[['alpha']], 25,
                                display_name="alpha",
                                baseline=3,
                                baseline_col=util$c_mid_teal)

util$plot_expectand_pushforward(samples1[['beta']], 25,
                                display_name="beta",
                                baseline=2,
                                baseline_col=util$c_mid_teal)

Because we have three component models we can directly visualize the simplex of component probabilities. This offers a much more complete picture of our inferences than trying to visualize the marginal inferences for each component probability independently and neglecting the simplex constraints.

to_plot_coordinates <- function(q, C) {
  c(C * (q[2] - q[1]), q[3])
}

plot_simplex_border <- function(label_cex, C, center_label=TRUE) {
  lines( c(-C, 0), c(0, 1), lwd=3)
  lines( c(+C, 0), c(0, 1), lwd=3)
  lines( c(-C, +C), c(0, 0), lwd=3)

  text_delta <- 0.05
  text( 0, 1 + text_delta, "(0, 0, 1)", cex=label_cex)
  text(-C - text_delta, -text_delta, "(1, 0, 0)", cex=label_cex)
  text(+C + text_delta, -text_delta, "(0, 1, 0)", cex=label_cex)

  tick_delta <- 0.025
  lines( c(0, 0), c(0, tick_delta), lwd=3)
  text(0, 0 - text_delta, "(1/2, 1/2, 0)", cex=label_cex)

  lines( c(+C * 0.5, +C * 0.5 - tick_delta * 0.5 * sqrt(3)),
         c(0.5, 0.5 - tick_delta * 0.5), lwd=3)
  text(C * 0.5 + text_delta * 0.5 * sqrt(3) + 2.5 * text_delta,
       0.5 + text_delta * 0.5, "(0, 1/2, 1/2)", cex=label_cex)

  lines( c(-C * 0.5, -C * 0.5 + tick_delta * 0.5 * sqrt(3)),
         c(0.5, 0.5 - tick_delta * 0.5), lwd=3)
  text(-C * 0.5 - text_delta * 0.5 * sqrt(3) - 2.5 * text_delta,
       0.5 + text_delta * 0.5, "(1/2, 0, 1/2)", cex=label_cex)

  points(0, 1/3, col="white", pch=16, cex=1.5)
  points(0, 1/3, col="black", pch=16, cex=1)
  if (center_label)
    text(0, 1/3 - 1.5 * text_delta, "(1/3, 1/3, 1/3)", cex=label_cex)
}

plot_simplex_samples <- function(q1, q2, q3, label_cex=1,
                                 main="", baseline=NULL) {
  N <- 200
  C <- 1 / sqrt(3)

  plot(NULL, xlab="", ylab="", xaxt="n", yaxt="n", frame.plot=F,
       xlim=c(-(C + 0.2), +(C + 0.2)), ylim=c(-0.1, 1.1))
  plot_simplex_border(label_cex, C, FALSE)
  title(main)

  N <- min(length(q1), length(q2), length(q3))
  for (n in 1:N) {
    xy <- to_plot_coordinates(c(q1[n], q2[n], q3[n]), C)
    points(xy[1], xy[2], col="#8F272710", pch=16, cex=1.0)
  }
  if (!is.null(baseline)) {
    xy <- to_plot_coordinates(baseline, C)
    points(xy[1], xy[2], col="white", pch=16, cex=1.5)
    points(xy[1], xy[2], col=util$c_dark_teal, pch=16, cex=1)
  }
}

With this visualization we can clearly see the posterior inferences for the component probabilities concentrating around the true value.

par(mfrow=c(1, 1), mar=c(0, 0, 2, 0))

plot_simplex_samples(c(samples1[['lambda[1]']], recursive=TRUE),
                     c(samples1[['lambda[2]']], recursive=TRUE),
                     c(samples1[['lambda[3]']], recursive=TRUE),
                     main="lambda", baseline=c(0.75, 0.15, 0.10))

While the joint model works well we can also fit the inflated and non-inflated observations independently of each other.

data$N_zero <- sum(data$y == 0)
data$N_one  <- sum(data$y == 1)

data$y_else <- data$y[data$y != 0 & data$y != 1]
data$N_else <- length(data$y_else)
zoib2a.stan
data {
  // Number of non-zero/one observations
  int<lower=1> N_else;
  // Non-zero/one observations
  array[N_else] real<lower=0, upper=1> y_else;
}

parameters {
  real<lower=0> alpha; // Beta shape
  real<lower=0>  beta; // Beta scale
}

model {
  // Prior model
  alpha ~ normal(0, 10 / 2.57); // 0 <~ alpha <~ 10
  beta ~ normal(0, 10 / 2.57);  // 0 <~ beta  <~ 10

  // Observational model
  target += beta_lpdf(y_else | alpha, beta);
}
fit <- stan(file="stan_programs/zoib2a.stan",
            data=data, seed=8438338,
            warmup=1000, iter=2024, refresh=0)

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples2a <- util$extract_expectand_vals(fit)
util$check_all_expectand_diagnostics(samples2a)
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
zoib2b.stan
data {
  int<lower=1> N_zero; // Number of zero observations
  int<lower=1> N_one;  // Number of one observations
  int<lower=1> N_else; // Number of non-zero/one observations
}

parameters {
  simplex[3] lambda; // Component probabilities
}

model {
  // Prior model
  // Implicit uniform prior density function for lambda

  // Observational model
  target += multinomial_lpmf({N_else, N_zero, N_one} | lambda);
}
fit <- stan(file="stan_programs/zoib2b.stan",
            data=data, seed=8438338,
            warmup=1000, iter=2024, refresh=0)

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples2b <- util$extract_expectand_vals(fit)
util$check_all_expectand_diagnostics(samples2b)
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.

Critically the posterior inferences derived in the two approaches are consistent up to the expected Markov chain Monte Carlo variation.

par(mfrow=c(1, 2), mar=c(5, 5, 3, 1))

util$plot_expectand_pushforward(samples1[['alpha']],
                                25, flim=c(1.5, 6.5),
                                display_name="alpha")
text(2.25, 0.5, "Joint", col=util$c_dark)

util$plot_expectand_pushforward(samples2a[['alpha']],
                                25, flim=c(1.5, 6.5),
                                col=util$c_mid,
                                border="#DDDDDD88",
                                add=TRUE)
text(3.5, 0.05, "Separate", col=util$c_mid)

util$plot_expectand_pushforward(samples1[['beta']],
                                25, flim=c(1, 3.5),
                                display_name="beta",)
text(2.5, 1.3, "Joint", col=util$c_dark)

util$plot_expectand_pushforward(samples2a[['beta']],
                                25, flim=c(1, 3.5),
                                col=util$c_mid,
                                border="#DDDDDD88",
                                add=TRUE)
text(2, 0.1, "Separate", col=util$c_mid)

par(mfrow=c(1, 2), mar=c(0, 0, 2, 0))

plot_simplex_samples(c(samples1[['lambda[1]']], recursive=TRUE),
                     c(samples1[['lambda[2]']], recursive=TRUE),
                     c(samples1[['lambda[3]']], recursive=TRUE),
                     main="Joint")

plot_simplex_samples(c(samples2b[['lambda[1]']], recursive=TRUE),
                     c(samples2b[['lambda[2]']], recursive=TRUE),
                     c(samples2b[['lambda[3]']], recursive=TRUE),
                     main="Separate")

If the inflated values are modeling some undesired contamination and we are interested in only the behavior of the continuous baseline observational model then we can always fit the baseline model to the non-inflated values and ignore the inflated values entirely. Similarly if we are interested in only the prevalence of inflated values then we can fit the component probabilities to the counts directly, ignoring the precise value of the continuous observations.

5.5 Redundant Mixture Model

To avoid any ambiguity let me emphasize that I do not recommend using redundant mixture models in applied practice. Without some way of distinguishing the component probability distributions mixture models are inherently prone to degenerate inferences that impede meaningful insights.

That said, in this section we will explore a redundant mixture of normal observational models to see just how problematic redundancy can be in mixture models. If you’re not tempted to use redundant mixture models in your own analyses then please feel free to skim if not skip this section!

At the very least the simulation of data is straightforward.

simu\_normal\_mix.stan
data {
  int<lower=1> N;      // Number of observations
}

transformed data {
  int K = 3;                                    // Number of components
  array[K] real mu = {-4, 1, 3};                // Component locations
  array[K] real<lower=0> sigma = {2, 0.5, 0.5}; // Component scales
  simplex[K] lambda = [0.3, 0.5, 0.2]';         // Component probabilities

}

generated quantities {
  array[N] real y;

  for (n in 1:N) {
    int z = categorical_rng(lambda);
    y[n] = normal_rng(mu[z], sigma[z]);
  }
}
N <- 500

simu <- stan(file="stan_programs/simu_normal_mix.stan",
             algorithm="Fixed_param",
             data=list("N" = N), seed=8438338,
             warmup=0, iter=1, chains=1, refresh=0)

data <- list("N" = N,
             "y" = extract(simu)$y[1,])

A histogram of the data clearly shows at least two peaks, with the possibility of the second peak maybe separating into two more peaks.

par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))

util$plot_line_hist(data$y, -11, 5, 0.5, xlab="y")

5.5.1 Unknown Component Probabilities

If we know the true configuration of the component observational models, so that all we have to infer are the component probabilities, then the components will not actually be redundant.

normal\_mix1.stan
data {
  int<lower=1> N;  // Number of observations
  array[N] real y; // Observations
}

transformed data {
  int K = 3;                                    // Number of components
  array[K] real mu = {-4, 1, 3};                // Component locations
  array[K] real<lower=0> sigma = {2, 0.5, 0.5}; // Component scales
}

parameters {
  simplex[K] lambda; // Component probabilities
}

model {
  // Prior model
  // Implicit uniform prior density function for lambda

  // Observational model
  for (n in 1:N) {
    vector[K] lpds;
    for (k in 1:K) {
      lpds[k] = log(lambda[k]) + normal_lpdf(y[n] | mu[k], sigma[k]);
    }
    target += log_sum_exp(lpds);
  }
}

generated quantities {
  array[N] real y_pred;

  for (n in 1:N) {
    int z = categorical_rng(lambda);
    y_pred[n] = normal_rng(mu[z], sigma[z]);
  }
}
fit <- stan(file="stan_programs/normal_mix1.stan",
            data=data, seed=8438338,
            warmup=1000, iter=2024, refresh=0)

In this case the computation is clean.

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples,
                                       c('lambda'),
                                       check_arrays=TRUE)
util$check_all_expectand_diagnostics(base_samples)
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.

The posterior retrodictive check shows not only shows signs of model inadequacy but also that our model infers three distinct peaks from the observed data.

par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))

util$plot_hist_quantiles(samples, 'y_pred', -12, 6, 0.5,
                         baseline_values=data$y, xlab="y")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 12 predictive values (0.0%) fell below the binning.

Moreover posterior inferences for the component probabilities are consistent with the true values used to simulate the data.

par(mfrow=c(1, 1), mar=c(0, 0, 0, 0))

lambda_true <- c(0.3, 0.5, 0.2)
plot_simplex_samples(c(samples[['lambda[1]']], recursive=TRUE),
                     c(samples[['lambda[2]']], recursive=TRUE),
                     c(samples[['lambda[3]']], recursive=TRUE),
                     baseline=lambda_true)

5.5.2 Unknown Component Probabilities and Locations

Let’s now complicate the situation by leaving the component scales fixed but trying to infer the component locations. Because the component scales are not all equal to each other the resulting mixture model is not completely redundant. That said the components are all pretty similar, and the latter two components are exactly redundant with each other.

normal\_mix2a.stan
data {
  int<lower=1> N;  // Number of observations
  array[N] real y; // Observations
}

transformed data {
  int K = 3;                                    // Number of components
  array[K] real<lower=0> sigma = {2, 0.5, 0.5}; // Component scales
}

parameters {
  array[K] real mu;  // Component locations
  simplex[K] lambda; // Component probabilities
}

model {
  // Prior model
  mu ~ normal(0, 10 / 2.32); // -10 <~ mu[k] <~ +10
  // Implicit uniform prior density function for lambda

  // Observational model
  for (n in 1:N) {
    vector[K] lpds;
    for (k in 1:K) {
      lpds[k] = log(lambda[k]) + normal_lpdf(y[n] | mu[k], sigma[k]);
    }
    target += log_sum_exp(lpds);
  }
}

generated quantities {
  array[N] real y_pred;

  for (n in 1:N) {
    int z = categorical_rng(lambda);
    y_pred[n] = normal_rng(mu[z], sigma[z]);
  }
}
fit <- stan(file="stan_programs/normal_mix2a.stan",
            data=data, seed=8438338,
            warmup=1000, iter=2024, refresh=0)

The preponderance of split \hat{R} warnings hints at a multi-modal posterior distribution.

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples,
                                       c('mu', 'lambda'),
                                       check_arrays=TRUE)
util$check_all_expectand_diagnostics(base_samples)
mu[1]:
  Split hat{R} (12.002) exceeds 1.1.

mu[2]:
  Split hat{R} (47.876) exceeds 1.1.

mu[3]:
  Split hat{R} (18.290) exceeds 1.1.

lambda[1]:
  Split hat{R} (5.779) exceeds 1.1.

lambda[2]:
  Split hat{R} (7.297) exceeds 1.1.

lambda[3]:
  Split hat{R} (5.198) exceeds 1.1.


Split Rhat larger than 1.1 suggests that at least one of the Markov
chains has not reached an equilibrium.

Indeed a selection of pair plots suggest at least three distinct posterior modes, although accurately counting modes based on two-dimensional projections is always tricky. Moreover there could easily be more modes that our Markov chains missed.

names <- sapply(1:3, function(k) paste0('mu[', k, ']'))
util$plot_div_pairs(names, names, samples, diagnostics)

names <- sapply(1:3, function(k) paste0('lambda[', k, ']'))
util$plot_div_pairs(names, names, samples, diagnostics)

Unsurprisingly the individual Markov chains are each confined to a single mode. Because the first, second, and third Markov chains appear to have fallen into distinct modes we’ll focus on those going forwards.

util$plot_pairs_by_chain(samples[['mu[1]']], 'mu[1]',
                         samples[['mu[2]']], 'mu[2]')