par(family="serif", las=1, bty="l",
cex.axis=1, cex.lab=1, cex.main=1,
xaxs="i", yaxs="i", mar = c(5, 5, 3, 1))
library(rstan)
rstan_options(auto_write = TRUE) # Cache compiled Stan programs
options(mc.cores = parallel::detectCores()) # Parallelize chains
:::setDefaultClusterOptions(setup_strategy = "sequential") parallel
Hidden Markov Models
In the mixture modeling chapter we first considered how to model data that could have arisen from one of a number of component data generating processes. There we derived mixture observational models under the assumption that the activation of a component data generating process, and hence the behavior of the component assignment variables, is independent across individual observations.
Mixture observational models have broad applicability in practice, but they are by no means universal. In order to generalize this modeling technique we need to consider joint mixture observational models that couple the component assignment variables together. These generalizations, however, also introduce an immediate practical challenge: how, if at all, can be efficiently marginalize the component assignment variables?
From this perspective hidden Markov models are an exceptional generalization to mixture observational models. On one hand they allow for more sophisticated behavior by allowing the contribution of a component data generating process to persist across multiple observations before another data generating process takes over. On the other their mathematical structure is just cooperative enough to allow for practical marginalization.
In this chapter I will introduce hidden Markov models, carefully derive the algorithms needed for efficient marginalization, and demonstrate their robust implementation in Stan.
1 Joint Mixture Observational Models
Consider I component observations, y_{1}, \ldots, y_{i}, \ldots, y_{I}, each of which can arise from one of K component observational models, p(y_{i} \mid i, k, \theta) = p_{i, k}( y_{i} \mid \theta).
Note that in general the y_{i}, and hence the precise form of the p(y_{i} \mid i, k, \theta), do not have to homogeneous. For example some y_{i} might be discrete while others are continuous. Moreover the dimension of each y_{i} might vary, with some observations featuring more sub-components than others.
To quantify which component observational model is responsible for each component observation we introduce I categorical assignment variables, z_{i} \in \{ 1, \ldots, K \}. These assignment variables are also known as states, as in the state of the overall data generating process for each observation.
When the states z_{i} are unknown, equivalently latent or even hidden, we need to need to incorporate them into the observational model so that that we can infer their behavior. This requires a joint model of the form (Figure 1) \begin{align*} p(z_{1}, \ldots, z_{I}, y_{1}, \ldots, y_{I}, \theta) &= p(y_{1}, \ldots, y_{I} \mid z_{1}, \ldots, z_{I}, \theta) \, p(z_{1}, \ldots, z_{I} \mid \theta) \, p( \theta ) \\ &= \prod_{i = 1}^{I} p_{i, z_{i}}(y_{i} \mid \theta) p(z_{1}, \ldots, z_{I} \mid \theta) \, p( \theta ) \\ &= \prod_{i = 1}^{I} p(y_{i} \mid i, z_{i}, \theta) p(z_{1}, \ldots, z_{I} \mid \theta) \, p( \theta ). \end{align*}
To make the equations in this chapter a bit less overwhelming let’s take the opportunity to introduce a more compact notation for sequences of observations and latent states, \begin{align*} y_{i_{i}:i_{2}} &= ( y_{i_{1}}, y_{i_{1} + 1}, \ldots, y_{i_{2} - 1}, y_{i_{2}} ) \\ z_{i_{i}:i_{2}} &= ( z_{i_{1}}, z_{i_{1} + 1}, \ldots, z_{i_{2} - 1}, z_{i_{2}} ). \end{align*} This allows us to write the general joint model as p(z_{1:I}, y_{1:I}, \theta) = \prod_{i = 1}^{I} p(y_{i} \mid i, z_{i}, \theta) p(z_{1}, \ldots, z_{I} \mid \theta) \, p( \theta ).
In the mixture modeling chapter we derived mixture observational models by assuming that all of the latent states were independent and identically distributed (Figure 2), p(z_{1:I} \mid \theta) = \prod_{i = 1}^{I} p(z_{i} \mid \theta); I will refer to this as a basic mixture observational model. With this assumption we were able to marginalize out each discrete latent state individually. This, in turn, allowed us to use high performance computational tools like Hamiltonian Monte Carlo.
An arbitrary joint mixture observational model, however, features K^{I} distinct configurations of the latent states. Consequently direct summation of the latent states, p(y_{1:I} \mid \theta) = \sum_{z_{1}} \cdots \sum_{z_{I}} p(z_{1:I}, y_{1:I} \mid \theta), is typically infeasible. There is a key exception, however, where the coupling between the latent states allows for efficient marginalization while still being rich enough to be useful for practical applications.
To motivate this coupling let’s first assume that the latent states are ordered so that we have a well-defined notion of neighboring states. For example the observations might be indexed in the temporal order in which they were made, organized along a one-dimensional spatial direction, and the like.
Next we will assume that each latent state is conditionally dependent on only its preceding neighbor (Figure 3), p(z_{1:I}, \mid \theta) = p(z_{1}) \, \prod_{i = 2}^{I} p(z_{i} \mid z_{i - 1}, \theta). This particular dependency structure arises so often in both theoretical and applied probability theory that it has been given its own adjective: Markovian.
Under this Markovian assumption the joint model becomes p(z_{1:I}, y_{1:I}, \theta) = p(z_{1:I}, y_{1:I} \mid \theta) \, p( \theta ) where \begin{align*} p(z_{1:I}, y_{1:I} \mid \theta) &= \prod_{i = 1}^{I} p_(y_{i} \mid i, z_{i}, \theta) p(z_{1}, \ldots, z_{I} \mid \theta) \\ &= \left[ \prod_{i = 1}^{I} p(y_{i} \mid i, z_{i}, \theta) \right] \left[ p(z_{1}) \, \prod_{i = 2}^{I} p(z_{i} \mid z_{i - 1}, \theta) \right] \\ &= p(z_{1}) \, (y_{1} \mid 1, z_{1}, \theta) \prod_{i = 2}^{I} p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{i} \mid i, z_{i}, \theta). \end{align*}
Because it assumes “Markovian” dependencies for a sequence of “hidden” states this model is known as a hidden Markov model. That said is it not unreasonable to refer to it as a Markov mixture observational model if we want to emphasize the assumed structure of the observations as well as that of the latent states.
In the context of a hidden Markov model the conditional probabilities p(z_{i} \mid z_{i - 1}, \theta) are known referred to as transition probabilities. A hidden Markov model with the same transition probabilities between all pairs of states, so that p(z_{i} = k' \mid z_{i - 1} = k, \theta) depends on only k and k' but not i, is known as a homogeneous hidden Markov model. Heterogeneous, non-homogeneous, and general hidden Markov models are all terms used for emphasizing varying transition probabilities.
2 The Marginal Hidden Markov Model
In theory the joint observational model p(z_{1:I}, y_{1:I} \mid \theta) is sufficient to infer which latent states z_{1:I} and model configurations \theta are consistent with any observed data. The discreteness of the latent states, however, frustrates practical implementation of those inferences.
Fortunately the Markovian structure of hidden Markov models allows us to efficiently marginalize out the discrete latent states. That said actually deriving an explicit marginalization algorithm from that structure is a pretty onerous task.
In this section I will motivate and then carefully derive the strategy for efficiently marginalizing the discrete hidden states of a hidden Markov model. The derivation requires some pretty brutal calculations and readers interested in only the final result can jump straight to Section 2.3 without missing anything.
2.1 Looking For A Pattern
First and foremost let’s motivate the basic approach to efficient marginalization by carefully working out the direct summations for I = 1, I = 2, and then I = 3 and then looking for any useful patterns.
When I = 1 we have p(z_{1}, y_{1} \mid \theta) = p(z_{1}) \, p(y_{1} \mid 1, z_{1}, \theta). Summing over the initial latent state gives \begin{align*} p(y_{1} \mid \theta) &= \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \\ &= \sum_{z_{1}} p(y_{1} \mid 1, z_{1}, \theta) \, p(z_{1}). \end{align*} Note the similarity to the marginal form of a basic mixture observational model, p(y \mid \theta) = \sum_{z} p( y \mid z, \theta) \, p(z).
For I = 2 we have p(z_{1}, z_{2}, y_{1}, y_{2} \mid \theta) = p(z_{1}) \, p(y_{1} \mid 1, z_{1}, \theta) \, p(z_{2} \mid z_{1}, \theta) \, p(y_{2} \mid 2, z_{2}, \theta), or, equivalently, p(z_{1}, z_{2}, y_{1}, y_{2} \mid \theta) = p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta) \, p(y_{2} \mid 2, z_{2}, \theta). Summing over the initial latent state gives \begin{align*} p(z_{2}, y_{1}, y_{2} \mid \theta) &= \sum_{z_{1}} p(z_{1}, z_{2}, y_{1}, y_{2} \mid \theta) \\ &= \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta) \, p(y_{2} \mid 2, z_{2}, \theta) \\ &= p(y_{2} \mid 2, z_{2}, \theta) \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta). \end{align*}
We can then marginalize out both of the latent states by summing this result over z_{2}, \begin{align*} p(y_{1}, y_{2} \mid \theta) &= \sum_{z_{2}} p(z_{2}, y_{1}, y_{2} \mid \theta) \\ &= \sum_{z_{2}} p(y_{2} \mid 2, z_{2}, \theta) \bigg[ \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta) \bigg] \\ &= \sum_{z_{2}} p(y_{2} \mid 2, z_{2}, \theta) \, p(z_{2} \mid y_{1}, \theta). \end{align*}
Interestingly this once again this has the same form as a basic mixture observational model. Now, however, the component probabilities are informed by the previous observation, p(z_{2} \mid y_{1}, \theta) = \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta).
Finally let’s consider I = 3 where the joint model is given by \begin{align*} p(z_{1:3}, y_{1:3} \mid \theta) &= p(z_{1}) \, p(y_{1} \mid 1, z_{1}, \theta) \, \prod_{i = 2}^{3} p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{i} \mid i, z_{i}, \theta) \\ &= p(z_{1}, y_{1} \mid \theta) \, \prod_{i = 2}^{3} p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{i} \mid i, z_{i}, \theta). \end{align*}
Here summing over the initial latent state gives \begin{align*} p(z_{2:3}, y_{1:3} \mid \theta) &= \sum_{z_{1}} p(z_{1:3}, y_{1:3} \mid \theta) \\ &= \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, \prod_{i = 2}^{3} p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{i} \mid i, z_{i}, \theta) \\ &= \;\, \bigg[ p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{3} \mid z_{2}, \theta) \bigg] \\ &\quad \cdot \bigg[ p(y_{2} \mid 2, z_{2}, \theta) \, \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta) \bigg]. \end{align*}
The second term might look familiar; it’s exactly p(z_{2}, y_{1}, y_{2} \mid \theta) which we just derived for the I = 2 case! Consequently we can write \begin{align*} p(z_{2:3}, y_{1:3} \mid \theta) &= \; \bigg[ p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{3} \mid z_{2}, \theta) \bigg] \\ &\quad \cdot \bigg[ p(y_{2} \mid 2, z_{2}, \theta) \, \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta) \bigg] \\ &= \; \bigg[ p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{3} \mid z_{2}, \theta) \bigg] \\ &\quad \cdot \bigg[ p(z_{2}, y_{1}, y_{2} \mid \theta) \bigg] \\ &= p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{2}, y_{1}, y_{2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta). \end{align*}
Now we can sum over z_{2} to give \begin{align*} p(z_{3}, y_{1:3} \mid \theta) &= \sum_{z_{2}} p(z_{2:3}, y_{1:3} \mid \theta) \\ &= \sum_{z_{2}} p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{2}, y_{1}, y_{2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta) \\ &= p(y_{3} \mid 3, z_{3}, \theta) \, \sum_{z_{2}} p(z_{2}, y_{1}, y_{2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta). \end{align*} Notice how we derive p(z_{3}, y_{1:3} \mid \theta) from p(z_{2}, y_{1}, y_{2} \mid \theta) in exactly the same way that we derived p(z_{2}, y_{1}, y_{2} \mid \theta) from p(z_{1}, y_{1} \mid \theta).
Lastly we sum over z_{3} to give the full marginal probability density function, \begin{align*} p(y_{1:3} \mid \theta) &= \sum_{z_{3}} p(z_{3}, y_{1:3} \mid \theta) \\ &= \sum_{z_{3}} p(y_{3} \mid 3, z_{3}, \theta) \, \bigg[ \sum_{z_{2}} p(z_{2}, y_{1}, y_{2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta) \bigg] \\ &= \sum_{z_{3}} p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{3} \mid y_{1:2}, \theta ). \end{align*}
The final marginal model continues to mirror a basic mixture observational model, in this case with derived component probabilities p(z_{3} \mid y_{1:2}, \theta ) = \sum_{z_{2}} p(z_{2}, y_{1}, y_{2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta).
Let’s summarize what we have seen in these first few cases.
Instead of deriving p(y_{1:3} \mid \theta) by summing p(z_{1:3}, y_{1:3} \mid \theta) over z_{1}, z_{2}, and z_{3} at the same time, we can evaluate it by emulating a basic mixture observational model and summing p(z_{3}, y_{1:3} \mid \theta) over just z_{3}.
Conveniently p(z_{3}, y_{1:3} \mid \theta) can be derived directly from p(z_{2}, y_{1:2} \mid \theta), and that term can be derived from p(z_{1}, y_{1} \mid \theta) which in turn can be derived from the initial state probabilities p(z_{1}), \begin{align*} p(z_{1}, y_{1} \;\; \mid \theta) &= p(y_{1} \mid 1, z_{1}, \theta) \, p(z_{1}), \\ p(z_{2}, y_{1:2} \mid \theta) &= p(y_{2} \mid 2, z_{2}, \theta) \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta), \\ p(z_{3}, y_{1:3} \mid \theta) &= p(y_{3} \mid 3, z_{3}, \theta) \, \sum_{z_{2}} p(z_{2}, y_{1:2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta). \end{align*}
At this point the important question is whether or not this recursive pattern over the first three cases holds more generally, with p(z_{i} \mid y_{1:i}, \theta ) = p(y_{i} \mid i, z_{i}, \theta) \, \bigg[ \sum_{z_{i - 1}} p(z_{i - 1}, y_{1:(i - 1)} \mid \theta) \, p(z_{i} \mid z_{i - 1}, \theta) \bigg] for any 1 < i \le I. If it does then starting with the initial state probabilities p(z_{1}) we can iteratively compute p(z_{I} \mid y_{1:I} \mid \theta) and then sum over z_{I} to give the marginal model p(y_{1:I} \mid \theta).
2.2 Verifying The Pattern
While not exactly straightforward, with enough mathematical effort we can derive a general relationship between p(z_{i}, y_{1:i} \mid \theta) and p(z_{i - 1}, y_{1:(i - 1)} \mid \theta) using conditional probability theory and the conditional structure of the joint hidden Markov model. The exact properties we need are pretty onerous to work out, and I will not attempt to do so in this chapter. Instead I will shamelessly defer to Bishop (2006) which reviews the necessary properties and strategies for deriving them.
Ultimately we want to reduce p(z_{i}, y_{1:i} \mid \theta) to an object that depends on only the first i - 1 latent states and observations. To that end let’s peel off the terms that depend z_{i} and y_{i}, \begin{align*} p(z_{i}, y_{1:i} \mid \theta) &= p(z_{i}, y_{i}, y_{1:(i - 1)} \mid \theta) \\ &= p(y_{1:(i - 1)} \mid z_{i}, y_{i}, \theta) \, p(z_{i}, y_{i} \mid \theta) \\ &= p(y_{1:(i - 1)} \mid z_{i}, y_{i}, \theta) \, p(y_{i} \mid i, z_{i}, \theta) \, p(z_{i} \mid \theta). \end{align*}
Because of the Markovian structure of the hidden Markov model the first term can be written as p(y_{1:(i - 1)} \mid z_{i}, y_{i}, \theta) = p(y_{1:(i - 1)} \mid z_{i}, \theta) Consequently we have \begin{align*} p(z_{i}, y_{1:i} \mid \theta) &= p(y_{1:(i - 1)} \mid z_{i}, y_{i}, \theta) \, p(y_{i} \mid i, z_{i}, \theta) \, p(z_{i} \mid \theta) \\ &= p(y_{1:(i - 1)} \mid z_{i}, \theta) \, p(y_{i} \mid i, z_{i}, \theta) \, p(z_{i} \mid \theta) \\ &= p(y_{i} \mid i, z_{i}, \theta) \bigg[ p(y_{1:(i - 1)} \mid i, z_{i}, \theta) \, p(z_{i} \mid \theta) \bigg] \\ &= p(y_{i} \mid i, z_{i}, \theta) \, p(z_{i}, y_{1:(i - 1)} \mid \theta). \end{align*}
The second term here is a bit awkward, but it can be simplified by writing it as a marginal probability density function, \begin{align*} p(z_{i}, y_{1:(i - 1)} \mid \theta) &= \sum_{z_{i - 1}} p(z_{i - 1}, z_{i}, y_{1:(i - 1)} \mid \theta) \\ &= \sum_{z_{i - 1}} p(z_{i}, y_{1:(i - 1)} \mid z_{i - 1}, \theta) \, p(z_{i - 1} \mid \theta). \end{align*} We can further simplify the first term in the sum to p(z_{i}, y_{1:(i - 1)} \mid z_{i - 1}, \theta) = p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{1:(i - 1)} \mid z_{i - 1}, \theta); intuitively conditioning on z_{i - 1} blocks any coupling between the following latent state z_{i} and the previous observations y_{1:(i - 1)}.
This allows us to write \begin{align*} p(z_{i}, y_{1:(i - 1)} \mid \theta) &= \sum_{z_{i - 1}} p(z_{i}, y_{1:(i - 1)} \mid z_{i - 1}, \theta) \, p(z_{i - 1} \mid \theta) \\ &= \sum_{z_{i - 1}} p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{1:(i - 1)} \mid z_{i - 1}, \theta) \, p(z_{i - 1} \mid \theta) \\ &= \sum_{z_{i - 1}} p(z_{i} \mid z_{i - 1}, \theta) \, \bigg[ p(y_{1:(i - 1)} \mid z_{i - 1}, \theta) \, p(z_{i - 1} \mid \theta) \bigg] \\ &= \sum_{z_{i - 1}} p(z_{i} \mid z_{i - 1}, \theta) \, p(z_{i - 1}, y_{1:(i - 1)} \mid \theta). \end{align*}
Substituting this back into the original equation for p(z_{i}, y_{1:i} \mid \theta) finally gives \begin{align*} p(z_{i}, y_{1:i} \mid \theta) &= p(y_{i} \mid i, z_{i}, \theta) \, p(z_{i}, y_{1:(i - 1)} \mid \theta) \\ &= p(y_{i} \mid i, z_{i}, \theta) \, \sum_{z_{i - 1}} p(z_{i} \mid z_{i - 1}, \theta) \, p(z_{i - 1}, y_{1:(i - 1)} \mid \theta), \end{align*} which is exactly the recursion equation that generalizes the pattern we found in the Section 2.1.
2.3 The Forward Algorithm
Implementing this recursive calculation given observations \tilde{y}_{1}, \ldots, \tilde{y}_{i}, \ldots, \tilde{y}_{I} is made a bit more straightforward with the introduction of some more compact notation.
First let’s denote the initial state probabilities, observational model outputs, and transition probabilities as \begin{align*} \rho_{k} &= p( z_{0} = k ) \\ \omega_{i, k} &= p( \tilde{y}_{i} \mid i, z_{i} = k, \theta) \\ \Gamma_{i, k'k} &= p(z_{i} = k \mid z_{i - 1} = k', \theta) \end{align*} Then we can define \alpha_{i, k} = p(z_{i} = k, \tilde{y}_{1:i} \mid \theta) with the recursion becoming \begin{align*} \alpha_{i, k} &= p(\tilde{y}_{i} \mid i, z_{i} = k) \sum_{k' = 1}^{K} p(z_{i} = k \mid z_{i - 1} = k') \, \alpha_{i - 1, k'} \\ &= \omega_{i, k} \sum_{k' = 1}^{K} \Gamma_{i, k'k} \, \alpha_{i - 1, k'}. \end{align*}
Conveniently we can also write this as a matrix equation, \boldsymbol{\alpha}_{i} = \boldsymbol{\omega}_{i} \circ \left( \boldsymbol{\Gamma}_{i}^{T} \cdot \boldsymbol{\alpha}_{i - 1} \right), where \circ denotes the Hadamard or element-wise, product. Alternatively we can use only matrix-vector products, \boldsymbol{\alpha}_{i} = \mathrm{diag}(\boldsymbol{\omega}_{i}) \cdot \left( \boldsymbol{\Gamma}_{i}^{T} \cdot \boldsymbol{\alpha}_{i - 1} \right), although we have to be careful to not actually evaluate \mathrm{diag}(\boldsymbol{\omega}_{i}) as a dense matrix and waste precious computational resources.
In this matrix notation the evaluation of the marginal observational model is given by \begin{align*} \boldsymbol{\alpha}_{0} &= \boldsymbol{\rho} \\ & \ldots \\ \boldsymbol{\alpha}_{i} &= \boldsymbol{\omega}_{i} \circ \left( \boldsymbol{\Gamma}_{i}^{T} \cdot \boldsymbol{\alpha}_{i - 1} \right) \\ & \ldots \\ p(\tilde{y}_{1:I} \mid \theta) &= \mathbf{1}^{T} \cdot \boldsymbol{\alpha}_{I}. \end{align*} This procedure is also known as the forward algorithm as it iterates through the latent states in sequential order.
The cost of each iteration of the forward algorithm is dominated by a matrix-vector product which requires \mathcal{O}(K^{2}) operations. Consequently the cost of the entire forward algorithm scales as \mathcal{O}(I \, K^{2}). For all but the smallest values of K and I this is drastically faster than the \mathcal{O}(K^{I}) scaling that would be required of a brute force summation.
That said for sufficiently large K and/or I the cost of implementing the forward algorithm can still be prohibitive. In particular increasing the dimension of the latent states increases cost much faster than increasing the number of latent states.
2.4 Alternative Conventions
In the previous derivation of the forward algorithm I have made a few choices of notation that are not universal, and hence might clash with other references.
For example many references like Bishop (2006) consider an initial latent state z_{0} that is not paired with a corresponding observation (Figure 4). One awkward consequence of this convention is that there will be a different number of latent states and observations which can be awkward to manage in practical implementations. Indexing is already hard enough!
The convention used in this chapter is more general in the sense that the alternative convention can be recovered by setting \omega_{1, k} = p( \tilde{y}_{1} \mid 1, z_{i} = k, \theta) = 1 for all k.
Another potential divergence in notation concerns the transition matrix \boldsymbol{\Gamma}. Some references like Bishop (2006) define the elements of the transition matrix not as \Gamma_{i, k'k} = p(z_{i} = k \mid z_{i - 1} = k', \theta) but rather the transpose \Gamma_{i, kk'} = p(z_{i} = k \mid z_{i - 1} = k', \theta). Neither convention has any substantial practical advantage over the other, we just have to make sure that we’re using one convention consistently.
3 Inferring Latent State Behavior
After marginalization we can still recover inferences for the latent states with an application of Bayes’ Theorem. Specifically the joint posterior density function for all of the latent states is given, up to normalization, by partially evaluating the joint model, p( z_{1:I} \mid \tilde{y}_{1:I}, \theta) \propto p( z_{1:I}, \tilde{y}_{1:I}, \theta). When I is large, however, this joint posterior density function will be much too ungainly to manipulate or communicate directly.
If we are interested in the behavior of the latent states then we will need to be able to isolate meaningful, low-dimensional summaries of these inferences.
As in Section 2 we will make heavy use of some of the properties of the hidden Markov model detailed in Bishop (2006). Moreover, readers interested in only the final results can safely jump directly to the end of each section.
3.1 Marginal Posterior Probabilities For Individual Latent States
In some applications the marginal posterior probabilities for a single latent state, p( z_{i} \mid y_{1:I}, \theta), is useful. We can derive these marginal posterior probabilities with an application of Bayes’ Theorem, p( z_{i} \mid y_{1:I}, \theta) \propto p( y_{1:I} \mid z_{i}, \theta) \, p(z_{i} \mid \theta).
Fortunately deriving the terms on the right hand side of this equation is relatively straightforward if we take advantage of the structure of the joint model. For example the first term simplifies to, p( y_{1:I} \mid z_{i}, \theta) = p( y_{1:i} \mid z_{i}, \theta) \, p( y_{(i + 1):I} \mid z_{i}, \theta). In words conditioning on z_{i} breaks any coupling between the proceeding observations and the following observations.
Substituting this result gives \begin{align*} p( z_{i} \mid y_{1:I}, \theta) &\propto p( y_{1:I} \mid z_{i}, \theta) \, p(z_{i} \mid \theta) \\ &\propto p( y_{1:i} \mid z_{i}, \theta) \, p( y_{(i + 1):I} \mid z_{i}, \theta) \, p(z_{i} \mid \theta) \\ &\propto \bigg[ p( y_{1:i} \mid z_{i}, \theta) \, p(z_{i} \mid \theta) \bigg] \, p( y_{(i + 1):I} \mid z_{i}, \theta) \\ &\propto p( z_{i}, y_{1:i} \mid \theta) \, p( y_{(i + 1):I} \mid z_{i}, \theta). \end{align*}
As we saw in Section 2, when we run the forward algorithm for a given observation we compute p( z_{i}, \tilde{y}_{1:i} \mid \theta) for each i. Consequently in order to compute the marginal posterior probabilities for each latent state we just need to save these intermediate values and then compute p( \tilde{y}_{(i + 1):I} \mid z_{i}, \theta ).
Conveniently this term can be evaluated with a recursion of its own. To parallel the notation of the forward algorithm let’s denote \beta_{i, z_{i}} = p (\tilde{y}_{(i + 1):I} \mid z_{i}, \theta ). In order to compute a particular \beta_{i, z_{i}} we need to introduce the latent state z_{i + 1} and then decompose, \begin{align*} \beta_{i, z_{i}} &= p (\tilde{y}_{(i + 1):I} \mid z_{i}, \theta) \\ &= \sum_{z_{i + 1}} p (\tilde{y}_{(i + 1):I}, z_{i + 1} \mid z_{i}, \theta ) \\ &= \sum_{z_{i + 1}} p (\tilde{y}_{(i + 1):I} \mid z_{i}, z_{i + 1}, \theta ) \, p(z_{i + 1} \mid z_{i}, \theta). \end{align*}
Now the first term in each summand simplifies due to, you guessed it, the structure of the joint model, p (y_{(i + 1):I}, \mid z_{i}, z_{i + 1}, \theta ) = p (y_{(i + 1):I}, \mid z_{i + 1}, \theta ). Once we know z_{i + 1} the following observations y_{(i + 1):I} do not depend on any of the previous latent states.
Substituting this simplification gives \begin{align*} \beta_{i, z_{i}} &= \sum_{z_{i + 1}} p (\tilde{y}_{(i + 1):I}, \mid z_{i}, z_{i + 1}, \theta ) \, p(z_{i + 1} \mid z_{i}, \theta) \\ &= \sum_{z_{i + 1}} p (\tilde{y}_{(i + 1):I}, \mid z_{i + 1}, \theta) \, p(z_{i + 1} \mid z_{i}, \theta) \\ &= \sum_{z_{i + 1}} p(\tilde{y}_{i + 1} \mid i + 1, z_{i + 1}, \theta) \, p(\tilde{y}_{(i + 2):I}, \mid z_{i + 1}, \theta) \, p(z_{i + 1} \mid z_{i}, \theta). \end{align*}
That middle term, however, is just the \beta for the subsequent latent state, p (\tilde{y}_{(i + 2):I}, \mid z_{i + 1}, \theta) = \beta_{i + 1, z_{i + 1}}!
Consequently we have a backwards recursion formula for the \beta_{i, z_{i}}, \begin{align*} \beta_{i, z_{i}} &= \sum_{z_{i + 1}} p(\tilde{y}_{i + 1} \mid i + 1, z_{i + 1}, \theta) \, p(\tilde{y}_{(i + 2):I}, \mid z_{i + 1}, \theta) \, p(z_{i + 1} \mid z_{i}, \theta) \\ &= \sum_{z_{i + 1}} p(\tilde{y}_{i + 1} \mid i + 1, z_{i + 1}, \theta ) \, \beta_{i + 1, z_{i + 1}} \, p(z_{i + 1} \mid z_{i}, \theta) \\ &= \sum_{z_{i + 1}} \omega_{i + 1, z_{i + 1} } \, \beta_{i + 1, z_{i + 1}} \, \Gamma_{i + 1, z_{i}, z_{i + 1}}, \end{align*} which we can also write in matrix notation, \boldsymbol{\beta}_{i} = \boldsymbol{\Gamma}_{i + 1} \cdot ( \boldsymbol{\omega}_{i + 1} \circ \boldsymbol{\beta}_{i + 1} ) = \boldsymbol{\Gamma}_{i + 1} \cdot ( \mathrm{diag} ( \boldsymbol{\omega}_{i + 1} ) \cdot \boldsymbol{\beta}_{i + 1} ).
The only remaining issue is how to compute \beta_{I, z_{I}} and initialize the backwards recursion. At the last step we have \begin{align*} p( y_{I} \mid z_{I - 1}, \theta ) &= \sum_{z_{I}} p( y_{I}, z_{I} \mid z_{I - 1}, \theta ) \\ &= \sum_{z_{I}} p( y_{I} \mid z_{I}, z_{I - 1} , \theta) \, p( z_{I} \mid z_{I - 1}, \theta ). \end{align*} Recall, however, that conditioning on z_{I} blocks any coupling between z_{I - 1} and y_{I}. Consequently p( y_{I} \mid z_{I}, z_{I - 1} , \theta) = p( y_{I} \mid I, z_{I}, \theta) and \begin{align*} p( y_{I} \mid z_{I - 1}, \theta ) &= \sum_{z_{I}} p( y_{I} \mid z_{I}, z_{I - 1}, \theta) \, p( z_{I} \mid z_{I - 1}, \theta ) \\ &= \sum_{z_{I}} p( y_{I} \mid I, z_{I}, \theta ) \, p( z_{I} \mid z_{I - 1}, \theta ). \end{align*}
Comparing this to the backwards recursion formula \beta_{I - 1, z_{I - 1}} = \sum_{z_{I}} p(\tilde{y}_{I} \mid I, z_{I}, \theta ) \, \beta_{I, z_{I}} \, p(\tilde{y}_{I} \mid z_{I - 1}, \theta) we see that we will get consistent results if and only if we set \beta_{I, z_{I}} = 1 for all values of z_{I}.
Putting everything together we can evaluate the marginal posterior probabilities for the configuration of any latent state by first scanning forward through the latent states, \begin{align*} \boldsymbol{\alpha}_{0} &= \boldsymbol{\rho} \\ & \ldots \\ \boldsymbol{\alpha}_{i} &= \boldsymbol{\omega}_{i} \circ \left( \boldsymbol{\Gamma}^{T}_{i} \cdot \boldsymbol{\alpha}_{i - 1} \right) \\ & \ldots, \end{align*} scanning backwards through the latent states, \begin{align*} \boldsymbol{\beta}_{I} &= \mathbf{1} \\ & \ldots \\ \boldsymbol{\beta}_{i - 1} &= \boldsymbol{\Gamma}_{i} \cdot ( \boldsymbol{\omega}_{i} \circ \boldsymbol{\beta}_{i} ), \end{align*} and then finally multiplying and normalizing, p( z_{i} = k \mid \tilde{y}_{1:I}, \theta) = \frac{ \alpha_{i, k} \, \beta_{i, k} } { \sum_{k' = 1}^{K} \alpha_{i, k'} \, \beta_{i, k'} }.
This combined procedure is known as the forward-backward algorithm. Comparing operations we can see the the forward-backward algorithm exhibits the same computational scaling as the forward algorithm alone. Once we’ve run the forward algorithm running the backward algorithm through all of the latent states will only double the computational cost.
7 Demonstrations
To anchor all of this abstract math into a more practical context let’s work through some implementations of hidden Markov models that demonstrate not only the basic features but also some of the more sophisticated directions that we can take the methodology.
7.1 Setup
First and foremost we have to setup our local R
environment.
<- new.env()
util source('mcmc_analysis_tools_rstan.R', local=util)
source('mcmc_visualization_tools.R', local=util)
7.2 Comparing and Contrasting Implementations
Let’s start with a relatively simple exercise that compares some of the basic implementation strategies.
7.2.1 Simulate Data
We’ll simulate data assuming K = 3
component observational models across I = 100
states.
<- 100
I <- 3 K
To avoid too much complexity we’ll assume a homogeneous transition matrix between all of the neighboring latent states and the same component observational models at each latent state.
simu\_data.stan
data {
// Number of latent states
int<lower=1> I;
}
transformed data {
// True model configuration
int<lower=1> K = 3;
array[K] real mu = {-3, 2, 5};
array[K] real<lower=0> sigma = {1.0, 1.5, 0.75};
simplex[K] rho = [0.8, 0.2, 0.0 ]';
matrix[K, K] Gamma = [ [ 0.60, 0.30, 0.10 ],
0.40, 0.50, 0.10 ],
[ 0.05, 0.05, 0.90 ] ];
[
}
generated quantities {
array[I] int<lower=1, upper=K> z; // Simulated latent states
array[I] real y; // Simulated observations
// Initial state
1] = categorical_rng(rho);
z[1] = normal_rng(mu[z[1]], sigma[z[1]]);
y[
for (i in 2:I) {
// Following state
vector[K] lambda = Gamma[z[i - 1],]';
z[i] = categorical_rng(lambda);
// Observation
y[i] = normal_rng(mu[z[i]], sigma[z[i]]);
} }
<- stan(file="stan_programs/simu_data.stan",
simu algorithm="Fixed_param", seed=194838,
data=list("I" = I), iter=1, chains=1, refresh=0)
<- util$extract_expectand_vals(simu)
simu_samples
<- sapply(1:I, function(i) paste0('y[', i, ']'))
y_names <- sapply(y_names, function(name) simu_samples[[name]][1,1])
y
<- sapply(1:I, function(i) paste0('z[', i, ']'))
z_names <- sapply(z_names, function(name) simu_samples[[name]][1,1]) z
<- list("I" = I, "y" = y, "K" = K) data
7.2.2 Exploratory Data Analysis
There are a few summary statistics that are particularly natural to the structure assumed by a hidden Markov model.
A histogram of all observations collapses the latent dynamics, allowing us to focus on the aggregate behavior of the component observational models. This summary is particularly informative when the component observational models are the same for all latent states.
par(mfrow=c(1, 1))
$plot_line_hist(data$y, -8, 8, 1, xlab="y") util
We can also isolate the latent dynamics as much as possible by plotting the observations against the corresponding latent state indices. This is often referred to as an empirical trajectory.
par(mfrow=c(1, 1))
plot(1, type="n",
xlab="Latent State", xlim=c(0.5, data$I + 0.5),
ylab="y", ylim=c(-8, 8))
for (i in 1:data$I) {
lines(c(i - 0.5, i + 0.5), rep(data$y[i], 2),
col="black", lwd=2)
}
7.2.6 Implementation Consistency Check
To properly compare the outputs of these three model implementations we really need to visualize them next to each other.
As expected, or at least for what we hoped, the posterior retrodictive performance for the empirical trajectories is consistent across all three model implementations.
par(mfrow=c(3, 1))
<- sapply(1:data$I, function(i) paste0('y_pred[', i, ']'))
names
$plot_disc_pushforward_quantiles(samples1, names,
utilbaseline_values=data$y,
xlab="Latent State", ylab="y",
main="Log-Sum-Exp")
$plot_disc_pushforward_quantiles(samples2, names,
utilbaseline_values=data$y,
xlab="Latent State", ylab="y",
main="Rescaling")
$plot_disc_pushforward_quantiles(samples3, names,
utilbaseline_values=data$y,
xlab="Latent State", ylab="y",
main="Built-In")
So too is the inferred behavior of the latent states.
par(mfrow=c(3, 1))
<- sapply(1:data$I, function(i) paste0('z[', i, ']'))
names
$plot_disc_pushforward_quantiles(samples1, names,
utilxlab="Latent State", ylab="z",
main="Log-Sum-Exp")
$plot_disc_pushforward_quantiles(samples2, names,
utilxlab="Latent State", ylab="z",
main="Rescaling")
$plot_disc_pushforward_quantiles(samples3, names,
utilxlab="Latent State", ylab="z",
main="Built-In")
The consistency continues into the inferences for the dynamical parameters.
par(mfrow=c(3, 1))
<- sapply(1:data$K, function(kk) paste0('rho[', kk, ']'))
names
$plot_disc_pushforward_quantiles(samples1, names,
utilxlab="Component",
ylab="rho",
display_ylim=c(0, 1),
main="Log-Sum-Exp")
$plot_disc_pushforward_quantiles(samples2, names,
utilxlab="Component",
ylab="rho",
display_ylim=c(0, 1),
main="Rescaling")
$plot_disc_pushforward_quantiles(samples3, names,
utilxlab="Component",
ylab="rho",
display_ylim=c(0, 1),
main="Built-In")
par(mfrow=c(3, 3))
for (k in 1:data$K) {
<- sapply(1:data$K,
names function(kk) paste0('gamma[', k, ',', kk, ']'))
$plot_disc_pushforward_quantiles(samples1, names,
utilxlab="Component",
ylab=paste("Row", k, "of Gamma"),
main="Log-Sum-Exp")
$plot_disc_pushforward_quantiles(samples2, names,
utilxlab="Component",
ylab=paste("Row", k, "of Gamma"),
main="Rescaling")
$plot_disc_pushforward_quantiles(samples3, names,
utilxlab="Component",
ylab=paste("Row", k, "of Gamma"),
main="Built-In")
}
Finally the inferred behavior of the component observational models is the same across all three model implementations.
par(mfrow=c(3, 2))
for (k in 1:data$K) {
<- paste0('mu[', k, ']')
name $plot_expectand_pushforward(samples1[[name]],
util75, flim=c(-10, 10),
display_name=name)
$plot_expectand_pushforward(samples2[[name]],
util75, flim=c(-10, 10),
col=util$c_mid,
border="#DDDDDD88",
add=TRUE)
$plot_expectand_pushforward(samples3[[name]],
util75, flim=c(-10, 10),
col=util$c_light,
border="#DDDDDD88",
add=TRUE)
<- seq(-10, 10, 0.1)
xs <- dnorm(xs, 0, 10 / 2.32);
ys lines(xs, ys, lwd=2, col="white")
lines(xs, ys, lwd=1, col=util$c_mid_teal)
<- paste0('sigma[', k, ']')
name $plot_expectand_pushforward(samples1[[name]],
util50, flim=c(0, 5),
display_name=name)
$plot_expectand_pushforward(samples2[[name]],
util50, flim=c(0, 5),
col=util$c_mid,
border="#DDDDDD88",
add=TRUE)
$plot_expectand_pushforward(samples3[[name]],
util50, flim=c(0, 5),
col=util$c_light,
border="#DDDDDD88",
add=TRUE)
<- seq(0, 5, 0.1)
xs <- 2 * dnorm(xs, 0, 5 / 2.57);
ys lines(xs, ys, lwd=2, col="white")
lines(xs, ys, lwd=1, col=util$c_mid_teal)
}
7.2.7 Static/Dynamic Mixture Model Comparison
Finally to see if there is any advantage to modeling the dynamics of the latent states let’s compare the hidden Markov model inferences to inferences from a static mixture model.
static\_mixure.stan
data {
int<lower=0> K; // Number of component observational models
int<lower=0> I; // Number of latent states
array[I] real y; // Observations at each step
}
parameters {
// Component observational model configurations
ordered[K] mu;
array[K] real<lower=0> sigma;
// Homogeneous mixture probabilities
simplex[K] lambda;
}
model {
// Prior model
target += normal_lpdf(mu | 0, 10 / 2.32); // -10 <~ mu[k] <~ +10
target += normal_lpdf(sigma | 0, 5 / 2.57); // 0 <~ sigma[k] <~ 5
target += dirichlet_lpdf(lambda | rep_vector(1, K));
// Observational model
for (n in 1:I) {
array[K] real lpds;
for (k in 1:K) {
lpds[k] = log(lambda[k])
+ normal_lpdf(y[n] | mu[k], sigma[k]);
}target += log_sum_exp(lpds);
}
}
generated quantities {
array[I] real y_pred; // Posterior predictive observations
for (i in 1:I) {
int z = categorical_rng(lambda);
y_pred[i] = normal_rng(mu[z], sigma[z]);
} }
<- stan(file="stan_programs/static_mixture.stan",
fit data=data, seed=4938483,
warmup=1000, iter=2024, refresh=0)
The computational diagnostics are clean.
<- util$extract_hmc_diagnostics(fit)
diagnostics $check_all_hmc_diagnostics(diagnostics) util
All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
<- util$extract_expectand_vals(fit)
samples4 <- util$filter_expectands(samples4,
base_samples c('mu', 'sigma',
'lambda'),
check_arrays=TRUE)
$check_all_expectand_diagnostics(base_samples) util
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
We also see excellent posterior retrodictive performance with respect to the histogram summary statistic. When the component observational models are homogeneous across the latent states static mixture models often perform reasonably well in aggregate even though they ignore the latent dynamics.
par(mfrow=c(1, 1))
$plot_hist_quantiles(samples4, 'y_pred', -8, 8, 1,
utilbaseline_values=data$y, xlab='y')
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 48 predictive values (0.0%) fell below the binning.
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 647 predictive values (0.2%) fell above the binning.
Inferences for the location and scale of the first component observational model are nearly identical between the dynamic and static models, but the inferences for the second and third component observational model configurations are more strongly informed with the hidden Markov model.
par(mfrow=c(3, 2))
<- list(c(-4, -2), c(-5, 5), c(3.5, 6))
mu_lims <- list(c(0, 2.75), c(0, 1.25), c(0, 4))
mu_ylims
<- list(c(0.25, 1.75), c(0, 4.5), c(0.25, 2))
sigma_lims <- list(c(0, 3.5), c(0, 1.5), c(0, 5.5))
sigma_ylims
for (k in 1:data$K) {
<- paste0('mu[', k, ']')
name $plot_expectand_pushforward(samples3[[name]], 50,
utildisplay_name=name,
flim=mu_lims[[k]],
ylim=mu_ylims[[k]])
$plot_expectand_pushforward(samples4[[name]], 50,
utildisplay_name=name,
flim=mu_lims[[k]],
ylim=mu_ylims[[k]],
col=util$c_mid,
border="#DDDDDD88",
add=TRUE)
<- paste0('sigma[', k, ']')
name $plot_expectand_pushforward(samples3[[name]], 50,
utildisplay_name=name,
flim=sigma_lims[[k]],
ylim=sigma_ylims[[k]])
$plot_expectand_pushforward(samples4[[name]], 50,
utildisplay_name=name,
flim=sigma_lims[[k]],
ylim=sigma_ylims[[k]],
col=util$c_mid,
border="#DDDDDD88",
add=TRUE)
}
Warning in util$plot_expectand_pushforward(samples4[[name]], 50, display_name =
name, : 1 value (0.0%) fell below the histogram binning.
Warning in util$plot_expectand_pushforward(samples4[[name]], 50, display_name =
name, : 1 value (0.0%) fell above the histogram binning.
Even when a static mixture model adequately captures the aggregate behavior of the observed data the ignorance of any latent dynamics generally results in worse inferential uncertainties.
7.3 Modeling Unobserved Latent States
Now let’s explore some of the ways that we can accommodate latent states that have been only partially observed.
For this exercise we’ll use a data set where there are fewer observations than latent states. Consequently at least some of the latent states have to be missing complementary data.
<- read_rdump("data/miss.data.R")
data
cat(sprintf('%i latent states', data$I))
100 latent states
cat(sprintf('%i observational model components', data$K))
3 observational model components
cat(sprintf('%i observations', data$N))
60 observations
If we look at the empirical trajectory we can see two prominent gaps in the observations.
par(mfrow=c(1, 1))
plot(1, type="n",
xlab="Latent State", xlim=c(0.5, data$I + 0.5),
ylab="y", ylim=c(-8, 8))
for (miss_win in list(c(31, 55), c(76, 90))) {
polygon(c(miss_win[1] - 0.5, miss_win[2] + 0.5,
2] + 0.5, miss_win[1] - 0.5),
miss_win[c(-8, -8, 8, 8), col="#DDDDDD", border=NA)
}
for (n in 1:data$N) {
lines(c(data$state[n] - 0.5, data$state[n] + 0.5),
rep(data$y[n], 2),
col="black", lwd=2)
}