par(family="serif", las=1, bty="l",
cex.axis=1, cex.lab=1, cex.main=1,
xaxs="i", yaxs="i", mar = c(5, 5, 3, 1))
library(rstan)
rstan_options(auto_write = TRUE) # Cache compiled Stan programs
options(mc.cores = parallel::detectCores()) # Parallelize chains
:::setDefaultClusterOptions(setup_strategy = "sequential") parallel
Hidden Markov Models
In the mixture modeling chapter we first considered how to model data that could have arisen from one of a number of component data generating processes. There we derived mixture observational models under the assumption that the activation of a component data generating process, and hence the behavior of the component assignment variables, is independent across individual observations.
Mixture observational models have broad applicability in practice, but they are by no means universal. In order to generalize this modeling technique we need to consider joint mixture observational models that couple the component assignment variables together. These generalizations, however, also introduce an immediate practical challenge: how, if at all, can be efficiently marginalize the component assignment variables?
From this perspective hidden Markov models are an exceptional generalization to mixture observational models. On one hand they allow for more sophisticated behavior by allowing the contribution of a component data generating process to persist across multiple observations before another data generating process takes over. On the other their mathematical structure is just cooperative enough to allow for practical marginalization.
In this chapter I will introduce hidden Markov models, carefully derive the algorithms needed for efficient marginalization, and demonstrate their robust implementation in Stan.
1 Joint Mixture Observational Models
Consider I component observations, y_{1}, \ldots, y_{i}, \ldots, y_{I}, each of which can arise from one of K component observational models, p(y_{i} \mid i, k, \theta) = p_{i, k}( y_{i} \mid \theta).
Note that in general the y_{i}, and hence the precise form of the p(y_{i} \mid i, k, \theta), do not have to homogeneous. For example some y_{i} might be discrete while others are continuous. Moreover the dimension of each y_{i} might vary, with some observations featuring more sub-components than others.
To quantify which component observational model is responsible for each component observation we introduce I categorical assignment variables, z_{i} \in \{ 1, \ldots, K \}. These assignment variables are also known as states, as in the state of the overall data generating process for each observation.
When the states z_{i} are unknown, equivalently latent or even hidden, we need to need to incorporate them into the observational model so that that we can infer their behavior. This requires a joint model of the form (Figure 1) \begin{align*} p(z_{1}, \ldots, z_{I}, y_{1}, \ldots, y_{I}, \theta) &= p(y_{1}, \ldots, y_{I} \mid z_{1}, \ldots, z_{I}, \theta) \, p(z_{1}, \ldots, z_{I} \mid \theta) \, p( \theta ) \\ &= \prod_{i = 1}^{I} p_{i, z_{i}}(y_{i} \mid \theta) p(z_{1}, \ldots, z_{I} \mid \theta) \, p( \theta ) \\ &= \prod_{i = 1}^{I} p(y_{i} \mid i, z_{i}, \theta) p(z_{1}, \ldots, z_{I} \mid \theta) \, p( \theta ). \end{align*}
To make the equations in this chapter a bit less overwhelming let’s take the opportunity to introduce a more compact notation for sequences of observations and latent states, \begin{align*} y_{i_{i}:i_{2}} &= ( y_{i_{1}}, y_{i_{1} + 1}, \ldots, y_{i_{2} - 1}, y_{i_{2}} ) \\ z_{i_{i}:i_{2}} &= ( z_{i_{1}}, z_{i_{1} + 1}, \ldots, z_{i_{2} - 1}, z_{i_{2}} ). \end{align*} This allows us to write the general joint model as p(z_{1:I}, y_{1:I}, \theta) = \prod_{i = 1}^{I} p(y_{i} \mid i, z_{i}, \theta) p(z_{1}, \ldots, z_{I} \mid \theta) \, p( \theta ).
In the mixture modeling chapter we derived mixture observational models by assuming that all of the latent states were independent and identically distributed (Figure 2), p(z_{1:I} \mid \theta) = \prod_{i = 1}^{I} p(z_{i} \mid \theta); I will refer to this as a basic mixture observational model. With this assumption we were able to marginalize out each discrete latent state individually. This, in turn, allowed us to use high performance computational tools like Hamiltonian Monte Carlo.
An arbitrary joint mixture observational model, however, features K^{I} distinct configurations of the latent states. Consequently direct summation of the latent states, p(y_{1:I} \mid \theta) = \sum_{z_{1}} \cdots \sum_{z_{I}} p(z_{1:I}, y_{1:I} \mid \theta), is typically infeasible. There is a key exception, however, where the coupling between the latent states allows for efficient marginalization while still being rich enough to be useful for practical applications.
To motivate this coupling let’s first assume that the latent states are ordered so that we have a well-defined notion of neighboring states. For example the observations might be indexed in the temporal order in which they were made, organized along a one-dimensional spatial direction, and the like.
Next we will assume that each latent state is conditionally dependent on only its preceding neighbor (Figure 3), p(z_{1:I}, \mid \theta) = p(z_{1}) \, \prod_{i = 2}^{I} p(z_{i} \mid z_{i - 1}, \theta). This particular dependency structure arises so often in both theoretical and applied probability theory that it has been given its own adjective: Markovian.
Under this Markovian assumption the joint model becomes p(z_{1:I}, y_{1:I}, \theta) = p(z_{1:I}, y_{1:I} \mid \theta) \, p( \theta ) where \begin{align*} p(z_{1:I}, y_{1:I} \mid \theta) &= \prod_{i = 1}^{I} p_(y_{i} \mid i, z_{i}, \theta) p(z_{1}, \ldots, z_{I} \mid \theta) \\ &= \left[ \prod_{i = 1}^{I} p(y_{i} \mid i, z_{i}, \theta) \right] \left[ p(z_{1}) \, \prod_{i = 2}^{I} p(z_{i} \mid z_{i - 1}, \theta) \right] \\ &= p(z_{1}) \, (y_{1} \mid 1, z_{1}, \theta) \prod_{i = 2}^{I} p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{i} \mid i, z_{i}, \theta). \end{align*}
Because it assumes “Markovian” dependencies for a sequence of “hidden” states this model is known as a hidden Markov model. That said is it not unreasonable to refer to it as a Markov mixture observational model if we want to emphasize the assumed structure of the observations as well as that of the latent states.
In the context of a hidden Markov model the conditional probabilities p(z_{i} \mid z_{i - 1}, \theta) are known referred to as transition probabilities. A hidden Markov model with the same transition probabilities between all pairs of states, so that p(z_{i} = k' \mid z_{i - 1} = k, \theta) depends on only k and k' but not i, is known as a homogeneous hidden Markov model. Heterogeneous, non-homogeneous, and general hidden Markov models are all terms used for emphasizing varying transition probabilities.
2 The Marginal Hidden Markov Model
In theory the joint observational model p(z_{1:I}, y_{1:I} \mid \theta) is sufficient to infer which latent states z_{1:I} and model configurations \theta are consistent with any observed data. The discreteness of the latent states, however, frustrates practical implementation of those inferences.
Fortunately the Markovian structure of hidden Markov models allows us to efficiently marginalize out the discrete latent states. That said actually deriving an explicit marginalization algorithm from that structure is a pretty onerous task.
In this section I will motivate and then carefully derive the strategy for efficiently marginalizing the discrete hidden states of a hidden Markov model. The derivation requires some pretty brutal calculations and readers interested in only the final result can jump straight to Section 2.3 without missing anything.
2.1 Looking For A Pattern
First and foremost let’s motivate the basic approach to efficient marginalization by carefully working out the direct summations for I = 1, I = 2, and then I = 3 and then looking for any useful patterns.
When I = 1 we have p(z_{1}, y_{1} \mid \theta) = p(z_{1}) \, p(y_{1} \mid 1, z_{1}, \theta). Summing over the initial latent state gives \begin{align*} p(y_{1} \mid \theta) &= \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \\ &= \sum_{z_{1}} p(y_{1} \mid 1, z_{1}, \theta) \, p(z_{1}). \end{align*} Note the similarity to the marginal form of a basic mixture observational model, p(y \mid \theta) = \sum_{z} p( y \mid z, \theta) \, p(z).
For I = 2 we have p(z_{1}, z_{2}, y_{1}, y_{2} \mid \theta) = p(z_{1}) \, p(y_{1} \mid 1, z_{1}, \theta) \, p(z_{2} \mid z_{1}, \theta) \, p(y_{2} \mid 2, z_{2}, \theta), or, equivalently, p(z_{1}, z_{2}, y_{1}, y_{2} \mid \theta) = p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta) \, p(y_{2} \mid 2, z_{2}, \theta). Summing over the initial latent state gives \begin{align*} p(z_{2}, y_{1}, y_{2} \mid \theta) &= \sum_{z_{1}} p(z_{1}, z_{2}, y_{1}, y_{2} \mid \theta) \\ &= \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta) \, p(y_{2} \mid 2, z_{2}, \theta) \\ &= p(y_{2} \mid 2, z_{2}, \theta) \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta). \end{align*}
We can then marginalize out both of the latent states by summing this result over z_{2}, \begin{align*} p(y_{1}, y_{2} \mid \theta) &= \sum_{z_{2}} p(z_{2}, y_{1}, y_{2} \mid \theta) \\ &= \sum_{z_{2}} p(y_{2} \mid 2, z_{2}, \theta) \bigg[ \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta) \bigg] \\ &= \sum_{z_{2}} p(y_{2} \mid 2, z_{2}, \theta) \, p(z_{2} \mid y_{1}, \theta). \end{align*}
Interestingly this once again this has the same form as a basic mixture observational model. Now, however, the component probabilities are informed by the previous observation, p(z_{2} \mid y_{1}, \theta) = \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta).
Finally let’s consider I = 3 where the joint model is given by \begin{align*} p(z_{1:3}, y_{1:3} \mid \theta) &= p(z_{1}) \, p(y_{1} \mid 1, z_{1}, \theta) \, \prod_{i = 2}^{3} p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{i} \mid i, z_{i}, \theta) \\ &= p(z_{1}, y_{1} \mid \theta) \, \prod_{i = 2}^{3} p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{i} \mid i, z_{i}, \theta). \end{align*}
Here summing over the initial latent state gives \begin{align*} p(z_{2:3}, y_{1:3} \mid \theta) &= \sum_{z_{1}} p(z_{1:3}, y_{1:3} \mid \theta) \\ &= \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, \prod_{i = 2}^{3} p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{i} \mid i, z_{i}, \theta) \\ &= \;\, \bigg[ p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{3} \mid z_{2}, \theta) \bigg] \\ &\quad \cdot \bigg[ p(y_{2} \mid 2, z_{2}, \theta) \, \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta) \bigg]. \end{align*}
The second term might look familiar; it’s exactly p(z_{2}, y_{1}, y_{2} \mid \theta) which we just derived for the I = 2 case! Consequently we can write \begin{align*} p(z_{2:3}, y_{1:3} \mid \theta) &= \; \bigg[ p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{3} \mid z_{2}, \theta) \bigg] \\ &\quad \cdot \bigg[ p(y_{2} \mid 2, z_{2}, \theta) \, \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta) \bigg] \\ &= \; \bigg[ p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{3} \mid z_{2}, \theta) \bigg] \\ &\quad \cdot \bigg[ p(z_{2}, y_{1}, y_{2} \mid \theta) \bigg] \\ &= p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{2}, y_{1}, y_{2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta). \end{align*}
Now we can sum over z_{2} to give \begin{align*} p(z_{3}, y_{1:3} \mid \theta) &= \sum_{z_{2}} p(z_{2:3}, y_{1:3} \mid \theta) \\ &= \sum_{z_{2}} p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{2}, y_{1}, y_{2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta) \\ &= p(y_{3} \mid 3, z_{3}, \theta) \, \sum_{z_{2}} p(z_{2}, y_{1}, y_{2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta). \end{align*} Notice how we derive p(z_{3}, y_{1:3} \mid \theta) from p(z_{2}, y_{1}, y_{2} \mid \theta) in exactly the same way that we derived p(z_{2}, y_{1}, y_{2} \mid \theta) from p(z_{1}, y_{1} \mid \theta).
Lastly we sum over z_{3} to give the full marginal probability density function, \begin{align*} p(y_{1:3} \mid \theta) &= \sum_{z_{3}} p(z_{3}, y_{1:3} \mid \theta) \\ &= \sum_{z_{3}} p(y_{3} \mid 3, z_{3}, \theta) \, \bigg[ \sum_{z_{2}} p(z_{2}, y_{1}, y_{2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta) \bigg] \\ &= \sum_{z_{3}} p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{3} \mid y_{1:2}, \theta ). \end{align*}
The final marginal model continues to mirror a basic mixture observational model, in this case with derived component probabilities p(z_{3} \mid y_{1:2}, \theta ) = \sum_{z_{2}} p(z_{2}, y_{1}, y_{2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta).
Let’s summarize what we have seen in these first few cases.
Instead of deriving p(y_{1:3} \mid \theta) by summing p(z_{1:3}, y_{1:3} \mid \theta) over z_{1}, z_{2}, and z_{3} at the same time, we can evaluate it by emulating a basic mixture observational model and summing p(z_{3}, y_{1:3} \mid \theta) over just z_{3}.
Conveniently p(z_{3}, y_{1:3} \mid \theta) can be derived directly from p(z_{2}, y_{1:2} \mid \theta), and that term can be derived from p(z_{1}, y_{1} \mid \theta) which in turn can be derived from the initial state probabilities p(z_{1}), \begin{align*} p(z_{1}, y_{1} \;\; \mid \theta) &= p(y_{1} \mid 1, z_{1}, \theta) \, p(z_{1}), \\ p(z_{2}, y_{1:2} \mid \theta) &= p(y_{2} \mid 2, z_{2}, \theta) \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta), \\ p(z_{3}, y_{1:3} \mid \theta) &= p(y_{3} \mid 3, z_{3}, \theta) \, \sum_{z_{2}} p(z_{2}, y_{1:2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta). \end{align*}
At this point the important question is whether or not this recursive pattern over the first three cases holds more generally, with p(z_{i} \mid y_{1:i}, \theta ) = p(y_{i} \mid i, z_{i}, \theta) \, \bigg[ \sum_{z_{i - 1}} p(z_{i - 1}, y_{1:(i - 1)} \mid \theta) \, p(z_{i} \mid z_{i - 1}, \theta) \bigg] for any 1 < i \le I. If it does then starting with the initial state probabilities p(z_{1}) we can iteratively compute p(z_{I} \mid y_{1:I} \mid \theta) and then sum over z_{I} to give the marginal model p(y_{1:I} \mid \theta).
2.2 Verifying The Pattern
While not exactly straightforward, with enough mathematical effort we can derive a general relationship between p(z_{i}, y_{1:i} \mid \theta) and p(z_{i - 1}, y_{1:(i - 1)} \mid \theta) using conditional probability theory and the conditional structure of the joint hidden Markov model. The exact properties we need are pretty onerous to work out, and I will not attempt to do so in this chapter. Instead I will shamelessly defer to Bishop (2006) which reviews the necessary properties and strategies for deriving them.
Ultimately we want to reduce p(z_{i}, y_{1:i} \mid \theta) to an object that depends on only the first i - 1 latent states and observations. To that end let’s peel off the terms that depend z_{i} and y_{i}, \begin{align*} p(z_{i}, y_{1:i} \mid \theta) &= p(z_{i}, y_{i}, y_{1:(i - 1)} \mid \theta) \\ &= p(y_{1:(i - 1)} \mid z_{i}, y_{i}, \theta) \, p(z_{i}, y_{i} \mid \theta) \\ &= p(y_{1:(i - 1)} \mid z_{i}, y_{i}, \theta) \, p(y_{i} \mid i, z_{i}, \theta) \, p(z_{i} \mid \theta). \end{align*}
Because of the Markovian structure of the hidden Markov model the first term can be written as p(y_{1:(i - 1)} \mid z_{i}, y_{i}, \theta) = p(y_{1:(i - 1)} \mid z_{i}, \theta) Consequently we have \begin{align*} p(z_{i}, y_{1:i} \mid \theta) &= p(y_{1:(i - 1)} \mid z_{i}, y_{i}, \theta) \, p(y_{i} \mid i, z_{i}, \theta) \, p(z_{i} \mid \theta) \\ &= p(y_{1:(i - 1)} \mid z_{i}, \theta) \, p(y_{i} \mid i, z_{i}, \theta) \, p(z_{i} \mid \theta) \\ &= p(y_{i} \mid i, z_{i}, \theta) \bigg[ p(y_{1:(i - 1)} \mid i, z_{i}, \theta) \, p(z_{i} \mid \theta) \bigg] \\ &= p(y_{i} \mid i, z_{i}, \theta) \, p(z_{i}, y_{1:(i - 1)} \mid \theta). \end{align*}
The second term here is a bit awkward, but it can be simplified by writing it as a marginal probability density function, \begin{align*} p(z_{i}, y_{1:(i - 1)} \mid \theta) &= \sum_{z_{i - 1}} p(z_{i - 1}, z_{i}, y_{1:(i - 1)} \mid \theta) \\ &= \sum_{z_{i - 1}} p(z_{i}, y_{1:(i - 1)} \mid z_{i - 1}, \theta) \, p(z_{i - 1} \mid \theta). \end{align*} We can further simplify the first term in the sum to p(z_{i}, y_{1:(i - 1)} \mid z_{i - 1}, \theta) = p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{1:(i - 1)} \mid z_{i - 1}, \theta); intuitively conditioning on z_{i - 1} blocks any coupling between the following latent state z_{i} and the previous observations y_{1:(i - 1)}.
This allows us to write \begin{align*} p(z_{i}, y_{1:(i - 1)} \mid \theta) &= \sum_{z_{i - 1}} p(z_{i}, y_{1:(i - 1)} \mid z_{i - 1}, \theta) \, p(z_{i - 1} \mid \theta) \\ &= \sum_{z_{i - 1}} p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{1:(i - 1)} \mid z_{i - 1}, \theta) \, p(z_{i - 1} \mid \theta) \\ &= \sum_{z_{i - 1}} p(z_{i} \mid z_{i - 1}, \theta) \, \bigg[ p(y_{1:(i - 1)} \mid z_{i - 1}, \theta) \, p(z_{i - 1} \mid \theta) \bigg] \\ &= \sum_{z_{i - 1}} p(z_{i} \mid z_{i - 1}, \theta) \, p(z_{i - 1}, y_{1:(i - 1)} \mid \theta). \end{align*}
Substituting this back into the original equation for p(z_{i}, y_{1:i} \mid \theta) finally gives \begin{align*} p(z_{i}, y_{1:i} \mid \theta) &= p(y_{i} \mid i, z_{i}, \theta) \, p(z_{i}, y_{1:(i - 1)} \mid \theta) \\ &= p(y_{i} \mid i, z_{i}, \theta) \, \sum_{z_{i - 1}} p(z_{i} \mid z_{i - 1}, \theta) \, p(z_{i - 1}, y_{1:(i - 1)} \mid \theta), \end{align*} which is exactly the recursion equation that generalizes the pattern we found in the Section 2.1.
2.3 The Forward Algorithm
Implementing this recursive calculation given observations \tilde{y}_{1}, \ldots, \tilde{y}_{i}, \ldots, \tilde{y}_{I} is made a bit more straightforward with the introduction of some more compact notation.
First let’s denote the initial state probabilities, observational model outputs, and transition probabilities as \begin{align*} \rho_{k} &= p( z_{0} = k ) \\ \omega_{i, k} &= p( \tilde{y}_{i} \mid i, z_{i} = k, \theta) \\ \Gamma_{i, k'k} &= p(z_{i} = k \mid z_{i - 1} = k', \theta) \end{align*} Then we can define \alpha_{i, k} = p(z_{i} = k, \tilde{y}_{1:i} \mid \theta) with the recursion becoming \begin{align*} \alpha_{i, k} &= p(\tilde{y}_{i} \mid i, z_{i} = k) \sum_{k' = 1}^{K} p(z_{i} = k \mid z_{i - 1} = k') \, \alpha_{i - 1, k'} \\ &= \omega_{i, k} \sum_{k' = 1}^{K} \Gamma_{i, k'k} \, \alpha_{i - 1, k'}. \end{align*}
Conveniently we can also write this as a matrix equation, \boldsymbol{\alpha}_{i} = \boldsymbol{\omega}_{i} \circ \left( \boldsymbol{\Gamma}_{i}^{T} \cdot \boldsymbol{\alpha}_{i - 1} \right), where \circ denotes the Hadamard or element-wise, product. Alternatively we can use only matrix-vector products, \boldsymbol{\alpha}_{i} = \mathrm{diag}(\boldsymbol{\omega}_{i}) \cdot \left( \boldsymbol{\Gamma}_{i}^{T} \cdot \boldsymbol{\alpha}_{i - 1} \right), although we have to be careful to not actually evaluate \mathrm{diag}(\boldsymbol{\omega}_{i}) as a dense matrix and waste precious computational resources.
In this matrix notation the evaluation of the marginal observational model is given by \begin{align*} \boldsymbol{\alpha}_{0} &= \boldsymbol{\rho} \\ & \ldots \\ \boldsymbol{\alpha}_{i} &= \boldsymbol{\omega}_{i} \circ \left( \boldsymbol{\Gamma}_{i}^{T} \cdot \boldsymbol{\alpha}_{i - 1} \right) \\ & \ldots \\ p(\tilde{y}_{1:I} \mid \theta) &= \mathbf{1}^{T} \cdot \boldsymbol{\alpha}_{I}. \end{align*} This procedure is also known as the forward algorithm as it iterates through the latent states in sequential order.
The cost of each iteration of the forward algorithm is dominated by a matrix-vector product which requires \mathcal{O}(K^{2}) operations. Consequently the cost of the entire forward algorithm scales as \mathcal{O}(I \, K^{2}). For all but the smallest values of K and I this is drastically faster than the \mathcal{O}(K^{I}) scaling that would be required of a brute force summation.
That said for sufficiently large K and/or I the cost of implementing the forward algorithm can still be prohibitive. In particular increasing the dimension of the latent states increases cost much faster than increasing the number of latent states.
2.4 Alternative Conventions
In the previous derivation of the forward algorithm I have made a few choices of notation that are not universal, and hence might clash with other references.
For example many references like Bishop (2006) consider an initial latent state z_{0} that is not paired with a corresponding observation (Figure 4). One awkward consequence of this convention is that there will be a different number of latent states and observations which can be awkward to manage in practical implementations. Indexing is already hard enough!
The convention used in this chapter is more general in the sense that the alternative convention can be recovered by setting \omega_{1, k} = p( \tilde{y}_{1} \mid 1, z_{i} = k, \theta) = 1 for all k.
Another potential divergence in notation concerns the transition matrix \boldsymbol{\Gamma}. Some references like Bishop (2006) define the elements of the transition matrix not as \Gamma_{i, k'k} = p(z_{i} = k \mid z_{i - 1} = k', \theta) but rather the transpose \Gamma_{i, kk'} = p(z_{i} = k \mid z_{i - 1} = k', \theta). Neither convention has any substantial practical advantage over the other, we just have to make sure that we’re using one convention consistently.
3 Inferring Latent State Behavior
After marginalization we can still recover inferences for the latent states with an application of Bayes’ Theorem. Specifically the joint posterior density function for all of the latent states is given, up to normalization, by partially evaluating the joint model, p( z_{1:I} \mid \tilde{y}_{1:I}, \theta) \propto p( z_{1:I}, \tilde{y}_{1:I}, \theta). When I is large, however, this joint posterior density function will be much too ungainly to manipulate or communicate directly.
If we are interested in the behavior of the latent states then we will need to be able to isolate meaningful, low-dimensional summaries of these inferences.
As in Section 2 we will make heavy use of some of the properties of the hidden Markov model detailed in Bishop (2006). Moreover, readers interested in only the final results can safely jump directly to the end of each section.
3.1 Marginal Posterior Probabilities For Individual Latent States
In some applications the marginal posterior probabilities for a single latent state, p( z_{i} \mid y_{1:I}, \theta), is useful. We can derive these marginal posterior probabilities with an application of Bayes’ Theorem, p( z_{i} \mid y_{1:I}, \theta) \propto p( y_{1:I} \mid z_{i}, \theta) \, p(z_{i} \mid \theta).
Fortunately deriving the terms on the right hand side of this equation is relatively straightforward if we take advantage of the structure of the joint model. For example the first term simplifies to, p( y_{1:I} \mid z_{i}, \theta) = p( y_{1:i} \mid z_{i}, \theta) \, p( y_{(i + 1):I} \mid z_{i}, \theta). In words conditioning on z_{i} breaks any coupling between the proceeding observations and the following observations.
Substituting this result gives \begin{align*} p( z_{i} \mid y_{1:I}, \theta) &\propto p( y_{1:I} \mid z_{i}, \theta) \, p(z_{i} \mid \theta) \\ &\propto p( y_{1:i} \mid z_{i}, \theta) \, p( y_{(i + 1):I} \mid z_{i}, \theta) \, p(z_{i} \mid \theta) \\ &\propto \bigg[ p( y_{1:i} \mid z_{i}, \theta) \, p(z_{i} \mid \theta) \bigg] \, p( y_{(i + 1):I} \mid z_{i}, \theta) \\ &\propto p( z_{i}, y_{1:i} \mid \theta) \, p( y_{(i + 1):I} \mid z_{i}, \theta). \end{align*}
As we saw in Section 2, when we run the forward algorithm for a given observation we compute p( z_{i}, \tilde{y}_{1:i} \mid \theta) for each i. Consequently in order to compute the marginal posterior probabilities for each latent state we just need to save these intermediate values and then compute p( \tilde{y}_{(i + 1):I} \mid z_{i}, \theta ).
Conveniently this term can be evaluated with a recursion of its own. To parallel the notation of the forward algorithm let’s denote \beta_{i, z_{i}} = p (\tilde{y}_{(i + 1):I} \mid z_{i}, \theta ). In order to compute a particular \beta_{i, z_{i}} we need to introduce the latent state z_{i + 1} and then decompose, \begin{align*} \beta_{i, z_{i}} &= p (\tilde{y}_{(i + 1):I} \mid z_{i}, \theta) \\ &= \sum_{z_{i + 1}} p (\tilde{y}_{(i + 1):I}, z_{i + 1} \mid z_{i}, \theta ) \\ &= \sum_{z_{i + 1}} p (\tilde{y}_{(i + 1):I} \mid z_{i}, z_{i + 1}, \theta ) \, p(z_{i + 1} \mid z_{i}, \theta). \end{align*}
Now the first term in each summand simplifies due to, you guessed it, the structure of the joint model, p (y_{(i + 1):I}, \mid z_{i}, z_{i + 1}, \theta ) = p (y_{(i + 1):I}, \mid z_{i + 1}, \theta ). Once we know z_{i + 1} the following observations y_{(i + 1):I} do not depend on any of the previous latent states.
Substituting this simplification gives \begin{align*} \beta_{i, z_{i}} &= \sum_{z_{i + 1}} p (\tilde{y}_{(i + 1):I}, \mid z_{i}, z_{i + 1}, \theta ) \, p(z_{i + 1} \mid z_{i}, \theta) \\ &= \sum_{z_{i + 1}} p (\tilde{y}_{(i + 1):I}, \mid z_{i + 1}, \theta) \, p(z_{i + 1} \mid z_{i}, \theta) \\ &= \sum_{z_{i + 1}} p(\tilde{y}_{i + 1} \mid i + 1, z_{i + 1}, \theta) \, p(\tilde{y}_{(i + 2):I}, \mid z_{i + 1}, \theta) \, p(z_{i + 1} \mid z_{i}, \theta). \end{align*}
That middle term, however, is just the \beta for the subsequent latent state, p (\tilde{y}_{(i + 2):I}, \mid z_{i + 1}, \theta) = \beta_{i + 1, z_{i + 1}}!
Consequently we have a backwards recursion formula for the \beta_{i, z_{i}}, \begin{align*} \beta_{i, z_{i}} &= \sum_{z_{i + 1}} p(\tilde{y}_{i + 1} \mid i + 1, z_{i + 1}, \theta) \, p(\tilde{y}_{(i + 2):I}, \mid z_{i + 1}, \theta) \, p(z_{i + 1} \mid z_{i}, \theta) \\ &= \sum_{z_{i + 1}} p(\tilde{y}_{i + 1} \mid i + 1, z_{i + 1}, \theta ) \, \beta_{i + 1, z_{i + 1}} \, p(z_{i + 1} \mid z_{i}, \theta) \\ &= \sum_{z_{i + 1}} \omega_{i + 1, z_{i + 1} } \, \beta_{i + 1, z_{i + 1}} \, \Gamma_{i + 1, z_{i}, z_{i + 1}}, \end{align*} which we can also write in matrix notation, \boldsymbol{\beta}_{i} = \boldsymbol{\Gamma}_{i + 1} \cdot ( \boldsymbol{\omega}_{i + 1} \circ \boldsymbol{\beta}_{i + 1} ) = \boldsymbol{\Gamma}_{i + 1} \cdot ( \mathrm{diag} ( \boldsymbol{\omega}_{i + 1} ) \cdot \boldsymbol{\beta}_{i + 1} ).
The only remaining issue is how to compute \beta_{I, z_{I}} and initialize the backwards recursion. At the last step we have \begin{align*} p( y_{I} \mid z_{I - 1}, \theta ) &= \sum_{z_{I}} p( y_{I}, z_{I} \mid z_{I - 1}, \theta ) \\ &= \sum_{z_{I}} p( y_{I} \mid z_{I}, z_{I - 1} , \theta) \, p( z_{I} \mid z_{I - 1}, \theta ). \end{align*} Recall, however, that conditioning on z_{I} blocks any coupling between z_{I - 1} and y_{I}. Consequently p( y_{I} \mid z_{I}, z_{I - 1} , \theta) = p( y_{I} \mid I, z_{I}, \theta) and \begin{align*} p( y_{I} \mid z_{I - 1}, \theta ) &= \sum_{z_{I}} p( y_{I} \mid z_{I}, z_{I - 1}, \theta) \, p( z_{I} \mid z_{I - 1}, \theta ) \\ &= \sum_{z_{I}} p( y_{I} \mid I, z_{I}, \theta ) \, p( z_{I} \mid z_{I - 1}, \theta ). \end{align*}
Comparing this to the backwards recursion formula \beta_{I - 1, z_{I - 1}} = \sum_{z_{I}} p(\tilde{y}_{I} \mid I, z_{I}, \theta ) \, \beta_{I, z_{I}} \, p(\tilde{y}_{I} \mid z_{I - 1}, \theta) we see that we will get consistent results if and only if we set \beta_{I, z_{I}} = 1 for all values of z_{I}.
Putting everything together we can evaluate the marginal posterior probabilities for the configuration of any latent state by first scanning forward through the latent states, \begin{align*} \boldsymbol{\alpha}_{0} &= \boldsymbol{\rho} \\ & \ldots \\ \boldsymbol{\alpha}_{i} &= \boldsymbol{\omega}_{i} \circ \left( \boldsymbol{\Gamma}^{T}_{i} \cdot \boldsymbol{\alpha}_{i - 1} \right) \\ & \ldots, \end{align*} scanning backwards through the latent states, \begin{align*} \boldsymbol{\beta}_{I} &= \mathbf{1} \\ & \ldots \\ \boldsymbol{\beta}_{i - 1} &= \boldsymbol{\Gamma}_{i} \cdot ( \boldsymbol{\omega}_{i} \circ \boldsymbol{\beta}_{i} ), \end{align*} and then finally multiplying and normalizing, p( z_{i} = k \mid \tilde{y}_{1:I}, \theta) = \frac{ \alpha_{i, k} \, \beta_{i, k} } { \sum_{k' = 1}^{K} \alpha_{i, k'} \, \beta_{i, k'} }.
This combined procedure is known as the forward-backward algorithm. Comparing operations we can see the the forward-backward algorithm exhibits the same computational scaling as the forward algorithm alone. Once we’ve run the forward algorithm running the backward algorithm through all of the latent states will only double the computational cost.
7 Demonstrations
To anchor all of this abstract math into a more practical context let’s work through some implementations of hidden Markov models that demonstrate not only the basic features but also some of the more sophisticated directions that we can take the methodology.
7.1 Setup
First and foremost we have to setup our local R
environment.
<- new.env()
util source('mcmc_analysis_tools_rstan.R', local=util)
source('mcmc_visualization_tools.R', local=util)
7.2 Comparing and Contrasting Implementations
Let’s start with a relatively simple exercise that compares some of the basic implementation strategies.
7.2.1 Simulate Data
We’ll simulate data assuming K = 3
component observational models across I = 100
states.
<- 100
I <- 3 K
To avoid too much complexity we’ll assume a homogeneous transition matrix between all of the neighboring latent states and the same component observational models at each latent state.
simu\_data.stan
data {
// Number of latent states
int<lower=1> I;
}
transformed data {
// True model configuration
int<lower=1> K = 3;
array[K] real mu = {-3, 2, 5};
array[K] real<lower=0> sigma = {1.0, 1.5, 0.75};
simplex[K] rho = [0.8, 0.2, 0.0 ]';
matrix[K, K] Gamma = [ [ 0.60, 0.30, 0.10 ],
0.40, 0.50, 0.10 ],
[ 0.05, 0.05, 0.90 ] ];
[
}
generated quantities {
array[I] int<lower=1, upper=K> z; // Simulated latent states
array[I] real y; // Simulated observations
// Initial state
1] = categorical_rng(rho);
z[1] = normal_rng(mu[z[1]], sigma[z[1]]);
y[
for (i in 2:I) {
// Following state
vector[K] lambda = Gamma[z[i - 1],]';
z[i] = categorical_rng(lambda);
// Observation
y[i] = normal_rng(mu[z[i]], sigma[z[i]]);
} }
<- stan(file="stan_programs/simu_data.stan",
simu algorithm="Fixed_param", seed=194838,
data=list("I" = I), iter=1, chains=1, refresh=0)
<- util$extract_expectand_vals(simu)
simu_samples
<- sapply(1:I, function(i) paste0('y[', i, ']'))
y_names <- sapply(y_names, function(name) simu_samples[[name]][1,1])
y
<- sapply(1:I, function(i) paste0('z[', i, ']'))
z_names <- sapply(z_names, function(name) simu_samples[[name]][1,1]) z
<- list("I" = I, "y" = y, "K" = K) data
7.2.2 Exploratory Data Analysis
There are a few summary statistics that are particularly natural to the structure assumed by a hidden Markov model.
A histogram of all observations collapses the latent dynamics, allowing us to focus on the aggregate behavior of the component observational models. This summary is particularly informative when the component observational models are the same for all latent states.
par(mfrow=c(1, 1))
$plot_line_hist(data$y, -8, 8, 1, xlab="y") util
We can also isolate the latent dynamics as much as possible by plotting the observations against the corresponding latent state indices. This is often referred to as an empirical trajectory.
par(mfrow=c(1, 1))
plot(1, type="n",
xlab="Latent State", xlim=c(0.5, data$I + 0.5),
ylab="y", ylim=c(-8, 8))
for (i in 1:data$I) {
lines(c(i - 0.5, i + 0.5), rep(data$y[i], 2),
col="black", lwd=2)
}
7.2.6 Implementation Consistency Check
To properly compare the outputs of these three model implementations we really need to visualize them next to each other.
As expected, or at least for what we hoped, the posterior retrodictive performance for the empirical trajectories is consistent across all three model implementations.
par(mfrow=c(3, 1))
<- sapply(1:data$I, function(i) paste0('y_pred[', i, ']'))
names
$plot_disc_pushforward_quantiles(samples1, names,
utilbaseline_values=data$y,
xlab="Latent State", ylab="y",
main="Log-Sum-Exp")
$plot_disc_pushforward_quantiles(samples2, names,
utilbaseline_values=data$y,
xlab="Latent State", ylab="y",
main="Rescaling")
$plot_disc_pushforward_quantiles(samples3, names,
utilbaseline_values=data$y,
xlab="Latent State", ylab="y",
main="Built-In")
So too is the inferred behavior of the latent states.
par(mfrow=c(3, 1))
<- sapply(1:data$I, function(i) paste0('z[', i, ']'))
names
$plot_disc_pushforward_quantiles(samples1, names,
utilxlab="Latent State", ylab="z",
main="Log-Sum-Exp")
$plot_disc_pushforward_quantiles(samples2, names,
utilxlab="Latent State", ylab="z",
main="Rescaling")
$plot_disc_pushforward_quantiles(samples3, names,
utilxlab="Latent State", ylab="z",
main="Built-In")
The consistency continues into the inferences for the dynamical parameters.
par(mfrow=c(3, 1))
<- sapply(1:data$K, function(kk) paste0('rho[', kk, ']'))
names
$plot_disc_pushforward_quantiles(samples1, names,
utilxlab="Component",
ylab="rho",
display_ylim=c(0, 1),
main="Log-Sum-Exp")
$plot_disc_pushforward_quantiles(samples2, names,
utilxlab="Component",
ylab="rho",
display_ylim=c(0, 1),
main="Rescaling")
$plot_disc_pushforward_quantiles(samples3, names,
utilxlab="Component",
ylab="rho",
display_ylim=c(0, 1),
main="Built-In")
par(mfrow=c(3, 3))
for (k in 1:data$K) {
<- sapply(1:data$K,
names function(kk) paste0('gamma[', k, ',', kk, ']'))
$plot_disc_pushforward_quantiles(samples1, names,
utilxlab="Component",
ylab=paste("Row", k, "of Gamma"),
main="Log-Sum-Exp")
$plot_disc_pushforward_quantiles(samples2, names,
utilxlab="Component",
ylab=paste("Row", k, "of Gamma"),
main="Rescaling")
$plot_disc_pushforward_quantiles(samples3, names,
utilxlab="Component",
ylab=paste("Row", k, "of Gamma"),
main="Built-In")
}
Finally the inferred behavior of the component observational models is the same across all three model implementations.
par(mfrow=c(3, 2))
for (k in 1:data$K) {
<- paste0('mu[', k, ']')
name $plot_expectand_pushforward(samples1[[name]],
util75, flim=c(-10, 10),
display_name=name)
$plot_expectand_pushforward(samples2[[name]],
util75, flim=c(-10, 10),
col=util$c_mid,
border="#DDDDDD88",
add=TRUE)
$plot_expectand_pushforward(samples3[[name]],
util75, flim=c(-10, 10),
col=util$c_light,
border="#DDDDDD88",
add=TRUE)
<- seq(-10, 10, 0.1)
xs <- dnorm(xs, 0, 10 / 2.32);
ys lines(xs, ys, lwd=2, col="white")
lines(xs, ys, lwd=1, col=util$c_mid_teal)
<- paste0('sigma[', k, ']')
name $plot_expectand_pushforward(samples1[[name]],
util50, flim=c(0, 5),
display_name=name)
$plot_expectand_pushforward(samples2[[name]],
util50, flim=c(0, 5),
col=util$c_mid,
border="#DDDDDD88",
add=TRUE)
$plot_expectand_pushforward(samples3[[name]],
util50, flim=c(0, 5),
col=util$c_light,
border="#DDDDDD88",
add=TRUE)
<- seq(0, 5, 0.1)
xs <- 2 * dnorm(xs, 0, 5 / 2.57);
ys lines(xs, ys, lwd=2, col="white")
lines(xs, ys, lwd=1, col=util$c_mid_teal)
}
7.2.7 Static/Dynamic Mixture Model Comparison
Finally to see if there is any advantage to modeling the dynamics of the latent states let’s compare the hidden Markov model inferences to inferences from a static mixture model.
static\_mixure.stan
data {
int<lower=0> K; // Number of component observational models
int<lower=0> I; // Number of latent states
array[I] real y; // Observations at each step
}
parameters {
// Component observational model configurations
ordered[K] mu;
array[K] real<lower=0> sigma;
// Homogeneous mixture probabilities
simplex[K] lambda;
}
model {
// Prior model
target += normal_lpdf(mu | 0, 10 / 2.32); // -10 <~ mu[k] <~ +10
target += normal_lpdf(sigma | 0, 5 / 2.57); // 0 <~ sigma[k] <~ 5
target += dirichlet_lpdf(lambda | rep_vector(1, K));
// Observational model
for (n in 1:I) {
array[K] real lpds;
for (k in 1:K) {
lpds[k] = log(lambda[k])
+ normal_lpdf(y[n] | mu[k], sigma[k]);
}target += log_sum_exp(lpds);
}
}
generated quantities {
array[I] real y_pred; // Posterior predictive observations
for (i in 1:I) {
int z = categorical_rng(lambda);
y_pred[i] = normal_rng(mu[z], sigma[z]);
} }
<- stan(file="stan_programs/static_mixture.stan",
fit data=data, seed=4938483,
warmup=1000, iter=2024, refresh=0)
The computational diagnostics are clean.
<- util$extract_hmc_diagnostics(fit)
diagnostics $check_all_hmc_diagnostics(diagnostics) util
All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
<- util$extract_expectand_vals(fit)
samples4 <- util$filter_expectands(samples4,
base_samples c('mu', 'sigma',
'lambda'),
check_arrays=TRUE)
$check_all_expectand_diagnostics(base_samples) util
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
We also see excellent posterior retrodictive performance with respect to the histogram summary statistic. When the component observational models are homogeneous across the latent states static mixture models often perform reasonably well in aggregate even though they ignore the latent dynamics.
par(mfrow=c(1, 1))
$plot_hist_quantiles(samples4, 'y_pred', -8, 8, 1,
utilbaseline_values=data$y, xlab='y')
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 48 predictive values (0.0%) fell below the binning.
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 647 predictive values (0.2%) fell above the binning.
Inferences for the location and scale of the first component observational model are nearly identical between the dynamic and static models, but the inferences for the second and third component observational model configurations are more strongly informed with the hidden Markov model.
par(mfrow=c(3, 2))
<- list(c(-4, -2), c(-5, 5), c(3.5, 6))
mu_lims <- list(c(0, 2.75), c(0, 1.25), c(0, 4))
mu_ylims
<- list(c(0.25, 1.75), c(0, 4.5), c(0.25, 2))
sigma_lims <- list(c(0, 3.5), c(0, 1.5), c(0, 5.5))
sigma_ylims
for (k in 1:data$K) {
<- paste0('mu[', k, ']')
name $plot_expectand_pushforward(samples3[[name]], 50,
utildisplay_name=name,
flim=mu_lims[[k]],
ylim=mu_ylims[[k]])
$plot_expectand_pushforward(samples4[[name]], 50,
utildisplay_name=name,
flim=mu_lims[[k]],
ylim=mu_ylims[[k]],
col=util$c_mid,
border="#DDDDDD88",
add=TRUE)
<- paste0('sigma[', k, ']')
name $plot_expectand_pushforward(samples3[[name]], 50,
utildisplay_name=name,
flim=sigma_lims[[k]],
ylim=sigma_ylims[[k]])
$plot_expectand_pushforward(samples4[[name]], 50,
utildisplay_name=name,
flim=sigma_lims[[k]],
ylim=sigma_ylims[[k]],
col=util$c_mid,
border="#DDDDDD88",
add=TRUE)
}
Warning in util$plot_expectand_pushforward(samples4[[name]], 50, display_name =
name, : 1 value (0.0%) fell below the histogram binning.
Warning in util$plot_expectand_pushforward(samples4[[name]], 50, display_name =
name, : 1 value (0.0%) fell above the histogram binning.
Even when a static mixture model adequately captures the aggregate behavior of the observed data the ignorance of any latent dynamics generally results in worse inferential uncertainties.
7.3 Modeling Unobserved Latent States
Now let’s explore some of the ways that we can accommodate latent states that have been only partially observed.
For this exercise we’ll use a data set where there are fewer observations than latent states. Consequently at least some of the latent states have to be missing complementary data.
<- read_rdump("data/miss.data.R")
data
cat(sprintf('%i latent states', data$I))
100 latent states
cat(sprintf('%i observational model components', data$K))
3 observational model components
cat(sprintf('%i observations', data$N))
60 observations
If we look at the empirical trajectory we can see two prominent gaps in the observations.
par(mfrow=c(1, 1))
plot(1, type="n",
xlab="Latent State", xlim=c(0.5, data$I + 0.5),
ylab="y", ylim=c(-8, 8))
for (miss_win in list(c(31, 55), c(76, 90))) {
polygon(c(miss_win[1] - 0.5, miss_win[2] + 0.5,
2] + 0.5, miss_win[1] - 0.5),
miss_win[c(-8, -8, 8, 8), col="#DDDDDD", border=NA)
}
for (n in 1:data$N) {
lines(c(data$state[n] - 0.5, data$state[n] + 0.5),
rep(data$y[n], 2),
col="black", lwd=2)
}
The most straightforward way to unobserved latent states is to just replace the log observational model outputs with zeroes whenever a latent state is unobserved. In particular this allows us to use hidden Markov model functions in Stan
.
For convenience this Stan program fills in the unobserved latent states with proxy observations and then constructs a binary variable indicating whether or not each latent state is observed. This just makes it easier to work sequentially through the latent states.
hmm\_builtin\_miss.stan
data {
int<lower=0> K; // Number of component observational models
int<lower=0> I; // Number of latent states
int<lower=1, upper=I> N; // Number of observations
array[N] int<lower=1, upper=I> state; // Observed latent states
array[N] real y; // Observations
}
transformed data {
// Surrogate value for unobserved latent states
real miss_val = -25.0;
// Buffer observations with surrogate
// values for unobserved latent states
array[I] real y_buff = rep_array(miss_val, I);
y_buff[state] = y;
// Binary variable indicating whether or
// not each latent state is observed
array[I] int<lower=0, upper=1> obs_flag = rep_array(0, I);
1, N);
obs_flag[state] = rep_array(
}
parameters {
// Component observational model configurations
ordered[K] mu;
array[K] real<lower=0> sigma;
// Latent state dynamics
simplex[K] rho;
array[K] simplex[K] gamma;
}
model {
// Prior model
target += normal_lpdf(mu | 0, 10 / 2.32); // -10 <~ mu[k] <~ +10
target += normal_lpdf(sigma | 0, 5 / 2.57); // 0 <~ sigma[k] <~ 5
target += dirichlet_lpdf(rho | rep_vector(1, K));
for (k in 1:K)
target += dirichlet_lpdf(gamma[k] | rep_vector(1, K));
// Observational model
{// Construct transition matrix
matrix[K, K] Gamma;
for (k in 1:K) Gamma[k,] = gamma[k]';
// Construct component observational model evaluations
matrix[K, I] log_omega = rep_matrix(0, K, I);
for (i in 1:I) {
if (obs_flag[i])
for (k in 1:K)
log_omega[k, i] = normal_lpdf(y_buff[i] | mu[k], sigma[k]);
}
// Marginal observational model
target += hmm_marginal(log_omega, Gamma, rho);
}
}
generated quantities {
array[I] int z; // Posterior latent states
array[N] real y_pred; // Posterior predictive observations
{// Construct transition matrix
matrix[K, K] Gamma;
for (k in 1:K) Gamma[k,] = gamma[k]';
// Construct component observational model evaluations
matrix[K, I] log_omega = rep_matrix(0, K, I);
for (i in 1:I) {
if (obs_flag[i])
for (k in 1:K)
log_omega[k, i] = normal_lpdf(y_buff[i] | mu[k], sigma[k]);
}
z = hmm_latent_rng(log_omega, Gamma, rho);
array[I] real y_all_pred = normal_rng(mu[z], sigma[z]);
y_pred = y_all_pred[state];
} }
<- stan(file="stan_programs/hmm_builtin_partial.stan",
fit data=data, seed=4938483,
warmup=1000, iter=2024, refresh=0)
There are some indications of mild autocorrelations but no indications of inaccurate computation.
<- util$extract_hmc_diagnostics(fit)
diagnostics $check_all_hmc_diagnostics(diagnostics) util
All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
<- util$extract_expectand_vals(fit)
samples1 <- util$filter_expectands(samples1,
base_samples c('mu', 'sigma',
'rho', 'gamma'),
check_arrays=TRUE)
$check_all_expectand_diagnostics(base_samples) util
mu[2]:
Chain 1: hat{ESS} (73.717) is smaller than desired (100).
Chain 2: hat{ESS} (74.010) is smaller than desired (100).
Chain 3: hat{ESS} (52.720) is smaller than desired (100).
mu[3]:
Chain 1: hat{ESS} (94.008) is smaller than desired (100).
gamma[1,1]:
Chain 1: hat{ESS} (82.813) is smaller than desired (100).
Chain 3: hat{ESS} (52.942) is smaller than desired (100).
gamma[1,2]:
Chain 1: Right tail hat{xi} (0.280) exceeds 0.25.
Chain 1: hat{ESS} (84.315) is smaller than desired (100).
Chain 3: hat{ESS} (52.267) is smaller than desired (100).
gamma[2,3]:
Chain 3: hat{ESS} (97.845) is smaller than desired (100).
Large tail hat{xi}s suggest that the expectand might not be
sufficiently integrable.
Small empirical effective sample sizes result in imprecise Markov chain
Monte Carlo estimators.
The posterior retrodictive performance is solid.
par(mfrow=c(1, 1))
$plot_hist_quantiles(samples1, 'y_pred', -10, 12, 2,
utilbaseline_values=data$y, xlab='y')
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 8 predictive values (0.0%) fell below the binning.
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 4 predictive values (0.0%) fell above the binning.
par(mfrow=c(1, 1))
<- sapply(1:data$N, function(n) paste0('y_pred[', n, ']'))
names $plot_disc_pushforward_quantiles(samples1, names,
utilbaseline_values=data$y,
xlab="Observed Latent State",
ylab="y",
xticklabs=data$state)
Unsurprisingly inferences for the latent states become increasingly uncertain within the unobserved gaps.
par(mfrow=c(1, 1))
<- sapply(1:data$I, function(i) paste0('z[', i, ']'))
names $plot_disc_pushforward_quantiles(samples1, names,
utilxlab="Latent State", ylab="z")
abline(v=30.5, lwd=2, lty=3, col="#DDDDDD")
abline(v=55.5, lwd=2, lty=3, col="#DDDDDD")
abline(v=75.5, lwd=2, lty=3, col="#DDDDDD")
abline(v=90.5, lwd=2, lty=3, col="#DDDDDD")
The other way that we can take unobserved latent states into account is to remove them and update the transition matrices between the observed latent states. Because this results in a non-homogeneous hidden Markov model we can no longer use the Stan
hidden Markov model functions. Fortunately this non-homogeneous model is not too difficult to implement by hand.
There are a few ways that we could implement the heterogeneous transition matrices. For example we could pre-compute and then store each transition matrix, but this would introduce a pretty large memory burden. A more effective approach is to compute each transition matrix only when they are used. That said the matrix-matrix products needed for this can be pretty expensive, especially across long gaps of unobserved latent states.
In this Stan program I don’t compute the collapsed transition matrices directly but rather iteratively compute their action on the \boldsymbol{\alpha}_{i} and \boldsymbol{\beta}_{i} in the forward and backward algorithms. This requires only matrix-vector products which are less expensive than matrix-matrix products, especially if K is large.
hmm\_rescaled\_partial.stan
data {
int<lower=0> K; // Number of component observational models
int<lower=0> I; // Number of latent states
int<lower=1, upper=I> N; // Number of observations
array[N] int<lower=1, upper=I> state; // Observed latent states
array[N] real y; // Observations
}
parameters {
// Component observational model configurations
ordered[K] mu;
array[K] real<lower=0> sigma;
// Latent state dynamics
simplex[K] rho;
array[K] simplex[K] gamma;
}
model {
// Prior model
target += normal_lpdf(mu | 0, 10 / 2.32); // -10 <~ mu[k] <~ +10
target += normal_lpdf(sigma | 0, 5 / 2.57); // 0 <~ sigma[k] <~ 5
target += dirichlet_lpdf(rho | rep_vector(1, K));
for (k in 1:K)
target += dirichlet_lpdf(gamma[k] | rep_vector(1, K));
// Observational model
{// Construct baseline transition matrix
matrix[K, K] Gamma;
for (k in 1:K) Gamma[k,] = gamma[k]';
// Forward algorithm, jumping across any gaps of unobserved
// latent states with a heterogeneous transition matrix
real norm;
real log_norm = 0;
vector[K] alpha = rho;
norm = max(alpha);
log_norm += log(norm);
alpha /= norm;
for (n in 1:N) {
vector[K] omega;
for (k in 1:K)
omega[k] = exp(normal_lpdf(y[n] | mu[k], sigma[k]));
if (n > 1) {
int delta = state[n] - state[n - 1];
for (d in 1:delta)
alpha = Gamma' * alpha;
alpha = omega .* alpha;
}else {
alpha = omega .* (Gamma' * alpha);
}
norm = max(alpha);
log_norm += log(norm);
alpha /= norm;
}
// Marginal observational model
target += sum(alpha) + log_norm;
}
}
generated quantities {
array[N] int z; // Posterior latent states
array[N] real y_pred; // Posterior predictive observations
{// Construct baseline transition matrix
matrix[K, K] Gamma;
for (k in 1:K) Gamma[k,] = gamma[k]';
// Forward algorithm, jumping across any gaps of unobserved
// latent states with a heterogeneous transition matrix
array[N] vector[K] alpha;
for (n in 1:N) {
vector[K] omega;
for (k in 1:K)
omega[k] = exp(normal_lpdf(y[n] | mu[k], sigma[k]));
if (n > 1) {
int delta = state[n] - state[n - 1];
vector[K] a = alpha[n - 1];
for (d in 1:delta)
a = Gamma' * a;
alpha[n] = omega .* a;
}else {
alpha[n] = omega .* (Gamma' * rho);
}
alpha[n] /= max(alpha[n]);
}
// Sample final latent state
vector[K] r = alpha[N];
vector[K] lambda = r / sum(r);
z[N] = categorical_rng(lambda);
y_pred[N] = normal_rng(mu[z[N]], sigma[z[N]]);
// Sample latent states while running the backward algorithm,
// once again jumping across any gaps of unobserved latent
// states with a heterogeneous transition matrix
vector[K] beta = ones_vector(K);
for (rn in 2:N) {
int n = N + 1 - rn;
int z_prev = z[n + 1];
vector[K] omega_prev;
for (k in 1:K)
1] | mu[k], sigma[k]));
omega_prev[k] = exp(normal_lpdf(y[n +
matrix[K, K] Gamma_local;
if (n > 1)
1]);
Gamma_local = matrix_power(Gamma, state[n] - state[n - else
Gamma_local = Gamma;
r = beta[z_prev] * omega_prev[z_prev]
* ( alpha[n] .* Gamma_local[,z_prev] );
lambda = r / sum(r);
z[n] = categorical_rng(lambda);
y_pred[n] = normal_rng(mu[z[n]],
sigma[z[n]]);
1] - state[n]);
Gamma_local = matrix_power(Gamma, state[n +
beta = Gamma_local * (omega_prev .* beta);
beta /= max(beta);
}
} }
<- stan(file="stan_programs/hmm_rescaled_partial.stan",
fit data=data, seed=4938483,
warmup=1000, iter=2024, refresh=0)
We see the same hints are moderate autocorrelations in the computational diagnostics but nothing that suggests we shouldn’t trust our posterior computation.
<- util$extract_hmc_diagnostics(fit)
diagnostics $check_all_hmc_diagnostics(diagnostics) util
All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
<- util$extract_expectand_vals(fit)
samples2 <- util$filter_expectands(samples2,
base_samples c('mu', 'sigma',
'rho', 'gamma'),
check_arrays=TRUE)
$check_all_expectand_diagnostics(base_samples) util
mu[1]:
Chain 1: hat{ESS} (90.239) is smaller than desired (100).
Chain 4: hat{ESS} (76.016) is smaller than desired (100).
mu[2]:
Chain 1: hat{ESS} (50.050) is smaller than desired (100).
Chain 3: hat{ESS} (95.651) is smaller than desired (100).
Chain 4: hat{ESS} (52.991) is smaller than desired (100).
mu[3]:
Chain 4: hat{ESS} (80.004) is smaller than desired (100).
sigma[3]:
Chain 4: hat{ESS} (85.254) is smaller than desired (100).
gamma[1,1]:
Chain 1: hat{ESS} (41.724) is smaller than desired (100).
Chain 4: hat{ESS} (38.617) is smaller than desired (100).
gamma[1,2]:
Chain 4: Right tail hat{xi} (0.322) exceeds 0.25.
Chain 1: hat{ESS} (37.214) is smaller than desired (100).
Chain 4: hat{ESS} (36.432) is smaller than desired (100).
gamma[2,3]:
Chain 1: hat{ESS} (87.830) is smaller than desired (100).
Chain 4: hat{ESS} (67.000) is smaller than desired (100).
Large tail hat{xi}s suggest that the expectand might not be
sufficiently integrable.
Small empirical effective sample sizes result in imprecise Markov chain
Monte Carlo estimators.
We don’t see any indications of poor retrodictive performance.
par(mfrow=c(1, 1))
$plot_hist_quantiles(samples2, 'y_pred', -10, 12, 2,
utilbaseline_values=data$y, xlab='y')
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 9 predictive values (0.0%) fell below the binning.
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 10 predictive values (0.0%) fell above the binning.
par(mfrow=c(1, 1))
<- sapply(1:data$N, function(n) paste0('y_pred[', n, ']'))
names $plot_disc_pushforward_quantiles(samples2, names,
utilbaseline_values=data$y,
xlab="Observed Latent State",
ylab="y",
xticklabs=data$state)
Either way we approach this problem our posterior inferences are consistent with each other. For example inferred behavior for the the observed latent states is equivalent.
par(mfrow=c(2, 1))
<- sapply(data$state, function(i) paste0('z[', i, ']'))
names $plot_disc_pushforward_quantiles(samples1, names,
utilxlab="Observed Latent State",
ylab="z",
xticklabs=data$state,
main="Expanded Model")
abline(v=30.5, lwd=2, lty=3, col="#DDDDDD")
abline(v=50.5, lwd=2, lty=3, col="#DDDDDD")
<- sapply(1:data$N, function(n) paste0('z[', n, ']'))
names $plot_disc_pushforward_quantiles(samples2, names,
utilxlab="Observed Latent State",
ylab="z",
xticklabs=data$state,
main="Collapsed Model")
abline(v=30.5, lwd=2, lty=3, col="#DDDDDD")
abline(v=50.5, lwd=2, lty=3, col="#DDDDDD")
So too is the inferred behavior for the base transition matrix.
par(mfrow=c(3, 2))
for (k in 1:K) {
<- sapply(1:data$K,
names function(kk) paste0('gamma[', k, ',', kk, ']'))
<- paste("Row", k, "of Gamma")
ylab
$plot_disc_pushforward_quantiles(samples1, names,
utilxlab="Component", ylab=ylab,
main="Expanded Model")
$plot_disc_pushforward_quantiles(samples2, names,
utilxlab="Component", ylab=ylab,
main="Collapsed Model")
}
As well as the component observational model configurations.
par(mfrow=c(2, 2))
<- sapply(1:data$K, function(k) paste0('mu[', k, ']'))
names
$plot_disc_pushforward_quantiles(samples1, names,
utilxlab="Component", ylab="mu",
main="Expanded Model")
$plot_disc_pushforward_quantiles(samples2, names,
utilxlab="Component", ylab="mu",
main="Collapsed Model")
<- sapply(1:data$K, function(k) paste0('sigma[', k, ']'))
names
$plot_disc_pushforward_quantiles(samples1, names,
utilxlab="Component", ylab="sigma",
main="Expanded Model")
$plot_disc_pushforward_quantiles(samples2, names,
utilxlab="Component", ylab="sigma",
main="Collapsed Model")
8 Conclusion
Hidden Markov models generalize mixture observational modeling by allowing for persistent dynamics in the behavior of the data generating process. What really makes hidden Markov models so powerful in practice is that the forward and backward algorithms allow us to efficiently marginalize out the component assignment variables. With care in how exactly we define the latent states and the component observational model outputs these marginalization algorithms allow hidden Markov models to be used in a wide range of sophisticated applications.
Appendix: State-Space Models
The construction of hidden Markov models is actually quite a bit more general than what we have considered in this chapter (Cappé, Moulines, and Rydén (2005)). In particular the mathematical results generalize pretty immediately to any mixture observational model, including not only discrete mixture models but also continuous mixture models. This is useful if, for example, we want to model the evolution of a continuous latent state such as the position and velocity of a moving object.
Most references reserve hidden Markov model for finite latent states, using the term state space model to refer to models with the same conditional structure but more general latent states. That said this convention is not universal and it never hurts to use redundant language, such as “discrete hidden Markov model” or even “finite hidden Markov model” when there is any possibility of confusion.
Generalizing the hidden Markov model construction presented in this chapter uses the same joint model, p(z_{1:I}, y_{1:I} \mid \theta) = p(z_{1}) \, p(y_{1} \mid 1, z_{1}, \theta) \prod_{i = 2}^{I} p(z_{i} \mid z_{i - 1}, \theta) \, p_{z_{i}}(y_{i} \mid 1, z_{i}, \theta ). In order to marginalize the latent states, however, we have to replace summations with more general expectations. For example in order to accommodate a real-valued latent state we need to replace the marginal model p(y_{1:I} \mid \theta) = \sum_{z_{1}} \cdots \sum_{z_{I}} p(z_{1:I}, y_{1:I} \mid \theta) with p(y_{1:I} \mid \theta) = \int \mathrm{d} z_{1} \cdots \mathrm{d} z_{1} \, p(z_{1:I}, y_{1:I} \mid \theta).
In this case the forward algorithm becomes p(z_{i}, y_{1:i} \mid \theta) = p(y_{i} \mid i, z_{i}, \theta) \, \int \mathrm{d} z_{i - 1} \, p(z_{i} \mid z_{i - 1}, \theta) \, p(z_{i - 1}, y_{1:(i - 1)} \mid \theta) while the forward-backward algorithm becomes p(y_{(i + 1):I}, \mid z_{i}, \theta) = \int \mathrm{d} z_{i + 1} \, p(y_{i + 1} \mid i + 1, z_{i + 1}, \theta) \, p(y_{(i + 2):I}, \mid z_{i}, \theta) \, p(z_{i + 1} \mid z_{i}, \theta) with p( z_{i} \mid y_{1:I}, \theta) \propto p( z_{i}, y_{1:i} \mid \theta) \, p( y_{(i + 1):I} \mid z_{i}, \theta).
The practical implementation of these more general forward and backward algorithms is not always feasible. In addition to needing all of the integrals to admit analytic solutions we also need the intermediate conditional probability density functions to all fall into the same family of probability density functions so that we have to keep track of only finite-dimensional parameters and not arbitrary, infinite-dimensional functions.
One exceptional circumstance where this implementation is feasible is when the observational and transition probability density functions are all multivariate normal. In this case each p(z_{i}, y_{1:i} \mid \theta), p(y_{(i + 1):I}, \mid z_{i}, \theta), and p( z_{i} \mid y_{1:I}, \theta) also become multivariate normal probability density functions that we can completely characterized with location and covariance parameters. The forward and backward algorithms then reduce to recursive updates of these parameters.
Even better the recursive updates all become linear operations on these parameters. Consequently this model is known as a linear state space model. Maximum likelihood estimation of the location parameters is known as a Kalman filter.
Acknowledgements
A very special thanks to everyone supporting me on Patreon: Adam Fleischhacker, Alejandro Navarro-Martínez, Alessandro Varacca, Alex D, Alexander Noll, Amit, Andrea Serafino, Andrew Mascioli, Andrew Rouillard, Ara Winter, Ari Holtzman, Austin Rochford, Aviv Keshet, Avraham Adler, Ben Matthews, Ben Swallow, Benoit Essiambre, boot, Bradley Kolb, Brendan Galdo, Bryan Chang, Brynjolfur Gauti Jónsson, Cameron Smith, Canaan Breiss, Cat Shark, Cathy Oliveri, Charles Naylor, Chase Dwelle, Chris Jones, Christina Van Heer, Christopher Mehrvarzi, Colin Carroll, Colin McAuliffe, Damien Mannion, dan mackinlay, Dan W Joyce, Dan Waxman, Dan Weitzenfeld, Daniel Edward Marthaler, Daniel Saunders, Danny Van Nest, Darshan Pandit, Darthmaluus , David Burdelski, David Wurtz, Doug Rivers, Dr. Jobo, Dr. Omri Har Shemesh, Dylan Maher, Dylan Spielman, Ed Cashin, Edgar Merkle, Eli Witus, Eric LaMotte, Ero Carrera, Eugene O’Friel, Felipe González, Fergus Chadwick, Finn Lindgren, Francesco Corona, Geoff Rollins, Granville Matheson, Gregor Gorjanc, Guilherme Marthe, Håkan Johansson, Hamed Bastan-Hagh, haubur, Hector Munoz, Henri Wallen, hs, Hugo Botha, Ian, Ian Costley, idontgetoutmuch, Ignacio Vera, Ilaria Prosdocimi, Isaac Vock, Isidor Belic, jacob pine, Jair Andrade, James C, James Hodgson, James Wade, Janek Berger, Jarrett Byrnes, Jason Martin, Jason Pekos, Jason Wong, jd, Jeff Burnett, Jeff Dotson, Jeff Helzner, Jeffrey Erlich, Jerry Lin , Jessica Graves, Joe Sloan, John Flournoy, Jonathan H. Morgan, Jonathon Vallejo, Joran Jongerling, Josh Knecht, Joshua Miller, JU, Julian Lee, June, Justin Bois, Kádár András, Karim Naguib, Karim Osman, Kristian Gårdhus Wichmann, Lars Barquist, lizzie , LOU ODETTE, Luís F, Mads Christian Hansen, Marek Kwiatkowski, Mariana Carmona, Mark Donoghoe, Markus P., Márton Vaitkus, Matthew, Matthew Kay, Matthew Mulvahill, Matthieu LEROY, Mattia Arsendi, Matěj, Maurits van der Meer, Max, Michael Colaresi, Michael DeWitt, Michael Dillon, Michael Lerner, Mick Cooney, Mike Lawrence, MisterMentat, N Sanders, N.S. , Name, Nathaniel Burbank, Nicholas Clark, Nicholas Cowie, Nick S, Nikita Karetnikov, Octavio Medina, Ole Rogeberg, Oliver Crook, Olivier Ma, Patrick Kelley, Patrick Boehnke, Pau Pereira Batlle, Peter Johnson, Pieter van den Berg , ptr, quasar, Ramiro Barrantes Reynolds, Raúl Peralta Lozada, Ravin Kumar, Rémi , Rex, Riccardo Fusaroli, Richard Nerland, Robert Frost, Robert Goldman, Robert kohn, Robin Taylor, Ryan Gan, Ryan Grossman, Ryan Kelly, Sean Wilson, Seth Axen, shira, Simon Duane, Simon Lilburn, Simon Steiger, Simone Sebben, sssz, Stefan Lorenz, Stephen Lienhard, Steve Harris, Steven Forrest, Stew Watts, Stone Chen, Susan Holmes, Svilup, Tate Tunstall, Tatsuo Okubo, Teresa Ortiz, Theodore Dasher, Thomas Kealy, Thomas Siegert, Thomas Vladeck, Tobychev , Tony Wuersch, Tyler Burch, Virginia Fisher, Vitalie Spinu, Vladimir Markov, Wil Yegelwel, Will Farr, Will Lowe, Will Wen, woejozney, yolhaj, yureq, Zach A, and Zhengchen Cai.
References
License
A repository containing all of the files used to generate this chapter is available on GitHub.
The code in this case study is copyrighted by Michael Betancourt and licensed under the new BSD (3-clause) license:
https://opensource.org/licenses/BSD-3-Clause
The text and figures in this chapter are copyrighted by Michael Betancourt and licensed under the CC BY-NC 4.0 license:
Original Computing Environment
writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))
CC=clang
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined -Wno-unneeded-internal-declaration
CXX=clang++ -arch x86_64 -ftemplate-depth-256
CXX14FLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined -Wno-unneeded-internal-declaration -Wno-unknown-pragmas
CXX14=clang++ -arch x86_64 -ftemplate-depth-256
sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.7.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] colormap_0.1.4 rstan_2.32.6 StanHeaders_2.32.7
loaded via a namespace (and not attached):
[1] gtable_0.3.4 jsonlite_1.8.8 compiler_4.3.2 Rcpp_1.0.11
[5] parallel_4.3.2 gridExtra_2.3 scales_1.3.0 yaml_2.3.8
[9] fastmap_1.1.1 ggplot2_3.4.4 R6_2.5.1 curl_5.2.0
[13] knitr_1.45 htmlwidgets_1.6.4 tibble_3.2.1 munsell_0.5.0
[17] pillar_1.9.0 rlang_1.1.2 utf8_1.2.4 V8_4.4.1
[21] inline_0.3.19 xfun_0.41 RcppParallel_5.1.7 cli_3.6.2
[25] magrittr_2.0.3 digest_0.6.33 grid_4.3.2 lifecycle_1.0.4
[29] vctrs_0.6.5 evaluate_0.23 glue_1.6.2 QuickJSR_1.0.8
[33] codetools_0.2-19 stats4_4.3.2 pkgbuild_1.4.3 fansi_1.0.6
[37] colorspace_2.1-0 rmarkdown_2.25 matrixStats_1.2.0 tools_4.3.2
[41] loo_2.6.0 pkgconfig_2.0.3 htmltools_0.5.7