Hidden Markov Models

Author

Michael Betancourt

Published

June 2025

In the mixture modeling chapter we first considered how to model data that could have arisen from one of a number of component data generating processes. There we derived mixture observational models under the assumption that the activation of a component data generating process, and hence the behavior of the component assignment variables, is independent across individual observations.

Mixture observational models have broad applicability in practice, but they are by no means universal. In order to generalize this modeling technique we need to consider joint mixture observational models that couple the component assignment variables together. These generalizations, however, also introduce an immediate practical challenge: how, if at all, can be efficiently marginalize the component assignment variables?

From this perspective hidden Markov models are an exceptional generalization to mixture observational models. On one hand they allow for more sophisticated behavior by allowing the contribution of a component data generating process to persist across multiple observations before another data generating process takes over. On the other their mathematical structure is just cooperative enough to allow for practical marginalization.

In this chapter I will introduce hidden Markov models, carefully derive the algorithms needed for efficient marginalization, and demonstrate their robust implementation in Stan.

1 Joint Mixture Observational Models

Consider I component observations, y_{1}, \ldots, y_{i}, \ldots, y_{I}, each of which can arise from one of K component observational models, p(y_{i} \mid i, k, \theta) = p_{i, k}( y_{i} \mid \theta).

Note that in general the y_{i}, and hence the precise form of the p(y_{i} \mid i, k, \theta), do not have to homogeneous. For example some y_{i} might be discrete while others are continuous. Moreover the dimension of each y_{i} might vary, with some observations featuring more sub-components than others.

To quantify which component observational model is responsible for each component observation we introduce I categorical assignment variables, z_{i} \in \{ 1, \ldots, K \}. These assignment variables are also known as states, as in the state of the overall data generating process for each observation.

When the states z_{i} are unknown, equivalently latent or even hidden, we need to need to incorporate them into the observational model so that that we can infer their behavior. This requires a joint model of the form (Figure 1) \begin{align*} p(z_{1}, \ldots, z_{I}, y_{1}, \ldots, y_{I}, \theta) &= p(y_{1}, \ldots, y_{I} \mid z_{1}, \ldots, z_{I}, \theta) \, p(z_{1}, \ldots, z_{I} \mid \theta) \, p( \theta ) \\ &= \prod_{i = 1}^{I} p_{i, z_{i}}(y_{i} \mid \theta) p(z_{1}, \ldots, z_{I} \mid \theta) \, p( \theta ) \\ &= \prod_{i = 1}^{I} p(y_{i} \mid i, z_{i}, \theta) p(z_{1}, \ldots, z_{I} \mid \theta) \, p( \theta ). \end{align*}

Figure 1: In general the behavior of the latent assignment variables for each mixture observational model, or latent states, are all coupled together.

To make the equations in this chapter a bit less overwhelming let’s take the opportunity to introduce a more compact notation for sequences of observations and latent states, \begin{align*} y_{i_{i}:i_{2}} &= ( y_{i_{1}}, y_{i_{1} + 1}, \ldots, y_{i_{2} - 1}, y_{i_{2}} ) \\ z_{i_{i}:i_{2}} &= ( z_{i_{1}}, z_{i_{1} + 1}, \ldots, z_{i_{2} - 1}, z_{i_{2}} ). \end{align*} This allows us to write the general joint model as p(z_{1:I}, y_{1:I}, \theta) = \prod_{i = 1}^{I} p(y_{i} \mid i, z_{i}, \theta) p(z_{1}, \ldots, z_{I} \mid \theta) \, p( \theta ).

In the mixture modeling chapter we derived mixture observational models by assuming that all of the latent states were independent and identically distributed (Figure 2), p(z_{1:I} \mid \theta) = \prod_{i = 1}^{I} p(z_{i} \mid \theta); I will refer to this as a basic mixture observational model. With this assumption we were able to marginalize out each discrete latent state individually. This, in turn, allowed us to use high performance computational tools like Hamiltonian Monte Carlo.

Figure 2: Basic mixture observational models assume that the latent states are independent.

An arbitrary joint mixture observational model, however, features K^{I} distinct configurations of the latent states. Consequently direct summation of the latent states, p(y_{1:I} \mid \theta) = \sum_{z_{1}} \cdots \sum_{z_{I}} p(z_{1:I}, y_{1:I} \mid \theta), is typically infeasible. There is a key exception, however, where the coupling between the latent states allows for efficient marginalization while still being rich enough to be useful for practical applications.

To motivate this coupling let’s first assume that the latent states are ordered so that we have a well-defined notion of neighboring states. For example the observations might be indexed in the temporal order in which they were made, organized along a one-dimensional spatial direction, and the like.

Next we will assume that each latent state is conditionally dependent on only its preceding neighbor (Figure 3), p(z_{1:I}, \mid \theta) = p(z_{1}) \, \prod_{i = 2}^{I} p(z_{i} \mid z_{i - 1}, \theta). This particular dependency structure arises so often in both theoretical and applied probability theory that it has been given its own adjective: Markovian.

Figure 3: In a Markov model for the latent states the behavior of each latent state depends on only the behavior of its preceding neighbor.

Under this Markovian assumption the joint model becomes p(z_{1:I}, y_{1:I}, \theta) = p(z_{1:I}, y_{1:I} \mid \theta) \, p( \theta ) where \begin{align*} p(z_{1:I}, y_{1:I} \mid \theta) &= \prod_{i = 1}^{I} p_(y_{i} \mid i, z_{i}, \theta) p(z_{1}, \ldots, z_{I} \mid \theta) \\ &= \left[ \prod_{i = 1}^{I} p(y_{i} \mid i, z_{i}, \theta) \right] \left[ p(z_{1}) \, \prod_{i = 2}^{I} p(z_{i} \mid z_{i - 1}, \theta) \right] \\ &= p(z_{1}) \, (y_{1} \mid 1, z_{1}, \theta) \prod_{i = 2}^{I} p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{i} \mid i, z_{i}, \theta). \end{align*}

Because it assumes “Markovian” dependencies for a sequence of “hidden” states this model is known as a hidden Markov model. That said is it not unreasonable to refer to it as a Markov mixture observational model if we want to emphasize the assumed structure of the observations as well as that of the latent states.

In the context of a hidden Markov model the conditional probabilities p(z_{i} \mid z_{i - 1}, \theta) are known referred to as transition probabilities. A hidden Markov model with the same transition probabilities between all pairs of states, so that p(z_{i} = k' \mid z_{i - 1} = k, \theta) depends on only k and k' but not i, is known as a homogeneous hidden Markov model. Heterogeneous, non-homogeneous, and general hidden Markov models are all terms used for emphasizing varying transition probabilities.

2 The Marginal Hidden Markov Model

In theory the joint observational model p(z_{1:I}, y_{1:I} \mid \theta) is sufficient to infer which latent states z_{1:I} and model configurations \theta are consistent with any observed data. The discreteness of the latent states, however, frustrates practical implementation of those inferences.

Fortunately the Markovian structure of hidden Markov models allows us to efficiently marginalize out the discrete latent states. That said actually deriving an explicit marginalization algorithm from that structure is a pretty onerous task.

In this section I will motivate and then carefully derive the strategy for efficiently marginalizing the discrete hidden states of a hidden Markov model. The derivation requires some pretty brutal calculations and readers interested in only the final result can jump straight to Section 2.3 without missing anything.

2.1 Looking For A Pattern

First and foremost let’s motivate the basic approach to efficient marginalization by carefully working out the direct summations for I = 1, I = 2, and then I = 3 and then looking for any useful patterns.

When I = 1 we have p(z_{1}, y_{1} \mid \theta) = p(z_{1}) \, p(y_{1} \mid 1, z_{1}, \theta). Summing over the initial latent state gives \begin{align*} p(y_{1} \mid \theta) &= \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \\ &= \sum_{z_{1}} p(y_{1} \mid 1, z_{1}, \theta) \, p(z_{1}). \end{align*} Note the similarity to the marginal form of a basic mixture observational model, p(y \mid \theta) = \sum_{z} p( y \mid z, \theta) \, p(z).

For I = 2 we have p(z_{1}, z_{2}, y_{1}, y_{2} \mid \theta) = p(z_{1}) \, p(y_{1} \mid 1, z_{1}, \theta) \, p(z_{2} \mid z_{1}, \theta) \, p(y_{2} \mid 2, z_{2}, \theta), or, equivalently, p(z_{1}, z_{2}, y_{1}, y_{2} \mid \theta) = p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta) \, p(y_{2} \mid 2, z_{2}, \theta). Summing over the initial latent state gives \begin{align*} p(z_{2}, y_{1}, y_{2} \mid \theta) &= \sum_{z_{1}} p(z_{1}, z_{2}, y_{1}, y_{2} \mid \theta) \\ &= \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta) \, p(y_{2} \mid 2, z_{2}, \theta) \\ &= p(y_{2} \mid 2, z_{2}, \theta) \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta). \end{align*}

We can then marginalize out both of the latent states by summing this result over z_{2}, \begin{align*} p(y_{1}, y_{2} \mid \theta) &= \sum_{z_{2}} p(z_{2}, y_{1}, y_{2} \mid \theta) \\ &= \sum_{z_{2}} p(y_{2} \mid 2, z_{2}, \theta) \bigg[ \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta) \bigg] \\ &= \sum_{z_{2}} p(y_{2} \mid 2, z_{2}, \theta) \, p(z_{2} \mid y_{1}, \theta). \end{align*}

Interestingly this once again this has the same form as a basic mixture observational model. Now, however, the component probabilities are informed by the previous observation, p(z_{2} \mid y_{1}, \theta) = \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta).

Finally let’s consider I = 3 where the joint model is given by \begin{align*} p(z_{1:3}, y_{1:3} \mid \theta) &= p(z_{1}) \, p(y_{1} \mid 1, z_{1}, \theta) \, \prod_{i = 2}^{3} p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{i} \mid i, z_{i}, \theta) \\ &= p(z_{1}, y_{1} \mid \theta) \, \prod_{i = 2}^{3} p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{i} \mid i, z_{i}, \theta). \end{align*}

Here summing over the initial latent state gives \begin{align*} p(z_{2:3}, y_{1:3} \mid \theta) &= \sum_{z_{1}} p(z_{1:3}, y_{1:3} \mid \theta) \\ &= \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, \prod_{i = 2}^{3} p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{i} \mid i, z_{i}, \theta) \\ &= \;\, \bigg[ p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{3} \mid z_{2}, \theta) \bigg] \\ &\quad \cdot \bigg[ p(y_{2} \mid 2, z_{2}, \theta) \, \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta) \bigg]. \end{align*}

The second term might look familiar; it’s exactly p(z_{2}, y_{1}, y_{2} \mid \theta) which we just derived for the I = 2 case! Consequently we can write \begin{align*} p(z_{2:3}, y_{1:3} \mid \theta) &= \; \bigg[ p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{3} \mid z_{2}, \theta) \bigg] \\ &\quad \cdot \bigg[ p(y_{2} \mid 2, z_{2}, \theta) \, \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta) \bigg] \\ &= \; \bigg[ p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{3} \mid z_{2}, \theta) \bigg] \\ &\quad \cdot \bigg[ p(z_{2}, y_{1}, y_{2} \mid \theta) \bigg] \\ &= p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{2}, y_{1}, y_{2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta). \end{align*}

Now we can sum over z_{2} to give \begin{align*} p(z_{3}, y_{1:3} \mid \theta) &= \sum_{z_{2}} p(z_{2:3}, y_{1:3} \mid \theta) \\ &= \sum_{z_{2}} p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{2}, y_{1}, y_{2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta) \\ &= p(y_{3} \mid 3, z_{3}, \theta) \, \sum_{z_{2}} p(z_{2}, y_{1}, y_{2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta). \end{align*} Notice how we derive p(z_{3}, y_{1:3} \mid \theta) from p(z_{2}, y_{1}, y_{2} \mid \theta) in exactly the same way that we derived p(z_{2}, y_{1}, y_{2} \mid \theta) from p(z_{1}, y_{1} \mid \theta).

Lastly we sum over z_{3} to give the full marginal probability density function, \begin{align*} p(y_{1:3} \mid \theta) &= \sum_{z_{3}} p(z_{3}, y_{1:3} \mid \theta) \\ &= \sum_{z_{3}} p(y_{3} \mid 3, z_{3}, \theta) \, \bigg[ \sum_{z_{2}} p(z_{2}, y_{1}, y_{2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta) \bigg] \\ &= \sum_{z_{3}} p(y_{3} \mid 3, z_{3}, \theta) \, p(z_{3} \mid y_{1:2}, \theta ). \end{align*}

The final marginal model continues to mirror a basic mixture observational model, in this case with derived component probabilities p(z_{3} \mid y_{1:2}, \theta ) = \sum_{z_{2}} p(z_{2}, y_{1}, y_{2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta).

Let’s summarize what we have seen in these first few cases.

Instead of deriving p(y_{1:3} \mid \theta) by summing p(z_{1:3}, y_{1:3} \mid \theta) over z_{1}, z_{2}, and z_{3} at the same time, we can evaluate it by emulating a basic mixture observational model and summing p(z_{3}, y_{1:3} \mid \theta) over just z_{3}.

Conveniently p(z_{3}, y_{1:3} \mid \theta) can be derived directly from p(z_{2}, y_{1:2} \mid \theta), and that term can be derived from p(z_{1}, y_{1} \mid \theta) which in turn can be derived from the initial state probabilities p(z_{1}), \begin{align*} p(z_{1}, y_{1} \;\; \mid \theta) &= p(y_{1} \mid 1, z_{1}, \theta) \, p(z_{1}), \\ p(z_{2}, y_{1:2} \mid \theta) &= p(y_{2} \mid 2, z_{2}, \theta) \sum_{z_{1}} p(z_{1}, y_{1} \mid \theta) \, p(z_{2} \mid z_{1}, \theta), \\ p(z_{3}, y_{1:3} \mid \theta) &= p(y_{3} \mid 3, z_{3}, \theta) \, \sum_{z_{2}} p(z_{2}, y_{1:2} \mid \theta) \, p(z_{3} \mid z_{2}, \theta). \end{align*}

At this point the important question is whether or not this recursive pattern over the first three cases holds more generally, with p(z_{i} \mid y_{1:i}, \theta ) = p(y_{i} \mid i, z_{i}, \theta) \, \bigg[ \sum_{z_{i - 1}} p(z_{i - 1}, y_{1:(i - 1)} \mid \theta) \, p(z_{i} \mid z_{i - 1}, \theta) \bigg] for any 1 < i \le I. If it does then starting with the initial state probabilities p(z_{1}) we can iteratively compute p(z_{I} \mid y_{1:I} \mid \theta) and then sum over z_{I} to give the marginal model p(y_{1:I} \mid \theta).

2.2 Verifying The Pattern

While not exactly straightforward, with enough mathematical effort we can derive a general relationship between p(z_{i}, y_{1:i} \mid \theta) and p(z_{i - 1}, y_{1:(i - 1)} \mid \theta) using conditional probability theory and the conditional structure of the joint hidden Markov model. The exact properties we need are pretty onerous to work out, and I will not attempt to do so in this chapter. Instead I will shamelessly defer to Bishop (2006) which reviews the necessary properties and strategies for deriving them.

Ultimately we want to reduce p(z_{i}, y_{1:i} \mid \theta) to an object that depends on only the first i - 1 latent states and observations. To that end let’s peel off the terms that depend z_{i} and y_{i}, \begin{align*} p(z_{i}, y_{1:i} \mid \theta) &= p(z_{i}, y_{i}, y_{1:(i - 1)} \mid \theta) \\ &= p(y_{1:(i - 1)} \mid z_{i}, y_{i}, \theta) \, p(z_{i}, y_{i} \mid \theta) \\ &= p(y_{1:(i - 1)} \mid z_{i}, y_{i}, \theta) \, p(y_{i} \mid i, z_{i}, \theta) \, p(z_{i} \mid \theta). \end{align*}

Because of the Markovian structure of the hidden Markov model the first term can be written as p(y_{1:(i - 1)} \mid z_{i}, y_{i}, \theta) = p(y_{1:(i - 1)} \mid z_{i}, \theta) Consequently we have \begin{align*} p(z_{i}, y_{1:i} \mid \theta) &= p(y_{1:(i - 1)} \mid z_{i}, y_{i}, \theta) \, p(y_{i} \mid i, z_{i}, \theta) \, p(z_{i} \mid \theta) \\ &= p(y_{1:(i - 1)} \mid z_{i}, \theta) \, p(y_{i} \mid i, z_{i}, \theta) \, p(z_{i} \mid \theta) \\ &= p(y_{i} \mid i, z_{i}, \theta) \bigg[ p(y_{1:(i - 1)} \mid i, z_{i}, \theta) \, p(z_{i} \mid \theta) \bigg] \\ &= p(y_{i} \mid i, z_{i}, \theta) \, p(z_{i}, y_{1:(i - 1)} \mid \theta). \end{align*}

The second term here is a bit awkward, but it can be simplified by writing it as a marginal probability density function, \begin{align*} p(z_{i}, y_{1:(i - 1)} \mid \theta) &= \sum_{z_{i - 1}} p(z_{i - 1}, z_{i}, y_{1:(i - 1)} \mid \theta) \\ &= \sum_{z_{i - 1}} p(z_{i}, y_{1:(i - 1)} \mid z_{i - 1}, \theta) \, p(z_{i - 1} \mid \theta). \end{align*} We can further simplify the first term in the sum to p(z_{i}, y_{1:(i - 1)} \mid z_{i - 1}, \theta) = p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{1:(i - 1)} \mid z_{i - 1}, \theta); intuitively conditioning on z_{i - 1} blocks any coupling between the following latent state z_{i} and the previous observations y_{1:(i - 1)}.

This allows us to write \begin{align*} p(z_{i}, y_{1:(i - 1)} \mid \theta) &= \sum_{z_{i - 1}} p(z_{i}, y_{1:(i - 1)} \mid z_{i - 1}, \theta) \, p(z_{i - 1} \mid \theta) \\ &= \sum_{z_{i - 1}} p(z_{i} \mid z_{i - 1}, \theta) \, p(y_{1:(i - 1)} \mid z_{i - 1}, \theta) \, p(z_{i - 1} \mid \theta) \\ &= \sum_{z_{i - 1}} p(z_{i} \mid z_{i - 1}, \theta) \, \bigg[ p(y_{1:(i - 1)} \mid z_{i - 1}, \theta) \, p(z_{i - 1} \mid \theta) \bigg] \\ &= \sum_{z_{i - 1}} p(z_{i} \mid z_{i - 1}, \theta) \, p(z_{i - 1}, y_{1:(i - 1)} \mid \theta). \end{align*}

Substituting this back into the original equation for p(z_{i}, y_{1:i} \mid \theta) finally gives \begin{align*} p(z_{i}, y_{1:i} \mid \theta) &= p(y_{i} \mid i, z_{i}, \theta) \, p(z_{i}, y_{1:(i - 1)} \mid \theta) \\ &= p(y_{i} \mid i, z_{i}, \theta) \, \sum_{z_{i - 1}} p(z_{i} \mid z_{i - 1}, \theta) \, p(z_{i - 1}, y_{1:(i - 1)} \mid \theta), \end{align*} which is exactly the recursion equation that generalizes the pattern we found in the Section 2.1.

2.3 The Forward Algorithm

Implementing this recursive calculation given observations \tilde{y}_{1}, \ldots, \tilde{y}_{i}, \ldots, \tilde{y}_{I} is made a bit more straightforward with the introduction of some more compact notation.

First let’s denote the initial state probabilities, observational model outputs, and transition probabilities as \begin{align*} \rho_{k} &= p( z_{0} = k ) \\ \omega_{i, k} &= p( \tilde{y}_{i} \mid i, z_{i} = k, \theta) \\ \Gamma_{i, k'k} &= p(z_{i} = k \mid z_{i - 1} = k', \theta) \end{align*} Then we can define \alpha_{i, k} = p(z_{i} = k, \tilde{y}_{1:i} \mid \theta) with the recursion becoming \begin{align*} \alpha_{i, k} &= p(\tilde{y}_{i} \mid i, z_{i} = k) \sum_{k' = 1}^{K} p(z_{i} = k \mid z_{i - 1} = k') \, \alpha_{i - 1, k'} \\ &= \omega_{i, k} \sum_{k' = 1}^{K} \Gamma_{i, k'k} \, \alpha_{i - 1, k'}. \end{align*}

Conveniently we can also write this as a matrix equation, \boldsymbol{\alpha}_{i} = \boldsymbol{\omega}_{i} \circ \left( \boldsymbol{\Gamma}_{i}^{T} \cdot \boldsymbol{\alpha}_{i - 1} \right), where \circ denotes the Hadamard or element-wise, product. Alternatively we can use only matrix-vector products, \boldsymbol{\alpha}_{i} = \mathrm{diag}(\boldsymbol{\omega}_{i}) \cdot \left( \boldsymbol{\Gamma}_{i}^{T} \cdot \boldsymbol{\alpha}_{i - 1} \right), although we have to be careful to not actually evaluate \mathrm{diag}(\boldsymbol{\omega}_{i}) as a dense matrix and waste precious computational resources.

In this matrix notation the evaluation of the marginal observational model is given by \begin{align*} \boldsymbol{\alpha}_{0} &= \boldsymbol{\rho} \\ & \ldots \\ \boldsymbol{\alpha}_{i} &= \boldsymbol{\omega}_{i} \circ \left( \boldsymbol{\Gamma}_{i}^{T} \cdot \boldsymbol{\alpha}_{i - 1} \right) \\ & \ldots \\ p(\tilde{y}_{1:I} \mid \theta) &= \mathbf{1}^{T} \cdot \boldsymbol{\alpha}_{I}. \end{align*} This procedure is also known as the forward algorithm as it iterates through the latent states in sequential order.

The cost of each iteration of the forward algorithm is dominated by a matrix-vector product which requires \mathcal{O}(K^{2}) operations. Consequently the cost of the entire forward algorithm scales as \mathcal{O}(I \, K^{2}). For all but the smallest values of K and I this is drastically faster than the \mathcal{O}(K^{I}) scaling that would be required of a brute force summation.

That said for sufficiently large K and/or I the cost of implementing the forward algorithm can still be prohibitive. In particular increasing the dimension of the latent states increases cost much faster than increasing the number of latent states.

2.4 Alternative Conventions

In the previous derivation of the forward algorithm I have made a few choices of notation that are not universal, and hence might clash with other references.

For example many references like Bishop (2006) consider an initial latent state z_{0} that is not paired with a corresponding observation (Figure 4). One awkward consequence of this convention is that there will be a different number of latent states and observations which can be awkward to manage in practical implementations. Indexing is already hard enough!

Figure 4: One common convention for hidden Markov models assumes that the initial latent state is not complemented with any observations.

The convention used in this chapter is more general in the sense that the alternative convention can be recovered by setting \omega_{1, k} = p( \tilde{y}_{1} \mid 1, z_{i} = k, \theta) = 1 for all k.

Another potential divergence in notation concerns the transition matrix \boldsymbol{\Gamma}. Some references like Bishop (2006) define the elements of the transition matrix not as \Gamma_{i, k'k} = p(z_{i} = k \mid z_{i - 1} = k', \theta) but rather the transpose \Gamma_{i, kk'} = p(z_{i} = k \mid z_{i - 1} = k', \theta). Neither convention has any substantial practical advantage over the other, we just have to make sure that we’re using one convention consistently.

3 Inferring Latent State Behavior

After marginalization we can still recover inferences for the latent states with an application of Bayes’ Theorem. Specifically the joint posterior density function for all of the latent states is given, up to normalization, by partially evaluating the joint model, p( z_{1:I} \mid \tilde{y}_{1:I}, \theta) \propto p( z_{1:I}, \tilde{y}_{1:I}, \theta). When I is large, however, this joint posterior density function will be much too ungainly to manipulate or communicate directly.

If we are interested in the behavior of the latent states then we will need to be able to isolate meaningful, low-dimensional summaries of these inferences.

As in Section 2 we will make heavy use of some of the properties of the hidden Markov model detailed in Bishop (2006). Moreover, readers interested in only the final results can safely jump directly to the end of each section.

3.1 Marginal Posterior Probabilities For Individual Latent States

In some applications the marginal posterior probabilities for a single latent state, p( z_{i} \mid y_{1:I}, \theta), is useful. We can derive these marginal posterior probabilities with an application of Bayes’ Theorem, p( z_{i} \mid y_{1:I}, \theta) \propto p( y_{1:I} \mid z_{i}, \theta) \, p(z_{i} \mid \theta).

Fortunately deriving the terms on the right hand side of this equation is relatively straightforward if we take advantage of the structure of the joint model. For example the first term simplifies to, p( y_{1:I} \mid z_{i}, \theta) = p( y_{1:i} \mid z_{i}, \theta) \, p( y_{(i + 1):I} \mid z_{i}, \theta). In words conditioning on z_{i} breaks any coupling between the proceeding observations and the following observations.

Substituting this result gives \begin{align*} p( z_{i} \mid y_{1:I}, \theta) &\propto p( y_{1:I} \mid z_{i}, \theta) \, p(z_{i} \mid \theta) \\ &\propto p( y_{1:i} \mid z_{i}, \theta) \, p( y_{(i + 1):I} \mid z_{i}, \theta) \, p(z_{i} \mid \theta) \\ &\propto \bigg[ p( y_{1:i} \mid z_{i}, \theta) \, p(z_{i} \mid \theta) \bigg] \, p( y_{(i + 1):I} \mid z_{i}, \theta) \\ &\propto p( z_{i}, y_{1:i} \mid \theta) \, p( y_{(i + 1):I} \mid z_{i}, \theta). \end{align*}

As we saw in Section 2, when we run the forward algorithm for a given observation we compute p( z_{i}, \tilde{y}_{1:i} \mid \theta) for each i. Consequently in order to compute the marginal posterior probabilities for each latent state we just need to save these intermediate values and then compute p( \tilde{y}_{(i + 1):I} \mid z_{i}, \theta ).

Conveniently this term can be evaluated with a recursion of its own. To parallel the notation of the forward algorithm let’s denote \beta_{i, z_{i}} = p (\tilde{y}_{(i + 1):I} \mid z_{i}, \theta ). In order to compute a particular \beta_{i, z_{i}} we need to introduce the latent state z_{i + 1} and then decompose, \begin{align*} \beta_{i, z_{i}} &= p (\tilde{y}_{(i + 1):I} \mid z_{i}, \theta) \\ &= \sum_{z_{i + 1}} p (\tilde{y}_{(i + 1):I}, z_{i + 1} \mid z_{i}, \theta ) \\ &= \sum_{z_{i + 1}} p (\tilde{y}_{(i + 1):I} \mid z_{i}, z_{i + 1}, \theta ) \, p(z_{i + 1} \mid z_{i}, \theta). \end{align*}

Now the first term in each summand simplifies due to, you guessed it, the structure of the joint model, p (y_{(i + 1):I}, \mid z_{i}, z_{i + 1}, \theta ) = p (y_{(i + 1):I}, \mid z_{i + 1}, \theta ). Once we know z_{i + 1} the following observations y_{(i + 1):I} do not depend on any of the previous latent states.

Substituting this simplification gives \begin{align*} \beta_{i, z_{i}} &= \sum_{z_{i + 1}} p (\tilde{y}_{(i + 1):I}, \mid z_{i}, z_{i + 1}, \theta ) \, p(z_{i + 1} \mid z_{i}, \theta) \\ &= \sum_{z_{i + 1}} p (\tilde{y}_{(i + 1):I}, \mid z_{i + 1}, \theta) \, p(z_{i + 1} \mid z_{i}, \theta) \\ &= \sum_{z_{i + 1}} p(\tilde{y}_{i + 1} \mid i + 1, z_{i + 1}, \theta) \, p(\tilde{y}_{(i + 2):I}, \mid z_{i + 1}, \theta) \, p(z_{i + 1} \mid z_{i}, \theta). \end{align*}

That middle term, however, is just the \beta for the subsequent latent state, p (\tilde{y}_{(i + 2):I}, \mid z_{i + 1}, \theta) = \beta_{i + 1, z_{i + 1}}!

Consequently we have a backwards recursion formula for the \beta_{i, z_{i}}, \begin{align*} \beta_{i, z_{i}} &= \sum_{z_{i + 1}} p(\tilde{y}_{i + 1} \mid i + 1, z_{i + 1}, \theta) \, p(\tilde{y}_{(i + 2):I}, \mid z_{i + 1}, \theta) \, p(z_{i + 1} \mid z_{i}, \theta) \\ &= \sum_{z_{i + 1}} p(\tilde{y}_{i + 1} \mid i + 1, z_{i + 1}, \theta ) \, \beta_{i + 1, z_{i + 1}} \, p(z_{i + 1} \mid z_{i}, \theta) \\ &= \sum_{z_{i + 1}} \omega_{i + 1, z_{i + 1} } \, \beta_{i + 1, z_{i + 1}} \, \Gamma_{i + 1, z_{i}, z_{i + 1}}, \end{align*} which we can also write in matrix notation, \boldsymbol{\beta}_{i} = \boldsymbol{\Gamma}_{i + 1} \cdot ( \boldsymbol{\omega}_{i + 1} \circ \boldsymbol{\beta}_{i + 1} ) = \boldsymbol{\Gamma}_{i + 1} \cdot ( \mathrm{diag} ( \boldsymbol{\omega}_{i + 1} ) \cdot \boldsymbol{\beta}_{i + 1} ).

The only remaining issue is how to compute \beta_{I, z_{I}} and initialize the backwards recursion. At the last step we have \begin{align*} p( y_{I} \mid z_{I - 1}, \theta ) &= \sum_{z_{I}} p( y_{I}, z_{I} \mid z_{I - 1}, \theta ) \\ &= \sum_{z_{I}} p( y_{I} \mid z_{I}, z_{I - 1} , \theta) \, p( z_{I} \mid z_{I - 1}, \theta ). \end{align*} Recall, however, that conditioning on z_{I} blocks any coupling between z_{I - 1} and y_{I}. Consequently p( y_{I} \mid z_{I}, z_{I - 1} , \theta) = p( y_{I} \mid I, z_{I}, \theta) and \begin{align*} p( y_{I} \mid z_{I - 1}, \theta ) &= \sum_{z_{I}} p( y_{I} \mid z_{I}, z_{I - 1}, \theta) \, p( z_{I} \mid z_{I - 1}, \theta ) \\ &= \sum_{z_{I}} p( y_{I} \mid I, z_{I}, \theta ) \, p( z_{I} \mid z_{I - 1}, \theta ). \end{align*}

Comparing this to the backwards recursion formula \beta_{I - 1, z_{I - 1}} = \sum_{z_{I}} p(\tilde{y}_{I} \mid I, z_{I}, \theta ) \, \beta_{I, z_{I}} \, p(\tilde{y}_{I} \mid z_{I - 1}, \theta) we see that we will get consistent results if and only if we set \beta_{I, z_{I}} = 1 for all values of z_{I}.

Putting everything together we can evaluate the marginal posterior probabilities for the configuration of any latent state by first scanning forward through the latent states, \begin{align*} \boldsymbol{\alpha}_{0} &= \boldsymbol{\rho} \\ & \ldots \\ \boldsymbol{\alpha}_{i} &= \boldsymbol{\omega}_{i} \circ \left( \boldsymbol{\Gamma}^{T}_{i} \cdot \boldsymbol{\alpha}_{i - 1} \right) \\ & \ldots, \end{align*} scanning backwards through the latent states, \begin{align*} \boldsymbol{\beta}_{I} &= \mathbf{1} \\ & \ldots \\ \boldsymbol{\beta}_{i - 1} &= \boldsymbol{\Gamma}_{i} \cdot ( \boldsymbol{\omega}_{i} \circ \boldsymbol{\beta}_{i} ), \end{align*} and then finally multiplying and normalizing, p( z_{i} = k \mid \tilde{y}_{1:I}, \theta) = \frac{ \alpha_{i, k} \, \beta_{i, k} } { \sum_{k' = 1}^{K} \alpha_{i, k'} \, \beta_{i, k'} }.

This combined procedure is known as the forward-backward algorithm. Comparing operations we can see the the forward-backward algorithm exhibits the same computational scaling as the forward algorithm alone. Once we’ve run the forward algorithm running the backward algorithm through all of the latent states will only double the computational cost.

3.2 Sampling Hidden States

The main limitation of marginal posterior probabilities for individual latent states is that, by construction, they project out any information about the couplings between different latent states. In particular sampling latent state configurations from individual marginal posterior probabilities will not in general give a realistic trajectory through the latent states. Without realistic latent state trajectories we cannot properly simulate joint observations.

We can always construct more useful posterior summaries that are sensitive to multiple latent states if we can generate exact samples from the joint latent state posterior distribution. In practice this requires an ancestral sampling procedure where we sample one latent state at a time given all of the previously sampled latent states.

Ancestral sampling then raises the question of the sequence in which we should sample the latent states. For example we might naively consider trying to sample the latent states in order, \begin{align*} \tilde{z}_{1} &\sim p( z_{1}, \tilde{y}_{1:I} ) \\ \tilde{z}_{2} &\sim p( z_{2} \mid \tilde{z}_{1}, \tilde{y}_{1:I} \theta ) \\ &\ldots \\ \tilde{z}_{i} &\sim p( z_{i} \mid \tilde{z}_{i - 1}, \tilde{y}_{1:I}, \theta ). \end{align*} The immediate issue with this approach, however, is that each latent state needs to be informed by not just the proceeding observations but rather of the observations. In particular we won’t be able to generate proper posterior samples without first accumulating the needed information from subsequent latent states.

This suggests that we’ll need to sweep through the latent states at least twice, once to aggregating information from all of the observations and then a second time to sample latent states. Inspired by the structure of the forward-backward algorithm let’s consider sweeping forward first and then sampling latent states in reverse.

In order to sample posterior latent states in reverse we’ll need to construct the conditional posterior distributions \begin{align*} p( z_{i - 1} \mid z_{i}, y_{1:I}, \theta ) &\propto p( z_{i - 1}, z_{i} \mid y_{1:I}, \theta ) \\ &\propto p( y_{1:I} \mid z_{i - 1}, z_{i}, \theta ) \, p( z_{i - 1}, z_{i} \mid \theta ) \\ &\propto p( y_{1:I} \mid z_{i - 1}, z_{i}, \theta ) \, p( z_{i} \mid z_{i - 1}, \theta ) \, p( z_{i - 1} \mid \theta ). \end{align*}

We can simplify this by exploiting one last property of the joint model, p( y_{1:I} \mid z_{i - 1}, z_{i}, \theta ) = p( y_{1:(i - 1)} \mid z_{i - 1}, \theta ) \, p( y_{i} \mid i, z_{i}, \theta ) \, p( y_{(i + 1):I} \mid z_{i}, \theta ). This allows us to write the reverse conditional posterior probabilities as \begin{align*} p( z_{i - 1} \mid z_{i}, y_{1:I}, \theta ) &\propto \;\, p( y_{1:I} \mid z_{i - 1}, z_{i}, \theta ) \\ &\quad \cdot p( z_{i} \mid z_{i - 1}, \theta ) \, p( z_{i - 1} \mid \theta ) \\ &\propto \;\, p( y_{1:(i - 1)} \mid z_{i - 1}, \theta ) \, p( y_{i} \mid i, z_{i}, \theta ) \\ &\quad \cdot p( y_{(i + 1):I} \mid z_{i}, \theta ) \, p( z_{i} \mid z_{i - 1}, \theta ) \, p( z_{i - 1} \mid \theta ) \\ &\propto \;\, \bigg[ p( y_{1:(i - 1)} \mid z_{i - 1}, \theta ) \, p( z_{i - 1} \mid \theta ) \bigg] \\ &\quad \cdot p( y_{i} \mid i, z_{i}, \theta ) \, p( y_{(i + 1):I} \mid z_{i}, \theta ) \, p( z_{i} \mid z_{i - 1}, \theta ) \\ &\propto \;\, p( y_{1:(i - 1)}, z_{i - 1} \mid \theta ) \, p( y_{i} \mid i, z_{i}, \theta ) \\ &\quad \cdot p( y_{(i + 1):I} \mid z_{i}, \theta ) \, p( z_{i} \mid z_{i - 1}, \theta ). \end{align*}

In terms of the forward-backward algorithm notation this becomes \begin{align*} p( z_{i - 1} \mid z_{i}, \tilde{y}_{1:I}, \theta ) &\propto \;\, p( \tilde{y}_{1:(i - 1)}, z_{i - 1} \mid \theta ) \, p( \tilde{y}_{i} \mid i, z_{i}, \theta ) \, \\ &\quad \cdot p( \tilde{y}_{(i + 1):I} \mid z_{i}, \theta ) \, p( z_{i} \mid z_{i - 1}, \theta ) \\ &\propto \;\, \alpha_{i - 1, z_{i - 1}} \, \omega_{i, z_{i}} \\ &\quad \cdot \beta_{i, z_{i}} \, \Gamma_{i, z_{i} z_{i - 1}}, \end{align*} or in matrix notation, \mathbf{q}_{i - 1} \propto \omega_{i, z_{i}} \, \beta_{i, z_{i}} \, \left( \boldsymbol{\alpha}_{i - 1} \, \circ \boldsymbol{\gamma}_{i, z_{i}} \right), where \boldsymbol{\gamma}_{i, z_{i}} is the z_{i}th column of the transition matrix \Gamma_{i}.

Critically the conditional posterior probabilities for a given observation, p( z_{i - 1} \mid z_{i}, \tilde{y}_{1:I}, \theta ) are straightforward to compute once we’ve run the forward algorithm to completion and the backwards algorithm backwards to the active latent state. After completing the forward algorithm we can sample a final latent state, \begin{align*} \mathbf{r} &= \boldsymbol{\alpha}_{I} \\ \boldsymbol{\lambda} &= \frac{ \mathbf{r} }{ \sum_{k} r_{k} } \\ \tilde{z}_{I} &\sim \text{categorical} \left( \boldsymbol{\lambda} \right). \end{align*} At this point we start running the backwards algorithm, using each new \boldsymbol{\beta}_{i} and the previously calculated \boldsymbol{\alpha}_{i - 1} to sample a new latent state \tilde{z}_{i - 1} given the previously sampled latent state \tilde{z}_{i}, \begin{align*} \mathbf{r} &= \omega_{i, z_{i}} \, \beta_{i, z_{i}} \, \left( \boldsymbol{\alpha}_{i - 1} \, \circ \boldsymbol{\gamma}_{i, z_{i}} \right) \\ \boldsymbol{\lambda} &= \frac{ \mathbf{r} }{ \sum_{k} r_{k} } \\ \tilde{z}_{i - 1} &\sim \text{categorical} \left( \boldsymbol{\lambda} \right). \end{align*}

Compared to evaluating \boldsymbol{\alpha}_{i - 1} and \boldsymbol{\beta}_{i} the cost of multiplying and normalizing is negligible. Consequently sampling joint latent states introduces very little additional cost once if we’re already running the forward-backward algorithm.

4 Robust Implementations of Hidden Markov Models

As elegant as the forward and forward-backward algorithms are they are unfortunately not quite as straightforward to implement in practice as we might hope. The problem is that the forward and backward recursions are multiplicative; each \boldsymbol{\alpha}_{i} and \boldsymbol{\beta}_{i} is given by the product of many terms and consequently are prone to numerical underflow on computers.

One way to avoid numerical issues is to work on the log scale with the log-sum-exp function. For example instead of computing \alpha_{i, k} = \omega_{i, k} \sum_{k' = 1}^{K} \Gamma_{i, k'k} \, \alpha_{i - 1, k'} we could compute \begin{align*} \log(\alpha_{i, k}) &= \log \left( \omega_{i, k} \sum_{k' = 1}^{K} \Gamma_{i, k'k} \, \alpha_{i - 1, k'} \right) \\ &= \log \left( \omega_{i, k} \right) + \log \left( \sum_{k' = 1}^{K} \Gamma_{i, k'k} \, \alpha_{i - 1, k'} \right) \\ &= \log \left( \omega_{i, k} \right) + \log \left( \sum_{k' = 1}^{K} \exp \bigg( \log(\Gamma_{i, k'k}) + \log(\alpha_{i - 1, k'}) \bigg) \right) \\ &= \;\;\;\, \log \left( \omega_{i, k} \right) \\ &\quad + \text{log-sum-exp} \bigg( \log(\Gamma_{i, 1k}) + \log(\alpha_{i - 1, 1}), \\ & \hphantom{=\text{log-sum-exp} \bigg(} \quad\; \ldots, \\ & \hphantom{=\text{log-sum-exp} \bigg(} \quad\; \log(\Gamma_{i, Kk}) + \log(\alpha_{i - 1, K}) \bigg) \end{align*}

Similarly in the backwards pass we could compute \begin{align*} \log( \beta_{i, k} ) &= \log \left( \sum_{k' = 1}^{K} \omega_{i + 1, k'} \, \beta_{i + 1, k'} \, \Gamma_{i + 1, kk'} \right) \\ &= \log \left( \sum_{k' = 1}^{K} \exp \bigg( \log( \omega_{i + 1, k' } ) + \log( \beta_{i + 1, k'} ) + \log( \Gamma_{i + 1, kk'}) \bigg) \right) \\ &= \text{log-sum-exp} \bigg( \log( \omega_{i + 1, 1 } ) + \log( \beta_{i + 1, 1} ) + \log( \Gamma_{i + 1, k1}), \\ & \hphantom{=\text{log-sum-exp}} \quad\; \ldots, \\ & \hphantom{=\text{log-sum-exp}} \quad\; \log( \omega_{i + 1, K } ) + \log( \beta_{i + 1, K} ) + \log( \Gamma_{i + 1, kK}) \bigg) \end{align*}

While numerically stable, the log-sum-exp function is unfortunately expensive. Introducing a log-sum-exp function at each iteration of the forward-backward algorithm can substantially increase computational cost.

Another way that we can avoid numerical underflow is to repeatedly rescale the \boldsymbol{\alpha}_{i} and \boldsymbol{\beta}_{i} at each iteration and then correct for the rescaling whenever we use them in a calculation.

Consider, for example, the scaled forward recursion formula \begin{align*} \boldsymbol{a}_{i} &= \boldsymbol{\omega}_{i} \circ \left( \boldsymbol{\Gamma}_{i}^{T} \cdot \boldsymbol{\alpha'}_{i - 1} \right) \\ \nu_{i} &= \max ( \boldsymbol{a}_{i} ) \\ \boldsymbol{\alpha'}_{i} &= \frac{ \boldsymbol{a}_{i} }{ \nu_{i} }. \end{align*}

After i iterations the unscaled \boldsymbol{\alpha}_{i} and scaled \boldsymbol{\alpha'}_{i} will be proportional to each other, \boldsymbol{\alpha}_{i} = \left( \prod_{i' = 1}^{i} \nu_{i'} \right) \, \boldsymbol{\alpha'}_{i}. Consequently we can always recover \boldsymbol{\alpha}_{i} from the more numerically stable \boldsymbol{\alpha'}_{i}.

That said the product \prod_{i' = 1}^{i} \nu_{i'} will itself be prone to numerical underflow. Fortunately we can avoid this by working on the log scale, \boldsymbol{\alpha}_{i} = \exp \left( \sum_{i' = 1}^{i} \log(\nu_{i'}) \right) \, \boldsymbol{\alpha'}_{i}. Note that this requires evaluating only one logarithm function at each forward iteration and a single exponential function after all I iterations, much less expensive than repeated log-sum-exp evaluations.

Dynamic rescaling is also useful for ensuring numerical stability when computing each \boldsymbol{\beta}_{i} in the backward recursion. Here, however, we don’t even have to keep track of the running normalization so long as we always normalize the marginal and conditional latent probabilities at each iteration.

The log-sum-exp implementation of hidden Markov models is popular in many Stan tutorials, but the dedicated hidden Markov model functions in the Stan modeling language are implemented using dynamic rescaling, which was shown to be just as robust but substantially more efficient. We will implement both approaches in Section 7.

5 Hidden Markov Modeling Techniques

The basic construction and implementation of hidden Markov models is more general than they might first appear to be. With some care we can incorporate a diversity of interesting structure into hidden Markov models.

5.1 Modeling The Initial State

When marginalizing out the latent states all of the latent state probabilities are derived except for the probabilities of the initial latent state. These have to be modeled and inferred along with the transition matrices and configuration of the component observational models.

In some applications we might use hidden Markov models to capture the dynamics of a system evolving from an explicit initialization. Here we can use our domain expertise about that initialization to directly inform an appropriate prior model for the initial state probabilities. Absent any readily available domain expertise a uniform prior model over the simplex of possible initial state probabilities is not unreasonable, especially as a starting default to be improved upon as needed.

For some applications, however, we might be interested in modeling dynamics that have evolved for so long that the initialization is at best ambiguous. In these applications we need the initial state probabilities to model equilibrium behavior. To formally define equilibrium dynamics we’ll need to take a little detour into linear algebra.

The elements of each transition matrix are defined by the individual transition probabilities between each pair of latent states, \Gamma_{i, k'k} = p(z_{i} = k \mid z_{i - 1} = k', \theta). In particular the elements of \boldsymbol{\Gamma}_{i} are all bounded between zero and one, 0 \le \Gamma_{i, k'k} \le 1, and the elements in each row sum to one, \sum_{k = 1}^{K} \Gamma_{i, k'k} = \sum_{k = 1}^{K} p(z_{i} = k \mid z_{i - 1} = k', \theta) = 1. Matrices with these properties are more generally known as row stochastic matrices or right stochastic matrices

These properties endow the transpose of stochastic matrices with particularly interesting eigenstructure (Papoulis and Pillai (2002)). For example the magnitude of the eigenvalues of any right stochastic matrix are always bounded by one, and at least one left eigenvector will saturate this bound, \mathbf{v}^{T} \cdot \boldsymbol{S} = \mathbf{v}^{T} \cdot 1 = \mathbf{v}^{T}, or equivalently \boldsymbol{S}^{T} \cdot \mathbf{v} = 1 \cdot \mathbf{v} = \mathbf{v}. Consequently the repeated application of the transpose of a fixed right stochastic matrix to any vector \mathbf{u} will always converge to one of those leading eigenvectors, \lim_{i \rightarrow \infty} \left( \boldsymbol{S}^{T} \right)^{i} \cdot \mathbf{u} = \mathbf{v}.

If the elements of \boldsymbol{S}^m are all non-zero for some finite m \in \mathbb{N} then the leading eigenvector, and hence this limiting behavior, will be unique. Moreover the elements of this unique leading eigenvector will be real-valued and positive so that they can normalized into probabilities.

All of this linear algebra implies that any latent dynamics modeled with a well-behaved, homogeneous transition probabilities will eventually converge to some stable, equilibrium behavior regardless of the precise initial state. The equilibrium probabilities of the latent states are completely determined up to normalization by the leading left eigenvector of the transition matrix.

In practice we can model a system already in equilibrium by deriving the initial state probabilities from the eigendecomposition of the transition matrix. If the configuration of the transition matrix is uncertain then so to will be its eigendecomposition, and hence the derived initial states.

Our ability to implement these calculations in practice, however, will depend on the available linear algebraic tools. For example Stan currently implements eigendecompositions for only general complex matrices and symmetric real-valued matrices, but not stochastic matrices. In theory we can use a general eigensolver, but in practice identifying an eigenvector with exactly unit eigenvalue and removing any complex number artifacts is at best awkward for floating point arithmetic.

Finally it’s worth emphasizing that equilibrium is a well-defined concept for only homogeneous transition matrices. If the transition probabilities change between different pairs of neighboring latent states then the realized dynamics will always depend on the behavior of the initial latent states.

5.2 Modeling Multiple Latent Sequences

Some systems are best modeled not with a single sequence of latent states, z_{1}, \ldots, z_{i}, \ldots, z_{I}, but rather multiple sequences of latent states (Figure 5), \begin{align*} z^{1} &= ( z^{1}_{1}, \ldots, z^{1}_{i}, \ldots, z^{1}_{I}) \\ z^{2} &= ( z^{2}_{1}, \ldots, z^{2}_{i}, \ldots, z^{2}_{I}), \end{align*} with coupled dynamics p(z^{m}_{i} \mid z^{1}_{i - 1}, \ldots, z^{M}_{i - 1}).

 

(a)

 

 

(b)

 

Figure 5: Hidden Markov models can accommodate multiple sequences of latent states. (a) In general the latent states evolve together but in some applications they might (b) evolve independently of each other.

In theory we could derive generalized forward and forward-backward algorithms for multiple latent states. Alternatively we can combine multiple latent states together into a single joint latent state and use the standard forward and forward-backward algorithms.

One systematic way to combine two finite latent states together requires some more linear algebra, this time in the form of an operation known as a Kronecker product, matrix direct product, or tensor product (Curtis (1993)). The tensor product maps M latent states z^{m}_{i} into a single joint state consisting of all possible n-tuples of the component state elements, z_{i} = z^{1}_{i} \otimes \ldots \otimes z^{M}_{i} = \otimes_{m = 1}^{M} z^{m}_{i}.

For example the tensor product of the two latent states z^{1}_{i} \in (1, 2, 3) and z^{2}_{i} \in (1, 2) is given by the ordered pairs z_{i} = z^{1}_{i} \otimes z^{2}_{i} \in (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2). When working with these joint states in practice it is often helpful to relabel them with sequential integers, \begin{align*} (1, 1) &\rightarrow 1 \\ (1, 2) &\rightarrow 2 \\ (2, 1) &\rightarrow 3 \\ (2, 2) &\rightarrow 4 \\ (3, 1) &\rightarrow 5 \\ (3, 2) &\rightarrow 6. \end{align*} In general if each component state features K_{m} elements then then tensor product will feature \prod_{m = 1}^{M} K_{m} total elements.

Implementing a hidden Markov model with respect to this joint latent state then requires an expanded transition matrix. Sometimes it is more productive to interpret the transition matrix with respect to the joint latent state, \Gamma_{i, k'k} = p(z_{i} = k \mid z_{i - 1} = k', \theta), and sometimes it is more useful to reason about the transition matrix in terms of the component latent states, \begin{align*} &\Gamma_{i, k'_{1} \cdots k'_{M} k_{1} \cdots k_{M}} \\ &\quad\quad = p(z^{1}_{i} = k_{1}, \ldots, z^{M}_{i} = k_{M} \mid z^{1}_{i - 1} = k'_{1}, \ldots, z^{M}_{i - 1} = k'_{M}, \theta). \end{align*}

When the two states evolve independently of each other, \begin{align*} p(z^{1}_{i} = k_{1}, \ldots, z^{M}_{i} = k_{M} &\mid z^{1}_{i - 1} = k'_{1}, \ldots, z^{M}_{i - 1} = k'_{M}, \theta) \\ &= \prod_{m = 1}^{M} p(z^{m}_{i} = k_{m} \mid z^{m}_{i - 1} = k'_{m}, \theta), \end{align*} we can write the joint transition matrix in terms of component transition matrices. More formally if \Gamma^{m}_{i, k'k} = p(z^{m}_{i} = k \mid z^{m}_{i - 1} = k, \theta) then the joint transition matrix for the joint latent states is given by the tensor product \boldsymbol{\Gamma}_{i} = \boldsymbol{\Gamma}^{1}_{i} \otimes \ldots \otimes \boldsymbol{\Gamma}^{M}_{i} = \otimes_{m = 1}^{M} \boldsymbol{\Gamma}^{m}_{i}.

For example consider two component latent states with only two elements each, z^{1}_{i} \in (1, 2) and z^{2}_{i} \in (1, 2). Their tensor product is given by four ordered pairs z_{i} = z^{1}_{i} \otimes z^{2}_{i} \in (1, 1), (1, 2), (2, 1), (2, 2), which we can encode as \begin{align*} (1, 1) &\rightarrow 1 \\ (1, 2) &\rightarrow 2 \\ (2, 1) &\rightarrow 3 \\ (2, 2) &\rightarrow 4. \end{align*}

If each of these component latent states evolves independently of each other with the transition matrices \begin{align*} \Gamma^{1}_{i, k'k} &= p(z^{1}_{i} = k \mid z^{1}_{i - 1} = k, \theta) \\ \Gamma^{2}_{i, k'k} &= p(z^{2}_{i} = k \mid z^{2}_{i - 1} = k, \theta). \end{align*} then we can write the joint transition matrix as \begin{align*} \boldsymbol{\Gamma}_{i} &= \boldsymbol{\Gamma}^{1}_{i} \otimes \boldsymbol{\Gamma}^{2}_{i} \\ &= \begin{pmatrix} \Gamma^{1}_{i, 11} \, \boldsymbol{\Gamma}^{2} & \Gamma^{1}_{i, 12} \, \boldsymbol{\Gamma}^{2} \\ \Gamma^{1}_{i, 21} \, \boldsymbol{\Gamma}^{2} & \Gamma^{1}_{i, 22} \, \boldsymbol{\Gamma}^{2} \end{pmatrix} \\ &= \begin{pmatrix} \Gamma^{1}_{i, 11} \, \Gamma^{2}_{i; 11} & \Gamma^{1}_{i, 11} \, \Gamma^{2}_{i; 12} & \Gamma^{1}_{i, 12} \, \Gamma^{2}_{i; 11} & \Gamma^{1}_{i, 12} \, \Gamma^{2}_{i; 12} \\ \Gamma^{1}_{i, 11} \, \Gamma^{2}_{i; 21} & \Gamma^{1}_{i, 11} \, \Gamma^{2}_{i; 22} & \Gamma^{1}_{i, 12} \, \Gamma^{2}_{i; 21} & \Gamma^{1}_{i, 12} \, \Gamma^{2}_{i; 22} \\ \Gamma^{1}_{i, 21} \, \Gamma^{2}_{i; 11} & \Gamma^{1}_{i, 21} \, \Gamma^{2}_{i; 12} & \Gamma^{1}_{i, 22} \, \Gamma^{2}_{i; 11} & \Gamma^{1}_{i, 22} \, \Gamma^{2}_{i; 12} \\ \Gamma^{1}_{i, 21} \, \Gamma^{2}_{i; 21} & \Gamma^{1}_{i, 21} \, \Gamma^{2}_{i; 22} & \Gamma^{1}_{i, 22} \, \Gamma^{2}_{i; 21} & \Gamma^{1}_{i, 22} \, \Gamma^{2}_{i; 22} \end{pmatrix}. \end{align*}

Note that the tensor product of vectors and matrices is not strictly commutative. Changing the order of the components changes the organization of the ordered pairs which in turn permutes the elements of the joint states and joint transition matrix. In practice this just means that we have to be careful to organize everything consistently.

More generally we have to reason through all \left( \prod_{m = 1}^{M} K_{m} \right)^{2} of the possible couplings between the neighboring component states, \begin{align*} p(z_{i} &= k \mid z_{i - 1} = k', \theta) \\ &\equiv p(z^{1}_{i} = k_{1}, \ldots, z^{M}_{i} = k_{M} \mid z^{1}_{i - 1} = k'_{1}, \ldots, z^{M}_{i - 1} = k'_{M}, \theta). \end{align*} Any lack of conditional dependence, with the evolution of any component latent states depending on only some but not all of the component latent states, makes this task much more manageable.

5.3 Modeling Partially Observed Latent States

Unobserved latent states are not uncommon in hidden Markov models, especially when the models span many latent states. How best to model partially observed latent states depends on why the observations are incomplete.

For example if the probability that a particular latent state is unobserved is independent of any of the other model configuration variables then we can model partial observations with a straightforward marginalization. Formally if the ith latent state is unobserved then we can model the remaining observations with \begin{align*} p(z_{1:I}, &y_{1:(i - 1)}, y_{(i + 1):I} \mid \theta) \\ &= \int \mathrm{d} y_{i} p(z_{1:I}, y_{1:I} \mid \theta) \\ &= \int \mathrm{d} y_{i} p(z_{1}) \, p(y_{1} \mid 1, z_{1}, \theta) \prod_{i' = 2}^{I} p(z_{i'} \mid z_{i' - 1}, \theta) \, p(y_{i'} \mid i', z_{i'}, \theta ) \\ &= \quad p(z_{1}) \, p(y_{1} \mid 1, z_{1}, \theta) \\ &\quad \cdot \left[ \prod_{i' = 1}^{i - 1} p(z_{i'} \mid z_{i' - 1}, \theta) \, p(y_{i'} \mid i', z_{i'}, \theta ) \right] \\ &\quad \cdot p(z_{i} \mid z_{i - 1}, \theta) \, \int \mathrm{d} y_{i} p(y_{i} \mid i, z_{i}, \theta ) \\ &\quad \cdot \left[ \prod_{i' = i + 1}^{I} p(z_{i'} \mid z_{i' - 1}, \theta) \, p(y_{i'} \mid i', z_{i'}, \theta ) \right] \\ &=\quad p(z_{1}) \, p(y_{1} \mid 1, z_{1}, \theta) \\ &\quad \cdot \left[ \prod_{i' = 1}^{i - 1} p(z_{i'} \mid z_{i' - 1}, \theta) \, p(y_{i'} \mid i', z_{i'}, \theta ) \right] \\ &\quad \cdot p(z_{i} \mid z_{i - 1}, \theta) \, \\ &\quad \cdot \left[ \prod_{i' = i + 1}^{I} p(z_{i'} \mid z_{i' - 1}, \theta) \, p(y_{i'} \mid i', z_{i'}, \theta ) \right]. \end{align*}

Conveniently this is equivalent to setting the corresponding observational model evaluations to one, p(\tilde{y}_{i} \mid i, z_{i}, \theta ) = 1, or equivalently the log observational model evaluations to zero, \log \circ p(\tilde{y}_{i} \mid i, z_{i}, \theta ) = 0, for each z_{i}. In particular we can account for any unobserved latent states without having to modify the forward and forward-backward algorithms by simply replacing the component observational model evaluations with ones wherever necessary.

If we are not interested in inferring the behavior of unobserved latent states then we can also model the partially observed data by marginalizing out the unobserved latent states entirely. For example we can model a gap of observations along the latent states z_{i}, z_{i + 1}, \ldots, z_{i + \delta} by removing them and then replacing the transition matrix \boldsymbol{\Gamma}_{i} with the matrix product (Figure 6) \prod_{i' = i}^{i + \delta + 1} \boldsymbol{\Gamma}_{i}. After re-indexing the remaining latent states we can run the forward and forward-backward algorithms using the collapsed transition matrix. Really the only downside to this approach is that the collapsed hidden Markov model will be non-homogeneous even if the full hidden Markov model is homogeneous.

 

(a)

 

 

(b)

 

Figure 6: (a) Gaps of latent states without any observations can be (b) collapsed into a single transition given by the product of the transitions between the observed states.

Modeling more complicated mechanisms for unobserved latent states is, unsurprisingly, more complicated. A general approach that can be useful in practice is to introduce a auxiliary sequence of known, binary states w_{1}, \ldots, w_{i}, \ldots, w_{I} that indicates whether or not a latent state is complemented with any observations. For example we might define w_{i} = 0 when there are no observations at the ith state, and w_{i} = 1 when there are.

When w_{i} = 1 the component observational models behave as they would without any missingness, p(y_{i} \mid i, z_{i}, w_{i} = 0, \theta ) = p(y_{i} \mid i, z_{i}, \theta ) and when w_{i} = 0 they all return one, p(y_{i} \mid i, z_{i}, w_{i} = 1, \theta ) = 1. Because the w_{i} are known the implementation of the forward and forward-backward algorithms proceed as before with these modified observational model outputs for each i.

The mechanism for which latent states are observed and which are not is then encoded in the structure of the transition probabilities for these auxiliary states. In general the transition from any w_{i - 1} to any w_{i} can depend on not only i and w_{i - 1} but also z_{i - 1} as well as any other model configuration variables, \Omega_{i, j'jk} = p(w_{i} = j' \mid w_{i - 1} = j, z_{i - 1} = k, \theta). Specifically the coupling of w_{i} to z_{i - 1} and \theta allows our inferences to account for any selection bias in partially observed data.

6 Degeneracies of Hidden Markov Models

As discussed in the mixture modeling chapter basic mixture observational models are vulnerable to degenerate inferences when the component observational models include redundant behaviors. By coupling the latent states together hidden Markov models can amplify these problems if we are not careful.

For example the redundancy of hidden Markov models with fully exchangeable component observational models at each state prevents data from unambiguously informing individual latent states. Consequently the latent state dynamics will be wildly uncertain.

That said the coupling between neighboring latent states can actually reduce uncertainties is less redundant component observational models.

For example if one latent state is complemented by only weakly informative observations then inferences for the corresponding component probabilities using that local data alone will be poor. More informative observations at the previous and following latent states, however, can provide less uncertain inferences for those neighboring component probabilities. When the behavior of the transition matrices is also well informed these neighboring inferences will inform the behavior of the intermediate state far beyond what the local observations could do on their own.

In general the more rigid the latent state dynamics are the more the constraints from each local observation will propagate across the entire sequence of latent states. We can take advantage of this rigidity, however, only when we have strong inferences for the transition probabilities.

Without informative observations at enough pairs of neighboring states we may not be able to learn much about the behavior of the transition matrices. This is especially problematic for non-homogeneous hidden Markov models models which are particularly data hungry.

7 Demonstrations

To anchor all of this abstract math into a more practical context let’s work through some implementations of hidden Markov models that demonstrate not only the basic features but also some of the more sophisticated directions that we can take the methodology.

7.1 Setup

First and foremost we have to setup our local R environment.

par(family="serif", las=1, bty="l",
    cex.axis=1, cex.lab=1, cex.main=1,
    xaxs="i", yaxs="i", mar = c(5, 5, 3, 1))

library(rstan)
rstan_options(auto_write = TRUE)            # Cache compiled Stan programs
options(mc.cores = parallel::detectCores()) # Parallelize chains
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
util <- new.env()
source('mcmc_analysis_tools_rstan.R', local=util)
source('mcmc_visualization_tools.R', local=util)

7.2 Comparing and Contrasting Implementations

Let’s start with a relatively simple exercise that compares some of the basic implementation strategies.

7.2.1 Simulate Data

We’ll simulate data assuming K = 3 component observational models across I = 100 states.

I <- 100
K <- 3

To avoid too much complexity we’ll assume a homogeneous transition matrix between all of the neighboring latent states and the same component observational models at each latent state.

simu\_data.stan
data {
  // Number of latent states
  int<lower=1> I;
}

transformed data {
  // True model configuration
  int<lower=1> K = 3;
  array[K] real mu = {-3, 2, 5};
  array[K] real<lower=0> sigma = {1.0, 1.5, 0.75};

  simplex[K] rho = [0.8, 0.2, 0.0 ]';

  matrix[K, K] Gamma = [ [ 0.60, 0.30, 0.10 ],
                         [ 0.40, 0.50, 0.10 ],
                         [ 0.05, 0.05, 0.90 ] ];
}

generated quantities {
  array[I] int<lower=1, upper=K> z; // Simulated latent states
  array[I] real y;                  // Simulated observations

  //  Initial state
  z[1] = categorical_rng(rho);
  y[1] = normal_rng(mu[z[1]], sigma[z[1]]);

  for (i in 2:I) {
    // Following state
    vector[K] lambda = Gamma[z[i - 1],]';
    z[i] = categorical_rng(lambda);

    // Observation
    y[i] = normal_rng(mu[z[i]], sigma[z[i]]);
  }
}
simu <- stan(file="stan_programs/simu_data.stan",
             algorithm="Fixed_param", seed=194838,
             data=list("I" = I), iter=1, chains=1, refresh=0)
simu_samples <- util$extract_expectand_vals(simu)

y_names <- sapply(1:I, function(i) paste0('y[', i, ']'))
y <- sapply(y_names, function(name) simu_samples[[name]][1,1])

z_names <- sapply(1:I, function(i) paste0('z[', i, ']'))
z <- sapply(z_names, function(name) simu_samples[[name]][1,1])
data <- list("I" = I, "y" = y, "K" = K)

7.2.2 Exploratory Data Analysis

There are a few summary statistics that are particularly natural to the structure assumed by a hidden Markov model.

A histogram of all observations collapses the latent dynamics, allowing us to focus on the aggregate behavior of the component observational models. This summary is particularly informative when the component observational models are the same for all latent states.

par(mfrow=c(1, 1))

util$plot_line_hist(data$y, -8, 8, 1, xlab="y")

We can also isolate the latent dynamics as much as possible by plotting the observations against the corresponding latent state indices. This is often referred to as an empirical trajectory.

par(mfrow=c(1, 1))

plot(1, type="n",
     xlab="Latent State", xlim=c(0.5, data$I + 0.5),
     ylab="y", ylim=c(-8, 8))

for (i in 1:data$I) {
  lines(c(i - 0.5, i + 0.5), rep(data$y[i], 2),
        col="black", lwd=2)
}

7.2.3 Hidden Markov Model Log-Sum-Exp Implementation

Let’s first consider a hidden Markov model using the log-sum-exp implementation of the forward and forward-background algorithms.

hmm\_log\_sum\_exp.stan
data {
  int<lower=0> K;  // Number of component observational models
  int<lower=0> I;  // Number of latent states
  array[I] real y; // Observations of each latent state
}

parameters {
  // Component observational model configurations
  ordered[K] mu;
  array[K] real<lower=0> sigma;

  // Latent state dynamics
  simplex[K] rho;
  array[K] simplex[K] gamma;
}

model {
  // Prior model
  target += normal_lpdf(mu | 0, 10 / 2.32);   // -10 <~ mu[k] <~ +10
  target += normal_lpdf(sigma | 0, 5 / 2.57); // 0 <~ sigma[k] <~ 5

  target += dirichlet_lpdf(rho | rep_vector(1, K));

  for (k in 1:K)
    target += dirichlet_lpdf(gamma[k] | rep_vector(1, K));

  // Observational model
  {
    // Construct transition matrix
    matrix[K, K] log_Gamma;
    for (k in 1:K) log_Gamma[k,] = log(gamma[k]');

    // Forward algorithm
    vector[K] log_alpha = log(rho);

    for (i in 1:I) {
      vector[K] log_alpha_prev = log_alpha;
      for (k in 1:K) {
        real log_lambda = log_sum_exp(log_Gamma[,k] + log_alpha_prev);
        log_alpha[k] =   log_lambda
                       + normal_lpdf(y[i] | mu[k], sigma[k]);
      }
    }

    // Marginal observational model
    target += log_sum_exp(log_alpha);
  }
}

generated quantities {
  // Marginal latent state posterior probabilities
  array[I] simplex[K] z_prob;

  array[I] int z;       // Posterior latent states
  array[I] real y_pred; // Posterior predictive observations

  {
    // Construct transition matrix
    matrix[K, K] log_Gamma;
    for (k in 1:K) log_Gamma[k,] = log(gamma[k]');

    // Forward algorithm
    array[I] vector[K] log_alpha;

    for (i in 1:I) {
      for (k in 1:K) {
        real log_lambda;
        if (i > 1)
          log_lambda = log_sum_exp(log_Gamma[,k] + log_alpha[i - 1]);
        else
          log_lambda = log_sum_exp(log_Gamma[,k] + log(rho)[i]);

        log_alpha[i][k] =   log_lambda
                          + normal_lpdf(y[i] | mu[k], sigma[k]);
      }
    }

    // Sample final latent state
    vector[K] lambda = softmax(log_alpha[I]);
    z[I] = categorical_rng(lambda);
    y_pred[I] = normal_rng(mu[z[I]], sigma[z[I]]);

    // Sample latent states while running the backward algorithm
    vector[K] log_beta = zeros_vector(K);

    z_prob[I] = softmax( log_alpha[I] + log_beta );

    for (ri in 2:I) {
      int i = I + 1 - ri;
      int z_prev = z[i + 1];

      vector[K] log_beta_prev = log_beta;
      vector[K] log_omega_prev;
      for (k in 1:K)
        log_omega_prev[k] = normal_lpdf(y[i + 1] | mu[k], sigma[k]);

      lambda = softmax(   log_omega_prev[z_prev] + log_beta[z_prev]
                        + log_alpha[i] + log_Gamma[,z_prev]   );

      z[i] = categorical_rng(lambda);
      y_pred[i] = normal_rng(mu[z[i]],
                             sigma[z[i]]);

      for(k in 1:K) {
        log_beta[k] = log_sum_exp(  log_omega_prev
                                  + log_beta_prev
                                  + log_Gamma[k,]');
      }

      z_prob[i] = softmax( log_alpha[i] + log_beta );
    }
  }
}
fit <- stan(file="stan_programs/hmm_log_sum_exp.stan",
            data=data, seed=4938483,
            warmup=1000, iter=2024, refresh=0)

Hamiltonian Monte Carlo doesn’t appear to have any issues quantifying the posterior distribution.

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples1 <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples1,
                                       c('mu', 'sigma',
                                         'rho', 'gamma'),
                                       check_arrays=TRUE)
util$check_all_expectand_diagnostics(base_samples)
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.

Having modeled latent dynamics we can consider not only the histogram summary statistic but also the empirical trajectory. In both cases there is no appreciable retrodictive tension that would indicate any inadequacy of our modeling assumptions.

par(mfrow=c(1, 1))

util$plot_hist_quantiles(samples1, 'y_pred', -8, 8, 1,
                         baseline_values=data$y, xlab='y')
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 8 predictive values (0.0%) fell below the binning.
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 75 predictive values (0.0%) fell above the binning.

par(mfrow=c(1, 1))

names <- sapply(1:data$I, function(i) paste0('y_pred[', i, ']'))
util$plot_disc_pushforward_quantiles(samples1, names,
                                     baseline_values=data$y,
                                     xlab="Latent State", ylab="y")

Consequently we can have some confidence in the faithfulness of our posterior inferences.

This includes the inferred behavior of the component observational models.

par(mfrow=c(3, 2))

for (k in 1:data$K) {
  name <- paste0('mu[', k, ']')
  util$plot_expectand_pushforward(samples1[[name]], 75,
                                  display_name=name, flim=c(-10, 10))

  xs <- seq(-10, 10, 0.1)
  ys <- dnorm(xs, 0, 10 / 2.32);
  lines(xs, ys, lwd=2, col="white")
  lines(xs, ys, lwd=1, col=util$c_mid_teal)

  name <- paste0('sigma[', k, ']')
  util$plot_expectand_pushforward(samples1[[name]], 50,
                                  display_name=name, flim=c(0, 5))

  xs <- seq(0, 5, 0.1)
  ys <- 2 * dnorm(xs, 0, 5 / 2.57);
  lines(xs, ys, lwd=2, col="white")
  lines(xs, ys, lwd=1, col=util$c_mid_teal)
}

We can also make inferences about the behavior of the latent states.

par(mfrow=c(1, 1))

names <- sapply(1:data$I, function(i) paste0('z[', i, ']'))
util$plot_disc_pushforward_quantiles(samples1, names,
                                     xlab="Latent State", ylab="z")

Alternatively we can look at the inferred initial state and transition probabilities that drive the evolution of the latent states.

par(mfrow=c(1, 1))

names <- sapply(1:data$K, function(kk) paste0('rho[', kk, ']'))
util$plot_disc_pushforward_quantiles(samples1, names,
                                     xlab="Component",
                                     ylab="rho")

par(mfrow=c(3, 1))

for (k in 1:data$K) {
  names <- sapply(1:data$K,
                  function(kk) paste0('gamma[', k, ',', kk, ']'))
  util$plot_disc_pushforward_quantiles(samples1, names,
                                       xlab="Component",
                                       ylab=paste("Row", k, "of Gamma"))
}

7.2.4 Hidden Markov Model Dynamic Rescaling Implementation

We can also implement hidden Markov models with dynamical rescaling.

hmm\_rescaled.stan
data {
  int<lower=0> K;  // Number of component observational models
  int<lower=0> I;  // Number of latent states
  array[I] real y; // Observations of each latent state
}

parameters {
  // Component observational model configurations
  ordered[K] mu;
  array[K] real<lower=0> sigma;

  // Latent state dynamics
  simplex[K] rho;
  array[K] simplex[K] gamma;
}

model {
  // Prior model
  target += normal_lpdf(mu | 0, 10 / 2.32);   // -10 <~ mu[k] <~ +10
  target += normal_lpdf(sigma | 0, 5 / 2.57); // 0 <~ sigma[k] <~ 5

  target += dirichlet_lpdf(rho | rep_vector(1, K));

  for (k in 1:K)
    target += dirichlet_lpdf(gamma[k] | rep_vector(1, K));

  // Observational model
  {
    // Construct transition matrix
    matrix[K, K] Gamma;
    for (k in 1:K) Gamma[k,] = gamma[k]';

    // Forward algorithm
    real norm;
    real log_norm = 0;

    vector[K] alpha = rho;

    norm = max(alpha);
    log_norm += log(norm);
    alpha /= norm;

    for (i in 1:I) {
      vector[K] omega;
      for (k in 1:K)
        omega[k] = exp(normal_lpdf(y[i] | mu[k], sigma[k]));

      alpha = omega .* (Gamma' * alpha);

      norm = max(alpha);
      log_norm += log(norm);
      alpha /= norm;
    }

    // Marginal observational model
    target += sum(alpha) + log_norm;
  }
}

generated quantities {
  // Marginal latent state posterior probabilities
  array[I] simplex[K] z_prob;

  array[I] int z;       // Posterior latent states
  array[I] real y_pred; // Posterior predictive observations

  {
    // Construct transition matrix
    matrix[K, K] Gamma;
    for (k in 1:K) Gamma[k,] = gamma[k]';

    // Forward algorithm
    array[I] vector[K] alpha;

    for (i in 1:I) {
      vector[K] omega;
      for (k in 1:K)
        omega[k] = exp(normal_lpdf(y[i] | mu[k], sigma[k]));

      if (i > 1)
        alpha[i] = omega .* (Gamma' * alpha[i - 1]);
      else
        alpha[i] = omega .* (Gamma' * rho);

      alpha[i] /= max(alpha[i]);
    }

    // Sample final latent state
    vector[K] r = alpha[I];
    vector[K] lambda = r / sum(r);

    z[I] = categorical_rng(lambda);
    y_pred[I] = normal_rng(mu[z[I]], sigma[z[I]]);

    // Sample latent states while running the backward algorithm
    vector[K] beta = ones_vector(K);

    z_prob[I] = alpha[I] .* beta;
    z_prob[I] /= sum(z_prob[I]);

    for (ri in 2:I) {
      int i = I + 1 - ri;
      int z_prev = z[i + 1];

      vector[K] omega_prev;
      for (k in 1:K)
        omega_prev[k] = exp(normal_lpdf(y[i + 1] | mu[k], sigma[k]));

      r =   beta[z_prev] * omega_prev[z_prev]
          * ( alpha[i] .* Gamma[,z_prev] );
      lambda = r / sum(r);

      z[i] = categorical_rng(lambda);
      y_pred[i] = normal_rng(mu[z[i]],
                             sigma[z[i]]);

      beta = Gamma * (omega_prev .* beta);
      beta /= max(beta);

      z_prob[i] = alpha[i] .* beta;
      z_prob[i] /= sum(z_prob[i]);
    }
  }
}
fit <- stan(file="stan_programs/hmm_rescaled.stan",
            data=data, seed=4938483,
            warmup=1000, iter=2024, refresh=0)

No computational diagnostics have arisen with this new implementation.

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples2 <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples2,
                                       c('mu', 'sigma',
                                         'rho', 'gamma'),
                                       check_arrays=TRUE)
util$check_all_expectand_diagnostics(base_samples)
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.

Moreover the posterior retrodictive performance not only continues to be great but also is consistent with the performance we saw with the log-sum-exp model implementation.

par(mfrow=c(1, 1))

util$plot_hist_quantiles(samples2, 'y_pred', -8, 8, 1,
                         baseline_values=data$y, xlab='y')
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 5 predictive values (0.0%) fell below the binning.
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 70 predictive values (0.0%) fell above the binning.

par(mfrow=c(1, 1))

names <- sapply(1:data$I, function(i) paste0('y_pred[', i, ']'))
util$plot_disc_pushforward_quantiles(samples2, names,
                                     baseline_values=data$y,
                                     xlab="Latent State", ylab="y")

This is not surprising given the similarly of the posterior inferences.

par(mfrow=c(3, 2))

for (k in 1:data$K) {
  name <- paste0('mu[', k, ']')
  util$plot_expectand_pushforward(samples2[[name]], 75,
                                  display_name=name, flim=c(-10, 10))

  xs <- seq(-10, 10, 0.1)
  ys <- dnorm(xs, 0, 10 / 2.32);
  lines(xs, ys, lwd=2, col="white")
  lines(xs, ys, lwd=1, col=util$c_mid_teal)

  name <- paste0('sigma[', k, ']')
  util$plot_expectand_pushforward(samples2[[name]], 50,
                                  display_name=name, flim=c(0, 5))

  xs <- seq(0, 5, 0.1)
  ys <- 2 * dnorm(xs, 0, 5 / 2.57);
  lines(xs, ys, lwd=2, col="white")
  lines(xs, ys, lwd=1, col=util$c_mid_teal)
}

par(mfrow=c(1, 1))

names <- sapply(1:data$I, function(i) paste0('z[', i, ']'))
util$plot_disc_pushforward_quantiles(samples2, names,
                                     xlab="Latent State", ylab="z")

par(mfrow=c(2, 1))

names <- sapply(1:data$K, function(kk) paste0('rho[', kk, ']'))
util$plot_disc_pushforward_quantiles(samples2, names,
                                     xlab="Component",
                                     ylab="rho")

for (k in 1:data$K) {
  names <- sapply(1:data$K,
                  function(kk) paste0('gamma[', k, ',', kk, ']'))
  util$plot_disc_pushforward_quantiles(samples2, names,
                                       xlab="Component",
                                       ylab=paste("Row", k, "of Gamma"))
}

7.2.5 Hidden Markov Model Built-In Implementation

Finally we can use the hidden Markov model functions that are provided in the Stan Modeling Language. These functions take as inputs a matrix of component observational model evaluations, a homogeneous transition matrix, and initial state probabilities. They then run the forward and forward-backward algorithms using dynamic rescaling.

Using these built-in functions should give us equivalent results to the hand-coded implementation in the previous section. The advantage of the built-in functions is that because they are written in C++ they should be a bit faster. On the other hand the limitation to homogeneous hidden Markov models can limit their scope.

hmm\_builtin.stan
data {
  int<lower=0> K;  // Number of component observational models
  int<lower=0> I;  // Number of latent states
  array[I] real y; // Observations of each latent state
}

parameters {
  // Component observational model configurations
  ordered[K] mu;
  array[K] real<lower=0> sigma;

  // Latent state dynamics
  simplex[K] rho;
  array[K] simplex[K] gamma;
}

model {
  // Prior model
  target += normal_lpdf(mu | 0, 10 / 2.32);   // -10 <~ mu[k] <~ +10
  target += normal_lpdf(sigma | 0, 5 / 2.57); // 0 <~ sigma[k] <~ 5

  target += dirichlet_lpdf(rho | rep_vector(1, K));

  for (k in 1:K)
    target += dirichlet_lpdf(gamma[k] | rep_vector(1, K));

  // Observational model
  {
    // Construct transition matrix
    matrix[K, K] Gamma;
    for (k in 1:K) Gamma[k,] = gamma[k]';

    // Construct component observational model evaluations
    matrix[K, I] log_omega;
    for (i in 1:I)
      for (k in 1:K)
        log_omega[k, i] = normal_lpdf(y[i] | mu[k], sigma[k]);

    // Marginal observational model
    target += hmm_marginal(log_omega, Gamma, rho);
  }
}

generated quantities {
  // Marginal latent state posterior probabilities
  array[I] simplex[K] z_prob;

  array[I] int z;       // Posterior latent states
  array[I] real y_pred; // Posterior predictive observations

  {
    // Construct transition matrix
    matrix[K, K] Gamma;
    for (k in 1:K) Gamma[k,] = gamma[k]';

    // Construct component observational model evaluations
    matrix[K, I] log_omega;
    for (i in 1:I)
      for (k in 1:K)
        log_omega[k, i] = normal_lpdf(y[i] | mu[k], sigma[k]);

    z = hmm_latent_rng(log_omega, Gamma, rho);
    y_pred = normal_rng(mu[z], sigma[z]);

    matrix[K, I] A = hmm_hidden_state_prob(log_omega, Gamma, rho);
    for (i in 1:I)
      z_prob[i] = A[,i];
  }
}
fit <- stan(file="stan_programs/hmm_builtin.stan",
            data=data, seed=4938483,
            warmup=1000, iter=2024, refresh=0)

Welcomingly the computational diagnostics remain clean.

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples3 <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples3,
                                       c('mu', 'sigma',
                                         'rho', 'gamma'),
                                       check_arrays=TRUE)
util$check_all_expectand_diagnostics(base_samples)
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.

We then can examine the posterior retrodictive and posterior inferential behaviors as before. As expected everything appears to be consistent with the previous model implementations.

par(mfrow=c(1, 1))

util$plot_hist_quantiles(samples3, 'y_pred', -8, 8, 1,
                         baseline_values=data$y, xlab='y')
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 11 predictive values (0.0%) fell below the binning.
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 77 predictive values (0.0%) fell above the binning.

par(mfrow=c(1, 1))

names <- sapply(1:data$I, function(i) paste0('y_pred[', i, ']'))
util$plot_disc_pushforward_quantiles(samples3, names,
                                     baseline_values=data$y,
                                     xlab="Latent State", ylab="y")

par(mfrow=c(3, 2))

for (k in 1:data$K) {
  name <- paste0('mu[', k, ']')
  util$plot_expectand_pushforward(samples3[[name]], 75,
                                  display_name=name, flim=c(-10, 10))

  xs <- seq(-10, 10, 0.1)
  ys <- dnorm(xs, 0, 10 / 2.32);
  lines(xs, ys, lwd=2, col="white")
  lines(xs, ys, lwd=1, col=util$c_mid_teal)

  name <- paste0('sigma[', k, ']')
  util$plot_expectand_pushforward(samples3[[name]], 50,
                                  display_name=name, flim=c(0, 5))

  xs <- seq(0, 5, 0.1)
  ys <- 2 * dnorm(xs, 0, 5 / 2.57);
  lines(xs, ys, lwd=2, col="white")
  lines(xs, ys, lwd=1, col=util$c_mid_teal)
}

par(mfrow=c(1, 1))

names <- sapply(1:data$I, function(i) paste0('z[', i, ']'))
util$plot_disc_pushforward_quantiles(samples3, names,
                                     xlab="Latent State", ylab="z")

par(mfrow=c(2, 1))

names <- sapply(1:data$K, function(kk) paste0('rho[', kk, ']'))
util$plot_disc_pushforward_quantiles(samples3, names,
                                     xlab="Component",
                                     ylab="rho")

for (k in 1:data$K) {
  names <- sapply(1:data$K,
                  function(kk) paste0('gamma[', k, ',', kk, ']'))
  util$plot_disc_pushforward_quantiles(samples3, names,
                                       xlab="Component",
                                       ylab=paste("Row", k, "of Gamma"))
}

7.2.6 Implementation Consistency Check

To properly compare the outputs of these three model implementations we really need to visualize them next to each other.

As expected, or at least for what we hoped, the posterior retrodictive performance for the empirical trajectories is consistent across all three model implementations.

par(mfrow=c(3, 1))

names <- sapply(1:data$I, function(i) paste0('y_pred[', i, ']'))

util$plot_disc_pushforward_quantiles(samples1, names,
                                     baseline_values=data$y,
                                     xlab="Latent State", ylab="y",
                                     main="Log-Sum-Exp")

util$plot_disc_pushforward_quantiles(samples2, names,
                                     baseline_values=data$y,
                                     xlab="Latent State", ylab="y",
                                     main="Rescaling")

util$plot_disc_pushforward_quantiles(samples3, names,
                                     baseline_values=data$y,
                                     xlab="Latent State", ylab="y",
                                     main="Built-In")

So too is the inferred behavior of the latent states.

par(mfrow=c(3, 1))

names <- sapply(1:data$I, function(i) paste0('z[', i, ']'))

util$plot_disc_pushforward_quantiles(samples1, names,
                                     xlab="Latent State", ylab="z",
                                     main="Log-Sum-Exp")

util$plot_disc_pushforward_quantiles(samples2, names,
                                     xlab="Latent State", ylab="z",
                                     main="Rescaling")

util$plot_disc_pushforward_quantiles(samples3, names,
                                     xlab="Latent State", ylab="z",
                                     main="Built-In")

The consistency continues into the inferences for the dynamical parameters.

par(mfrow=c(3, 1))

names <- sapply(1:data$K, function(kk) paste0('rho[', kk, ']'))

util$plot_disc_pushforward_quantiles(samples1, names,
                                     xlab="Component",
                                     ylab="rho",
                                     display_ylim=c(0, 1),
                                     main="Log-Sum-Exp")

util$plot_disc_pushforward_quantiles(samples2, names,
                                     xlab="Component",
                                     ylab="rho",
                                     display_ylim=c(0, 1),
                                     main="Rescaling")

util$plot_disc_pushforward_quantiles(samples3, names,
                                     xlab="Component",
                                     ylab="rho",
                                     display_ylim=c(0, 1),
                                     main="Built-In")

par(mfrow=c(3, 3))

for (k in 1:data$K) {
  names <- sapply(1:data$K,
                  function(kk) paste0('gamma[', k, ',', kk, ']'))

  util$plot_disc_pushforward_quantiles(samples1, names,
                                       xlab="Component",
                                       ylab=paste("Row", k, "of Gamma"),
                                       main="Log-Sum-Exp")

  util$plot_disc_pushforward_quantiles(samples2, names,
                                       xlab="Component",
                                       ylab=paste("Row", k, "of Gamma"),
                                       main="Rescaling")

  util$plot_disc_pushforward_quantiles(samples3, names,
                                       xlab="Component",
                                       ylab=paste("Row", k, "of Gamma"),
                                       main="Built-In")
}

Finally the inferred behavior of the component observational models is the same across all three model implementations.

par(mfrow=c(3, 2))

for (k in 1:data$K) {
  name <- paste0('mu[', k, ']')
  util$plot_expectand_pushforward(samples1[[name]],
                                  75, flim=c(-10, 10),
                                  display_name=name)
  util$plot_expectand_pushforward(samples2[[name]],
                                  75, flim=c(-10, 10),
                                  col=util$c_mid,
                                  border="#DDDDDD88",
                                  add=TRUE)
  util$plot_expectand_pushforward(samples3[[name]],
                                  75, flim=c(-10, 10),
                                  col=util$c_light,
                                  border="#DDDDDD88",
                                  add=TRUE)

  xs <- seq(-10, 10, 0.1)
  ys <- dnorm(xs, 0, 10 / 2.32);
  lines(xs, ys, lwd=2, col="white")
  lines(xs, ys, lwd=1, col=util$c_mid_teal)

  name <- paste0('sigma[', k, ']')
  util$plot_expectand_pushforward(samples1[[name]],
                                  50, flim=c(0, 5),
                                  display_name=name)
  util$plot_expectand_pushforward(samples2[[name]],
                                  50, flim=c(0, 5),
                                  col=util$c_mid,
                                  border="#DDDDDD88",
                                  add=TRUE)
  util$plot_expectand_pushforward(samples3[[name]],
                                  50, flim=c(0, 5),
                                  col=util$c_light,
                                  border="#DDDDDD88",
                                  add=TRUE)

  xs <- seq(0, 5, 0.1)
  ys <- 2 * dnorm(xs, 0, 5 / 2.57);
  lines(xs, ys, lwd=2, col="white")
  lines(xs, ys, lwd=1, col=util$c_mid_teal)
}

7.2.7 Static/Dynamic Mixture Model Comparison

Finally to see if there is any advantage to modeling the dynamics of the latent states let’s compare the hidden Markov model inferences to inferences from a static mixture model.

static\_mixure.stan
data {
  int<lower=0> K;  // Number of component observational models
  int<lower=0> I;  // Number of latent states
  array[I] real y; // Observations at each step
}

parameters {
  // Component observational model configurations
  ordered[K] mu;
  array[K] real<lower=0> sigma;

  // Homogeneous mixture probabilities
  simplex[K] lambda;
}

model {
  // Prior model
  target += normal_lpdf(mu | 0, 10 / 2.32);   // -10 <~ mu[k] <~ +10
  target += normal_lpdf(sigma | 0, 5 / 2.57); // 0 <~ sigma[k] <~ 5
  target += dirichlet_lpdf(lambda | rep_vector(1, K));

  // Observational model
  for (n in 1:I) {
    array[K] real lpds;
    for (k in 1:K) {
      lpds[k] =   log(lambda[k])
                + normal_lpdf(y[n] | mu[k], sigma[k]);
    }
    target += log_sum_exp(lpds);
  }
}

generated quantities {
  array[I] real y_pred; // Posterior predictive observations

  for (i in 1:I) {
    int z = categorical_rng(lambda);
    y_pred[i] = normal_rng(mu[z], sigma[z]);
  }
}
fit <- stan(file="stan_programs/static_mixture.stan",
            data=data, seed=4938483,
            warmup=1000, iter=2024, refresh=0)

The computational diagnostics are clean.

diagnostics <- util$extract_hmc_diagnostics(fit)
util$check_all_hmc_diagnostics(diagnostics)
  All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
samples4 <- util$extract_expectand_vals(fit)
base_samples <- util$filter_expectands(samples4,
                                       c('mu', 'sigma',
                                         'lambda'),
                                       check_arrays=TRUE)
util$check_all_expectand_diagnostics(base_samples)
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.

We also see excellent posterior retrodictive performance with respect to the histogram summary statistic. When the component observational models are homogeneous across the latent states static mixture models often perform reasonably well in aggregate even though they ignore the latent dynamics.

par(mfrow=c(1, 1))

util$plot_hist_quantiles(samples4, 'y_pred', -8, 8, 1,
                         baseline_values=data$y, xlab='y')
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 48 predictive values (0.0%) fell below the binning.
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 647 predictive values (0.2%) fell above the binning.

Inferences for the location and scale of the first component observational model are nearly identical between the dynamic and static models, but the inferences for the second and third component observational model configurations are more strongly informed with the hidden Markov model.

par(mfrow=c(3, 2))

mu_lims <- list(c(-4, -2), c(-5, 5), c(3.5, 6))
mu_ylims <- list(c(0, 2.75), c(0, 1.25), c(0, 4))

sigma_lims <- list(c(0.25, 1.75), c(0, 4.5), c(0.25, 2))
sigma_ylims <- list(c(0, 3.5), c(0, 1.5), c(0, 5.5))

for (k in 1:data$K) {
  name <- paste0('mu[', k, ']')
  util$plot_expectand_pushforward(samples3[[name]], 50,
                                  display_name=name,
                                  flim=mu_lims[[k]],
                                  ylim=mu_ylims[[k]])
  util$plot_expectand_pushforward(samples4[[name]], 50,
                                  display_name=name,
                                  flim=mu_lims[[k]],
                                  ylim=mu_ylims[[k]],
                                  col=util$c_mid,
                                  border="#DDDDDD88",
                                  add=TRUE)

  name <- paste0('sigma[', k, ']')
  util$plot_expectand_pushforward(samples3[[name]], 50,
                                  display_name=name,
                                  flim=sigma_lims[[k]],
                                  ylim=sigma_ylims[[k]])
  util$plot_expectand_pushforward(samples4[[name]], 50,
                                  display_name=name,
                                  flim=sigma_lims[[k]],
                                  ylim=sigma_ylims[[k]],
                                  col=util$c_mid,
                                  border="#DDDDDD88",
                                  add=TRUE)
}
Warning in util$plot_expectand_pushforward(samples4[[name]], 50, display_name =
name, : 1 value (0.0%) fell below the histogram binning.
Warning in util$plot_expectand_pushforward(samples4[[name]], 50, display_name =
name, : 1 value (0.0%) fell above the histogram binning.

Even when a static mixture model adequately captures the aggregate behavior of the observed data the ignorance of any latent dynamics generally results in worse inferential uncertainties.

7.3 Modeling Unobserved Latent States

Now let’s explore some of the ways that we can accommodate latent states that have been only partially observed.

For this exercise we’ll use a data set where there are fewer observations than latent states. Consequently at least some of the latent states have to be missing complementary data.

data <- read_rdump("data/miss.data.R")

cat(sprintf('%i latent states', data$I))
100 latent states
cat(sprintf('%i observational model components', data$K))
3 observational model components
cat(sprintf('%i observations', data$N))
60 observations

If we look at the empirical trajectory we can see two prominent gaps in the observations.

par(mfrow=c(1, 1))

plot(1, type="n",
     xlab="Latent State", xlim=c(0.5, data$I + 0.5),
     ylab="y", ylim=c(-8, 8))

for (miss_win in list(c(31, 55), c(76, 90))) {
  polygon(c(miss_win[1] - 0.5, miss_win[2] + 0.5,
            miss_win[2] + 0.5, miss_win[1] - 0.5),
          c(-8, -8, 8, 8), col="#DDDDDD", border=NA)
}

for (n in 1:data$N) {
  lines(c(data$state[n] - 0.5, data$state[n] + 0.5),
        rep(data$y[n], 2),
        col="black", lwd=2)
}