In order to apply probability theory in practice we need to be able to work with a given target probability distribution. More formally we need to be able to compute expectation values of relevant functions with respect to that target distribution or, more realistically, accurately estimate them. Unfortunately this is no easy task, especially when our ambient space exceeds a few dimensions and the target probability distribution concentrates into a typical set.
As we saw in my probabilistic computation case study, stochastic methods offer the possibility of quantifying high-dimensional target distributions with increasing accuracy given a corresponding increase in computation. In particular, stochastic methods efficiently compress the target distribution into a finite ensemble of samples that naturally concentrate within its typical set. Monte Carlo methods utilize exact samples to construct incredibly well-behaved estimators, but generating those exact samples is not feasible for most practical problems. Markov chain Monte Carlo generates correlated samples that share some of the same benefits as the Monte Carlo method under ideal circumstances but devolve to significantly worse behavior under less ideal circumstances.
Markov chain Monte Carlo is one of our best tools in the desperate struggle against high-dimensional probabilistic computation, but its fragility makes it dangerous to wield without adequate training. In order to use the method responsibly, and ensure accurate quantification of the target distribution, practitioners need to know not just how it works under ideal circumstances but also how it breaks under less ideal, but potentially more common, circumstances. Most importantly a practitioner needs to be able to identify when the method breaks and they shouldn’t trust the results that it gives. In other words a responsible user of Markov chain Monte Carlo needs to know how to manage its risks.
Unfortunately the Markov chain Monte Carlo literature provides limited guidance for practical risk management. Most introductory references, especially those coming from applied fields, oversimplify the method and give readers a treacherous overconfidence in the method which can facilitate a poor quantification, and hence understanding, of the target distributions they study. More formal treatments are more thorough but they’re also overwhelmed with intense technical detail that is largely irrelevant to practitioners. In this case study I attempt a compromise between these two extremes, introducing all of the important concepts needed to understand how Monte chain Monte Carlo performs in practice with clear motivations and without any extraneous technical detail. This compromise, however, will still be significantly more comprehensive than most introductory treatments so the reader should adjust their expectations accordingly. Provided you can estimate those expectations accurately, of course…
Before introducing Markov chain Monte Carlo we will begin with a short review of the Monte Carlo method. We will then continue to discuss Markov transitions that generate Markov chains, Markov chain Monte Carlo estimators derived from those Markov chains, and the delicate Markov chain Monte Carlo central limit theorem that is critical for the practical utility of this approach. Finally we will discuss how these theoretical concepts manifest in practice and carefully study the behavior of an explicit implementation of Markov chain Monte Carlo. Section 2.4 and Section 3.3 expand on some more technical considerations but are not strictly necessary for practical use of Markov chain Monte Carlo any may be skipped. They do, however, connect the conceptual ideas introduced here to the more formal concepts in the statistical literature and are recommended for anyone looking for more context.
There is a large literature of Markov chain Monte Carlo that both compliments and contrasts with the presentation given here, and hence may prove useful to the reader. [1] is a nice high-level introduction while [2] and [3] collect a diversity of articles on some of the theoretical and practical aspects of Markov chain Monte Carlo, with lots of references to further literature. For a more theoretical treatment I recommend starting with the excellent review article [4] and the text [5] before considering the dense enchiridions, [6] and [7]. For a history of Markov chain Monte Carlo see [8].
Recall that an exact sampling mechanism is a procedure for generating arbitrarily long sequences of independent points in the ambient space, \[ \{ q_{1}, \ldots, q_{N} \} \in Q, \] such that the ensemble average of any real-valued, integrable function \(f\), \[ \hat{f}_{N}^{\text{MC}} = \frac{1}{N} \sum_{n = 1}^{N} f(q_{n}), \] is asymptotically consistent. More formally the pushforward distribution of the ensemble average converges to a Dirac distribution concentrated at the target expectation value, \[ \lim_{N \rightarrow \infty} \pi_{f^{\text{MC}}_{N}} = \delta_{\mathbb{E}_{\pi}[f]}. \] In other words the ensemble average in this limit evaluates to the target expectation value, \[ \lim_{N \rightarrow \infty} \hat{f}_{N}^{\text{MC}} = \mathbb{E}_{\pi}[f], \] for all but a potentially finite number of realized samples. These ensemble averages are denoted Monte Carlo estimators.
By construction these exact samples concentrate in the typical set of the target distribution, or the target typical set. As the sample size grows they define increasingly refined quantifications of that geometry.
Importantly samples provide an accurate quantification even as the target typical set becomes relatively narrow in higher dimensions. This robustness to the specific geometry of the typical set helps explain why Monte Carlo estimators have no apparent dimensionality dependence – all of the hard work needed to scale with dimension has already been absorbed in the generation of the exact samples.
Asymptotic consistency, however, isn’t particularly useful in practice because we’ll never be able to generate infinity large ensembles with our finite computational resources. For Monte Carlo estimators to be practical we need to know how they converge to the limiting Dirac distribution. Fortunately the finite ensemble behavior of Monte Carlo estimators can also be quantified to high accuracy.
The Monte Carlo estimator for any square-integrable real-valued function, meaning that both \(\mathbb{E}_{\pi}[f]\) and \(\mathbb{E}_{\pi}[f^{2}]\) exist and are finite, satisfies a central limit theorem. The central limit theorem states that as the number of samples increases the pushforward distribution of standardized Monte Carlo estimators converges towards a probability distribution specified by a unit normal probability density function, \[ \lim_{N \rightarrow \infty} \frac{ \hat{f}_{N}^{\text{MC}} - \mathbb{E}_{\pi}[f] } { \text{MC-SE}_{N}[f] } \sim \text{normal}(0, 1), \] where the Monte Carlo standard error, \(\text{MC-SE}_{N}[f]\), is defined by \[ \text{MC-SE}_{N}[f] = \sqrt{ \frac{ \text{Var}_{\pi}[f]}{N} }. \]
Critically the relative difference between the exact probability distribution of the standardized estimators and the distribution specified by the unit normal probability density function is inversely proportional to the ensemble size and hence rapidly decays with increasing \(N\). Once we’ve generated a reasonably large ensemble, at least ten samples or so, the difference becomes negligible and for each fixed ensemble size we have \[ \frac{ \hat{f}_{N}^{\text{MC}} - \mathbb{E}_{\pi}[f] } { \text{MC-SE}_{N}[f] } \sim \text{normal}(0, 1), \] or upon rearrangement, \[ \hat{f}_{N}^{\text{MC}} \sim \text{normal}(\mathbb{E}_{\pi}[f], \text{MC-SE}_{N}[f]). \] In other words once the central limit theorem becomes an accurate approximation we can use the Monte Carlo standard error to quantify the total error of each Monte Carlo estimator for a given \(N\). While asymptotic consistency tells us the limiting distribution of the Monte Carlo estimators, the central limit theorem tells us how the estimator distribution converges to that limit for reasonably large but finite ensembles.
Technically the error quantification provided by the central limit theorem requires the exact variance, \(\text{Var}_{\pi}[f]\), in order to evaluate the Monte Carlo standard error. In practice we won’t know the variance exactly but if \(f^{2}\) is square-integrable, meaning that \(\mathbb{E}_{\pi}[f]\), \(\mathbb{E}_{\pi}[f^{2}]\), and \(\mathbb{E}_{\pi}[f^{4}]\) exist and are finite, then we can approximate the variance with another Monte Carlo estimator \[ \text{Var}_{\pi}[f] \approx \widehat{\text{Var}}_{\pi}[f], \] to give the empirical Monte Carlo standard error, \[ \widehat{\text{MC-SE}}_{N}[f] = \sqrt{ \frac{ \widehat{\text{Var}}_{\pi}[f]}{N} }. \] Conveniently the error introduced by this approximation scales inversely with the ensemble size and so is negligible exactly when the errors in assuming exact normality are negligible.
One nice feature of the exact and empirical Monte Carlo standard errors is their fixed scaling with the ensemble size, \(N\). In particular this allows us to immediately determine how many exact samples we need to achieve a given error, or how many more samples we need to generate in order to improve the error of an existing estimator by a certain amount.
Besides the difficulty of generating exact samples in practice, the main limitation of the Monte Carlo method is that the error quantification is probabilistic. While most samples will yield Monte Carlo estimators with error within a few \(\text{MC-SE}_{N}[f]\) of the exact value, some samples will inevitably suffer from much larger errors and we can never quite be sure when we’ll get unlucky.
Markov chain Monte Carlo applies the Monte Carlo method to the correlated ensembles generated by Markov chains. Markov chains define discrete paths through the ambient space that with careful engineering will explore the typical set of a specified target distribution.
Let’s say that we want to evaluate expectation values with respect to a target distribution \(\pi\) defined on the ambient space \(Q\) equipped with the \(\sigma\)-algebra \(\mathcal{Q}\).
In order to explore a target probability distribution we first have to be able to move through the ambient space. One natural way to define jumps across \(Q\) is to consider the product space, \(Q \times Q\), where the first component defines the points at which we can start and the second component defines the points to which we can jump. We can then introduce a distribution of possible jumps by specifying a probability distribution over the second \(Q\) for each possible initial point in the first \(Q\).
This collection of probability distributions, however, is exactly a conditional probability distribution over the product space, \[ \begin{alignat*}{6} T :\; &Q \times \mathcal{Q}& &\rightarrow& \; &[0, 1]& \\ &(q, B)& &\mapsto& &T [B \mid q]&, \end{alignat*} \] which we denote a Markov transition distribution. Markov refers to the fact that each transition distribution depends only on a single point in the ambient space, as opposed to depending on multiple points. In practice we typically specify a Markov transition with a conditional probability density function within a given parameterization of the ambient space, \[ \begin{alignat*}{6} T :\; &Q \times Q& &\rightarrow& \; &\mathbb{R}^{+}& \\ &(q, q')& &\mapsto& &T(q' \mid q)&. \end{alignat*} \]
Conditioning the Markov transition on a particular point \(q_{0} \in Q\) defines a probability distribution over \(Q\), \(T [\cdot \mid q_{0}]\), which itself defines the propensity for jumps to fall into each neighborhood. For example we are more likely to jump into a neighborhood \(B_{1}\) with higher transition probability than a neighborhood \(B_{2}\) with lower transition probability, \[ T [B_{1} \mid q_{0}] > T [B_{2} \mid q_{0}]. \] The corresponding probability density function provides a convenient visualization of which neighborhoods are endowed with higher transition probabilities given a particular \(q_{0}\).
An exact sample from the probability distribution \(T [\cdot \mid q_{0}]\) realizes an explicit jump, or transition, to a new point in the ambient space, \[ \tilde{q}_{1} \sim T(q_{1} \mid q_{0}). \] Because the probability distribution from which we sampled depends on the initial point \(q_{0}\), this new point will in general be correlated with that initial point.
Provided that we can readily generate exact samples from the transition distribution the implementation of these Markov transitions is straightforward in a programming environment like R
. Let’s consider a two-dimensional ambient space \(Q = \mathbb{R}^{2}\) with the coordinate functions \[
\begin{alignat*}{6}
\varpi_{1} :\; &Q& &\rightarrow& \; &\mathbb{R}&
\\
&q& &\mapsto& &q^{1}&,
\end{alignat*}
\] and \[
\begin{alignat*}{6}
\varpi_{2} :\; &Q& &\rightarrow& \; &\mathbb{R}&
\\
&q& &\mapsto& &q^{2}&,
\end{alignat*}
\] that project each point onto the two component axes; note that I am using superscripts to define the components to avoid any confusion with the subscripts that index points in an ensemble of samples. On this space we will define a Markov transition through a product of conditional probability density functions, \[
\begin{align*}
T(q_{1} \mid q_{0} )
&=
T(q^{1}_{1}, q^{2}_{1} \mid q^{1}_{0}, q^{2}_{0} )
\\
&=
T(q^{1}_{1} \mid q^{1}_{0}) \cdot
T(q^{2}_{1} \mid q^{2}_{0}).
\\
&=
\text{Normal}(q^{1}_{1} \mid q^{1}_{0}, \sigma) \cdot
\text{Normal}(q^{2}_{1} \mid q^{2}_{0}, \sigma).
\end{align*}
\]
# When working with Markov chains we always want to set an explicit seed
# to ensure that we can reproduce any weird behavior that might arise
set.seed(295148748)
# Initial point
D <- 2
q0 <- c(0, 0)
# Some possible transitions from the initial point
n_possible <- 50
possible_transitions <- matrix(data = 0, nrow = n_possible, ncol = D)
for (c in 1:n_possible) {
possible_transitions[c, 1] <- rnorm(1, q0[1], 1)
possible_transitions[c, 2] <- rnorm(1, q0[2], 1)
}
c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")
c_light_trans <- c("#DCBCBC30")
c_dark_trans <- c("#8F272730")
plot(possible_transitions[,1], possible_transitions[,2],
main="Possible Transitions From Initial Point",
col="green", pch=16, cex=1.2,
xlab="q^1", xlim=c(-3, 3), ylab="q^2", ylim=c(-3, 3))
points(q0[1], q0[2], col=c_dark, pch=16, cex=1.2)
points(q0[1], q0[2], col=c_light, pch=16, cex=0.8)
Each sample defines possible transitions from the initial point, but we will consider only one realized transition, \(\tilde{q}_{1}\).
q1 <- c(0, 0)
q1[1] <- rnorm(1, q0[1], 1)
q1[2] <- rnorm(1, q0[2], 1)
plot(possible_transitions[,1], possible_transitions[,2],
main="Realized Transition From Initial Point",
col="green", pch=16, cex=1.2,
xlab="q^1", xlim=c(-3, 3), ylab="q^2", ylim=c(-3, 3))
points(q0[1], q0[2], col=c_dark, pch=16, cex=1.2)
points(q0[1], q0[2], col=c_light, pch=16, cex=0.8)
lines(c(q0[1], q1[1]), c(q0[2], q1[2]), col=c_dark, lwd=2)
points(c(q0[1], q1[1]), c(q0[2], q1[2]), col=c_dark, pch=16, cex=1.2)
points(c(q0[1], q1[1]), c(q0[2], q1[2]), col=c_light, pch=16, cex=0.8)
Once we have jumped to a new point we can define a second transition by conditioning the Markov transition distribution on \(\tilde{q}_{1}\) instead of \(q_{0}\). Sampling from the resulting probability distribution defines another point, \[ \tilde{q}_{2} \sim T(q_{2} \mid \tilde{q}_{1}). \]
possible_transitions <- matrix(data = 0, nrow = n_possible, ncol = D)
for (c in 1:n_possible) {
possible_transitions[c, 1] <- rnorm(1, q1[1], 1)
possible_transitions[c, 2] <- rnorm(1, q1[2], 1)
}
plot(possible_transitions[,1], possible_transitions[,2],
main="Possible Transitions From First Transition",
col="green", pch=16, cex=1.2,
xlab="q^1", xlim=c(-3, 3), ylab="q^2", ylim=c(-3, 3))
lines(c(q0[1], q1[1]), c(q0[2], q1[2]), col=c_dark, lwd=2)
points(c(q0[1], q1[1]), c(q0[2], q1[2]), col=c_dark, pch=16, cex=1.2)
points(c(q0[1], q1[1]), c(q0[2], q1[2]), col=c_light, pch=16, cex=0.8)
q2 <- c(0, 0)
q2[1] <- rnorm(1, q1[1], 1)
q2[2] <- rnorm(1, q1[2], 1)
plot(possible_transitions[,1], possible_transitions[,2],
main="Realized Transition From First Transition",
col="green", pch=16, cex=1.2,
xlab="q^1", xlim=c(-3, 3), ylab="q^2", ylim=c(-3, 3))
lines(c(q0[1], q1[1], q2[1]), c(q0[2], q1[2], q2[2]), col=c_dark, lwd=2)
points(c(q0[1], q1[1], q2[1]), c(q0[2], q1[2], q2[2]), col=c_dark, pch=16, cex=1.2)
points(c(q0[1], q1[1], q2[1]), c(q0[2], q1[2], q2[2]), col=c_light, pch=16, cex=0.8)
Iterating Markov transitions, \[ \begin{align*} \tilde{q}_{1} &\sim T(q_{2} \mid q_{0}) \\ \tilde{q}_{2} &\sim T(q_{2} \mid \tilde{q}_{1}) \\ \tilde{q}_{3} &\sim T(q_{3} \mid \tilde{q}_{2}) \\ \ldots \\ \tilde{q}_{N} &\sim T(q_{N} \mid \tilde{q}_{N - 1}), \end{align*} \] generates a sequence of points in the ambient space that we call a Markov chain.
Because of this sequential sampling, each point \(q_{n}\) in the sequence is conditionally dependent on only the previous point, \(q_{n - 1}\): if we fix \(q_{n - 1}\) then \(q_{n}\) will be independent of all previous points. If \(q_{n - 1}\) itself is sampled from \(q_{n - 2}\), however, then \(q_{n}\) will in general assume a correlation with \(q_{n - 2}\) as well. Because we are sampling all of the points after the initial one, the Markov chain will define a correlated ensemble of samples.
The correlations spanning a Markov chain are evident if we compare multiple Markov chains generated from the same initial point. The local correlations between neighborhood points give each realized Markov chain a stiffness that constrains the relationship between even distant points.
# Initial point
q0.1 <- 0
q0.2 <- 0
# Let's generate ten chains, each comprised of ten transitions from the initial point
n_transitions <- 10
n_chains <- 10
plot(1, type="n", main="Markov Chain Realizations",
xlab="q^1", xlim=c(-7, 7),
ylab="q^2", ylim=c(-7, 7))
for (c in 1:(n_chains - 1)) {
mcmc_samples <- matrix(data = 0, nrow = n_transitions + 1, ncol = D)
mcmc_samples[1, 1] <- q0.1
mcmc_samples[1, 2] <- q0.2
for (n in 1:n_transitions) {
mcmc_samples[n + 1, 1] <- rnorm(1, mcmc_samples[n , 1], 1)
mcmc_samples[n + 1, 2] <- rnorm(1, mcmc_samples[n , 2], 1)
}
lines(mcmc_samples[,1], mcmc_samples[,2], col=c_dark_trans, lwd=2)
points(mcmc_samples[,1], mcmc_samples[,2], col=c_dark_trans, pch=16, cex=1.2)
points(mcmc_samples[,1], mcmc_samples[,2], col=c_light_trans, pch=16, cex=0.8)
}
# Realized chain
mcmc_samples <- matrix(data = 0, nrow = n_transitions + 1, ncol = D)
mcmc_samples[1, 1] <- q0.1
mcmc_samples[1, 2] <- q0.2
for (n in 1:n_transitions) {
mcmc_samples[n + 1, 1] <- rnorm(1, mcmc_samples[n , 1], 1)
mcmc_samples[n + 1, 2] <- rnorm(1, mcmc_samples[n , 2], 1)
}
lines(mcmc_samples[,1], mcmc_samples[,2], col=c_dark, lwd=2)
points(mcmc_samples[,1], mcmc_samples[,2], col=c_dark, pch=16, cex=1.2)
points(mcmc_samples[,1], mcmc_samples[,2], col=c_light, pch=16, cex=0.8)
Although the Markov chains realized from an arbitrary Markov transition distribution may seem to aimlessly wander through the ambient space, they are in fact looking for something particular. A probability distribution that is preserved under Markov transitions, \[ \pi(q) = \int \mathrm{d} q' \, \pi(q') \, T(q \mid q'), \] is known as the stationary or invariant distribution of the Markov transition distribution. Well-behaved Markov transition distributions admit a unique stationary distribution, and it is within the stationary typical set that the Markov chains are most likely to wonder.
No matter where we are in the ambient space the typical set of each Markov transition distribution will concentrate towards the stationary typical set.
Consequently realized transitions will jump towards the typical set with exceedingly high probability. After jumping the circumstance repeats, with the following transition distribution once again concentrating towards the typical set.
As we realize more and more Markov transitions the resulting Markov chain will drift towards the typical set of the stationary distribution and, once it has reached the typical set, the transitions will guide the Markov chain through it. In other words the Markov transitions serve as a probabilistic blood hound, sniffing around each point for subtle hints of probability and following a trail that guides us towards the stationary typical set. All we have to do is follow.
As our chains grow longer they fill up the stationary typical set. Eventually the newest points in the Markov chain start looking suspiciously similar to an ensemble of samples from the stationary distribution.
If we can engineer a Markov transition whose stationary distribution is our target distribution then we might even be able to use those points to construct expectation value estimators just as we did with Monte Carlo.
In order to formalize the intuition that Markov chains explore the stationary distribution we have to construct intermediate probability distributions that quantify the possible transitions we could realize at each iteration of the Markov chain.
First let’s consider initializing our Markov chain with a sample from some initial probability distribution, \[ \tilde{q}_{0} \sim \rho. \] We can always set this initial distribution to a Dirac distribution concentrated at a specific point, \[ \rho = \delta_{q_{0}}, \] to recover the fixed initialization that we used above.
To quantify where we might be after one transition we have to convolve this initial distribution with the Markov transition, \[ (T \rho)[B] = \mathbb{E}_{\rho} [ T[B \mid q_{0}] ], \] or in terms of probability density functions, \[ (T \rho)(q_{1}) = \int \mathrm{d} q_{0} \, T(q_{1} \mid q_{0}) \, \rho( q_{0} ). \] We can then quantify where we might be after two transitions by convolving this convolution with the Markov transition again, \[ (T^{2} \rho)[B] = \mathbb{E}_{\rho} [ \mathbb{E}_{T \rho} [ T[B \mid q_{0}] \mid q_{1} ] ], \] or in terms of probability density functions, \[ \begin{align*} (T^{2} \rho)(q_{2}) &= ( T \cdot T \rho )(q_{2}) \\ &= \int \mathrm{d} q_{1} \, T(q_{2} \mid q_{1}) \int \mathrm{d} q_{0} \, T(q_{1} \mid q_{0}) \, \rho(q_{0}) \\ &= \int \mathrm{d} q_{1} \, \mathrm{d} q_{0} \, T(q_{2} \mid q_{1}) \, T(q_{1} \mid q_{0}) \, \rho(q_{0}). \end{align*} \]
Iterating this convolution along with each transition we construct the \(N\)-step distributions, \[ (T^{N} \rho)(q_{N}) = ( T \cdot T^{N - 1} \rho ) (q_{N}) = ( \underbrace{T \cdot \ldots \cdot T}_{N\mathrm{ times}} \rho ) (q_{N}) \] which quantify the possible points our Markov chain could reach after \(N\) transitions.
Equivalently the initial distribution and the Markov transition distribution can be used to construct a probability distribution over the product space, \(Q^{N}\), whose \(n\)th component captures the state of the Markov chain after \(n\) transitions. From this perspective the \(n\)-step distributions are just the pushforward of the product distribution along the \(n\)th component projection function.
We can formalize the concept of Markov chain convergence as the convergence of the \(N\)-step distributions themselves. In particular we can ask to what probability distribution, \(\omega\), the \(N\)-step distributions converge, \[ \lim_{N \rightarrow \infty} T^{N} \rho = \omega. \] Equivalently, we can consider the convergence of probability density function representations of the \(N\)-step distributions.
The \(N\)-step distribution won’t always have a well-defined limit, but if a limit exists then it has to be the stationary distribution, \[ \lim_{N \rightarrow \infty} T^{N} \rho = \pi, \] defined by \[ T \pi = \pi. \]
The technical conditions under which a limit exists, and the \(N\)-step distribution converges to the stationary distribution are subtle. A critical aspect of developing new Markov chain Monte Carlo implementations is being able derive those conditions and how they might be obstructed in practice.
A well-defined limiting distribution is necessary for well-behaved Markov chains, but the practical utility of a given Markov transition really depends on how quickly the \(N\)-step distributions converge to that limiting distribution.
In order to quantify how the \(N\)-step distributions converge we need some notion of distance between probability distributions in order to compare them to the limiting stationary distribution as a function of the number of iterations, \(N\). Given a distance function \(\vert\vert \cdot \vert\vert\) we say that the \(N\)-step distributions converge towards a limiting probability distribution if we can always find a sufficiently large \(N\) such that the distance between the all of the subsequent \(N\)-step distributions is smaller than any given value. More formally, for any arbitrarily small distance \(\epsilon \in \mathbb{R}^{+}\) there has to be some finite integer that can possibly depend on the initial distribution, \(N(\rho)\), such that \[ \vert\vert T^{n} \rho - \pi \vert \vert \le \epsilon \] for all \(n > N(\rho)\).
In practice we aim for a more constructive result by bounding these distances with a monotonically decreasing function of \(N\), \[ \vert\vert T^{N} \rho - \pi \vert \vert \le f(\rho, N), \] which implies the formal definition of convergence but allows us to quantify the rate of convergence using the form of the bound, \(f(\rho, N)\).
One of the most common definitions of distance between probability distributions is the total variation distance. There are multiple equivalent definitions of the total variation distance, each with their own intuitions, but my favorite is as the largest difference of probabilities allocated across the ambient \(\sigma\)-algebra, \[ \vert\vert \rho - \pi \vert\vert_{\mathrm{TV}} = \mathrm{sup}_{B \in \mathcal{Q}} \left| \rho[B] - \pi[B] \right|. \] Here \(\mathrm{sup}\) refers to the supremum, which is just a very pedantic upper bound.
If the decay of the total variation distance is bounded above by a polynomial in \(N\), \[ \vert\vert T^{N} \rho - \pi \vert \vert_{\mathrm{TV}} \le C(\rho) (N + 1)^{-\beta}, \] then we say that the Markov transition has polynomial ergodicity. Polynomial ergodicity is a relatively weak condition due to the slow decay of the bound, but it some circumstances it is enough to ensure well-behaved Markov chains.
When the decay of the total variation distance is bounded above by a geometric series in \(N\), \[ \vert\vert T^{N} \rho - \pi \vert \vert_{\mathrm{TV}} \le C(\rho) r^{N}, \] for some \(r < 1\), we say that the Markov transition has geometric ergodicity. These geometric series decay exponentially fast with the number of iterations which places a much stronger constraint on the convergence of the \(N\)-step distributions.
Stronger still is uniform ergodicity where the constant in the geometric bound does not depend on the initial distribution, \[ \vert\vert T^{N} \rho - \pi \vert \vert_{\mathrm{TV}} \le C r^{N}. \] Uniform ergodicity guarantees rapid convergence no matter where we initialize our Markov chains, but results this strong are typically limited to ambient spaces with finite boundaries.
Sufficiently strong ergodicity bounds ensure that as \(N\) increases the \(N\)-step distributions become more and more similar to each other and the stationary distribution. This increasing uniformity prevents transient behaviors that might hinder the effective exploration of Markov chains after only a finite number of iterations.
Another way to analyze the convergence of the \(N\)-step distributions is through the Markov transtition itself. To motivate this approach let’s consider a discrete space with only \(I\) elements where we can analyze the Markov transition using linear algebra. In this case a conditional probability distribution like a Markov transition distribution can be represented with a square matrix whose columns sum to one. Similarly probability distributions can be represented with vectors whose elements sum to one.
The Markov transition matrix admits the \(I\) eigenvalues, \(\left\{\lambda_{1}, \ldots, \lambda_{I} \right\}\), and the \(I\) eigenvectors, \(\left\{\pi_{1}, \ldots, \pi_{I} \right\}\) that satisfy \[ T \, v_{i} = \lambda_{i} \, v_{i}. \] The collection of eigenvalues is also known as the eigenspectrum of the Markov transition.
Because of the summation constraint across the columns the magnitude of the eigenvalues are bounded by one, \[ -1 \le \lambda_{i} \le 1, \] with at least one of the eigenvalues saturating the upper bound. Without loss of generality let’s define our labels such that the smallest indices correspond to the eigenvalues with the largest magnitudes, \[ | \lambda_{1} | \ge | \lambda_{2} | \ge \ldots \ge | \lambda_{I - 1} | \ge | \lambda_{I} |. \]
Let’s now make the assumption that our Markov transition matrix is not degenerate so that all of the eigenvalues are distinct and the eigenvectors form a complete basis for our ambient space. In this case one and only one eigenvalue saturates the upper bound, \[ \lambda_{1} = 1, \, \lambda_{i \ne 1} < 1, \] and the corresponding eigenvector \(\pi_{1}\) specifies the unique stationary distribution of the transition, \[ T \, \pi_{1} = \lambda_{1} \, \pi_{1} = \pi_{1}. \]
If we expand an initial distribution in the eigenvectors of the Markov transition matrix, \[ \rho = \sum_{i = 1}^{I} \alpha_{i} \, v_{i} \] then the \(1\)-step distribution is given by a single matrix multiplication, \[ \begin{align*} T \rho &= T \cdot \sum_{i = 1}^{I} \alpha_{i} \, v_{i} \\ &= \sum_{i = 1}^{I} \alpha_{i} \, T v_{i} \\ &= \sum_{i = 1}^{I} \alpha_{i} \, \lambda_{i} \, v_{i}, \end{align*} \] and the \(N\)-step distribution is given by \(N\) matrix multiplications, \[ T^{N} \rho = \sum_{i = 1}^{I} \alpha_{i} \, \lambda_{i}^{N} \, v_{i}. \]
At this point let’s separate out the stationary distribution from the other eigenvectors, \[ \begin{align*} T^{N} \rho &= \sum_{i = 1}^{I} \alpha_{i} \, \lambda_{i}^{N} \, v_{i} \\ &= \alpha_{1} \, \lambda_{1} \, v_{1} + \sum_{i = 2}^{I} \alpha_{i} \, \lambda_{i}^{N} \, v_{i} \\ &= \alpha_{1} \, v_{1} + \sum_{i = 2}^{I} \alpha_{i} \, \lambda_{i}^{N} \, v_{i}. \end{align*} \] Because the magnitude of all of the eigenvalues in the second term are smaller than one, the powers \(\lambda_{i}^{N}\) quickly vanish and suppress the parts of the initial distribution that aren’t aligned with the stationary distribution. So long as the initial distribution has some overlap with the stationary distribution, \[ \alpha_{1} > 0, \] the \(N\)-step distributions will converge to the stationary distribution exponentially fast with a rate proportional to the absolute spectral gap, \[ \delta = \lambda_{1} - | \lambda_{2} | = 1 - | \lambda_{2} |. \]
With a significant amount of careful mathematical machinery these ideas can be generalized from finite spaces to the infinite spaces with which we typically with in practice. The resulting spectral theory treats a Markov transition as an operator that consumes the \((N - 1)\)-step distribution and returns the \(N\)-step distribution, working out the infinte-dimensional eigenspectra and absolute spectral gaps of these operators. These then provide the foundation for formal methods of studying how the \(N\)-step distributions converge.
As we apply our tailored Markov transition to an initial point we generate a Markov chain that finds and then explores the typical set of the stationary distribution. If the Markov chain is long enough then these points might provide a sufficiently thorough quantification of the stationary typical set to serve as the basis for an accurate stochastic estimator, just as exact samples provide the basis for Monte Carlo estimators.
The reality is, unfortunately, complicated. While we can construct Markov chain Monte Carlo estimators in this way, they will converge to the exact expectation values only under certain conditions. Even then that convergence often leaves much to be desired.
Given a Markov chain with points \(\{\tilde{q}_{0}, \ldots, \tilde{q}_{N} \}\) we can construct a Markov chain Monte Carlo estimator in the same way that we construct a Monte Carlo estimator, \[ \hat{f}^{\text{MCMC}}_{N} = \frac{1}{N + 1} \sum_{n = 0}^{N} f(\tilde{q}_{n}). \] The question is then how the pushforward distribution of these estimators behaves as the chains grow to be infinitely long. In particular, are they at least asymptotically consistent, \[ \lim_{N \rightarrow \infty} \pi_{f^{\text{MCMC}}_{N}} = \delta_{\mathbb{E}_{\pi}[f]}? \]
Unlike Monte Carlo estimators, the consistency of Markov chain Monte Carlo estimators is not universal. There are a few circumstances that can block a Markov chain from exploring the entirety of the stationary distribution and recovering asymptotically consistent estimators. To ensure good limiting behavior we need to avoid two particular behaviors.
A Markov transition is reducible if Markov chains initialized within some neighborhood are not able to escape that neighborhood and hence do not have the opportunity to explore the entirety of the ambient space. In this case the ambient space might decompose into a number of partitions, with Markov chains restricted to the partition in which they’re initialized; the resulting chains are able to explore the extent of their initial partition, but they’re not able to explore any of the other partitions. For example consider a Markov transition over the real numbers that jumps only to points of the same sign – a Markov chain initialized at a positive number will only ever be able to jump to other positive numbers, completely ignoring all of the negative numbers.
To ensure that Markov chains can explore anywhere in the ambient space as needed we have to engineer a Markov transition that is irreducible and will be able to jump into any sufficiently nice neighborhood with non-zero probability after only a finite number of transitions. One common way to ensure irreducibility is to specify a Markov transition with transition density functions that are nonzero across the entire ambient space.
Irreducibility ensures that Markov chains have the opportunity to explore any nice neighborhood in the ambient space at least once, but it offers no assurance that Markov chains will be able to explore those neighborhoods repeatedly. A periodic Markov transition is susceptible to cycles, neighborhoods that trap Markov chains and prevent them from exploring any other points afterwards. Even if a Markov chain is irreducible and has a chance to reach anywhere in ambient space from the initialization, it can be absorbed in one of these cycles and suffer from insufficient exploration ever after.
A Markov transition is aperiodic if it is not vulnerable to any cycles. The typical way to avoid cycles in practice is to ensure that a Markov transition always assigns a nonzero probability to staying at the initial point.
Together irreducibility and aperiodicity of a Markov transition ensure that realized Markov chains are recurrent. Recurrent Markov chains from most initializaitons will explore any neighborhood of the ambient space not just a finite number of times but a countably infinite number of times as the chains are run infinitely long. Consequently these Markov chains will be able to find the typical set and then quantify it with increasingly refined resolution. This guarantees that the Markov chain Monte Carlo estimator for any integrable function will be asymptotically consistent, \[ \lim_{N \rightarrow \infty} \pi_{f^{\text{MCMC}}_{N}} = \delta_{\mathbb{E}_{\pi}[f]} \] for all but a finite number of initializations. In order to generalize this result to include all initial points we have to improve the recurrence implied by irreducibility and aperiodicity to a technical condition called Harris recurrence.
While these technical conditions might seem esoteric and irrelevant to the practice of Markov chain Monte Carlo estimation, they are in fact absolutely critical to its robust use. We can always evaluate Markov chain Monte Carlo estimators from a given Markov chain, but those estimates will not have any relation to the exact expectation values unless these conditions are met. Only by carefully validating these conditions can we have any confidence that the estimators might provide useful information about the expectation values of interest.
As a consumer of Markov chain Monte Carlo this means that one should use Markov transitions designed by a trusted source with the expertise to verify these conditions. At the same time those interested in designing their own Markov transitions should invest the time to understand these conditions enough to be able to verify them.
The asymptotic consistency of Markov chain Monte Carlo estimators is not guaranteed, but it in practice it will hold for most well-behaved Markov transitions. Exactly how useful, however, is that asymptotic consistency?
Asymptotic consistency is relevant only in the limit of infinitely long Markov chains. Unfortunately infinity is really big. No, bigger than that. Keep going. On and on, forever. With finite computation we will always be bound to finite Markov chains and never reach that asymptotic limit where the estimators converge to the exact expectation values. Asymptotic convergence is a probabilistic happily ever after, and unfortunately just as much of a fairy tale.
Consequently what is more relevant for the practical utility of Markov chain Monte Carlo is not the behavior of Markov chain Monte Carlo estimators after an infinite number of iterations but rather their behavior after only a finite number of iterations. In other words the limiting behavior isn’t as important as how the estimators converge. While asymptotic consistency is typically required for good finite-iteration behavior, it is nowhere near sufficient. Estimators that converge asymptotically can exhibit atrocious behavior for any finite number of iterations.
The finite-iteration behavior of a Markov chain Monte Carlo estimators depends on the intimate details of the Markov transition. Ensuring good finite-iteration behavior is made all the more challenging by how subtly pathologies often manifest, and hence how difficult it can be to identify them empirically. To demonstrate the intricacies of finite-iteration behavior let’s consider how Markov chain Monte Carlo estimators behave under ideal, and not so ideal, circumstances. Those interested in a more formal perspective can continue all the way through Section 3.2.3 where we will more formally define the convergence of Markov chain Monte Carlo estimators.
Under ideal circumstances Markov chains will quickly find and then explore the typical set of the target distribution with progressively finer resolution. This evolution proceeds in three distinct phases that we can demonstrate using our cartoon of a high-dimensional typical set.
On the left we have the cartoon of the stationary typical set and on the right we will see how the absolute error of a Markov chain Monte Carlo estimator evolves with the growing Markov chain.
Our initial point, whether fixed or drawn from some auxiliary probability distribution, will fall outside of the typical set unless that auxiliary probability distribution happens to be the stationary distribution itself. Consequently the first responsibly of the Markov chain is to locate the typical set. The Markov transition distributions outside of the typical set will be heavily skewed towards the typical set and any realized Markov chains will evolve almost as if they were attracted to all of that probability mass.
Because it takes time to reach the typical set, the beginning of the Markov chain will poorly inform Markov chain Monte Carlo estimators. Estimators constructed entirely from these early iterations will exhibit significant bias.
Once it finds the typical set and has tasted the sweet nectar of concentrating probability, the Markov chain will begin to explore the neighborhoods of high probability mass with an initial sojourn or tour. The points accumulated in this second phase very quickly resolve the gross structure of the stationary distribution, overwhelming the initial bias and rapidly improving the overall accuracy of Markov chain Monte Carlo estimators. This threshold behavior of estimators is a classic indicator that we’ve settled into the stationary typical set.
After that initial settling the Markov chain continues to refine its quantification of the stationary typical set. As it continues to explore it captures finer and finer details, and the corresponding Markov chain Monte Carlo estimators fall into a gradual but consistent convergence.
As we will see in Section 4, under these so far ill-defined “ideal” circumstances the convergence of the estimators in this final phase will reproduce the square root convergence of Monte Carlo estimators, allowing us to quantity the estimator error and its scaling with iteration.
Ultimately this ideal behavior is characterized by an initial transient phase where the Markov chain realizations find and then settle into the stationary typical set followed by an equilibrium phase where the Markov chains steadily survey that typical set to inform increasingly precise Markov chain Monte Carlo estimators. The consistency of the exploration in the equilibrium phase is what distinguishes it from the transient phase – once a Markov chain has equilibrated it loses any residue of where it started. At the same time any finite subsequence will behave identically to any other. This inability to tell where we are in the Markov chain is also known as stationarity.
Now that we’ve laid out the qualitative convergence behavior of Markov chain Monte Carlo estimators under ideal circumstances let’s consider what happens under less ideal, dare I say typical, circumstances. These adverse circumstances should manifest in transient behavior that persists throughout the entire Markov chain, indicating that an equilibrium doesn’t exist or we are yet to reach it.
There are countless deviations from ideal circumstances, but here we will consider only two: suspensions and metastabilities. In order to know when not to trust Markov chain Monte Carlo estimators, and use Markov chain Monte Carlo responsibly, we need to be able to identify the signs of these deviations empirically so that we know when to not trust the accuracy of the resulting Markov chain Monte Carlo estimators.
Let’s consider a stationary distribution which exhibits a pinch at the bottom of its typical set.
Away from this pinch the shape of the stationary typical set is relatively uniform, but in the neighborhood of the pinch the geometry starts to change rapidly. This strong curvature frustrates the exploration of Markov chains and corrupts the corresponding Markov chain Monte Carlo estimators.
We can demonstrate this corruption using a similar setup as before. On the left we’ll observe the evolution of a Markov chain through the stationary typical set and on the right we’ll observe the the evolution of a Markov chain Monte Carlo estimator of the expectation value of the vertical coordinate function, \(\varpi_{v}\). Note that, in contrast to the previous demonstration, here we will look at the signed error instead of the absolute error. In between these two we will add a new figure – the trace plot of the vertical coordinate that shows how that coordinate evolves along with the Markov chain.
Let’s presume that our Markov chain has already found the typical set so that it just needs to explore. When exploring away from the pinch the Markov chain will have no problem evolving through the smooth geometry. The trace plot exhibits the incremental exploration and the estimator error appears to converge towards an equilibrium value, albeit one biased above the exact value.
The Markov chain, however, can’t enjoy that smooth geometry forever. Eventually it will have to confront the pinch, and that’s where our troubles begin.
Most Markov transitions feature a length scale that characterizes how far the realized transitions jump on average. When that length scale is too large the Markov transition won’t be able to resolve the structure of the pinch. Most of the time the realized Markov chains will wander right past the pinch, ignoring it entirely and generating only incomplete exploration of the typical set. Incomplete exploration makes biased Markov chain Monte Carlo estimators and, if not resolved, that bias will obstruct asymptotic consistency. The Markov chains, however, prove to be rather stubborn in their quest for consistency.
When a Markov chain gets close to the pinch there is a very small probability that the blunt transitions will jump deeper into the pinch rather than passing by it. After passing by it enough times it will eventually take a detour into the pinch and once there the Markov chain will become entangled by the strong curvature and freeze, no longer able to continue on its way. The suspended Markov chain drags the Markov chain Monte Carlo estimator with it, inducing a crude correction to the initial positive bias! This correction, however, quickly overcompensates and forces a negative bias in the estimator.
Eventually transitions will jump back out of the pinch, allowing the Markov chain to escape and explore the less frustrating regions of the typical set once again. As it explores away from the pinch the estimators are pulled back up resulting in another positive, although slightly smaller, bias.
In other words this Markov chain Monte Carlo estimator converges to the exact expectation value only awkwardly. Most of the time the estimator exhibits a positive bias, and occasionally it exhibits a larger negative bias. Only when the Markov chain is run infinitely long will these biases cancel to give the exact value.
These transient suspensions indicate that we lack the equilibrium needed for Markov chain Monte Carlo estimators that are well-behaved after only a finite number of iterations. If we can empirically differentiate the behavior of Markov chains near and far from the pinch then we might be able to identify the lack of equilibrium, and poor behavior of the resulting estimators, using the history of the Markov chain itself.
Even if an equilibrium does exist, however, there is no guarantee that we’ll be able to run long enough to reach it. Typically Markov chains equilibrate relatively quickly, but stationary distributions that exhibit metastabilities prove to be far less accommodating.
Metastabilities occur when a stationary distribution exhibits multiple neighborhoods of high probability surrounded in all directions by neighborhoods of low probability. Because Markov chains are attracted to probability mass they avoid these moats of low probability that separate the metastabilities and instead focus their attention on the closest metastability. Eventually realized Markov chains will brave these low probability valleys and jump from one metastability to another in order to explore the entire stationary distribution, but in practice we might exhaust our computational resources long before that happens.
Probability distributions that exhibit metastabilities are also known as multimodal distributions, with each metastability corresponding to a particular mode of the representative probability density function. Here I will use the metastability nomenclature to focus on the structure of the probability distribution and not any particular probability density function representation.
To demonstrate the typical behavior of Markov chains in the presence of metastabilities let’s consider a similar setup as before. On the left we have the stationary typical set, in this case exhibiting disconnected surfaces corresponding to each of the two metastabilities. In the middle we’ll look at the trace plot of the vertical coordinate function, and on the right the expectation value of that vertical coordinate function.
Initially a realized Markov chain will be attracted to one of the metastabilities, exploring it as if it were the entirety of the typical set. The exploration and the corresponding Markov chain Monte Carlo estimators all behave as if the Markov chain has reached a comfortable equilibrium.
Unfortunately this is only a local equilibrium as it ignores the other metastability. If we wait long enough the Markov chain will overcome the allure of that local equilibrium and venture over to the other metastability. It is only at this time that we have any indication that our initial exploration was incomplete.
When exploring a stationary distribution with metastabilities the proper equilibrium is defined not by the exploration within a single metastability but rather the exploration between metastabilities. The initial exploration of just one metastability is just transient behavior masquerading as equilibrium behavior. Local equilibria mimicking the full equilibrium makes metastability particularly hard to identify in practice.
While the full equilibrium might be reached after a finite number of iterations, that number might be far too large for Markov chain Monte Carlo to be feasible in practice.
In Section 2.4.2 we used the total variation distance to quantify the convergence of the \(N\)-step distributions towards the stationary distribution. We can also use the total variation distance, and distances like it, to quantify the convergence of expectation values with respect to the \(N\)-step distributions and, eventually, the convergence of Markov chain Monte Carlo estimators.
The total variation distance can be defined in terms of differences in probability allocations but it can also be defined in terms of the differences of certain expectation values. In particular let \(\mathcal{F}\) be the space of continuous functions with outputs in bounded interval \([-1, 1]\), \[ -1 \le f(q) \le 1, \, \forall q \in Q. \] We can write the total variation distance in terms of the supremum of differences of expectation values over this space, \[ \vert\vert \pi_{1} - \pi_{2} \vert \vert_{\mathrm{TV}} = \frac{1}{2} \mathrm{sup}_{f \in \mathcal{F}} \left| \mathbb{E}_{\pi_{1}}[f] - \mathbb{E}_{\pi_{2}}[f] \right|. \]
Consequently geometric ergodicity of the \(N\)-step distributions \[ \vert\vert T^{N} \rho - \pi \vert \vert_{\mathrm{TV}} \le C(\rho) r^{N}, \] implies that the \(N\)-step expectation value of functions in \(\mathcal{F}\) also converge geometrically to the stationary expectation values, \[ \left| \mathbb{E}_{\pi_{1}}[f] - \mathbb{E}_{\pi_{2}}[f] \right| \le 2 \, C(\rho) \, r^{N}, \, \forall f \in \mathcal{F}. \]
This then allows us to quantify the finite-iteration error of Markov chain Monte Carlo estimators corresponding to functions in \(\mathcal{F}\). For example the bias of one of these estimators is given by \[ \begin{align*} \left| \mathbb{E} \left[ \hat{f}_{N}^{\text{MCMC}} \right] - \mathbb{E}_{\pi}[f] \right| &= \left| \Big( \frac{1}{N + 1} \sum_{n = 0}^{N} \mathbb{E}_{T^{n} \rho}[f] \Big) - \mathbb{E}_{\pi}[f] \right| \\ &= \left| \frac{1}{N + 1} \sum_{n = 0}^{N} \mathbb{E}_{T^{n} \rho}[f] - \frac{1}{N + 1} \sum_{n = 0}^{N} \mathbb{E}_{\pi}[f] \right| \\ &= \frac{1}{N + 1} \left| \sum_{n = 0}^{N} \Big( \mathbb{E}_{T^{n} \rho}[f] - \mathbb{E}_{\pi}[f] \Big) \right| \\ &\le \frac{1}{N + 1} \sum_{n = 0}^{N} \Big| \mathbb{E}_{T^{n} \rho}[f] - \mathbb{E}_{\pi}[f] \Big| \\ &\le \frac{1}{N + 1} \sum_{n = 0}^{N} C(\rho) \, r^{N} \\ &\le \frac{ C(\rho) }{N + 1} \sum_{n = 0}^{N} \, r^{N} \\ &\le \frac{ C(\rho) }{N + 1} \frac{ 1 - r^{N + 1} }{1 - r}. \end{align*} \] Because \(r < 1\), the bias will monotonically decay at a geometric rate if our Markov transition is geometrically ergodic.
Results like this, however, are limited to only the functions in \(\mathcal{F}\). Unfortunately this space of bounded functions isn’t expressive enough to capture the functions whose expectation values are of interest in typical applications. Consequently Markov chain Monte Carlo estimator convergence based on total variation bounds can have limited practical utility.
To do better we have to consider more general integral probability metrics that bound the difference in expectation values over larger classes of functions [9]. A general integral probability metric is defined by a generating class of real-valued functions, \(\mathcal{F}\), and the corresponding distance \[ \vert\vert \pi_{1} - \pi_{2} \vert \vert_{\mathrm{IPM}(\mathcal{F})} = C \, \mathrm{sup}_{f \in \mathcal{F}} \left| \mathbb{E}_{\pi_{1}}[f] - \mathbb{E}_{\pi_{2}}[f] \right|, \] for some constant \(C\).
For example if we can establish a drift function, \(V\), that quantifies how strongly a Markov transition contracts then we can extend the total variation integral probability metric to a \(V\)-norm integral probability metric whose generating class spans all of the functions bounded by \(V\) [10], \[ -V(q) \le f(q) \le V(q), \, \forall q \in Q. \] The stronger the contraction of a Markov transition the larger the drift function we can establish and the larger the generating class will be, and the more useful \(V\)-norm integral probability metric bounds will be.
In the statistical literature the Wasserstein-\(2\) distance, defined by a generating class of all square integrable functions with respect to a given measure, and the Wasserstein-\(1\) distance, defined by a generating class of all integrable functions with respect to a given measure, are becoming increasingly common.
Because they completely quantify the behavior of Markov chain Monte Carlo estimators after only a finite number of iterations, explicit bounds on the convergence of the \(N\)-step distributions to the stationary distribution with respect to a sufficiently rich integral probability metric are prized by theoretical statisticians. Unfortunately the practical utility of these bounds is often less satisfying.
Bounds that have been worked out in the literature, for example, are often limited to small generating classes that don’t capture the expectation values of practical interest. Moreover the bounds often apply to only simple Markov transition distributions that have little to do with the sophisticated Markov transition distributions of applied interest. For instance Wasserstein-\(2\) and Wasserstein-\(1\) bounds are typically limited to Markov transition distributions whose stationary distributions are specified with strongly log concave probability density functions that are rare outside of the exponential family. Finally the bounds themselves are often too loose to ofter any practical constraint on convergence [11].
Some integral probability bounds, however, have other, more practical uses than directly constraining convergence. In particular while geometric ergodicity in the total variation distance is often weak as a direct constraint, it implies that Markov chain Monte Carlo estimators of the expectation values of well-behaved functions satisfy a central limit theorem. In other words the practical utility of establishing geometric ergodicity is not so much an explicit bound but rather a verification that we can probabilistically quantify the convergence of Markov chain Monte Carlo estimators just as we did for Monte Carlo estimators.
Monte Carlo methods are so exceptionally useful because Monte Carlo estimators satisfy a central limit theorem that allows us to quantify their errors. We have already qualitatively seen that Markov chain Monte Carlo estimators can be much more fragile than Monte Carlo estimators, but every now and then they can useful, too. In particular under sufficiently ideal circumstances Markov chain Monte Carlo estimators satisfy their own central limit theorem. The conditions for that Markov chain Monte Carlo central limit theorem, however, are significantly more complex.
Let’s consider a square integrable function, \(f : Q \rightarrow \mathbb{R}\), and assume that our Markov transition distribution enjoys a central limit theorem. In this case the probability distribution of realized Markov chain Monte Carlo estimators from sufficiently long Markov chains is approximately specified by a Gaussian probability density function, \[ \hat{f}^{\text{MCMC}}_{N} \sim \text{normal}( \mathbb{E} [f], \text{MCMC-SE}[f] ), \] where the Markov chain Monte Carlo standard error is defined as \[ \begin{align*} \text{MCMC-SE}[f] &= \sqrt{ \frac{ \text{Var} [f] }{ \text{ESS}[f] } } \\ &= \sqrt{ \frac{ \text{Var} [f] }{ \lambda[f] \cdot N} }. \end{align*} \] The mysterious term \(\text{ESS}[f] = \lambda[f] \cdot N\) is called the effective sample size.
Just as with Monte Carlo the existence of a central limit theorem allows us to quantify the error of Markov chain Monte Carlo estimators, and even control the errors by adjusting the length of a given Markov chain. This principled understanding of the estimator error elevates the method into the realm of robust probabilistic computation.
In this section we will carefully inspect each element of the Markov chain Monte Carlo central limit theorem in theory, and how it can be implemented in practice. We will begin by discussing the need for sufficiently long chains before examining the intimate details of the effective sample size. Finally we will take a more practical perspective and consider how we can approximate each of the terms in the central limit theorem and then attempt to verify its existence in the first place.
One of the more subtle aspects of Markov chains is that they never quite forget from where they came. Unless they’re seeded with an exact sample from the stationary distribution every Markov chain will contain some residual influence from points outside of the stationary typical set. With each iteration the newest point in a Markov chain looks more like an exact sample from the stationary distribution, but after only a finite number of iterations the points will always have some memory of from where they came.
When we construct Markov chain Monte Carlo estimators we average over the entire Markov chain, including not only the newest points but also the older points that are more strongly influenced by the initialization. Even under ideal circumstances this results in a small bias in Markov chain Monte Carlo estimators that decays linearly with the number of iterations. For an derivation of this bias see Section 3.3.
For sufficiently long chains this bias is negligible, but for small chains it cannot be ignored. One of the most important consequences of this bias is when trying to construct Markov chain Monte Carlo estimators from many parallel Markov chains. Averaging over many chains with the same initialization might reduce the variance of the estimators but it can’t reduce the bias. If all of the chains are short then this bias will persist and dominate the error of the pooled estimator.
In order for a Markov chain Monte Carlo central limit theorem to be a reasonable approximation we need to run our Markov chains long enough for the memories of the initialization, and the residual bias in Markov chain Monte Carlo estimators, to become negligible. This forgetfulness starts to formalize the concept of equilibrium that we introduced in the previous section; the transient phase is characterized by a significant influence from the initialization where as equilibrium is characterized by a vanishing influence. While the influence never exactly disappears, under ideal circumstances there is threshold where the influence suddenly plummets that we can use to separate the regimes.
This heuristic perspective will be useful when we attempt to verify a central limit theorem in practice. While we don’t have many ways to quantify how much memory of its initialization a Markov chain retains, we can look for any transient behavior it manifests.
With the approximate nature of the central limit theorem in mind we can now move our attention to the effective sample size that defines the Markov chain Monte Carlo standard error. The effective sample size, commonly denoted \(\text{ESS}\) or \(N_{\text{eff}}\), is a critical component of the Markov chain Monte Carlo central limit theorem and may even come to stalk not only your dreams but also your waking hours. On trip to Germany a few years ago I encountered not only the essBar cafe,
but also the Neff bakery.