My current introduction to Markov chain Monte Carlo uses the Metropolis-Hastings method to construct demonstrative Markov transitions. That can coupling of Markov chain Monte Carlo in general and Metropolis-Hastings in particular, however, can confuse many of the important concepts. Here I present a variety of Markov transitions that are not derived from the Metropolis-Hastings methods and have been carefully engineered to demonstrate specific aspects of Markov chain Monte Carlo.

The code presented in this case study is meant to be demonstrative. In particular it is designed towards pedagogy and understanding rather than performance. I would not recommend that you apply these functions in complex analyses without first optimizing the code for better performance.

For more detail on the theoretical concepts here see for example the reviews [1] and [2].

1 Markov, Markov, Markov

First let's review the basics of Markov chain Monte Carlo, that is using Markov chains to estimate expectation values with respect to certain probability distributions. To simplify the presentation we will restrict our consideration to real-valued spaces, \(X = \mathbb{R}^{D}\), so that all of the relevant probability distributions can be specified by probability density functions.

1.1 From Point A To Point B

A Markov transition distribution \(\tau\) is a conditional probability distribution defined over the product space \((x_{n}, x_{n + 1}) \in X \times X\). Given an initial state \(\tilde{x}_{n}\) a conditional sample \[ \tilde{x}_{n + 1} \sim \tau(x_{n + 1} \mid \tilde{x}_{n}) \] realizes a Markov transition that jumps from \(\tilde{x}_{n} \in X\) to some new state \(\tilde{x}_{n + 1} \in X\). In practice we will specify a Markov transition distribution with a Markov transition density function \[ \begin{alignat*}{6} \tau :\; &X \times X& &\rightarrow& \; &\mathbb{R}^{+}& \\ &(x_{n}, x_{n + 1})& &\mapsto& &\tau(x_{n + 1} \mid \tau_{n})&. \end{alignat*} \]

Consider for example the Markov transition distribution specified by the conditional density function \[ \tau(x_{n + 1} \mid \tau_{n}) = \text{normal}(x_{n + 1} \mid x_{n}, \omega). \] Each choice of scale \(\omega\) defines a different Markov transition distribution; if \(\omega\) is small then the Markov transitions will tend to stay close to the initial state and if \(\omega\) is large then the Markov transitions will tend to stray further.

c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")

par(family="serif", las=1, bty="l", cex.axis=1, cex.lab=1, cex.main=1,
    xaxs="i", yaxs="i", mar = c(5, 5, 3, 1))
xs <- seq(-15, 15, 0.01)

ys <- dnorm(xs, 0, 2)
plot(xs, ys, type="l", lwd=2, col=c_dark,
     xlab="x_{n + 1}", ylab="tau", yaxt='n', ylim=c(0, 0.4))
text(-6, 0.23, cex=1.25, label="x_n = 0, omega = 2",
     pos=4, col=c_dark)

ys <- dnorm(xs, -2, 3)
lines(xs, ys, lwd=2, col=c_mid)
text(-14, 0.1, cex=1.25, label="x_n = -2, omega = 3",
     pos=4, col=c_mid)

ys <- dnorm(xs, 5, 1)
lines(xs, ys, lwd=2, col=c_light)
text(-5, 0.32, cex=1.25, label="x_n = 5, omega = 1",
     pos=4, col=c_light)

transition <- function(x, omega) {
  rnorm(1, x, omega)
}
set.seed(9958284)

omega <- 3

x0 <- 5

x1 <- transition(x0, omega)
x1
[1] 5.121533
R <- 1000
x1_ensemble <- sapply(1:R, function(n) transition(x0, omega))

hist(x1_ensemble, breaks=seq(-5, 15, 0.75), prob="T",
     xlab="x_{n + 1}", ylab="tau", yaxt='n',
     col=c_dark, border=c_dark_highlight)

xs <- seq(-5, 15, 0.01)
ys <- dnorm(xs, x0, omega)
lines(xs, ys, lwd=2, col=c_light)

A Markov transition distribution can act not only on points in the ambient space but also on probability distributions. For example a Markov transition distribution lifts a probability distribution \(\rho\) into a joint probability distribution over the product space \(X \times X\): \[ \tau \times \rho(x_{n + 1}, x_{n}) = \tau(x_{n + 1} \mid x_{n}) \, \rho(x_{n}). \]

If we interpret \(\rho(x_{n})\) as a probability distribution of initial states then \(\tau \times \rho(x_{n + 1}, x_{n})\) defines the joint probability distribution of both initial and final states. Sampling from this joint probability distribution, \[ \begin{align*} \tilde{x}_{n} &\sim \rho(x_{n}) \\ \tilde{x}_{n + 1} &\sim \tau(x_{n + 1} \mid \tilde{x}_{n}), \end{align*} \] realizes a particular pair of initial and final states.

For example is \(\rho\) is specified by the probability density function \[ \rho(x_{n}) = \text{normal}(x_{n} \mid 0, 1) \] then \[ \begin{align*} \tau \times \rho(x_{n + 1}, x_{n}) &= \text{normal}(x_{n + 1} \mid x_{n}, \omega) \, \text{normal}(x_{n} \mid 0, 1). \end{align*} \]

init <- function() {
  rnorm(1, 0, 1)
}
par(mfrow=c(1, 2))

plot(0, type="n",
     xlab="x_{n}", xlim=c(-8, 8),
     ylab="x_{n + 1}", ylim=c(-8, 8))

R <- 25
for (r in 1:R) {
  x0 <- init()
  x1 <- transition(x0, omega)
  points(x0, x1, col=c_dark, pch=16, cex=1.2)
  points(x0, x1, col=c_light, pch=16, cex=0.8)
}

G <- 100
x0_grid <- seq(-8, 8, 16 / G)
x1_grid <- seq(-8, 8, 16 / G)
joint_densities <- matrix(data = 0, nrow = G + 1, ncol = G + 1)

for (g1 in 1:G)
  for (g2 in 1:G)
    joint_densities[g1, g2] <- dnorm(x1_grid[g2], x0_grid[g1], omega) *
                               dnorm(x0_grid[g1], 0, 1)

contour(x0_grid, x1_grid, joint_densities, nlevels=10,
        xlab="x_{n}", xlim=c(-8, 8),
        ylab="x_{n + 1}", ylim=c(-8, 8),
        drawlabels = FALSE, col = c_dark, lwd = 2)

Marginalizing the initial state out of this joint probability distribution defines a probability distribution over the final state, \[ \tau \circ \rho (x_{n + 1}) = \int \mathrm{d} x_{n} \, \tau(x_{n + 1} \mid x_{n}) \, \rho(x_{n}). \] In other words a Markov transition distribution defines a pushforward operation that updates the probability distribution of initial states \(\rho\) into a probability distribution of final states \(\tau \circ \rho(x)\). Consequently a Markov transition can be used to update not only both points in the ambient space but also probability distributions over the ambient space.

In our running example \[ \begin{align*} \tau \circ \rho (x_{n + 1}) &= \int \mathrm{d} x_{n} \, \tau(x_{n + 1} \mid x_{n}) \, \rho(x_{n}) \\ &= \int \mathrm{d} x_{n} \, \text{normal}(x_{n + 1} \mid x_{n}, \omega) \, \text{normal}(x_{n} \mid 0, 1) \\ &= \text{normal}(x_{n + 1} \mid 0, \sqrt{\omega^{2} + 1}). \end{align*} \]

par(mfrow=c(1, 1))

R <- 1000
x1_ensemble <- sapply(1:R, function(n) transition(init(), omega))

hist(x1_ensemble, breaks=seq(-15, 15, 0.6), prob="T",
     xlab="x_{n + 1}", ylab="tau", yaxt='n',
     col=c_dark, border=c_dark_highlight)

xs <- seq(-11, 11, 0.01)
ys <- dnorm(xs, 0, sqrt(omega**2 + 1))
lines(xs, ys, lwd=2, col=c_light)

1.2 Chain, Chain, Chain

Applying a Markov transition to some initial state \[ \tilde{x}_{0} \sim \rho(x_{0}) \] gives a second state, \[ \tilde{x}_{1} \sim \tau(x_{1} \mid \tilde{x}_{0}). \] Once we've realized this second state we can apply the Markov transition again to give a third state, \[ \tilde{x}_{2} \sim \tau(x_{2} \mid \tilde{x}_{1}). \] The joint probability distribution over all three states is specified by the probability density function \[ \begin{align*} \tau^{2} \times \rho(x_{2}, x_{1}, x_{0}) &\equiv \tau \times (\tau \times \rho) (x_{2}, x_{1}, x_{0}) \\ &= \tau(x_{2} \mid x_{1}) \, \tau \times \rho(x_{1}, x_{0}) \\ &= \tau(x_{2} \mid x_{1}) \, \tau(x_{1} \mid x_{0}) \, \rho(x_{0}) \end{align*} \] Likewise the probability distribution over the new final state is specified by the probability density function \[ \begin{align*} \tau^{2} \circ \rho(x_{2}) &\equiv \tau \circ (\tau \circ \rho) (x_{2}) \\ &= \int \mathrm{d} x_{1} \, \tau(x_{2} \mid x_{1}) \, \tau \circ \rho(x_{1}) \\ &= \int \mathrm{d} x_{1} \, \tau(x_{2} \mid x_{1}) \, \int \mathrm{d} x_{0} \, \tau(x_{1} \mid x_{0}) \, \rho(x_{0}). \end{align*} \] This latter probability distribution is referred to as the \(2\)-step distribution.

Iterating this procedure \(N\) times realizes a sequence of \(N + 1\) states \[ \begin{align*} \tilde{x}_{1} &\sim \tau(x_{1} \mid \tilde{x}_{0}) \\ \tilde{x}_{2} &\sim \tau(x_{2} \mid \tilde{x}_{1}) \\ \ldots \\ \tilde{x}_{n} &\sim \tau(x_{n} \mid \tilde{x}_{n - 1}) \\ \ldots \\ \tilde{x}_{N} &\sim \tau(x_{N} \mid \tilde{x}_{N - 1}). \end{align*} \] that we refer to as a Markov chain.

These sequences are often visualized in a trace plot that plots each state, or a one-dimensional output of an expectand evaluated at each state, verses an iteration index.

generate_chain <- function(x0, N, transition) {
  xs <- rep(0, N + 1)
  xs[1] <- x0
  for (n in 1:N) {
    xs[n + 1] <- transition(xs[n])
  }
  (xs)
}
transition <- function(x) {
  rnorm(1, x, 3)
}
N <- 25

x0 <- init()
chain <- generate_chain(x0, N, transition)

plot(0:N, chain, type="l", lwd=2, col=c_dark,
     xlim=c(0, N), xlab="Iteration",
     ylab="State")

The joint probability distribution over all of the states in a Markov chain is specified by the probability density function \[ \begin{align*} \tau^{N} \times \rho(x_{N}, \ldots, x_{n}, \ldots, x_{0}) &\equiv \underbrace{\tau \times \ldots \times \tau}_{N\text{ times}} \times \rho (x_{N}, \ldots, x_{n}, \ldots, x_{0}) \\ &= \left[ \prod_{n = 1}^{N} \tau(x_{n} \mid x_{n - 1}) \right] \, \rho(x_{0}). \end{align*} \] In particular we can interpret each realized Markov chain as a sample from this joint probability distribution.

plot(0, type="n",
     xlim=c(0, N), xlab="Iteration",
     ylim=c(-20, 20), ylab="State")

lines(0:N, generate_chain(init(), N, transition), lwd=2, col=c_dark)
lines(0:N, generate_chain(init(), N, transition), lwd=2, col=c_mid_highlight)
lines(0:N, generate_chain(init(), N, transition), lwd=2, col=c_mid)
lines(0:N, generate_chain(init(), N, transition), lwd=2, col=c_light_highlight)

Finally the \(N\)-step distribution characterizes the behavior of the final state after having applied the Markov transition \(N\) times, \[ \begin{align*} \tau^{N} \circ \rho(x_{N}) &\equiv \underbrace{\tau \circ \ldots \circ \tau}_{N\text{ times}} \circ \rho (x_{N}) \\ &= \left[ \prod_{n = 1}^{N - 1} \int \mathrm{d} x_{n} \, \tau(x_{n} \mid x_{n - 1}) \right] \, \int \mathrm{d} x_{0} \, \tau(x_{1} \mid x_{0}) \, \rho(x_{0}). \end{align*} \] We can also interpret this \(N\)-step distribution as a projection of the joint distribution, \[ \tau^{N} \circ \rho(x_{N}) = \left[ \prod_{n = 0}^{N - 1} \int \mathrm{d} x_{n} \right] \, \tau^{N} \times \rho(x_{N}, \ldots, x_{n}, \ldots, x_{0}). \]

Our demonstrative Markov transition admits an explicit \(N\)-step probability density function, \[ \tau^{N} \circ \rho (x_{n + 1}) = \text{normal}(x_{n + 1} \mid 0, \sqrt{N \, \omega^{2} + 1}) \]

R <- 1000
realized_chains <- matrix(data = 0, nrow = R, ncol = N + 1)

for (r in 1:R) {
  realized_chains[r,] <- generate_chain(init(), N, transition)
}

par(mfrow=c(2, 2))

for (n in c(5, 10, 15, 20)) {
  hist(realized_chains[, n], breaks=seq(-60, 60, 3), prob="T",
       main=paste0("N = ", n),
       xlab=paste0("x_", n),
       ylab=paste0("tau^", n, " circ rho"), yaxt='n',
       col=c_dark, border=c_dark_highlight)

  xs <- seq(-50, 50, 0.01)
  ys <- dnorm(xs, 0, sqrt(n * omega**2 + 1))
  lines(xs, ys, lwd=2, col=c_light)
}

If we want to fix the initial state for multiple Markov chain realizations then we can employ a Dirac distribution, \[ \rho = \delta_{x_{0}}, \] that concentrates all of its probability at \(x_{0}\). Every sample from this Dirac distribution will yield \(x_{0}\), ensuring a deterministic initialization.

init <- function() {
  -2;
}

par(mfrow=c(1, 1))

plot(0, type="n",
     xlim=c(0, N), xlab="Iteration",
     ylim=c(-30, 30), ylab="State")

lines(0:N, generate_chain(init(), N, transition), lwd=2, col=c_dark)
lines(0:N, generate_chain(init(), N, transition), lwd=2, col=c_mid_highlight)
lines(0:N, generate_chain(init(), N, transition), lwd=2, col=c_mid)
lines(0:N, generate_chain(init(), N, transition), lwd=2, col=c_light_highlight)

1.3 Estimation Inclination

Once we have generated a Markov chain \(\{x_{0}, \ldots, x_{N} \}\) we can construct empirical averages of any real-valued function, or expectand, \(f : X \rightarrow \mathbb{R}\), \[ \left< f \right>_{0:N} \equiv \frac{1}{N + 1} \sum_{n = 0}^{N} f(x_{n}). \]

The possible output values of these empirical averages are characterized by pushing forward the joint probability distribution of the states along the empirical average function, \[ ( \left< f \right>_{0:N} )_{*} \tau^{N} \times \rho, \] with the corresponding pushforward probability density function \[ ( \left< f \right>_{0:N} )_{*} \tau^{N} \times \rho(s). \]

Consider for example our demonstrative Markov transition distribution. The pushforward density function for the ensemble average of the identity function, \[ \begin{align*} \left< f \right>_{0:N} &= \frac{1}{N + 1} \sum_{n = 0}^{N} \text{Id}(x_{n}) \\ &= \frac{1}{N + 1} \sum_{n = 0}^{N} x_{n} \end{align*} \] can be evaluated in closed form: \[ ( \left< f \right>_{0:N} )_{*} \tau^{N} \times \rho(s) = \text{normal} \left(s \mid 0, \sqrt{1 + \frac{N \, (2 \, N + 1)}{6 \, (N + 1)} \, \omega^{2}} \right). \]

First we evaluate the expectand at each of our realized Markov chain states.

identity <- function(x) {
  (x)
}

N <- 25
pushforward_values <- matrix(data = 0, nrow = R, ncol = N + 1)
for (r in 1:R)
  for (n in 1:(N + 1))
    pushforward_values[r, n] <- identity(realized_chains[r, n])

Then we use a Welford accumulator to efficiently, and numerically stably, evaluate the empirical average for each Markov chain.

welford_average <- function(xs) {
  N <- length(xs)
  average <- 0
  for (n in 1:N) {
    delta <- xs[n] - average
    average <- average + delta / n
  }
  (average)
}

ensemble_empirical_averages <-
  sapply(1:R, function(r) welford_average(pushforward_values[r,]))

Now we should be able to plot things.

par(mfrow=c(1, 1))

hist(ensemble_empirical_averages, breaks=seq(-40, 40, 2), prob="T",
     xlab="x", ylab="", yaxt='n',
     col=c_dark, border=c_dark_highlight)

xs <- seq(-30, 30, 0.01)
sum_var <- 1 + omega**2 * N * (2 * N + 1) / (6 * (N + 1))
ys <- dnorm(xs, 0, sqrt(sum_var))
lines(xs, ys, lwd=2, col=c_light)

At this point we might be curious if, as with Monte Carlo, these empirical averages are related to expectation values with respect to some probability distribution, in which case we could use them at estimate these expectation values. Unfortunately explicitly constructing the pushforward density function for an empirical average, let alone relating it to the expectation value with respect to some distribution, cannot be done for most Markov transitions and functions.

Fortunately when a Markov transition distribution is sufficiently well-behaved we can relate empirical averages to explicit expectation values when the Markov chains are long enough.

1.4 You Don't Need To. Converge. To Nothing.

A probability distribution over the ambient spaces is referred to as a stationary probability distribution, or more compactly just a stationary distribution, of a Markov transition distribution if it equals its own \(1\)-step distribution, \[ \sigma = \tau \circ \sigma, \] or in terms of probability density functions \[ \begin{align*} \sigma(x_{n + 1}) &= \tau \circ \sigma(x_{n + 1}) \\ &= \int \mathrm{d} x_{n} \, \tau(x_{n + 1} \mid x_{n}) \, \sigma(x_{n}). \end{align*} \] Some Markov transition distributions feature no stationary distributions, some feature multiple stationary distributions, and a few special Markov transition distributions feature a single, unique stationary distribution. Identifying the stationary distributions for a general Markov transition distribution, however, is a notoriously difficult task.

Stationary distributions are critical because they determine the asymptotic behavior of Markov chains as the total number of states grows to infinity \(N \rightarrow \infty\). For example if a Markov transition distribution admits any stationary distributions then the \(N\)-step distributions will converge to one of those stationary distributions, \[ \lim_{N \rightarrow \infty} \tau^{N} \circ \rho = \sigma. \] In general the \(N\)-step distributions from different initial distributions \(\rho\) will converge to different stationary distributions \(\sigma\).

When the \(N\)-step distribution converges to a stationary distribution the corresponding distribution of empirical averages will converge to a Dirac distribution that concentrates on the corresponding stationary expectation value, \[ \lim_{N \rightarrow \infty} ( \left< f \right>_{0:N} )_{*} \tau^{N} \times = \delta_{\mathbb{E}_{\sigma}[f]}, \] provided that the stationary expectation value \(\mathbb{E}_{\sigma}[f]\) is well-defined.

For example the expectation value of an indicator function is always well-defined, \[ \mathbb{E}_{\pi} [ \mathbb{I}_{A} ] = \mathbb{P}_{\pi}[A]. \] Consequently the empirical averages \[ \frac{1}{N + 1} \sum_{n = 0}^{N} \mathbb{I}_{A}[x_{n}] \] will converge to \(\mathbb{P}_{\pi}[A]\) for almost every Markov chain realization \[ \begin{align*} \tilde{x}_{0} &\sim \rho(x_{0}) \\ \tilde{x}_{1} &\sim \tau(x_{1} \mid \tilde{x}_{0}) \\ \tilde{x}_{2} &\sim \tau(x_{2} \mid \tilde{x}_{1}) \\ \ldots \\ \tilde{x}_{n} &\sim \tau(x_{n} \mid \tilde{x}_{n - 1}) \\ \ldots \\ \tilde{x}_{N} &\sim \tau(x_{N} \mid \tilde{x}_{N - 1}). \end{align*} \]

Markov transition distributions that are irreducible and aperiodic feature a single, unique stationary distribution to which the \(N\)-step distribution converges for all initial distributions, if they admit any stationary distributions at all.

Our demonstrative Markov transition distribution is irreducible and aperiodic. Unfortunately it doesn't quite admit a well-defined stationary probability distribution; the unique stationary distribution is specified by a uniform density function with ill-defined normalization, \[ \text{uniform}(x) = \lim_{L \rightarrow \infty} \frac{1}{L - (-L)}. \] In this case this is straightforward to verify, \[ \begin{align*} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \tau(x_{n + 1} \mid x_{n}) \, \sigma(x_{n}) &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n + 1} \mid x_{n}, \omega) \, \text{uniform}(x_{n}) \\ &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n + 1} \mid x_{n}, \omega) \, \text{const} \\ &= \text{const} \, \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n + 1} \mid x_{n}, \omega) \\ &= \text{const} \, \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid x_{n + 1}, \omega) \\ &= \text{const} \\ &= \text{uniform}(x_{n + 1}). \end{align*} \]

Consequently the \(N\)-step density functions from any initial distribution should converge to this uniform density function, \[ \lim_{N \rightarrow \infty} \tau^{N} \circ \rho(x) = \text{uniform}(x), \] We can immediately verify this convergence using our analytic form of the \(N\)-step density function, \[ \begin{align*} \lim_{N \rightarrow \infty} \tau^{N} \circ \rho(x) &= \lim_{N \rightarrow \infty} \text{normal}(x \mid 0, N \, \omega^{2} + 1) \\ &= \lim_{N \rightarrow \infty} \text{normal}(x \mid 0, N) \\ &= \text{uniform}(x). \end{align*} \] The \(N\)-step density functions stay centered at zero but become more and more diffuse until they eventually yield a uniform density function in the infinite limit.

The probability allocated to any finite set by this uniform distribution is zero, and so we expect that \[ \lim_{N \rightarrow \infty} \frac{1}{N + 1} \sum_{n = 0}^{N} \mathbb{I}_{A}[x_{n}] = 0. \]

N <- 1000
R <- 5
realized_chains <- matrix(data = 0, nrow = R, ncol = N + 1)

for (r in 1:R) {
  realized_chains[r,] <- generate_chain(init(), N, transition)
}
indicator <- function(x) {
  ifelse(0 < x & x < 1, 1, 0)
}

pushforward_values <- matrix(data = 0, nrow = R, ncol = N + 1)
for (r in 1:R)
  for (n in 1:(N + 1))
    pushforward_values[r, n] <- indicator(realized_chains[r, n])
running_averages <- matrix(data = 0, nrow = R, ncol = N + 1)
for (r in 1:R)
  for (n in 1:(N + 1))
    running_averages[r, n] <- welford_average(pushforward_values[r, 1:n])

plot(0, type="n",
     xlim=c(0, N), xlab="Iteration",
     ylim=c(0, 0.1), ylab="Running Empirical Average")

lines(0:N, running_averages[1,], lwd=2, col=c_dark)
lines(0:N, running_averages[2,], lwd=2, col=c_mid_highlight)
lines(0:N, running_averages[3,], lwd=2, col=c_mid)
lines(0:N, running_averages[4,], lwd=2, col=c_light_highlight)
lines(0:N, running_averages[5,], lwd=2, col=c_light)

Moreover, because the empirical averages from almost all Markov chain realizations converge to the same value we can also average over an ensemble of Markov chains \[ \begin{align*} \left< f \right> &= \frac{1}{C} \sum_{c = 1}^{C} \left< f \right>_{c} \\ &= \frac{1}{C} \sum_{c = 1}^{C} \frac{1}{N_{c} + 1} \sum_{n = 0}^{N_{c}} f(x_{c, n}). \end{align*} \]

running_ensemble_averages <- rep(0, N + 1)
for (n in 1:(N + 1))
  running_ensemble_averages[n] <- welford_average(pushforward_values[1:R, 1:n])

plot(0:N, running_ensemble_averages, type="l", lwd=2, col=c_dark,
     xlim=c(0, N), xlab="Iteration",
     ylim=c(0, 0.1), ylab="Running Ensemble Empirical Average")

In both cases the empirical averages are converging towards zero, but because we cannot generate infinitely long Markov chains in practice we cannot empirically confirm that they will reach zero in the asymptotic limit.

2 Markov Chain Monte Carlo

For empirical averages over Markov chains to be useful we need to understand how they relate to stationary expectation values after only a finite number of iterations. In other words we need to know not only to what values empirical averages will converge but also how they converge to those values. Unfortunately quantifying how empirical averages converge is much more difficult than quantifying the asymptotic value alone.

In Monte Carlo the behavior of any empirical average over independent sequences could be characterized by a central limit theorem so long as \(\mathbb{E}_{\pi}[f]\) is sufficiently integrable. For sufficiently large \(N\) the behavior of the emprical averages is well-characterized by a normal probability density function \[ \pi(\left< f \right>_{0:N}) \approx \text{normal} \left( \left< f \right>_{0:N} \mid \mathbb{E}_{\pi}[f], \text{MC-SE}[f] \right), \] where the Monte Carlo standard error \[ \text{MC-SE}[f] = \sqrt{ \frac{\mathbb{V}_{\pi}[f]}{N} } \] probabilistically quantifies how strongly the empirical averages will concentrate around the exact expectation value. This central limit theorem provides the context needed to estimate the exact expectation value \(\mathbb{E}_{\pi}[f]\) with an empirical average \(\left< f \right>_{0:N}\), so long as the exact expectation value is well-defined.

In some exceptional cases a Markov transition distribution will also induce a central limit theorem that characterizes the behavior of empirical averages for sufficiently long Markov chains, \[ \pi(\left< f \right>_{0:N}) \approx \text{normal} \left( \left< f \right>_{0:N} \mid \mathbb{E}_{\pi}[f], \text{MCMC-SE}[f] \right), \] where the precision is now defined by a Markov chain Monte Carlo standard error \[ \text{MCMC-SE}[f] = \sqrt{ \frac{\mathbb{V}_{\pi}[f]}{ \text{ESS}[f] } }. \] Here the sequence length \(N\) is replaced by the expectand-dependent effective sample size \(\text{ESS}[f]\). When such a central limit theorem holds we can use empirical averages over Markov chains to estimate exact expectation values and probabilistically quantify the approximation error. This estimation methodology is referred to as Markov chain Monte Carlo.

2.1 The Markov Chain Monte Carlo Standard Error

The effective sample size directly influences the precision of a Markov chain Monte Carlo estimator; the larger the effective sample size the smaller the probabilistic error and vice versa. For infinitely long Markov chains the effective sample size is defined as a sum of expectand autocorrelations, \[ \begin{align*} \text{ESS}[f] &= \frac{N}{\sum_{l = -\infty}^{\infty} \rho_{l}[f]} \\ &= \frac{N}{1 + 2 \, \sum_{l = 1}^{\infty} \rho_{l}[f]}, \end{align*} \] where the \(\rho_{l}[f]\) are the \(l\)-lag autocorrelations, \[ \rho_{l}[f] = \frac{ \mathbb{E}_{\tau^{N} \times \rho} [ ( f (x_{n + l}) - \mathbb{E}_{\pi}[f] ) ( f(x_{n}) - \mathbb{E}_{\pi}[f] ) ] } { \mathrm{Var}_{\pi}[f] } \] Inspecting this definition we can see that the more strongly the expectand output is correlated at distant states the Markov chain the smaller the effective sample size will be.

In practice we cannot derive an exact effective sample size for each expectand, but if a Markov chain Monte Carlo central limit theorem holds then we can estimate it from a realized Markov chain. Specifically we can use construct Markov chain Monte Carlo estimators for the autocorrelations, \(\hat{\rho}_{l}[f]\), and the variance \(\widehat{\mathrm{Var}}_{\pi}[f]\) and then compute \[ \widehat{\text{MCMC-SE}}[f] = \sqrt{ \frac{ \widehat{\mathbb{V}}_{\pi}[f]}{ \widehat{\text{ESS}} [f] } } \] where \[ \widehat{\text{ESS}}[f] = \frac{N}{1 + 2 \, \sum_{l = 1}^{\infty} \hat{\rho}_{l}[f]}. \]

Effective sample size estimation becomes more robust if we process the empirical autocorrelations slightly. For example for high enough lags the exact autocorrelations will decay towards zero and the empirical autocorrelations will be dominated by estimator error. We can limit how much of this error propagates into the effective sample size estimator by truncating the sum over lags. Geyer's initial positive sequence truncates the sum as soon as pairs of empirical autocorrelations become negative, which is impossible for the exact autocorrelations. Similarly Geyer's initial monotone sequence further suppresses estimator error by enforcing monotonic decay expected of exact pairs. These are just a few of the possible corrections; see for example [3].

Here we use a Welford accumulator to efficiently evaluate empirical autocorrelations up to a given lag using only as single pass through the states of a realized Markov.

welford_summary <- function(fs, L) {
  summary <- rep(0, L + 1)

  N <- length(fs)
  for (n in 1:N) {
    delta <- fs[n] - summary[1]
    summary[1] <- summary[1] + delta / n
    for (l in 1:L) {
      if (n > l)
        summary[l + 1] <- summary[l + 1] + delta * (fs[n - l + 1] - summary[1])
    }
  }

  norm <- 1 / (N - 1)
  summary[2:(L + 1)] <- summary[2:(L + 1)] * norm

  (summary)
}

We can then process these empirical autocorrelations into a robust effective sample size estimator.

mcmc_est <- function(fs, L = 100) {
  # Compute cumulants
  summary <- welford_summary(fs, L)

  mean = summary[1]
  acov = summary[2:(L + 1)] # Empirical autocovariances

  ## Return a NaN MCMC standard error if the sequence is constant
  if (all(acov == 0))
    return (c(mean, 0, NaN))

  rho_hat = acov / acov[1]  # Empirical autocorrelations

  # Compute the effective sample size

  # Geyer's "initial positive sequence" of truncated empirical autocorrelations.
  # For every even lag l the exact autocorrelation paired sum rho_{l} + rho_{l + 1}
  # is guaranteed to be positive.  The initial positive sequence heuristic
  # truncates the empirical autocorrelations when the first paired sum is negative,
  # corresponding to the threshold where estimator noise starts to dominate.

  ips_rho_hat <- rep(0, L)

  ips_rho_hat[1] <- rho_hat[1] # Lag 0
  ips_rho_hat[2] <- rho_hat[2] # Lag 1

  max_s <- 2

  for (s in seq(3, L, 2)) {
    rho_hat_even <- rho_hat[s]     # Lag s - 1
    rho_hat_odd  <- rho_hat[s + 1] # Lag s - 2
  
    max_s <- s + 1
  
    if (rho_hat_even + rho_hat_odd > 0) {
      ips_rho_hat[s]     = rho_hat_even
      ips_rho_hat[s + 1] = rho_hat_odd
    } else {
      break
    }
  }

  # Geyer's "initial monotone sequence" of empirical autocorrelations.
  # Exact autocorrelation paired sums rho_{l} + rho_{l + 1} are guaranteed
  # to exhibit montonic decay with l.  The initial monotone sequence heuristic
  # replaces pairs of empirical autocorrelations that don't decay with the
  # previous sum, splitting the total evenly across the two components.

  ims_rho_hat <- ips_rho_hat

  for (s in seq(3, max_s, 2)) {
    current_sum <- ims_rho_hat[s] + ims_rho_hat[s + 1]
    previous_sum <- ims_rho_hat[s - 2] + ims_rho_hat[s - 1]
    if (current_sum > previous_sum) {
      ims_rho_hat[s]     <- 0.5 * previous_sum
      ims_rho_hat[s + 1] <- 0.5 * previous_sum
    }
  }

  # Finally we regularize the initial monotone sequence in case of antithetic
  # autocorrelations.  The sum of exact autocorreletions with lag >= 1 is
  # guaranteed to be greater than -1/2.  If the sum is too negative then
  # we replace it with -1/4 as a very conservative upper bound.

  rho_sum <- sum(ims_rho_hat[2:L])
  if (rho_sum <= -0.25) rho_sum <- -0.25

  # After all of that refinement of the empirical autocorrelations we can
  # finally construct an effective sample size estimator.

  ess <- length(fs) / (1.0 + 2 * rho_sum)

  c(mean, sqrt(acov[1] / ess), ess)
}

Asymptotic convergence requires only that the exact expectation value \(\mathbb{E}_{\pi}[f]\) is well-defined, and a central limit theorem requires that \(\mathbb{E}_{\pi}[f^{2}]\) is well-defined as well. One limitation to this empirical estimation of the Markov chain Monte Carlo standard error is that we also need \(\mathbb{E}_{\pi}[f^{4}]\) to be well-defined so that the error in estimating \(\widehat{\mathbb{V}}_{\pi}[f]\) is under control.

Replacing the exact variance and effective sample size with these estimates does introduce additional error, but that error is of the same order as the error in the Markov chain Monte Carlo central limit theorem itself. In particular both errors will become negligible so long as the Markov chain is long enough.

2.2 Warmup

The Monte Carlo central limit theorem requires \(N\) to be large enough for the normal approximation to be sufficiently accurate, but because the sequences are independent the normal approximation will be valid for any subset of \(N\) points in a realized sequence. States in a Markov chain, however, are by construction correlated which complicates exactly when a central limit theorem will hold.

Within a Markov chain the earliest states will be most correlated with the initial state, and hence the choice of initial distribution. As we generate more states, and the \(N\)-step distribution converges to a stationary distribution, the states forget become less correlated with the initial state, "forgetting" about the initialization.

The longer these correlations persist the longer a Markov chain we will need for the convergent behavior of later states to dominate empirical averages so that a Markov chain Monte Carlo central limit theorem can hold. In particular the more discrepant the initial distribution is from the relevant stationary distribution the longer our Markov chains will need to be.

Excising the early states that are strongly correlated with the initial state, and only those states, preserves the convergent behavior of the later states. In other words if empirical averages over the full Markov are well-characterized by a Markov chain Monte Carlo central limit theorem then removing the early states will give shorter Markov chains whose resulting empirical averages will still be well-characterized by the central limit theorem.

When applying Markov chain Monte Carlo in practice we can set aside an initial window of states to allow the Markov chains to equilibrate and forget about the initial distribution before we start aggregating states into empirical averages. This process is known as warmup, with that initial window ignored states known as a warmup period.

2.3 Markov Chain Monte Carlo With Multiple Markov Chains

A Markov chain Monte Carlo central limit theorem quantifies the behavior of Markov chain Monte Carlo estimators derived from a single Markov chain. An immediate question is then what kinds of estimators we might be able to construct from multiple Markov chains. Fortunately if the Markov transition distribution is recurrent then combining individual empirical averages is relatively straightforward.

If a recurrent Markov transition distribution admits a central limit theorem then each Markov chain yields the Markov chain Monte Carlo estimator \[ \left< f \right>_{0:N_{c}} = \frac{1}{N_{c} + 1} \sum_{n = 0}^{N_{c}} f(x_{c, n}), \] whose distribution is specified by the probability density function \[ \pi(\left< f \right>_{0:N_{c}}) \approx \text{normal} \left( \left< f \right>_{0:N_{c}} \mid \mathbb{E}_{\pi}[f], \text{MCMC-SE}_{c}[f] \right). \] Even if a Markov chain Monte Carlo central limit theorem holds the standard error might vary across the individual Markov chains if they are not all the same length.

While we cannot combine the Markov chain states directly we can combine these empirical estimators. Because the Markov chains are independent the distribution of the ensemble estimator \[ \begin{align*} \left<\left< f \right>\right> &\equiv \frac{1}{C} \sum_{c = 1}^{C} \left< f \right>_{0:N_{c}} \\ &= \frac{1}{C} \sum_{c = 1}^{C} \frac{1}{N_{c} + 1} \sum_{n = 0}^{N_{c}} f(x_{c, n}) \\ \end{align*} \] will be characterized by the normal probability density function \[ \pi(\left<\left< f \right>\right>) \approx \text{normal} \left( \left<\left< f \right>\right> \mid \mathbb{E}_{\pi}[f], \left< \text{MCMC-SE}[f] \right> \right) \] where \[ \begin{align*} \left< \text{MCMC-SE}[f] \right> &= \sqrt{ \sum_{c = 1}^{C} \text{MCMC-SE}_{c}[f]^{2} } \\ &= \sqrt{ \sum_{c = 1}^{C} \frac{\mathbb{V}_{\pi}[f]}{ \mathbb{ESS}_{c}[f] } } \\ &= \sqrt{ \mathbb{V}_{\pi}[f] \, \sum_{c = 1}^{C} \frac{1}{ \mathbb{ESS}_{c}[f] } }. \end{align*} \]

When we are estimating these standard errors from each realized Markov chain we have two strategies. The simplest approach is to estimate the variance \(\mathbb{V}_{\pi}[f]\) and the autocorrelations \(\rho_{l}[f]\) from each Markov chain separately, and then combine the individual empirical averages together based on these separate standard error estimates. A more sophisticated approach would take into account the fact that these objects are fixed across all Markov chain realizations and so we can pool these estimates while constructing the standard errors.

For example the following code pools the variance estimator but not the individual autocorrelations. This is a common approach, but by no means the most effective approach.

pushforward_chains <- function(chains, expectand) {
  lapply(chains, function(c) sapply(c, function(x) expectand(x)))
}

ensemble_mcmc_est <- function(chains, L = 100) {
  C <- length(chains)
  chain_ests <- lapply(chains, function(c) mcmc_est(c, L))

  # Total effective sample size
  total_ess <- sum(sapply(chain_ests, function(est) est[3]))

  if (is.nan(total_ess)) {
    m <- mean(sapply(chain_ests, function(est) est[1]))
    se <- mean(sapply(chain_ests, function(est) est[2]))
    return (c(m, se, NaN))
  }

  # Ensemble average weighted by effective sample size
  mean <- sum(sapply(chain_ests,
                     function(est) est[3] * est[1])) / total_ess

  # Ensemble variance weighed by effective sample size
  # including correction for the fact that individual Markov chain
  # variances are defined relative to the individual mean estimators
  # and not the ensemble mean estimator
  vars <- rep(0, C)

  for (c in 1:C) {
    est <- chain_ests[[c]]
    chain_var <- est[3] * est[2]**2
    var_update <- (est[1] - mean)**2
    vars[c] <- est[3] * (var_update + chain_var)
  }
  var <- sum(vars) / total_ess

  c(mean, sqrt(var / total_ess), total_ess)
}

# Format ensemble Markov chain Monte Carlo estimator for display
display_ensemble_mcmc_est <- function(chains, L = 100, name) {
  est <- ensemble_mcmc_est(chains, L)
  cat(sprintf("E[%s] = %.3f +/- %.3f (Total Effective Sample Size = %.0f)\n",
              name, est[1], 2 * est[2], est[3]))
}

2.4 Diagnostics

In an ideal world we would be able to theoretically verify whether or not a given Markov transition distribution admits a central limit theorem, and then whether or not our Markov chains are long enough for the central limit theorem to hold so that empirical averages can be used as Markov chain Monte Carlo estimators. Unfortunately this theoretical verification is difficult even for relatively simple Markov transition distributions, and outright infeasible for more sophisticated Markov transitions distributions.

If we want to ensure robust Markov chain Monte Carlo estimation then we cannot take the existence of a Markov chain Monte Carlo central limit theorem for granted. Instead we have to carefully investigate any realized Markov chains for any behavior inconsistent with a central limit theorem. Any inconsistencies suggest that the Markov transition is not recurrent, or does not admit a Markov chain Monte Carlo central limit theorem, or we have not generated long enough Markov chains for the central limit theorem to hold, or that the exact expectation value is not sufficiently well-behaved.

Consider for example a recurrent Markov transition distribution with the single stationary distribution \(\pi\). If the exact expectation value \(\mathbb{E}_{\pi}[f]\) is well-defined then the empirical averages from any realized Markov chains, no matter the initial distribution, should be consistent with each other up to the Markov chain Monte Carlo standard errors.

2.4.1 Visual Diagnostics

An immediate way to visualize discrepancies across realized Markov chains is to examine the trace plots for each expectand of interest.

plot_traces <- function(chains, N_warmup = 0) {
  C <- length(chains)

  n_cols <- 2
  n_rows <- ceiling(C / 2)
  par(mfrow=c(n_rows, n_cols))

  min_y <- min(unlist(chains))
  max_y <- max(unlist(chains))

  for (c in 1:C) {
    fs <- chains[[c]]
    N <- length(fs)
    plot(0, type="n", main=paste("Chain", c),
         xlim=c(0, N + 1), xlab="Iteration",
         ylim=c(min_y, max_y), ylab="State")
    rect(0, min_y, N_warmup, max_y, col=c("#BBBBBB88"), border=FALSE)
    lines(1:N, fs, lwd=1, col=c_dark)
  }
}

At the same time we could compare the Markov chain Monte Carlo estimators derived from these Markov chains. If the intervals \[ \left< f \right>_{0:N_{c}} \pm \delta \cdot \text{MCMC-SE}_{c}[f] \] for some constant \(\delta\) don't visually overlap well then we should be suspicious of a central limit theorem holding, and hence the relevance of terms like \(\text{MCMC-SE}_{c}[f]\)!

plot_est_comp <- function(chains) {
  # Evaluate Markov chain Monte Carlo estimators
  C <- length(chains)

  indiv_est <- sapply(chains, function(c) mcmc_est(c))
  ensemble_est <- ensemble_mcmc_est(chains)

  # Construct central limit theorem estimator intervals
  M <- C + 1
  inters <- matrix(data = 0, nrow = 9, ncol = M)

  for (d in 1:9) {
    for (c in 1:C) {
      inters[d, c] <- indiv_est[1, c] + (d - 5) * indiv_est[2, c]
    }
    inters[d, C + 1] <- ensemble_est[1] + (d - 5) * ensemble_est[2]
  }

  # Plot interval comparison
  par(mfrow=c(1, 1))

  idx <- rep(1:M, each=2)
  x <- sapply(1:length(idx), function(m) if(m %% 2 == 0) idx[m] + 0.5 else idx[m] - 0.5)

  pad_inters <- do.call(cbind, lapply(idx, function(m) inters[1:9, m]))

  plot(0, type="n", main="",
       xlab="", xlim=c(0.5, M + 0.5), xaxt="n",
       ylab="Markov chain\nMonte Carlo Estimator",
       ylim=c(min(pad_inters[1,]), max(pad_inters[9,])))
  abline(v=M - 0.5, col="gray80", lwd=2, lty=3)

  polygon(c(x, rev(x)), c(pad_inters[1,], rev(pad_inters[9,])),
          col = c_light, border = NA)
  polygon(c(x, rev(x)), c(pad_inters[2,], rev(pad_inters[8,])),
          col = c_light_highlight, border = NA)
  polygon(c(x, rev(x)), c(pad_inters[3,], rev(pad_inters[7,])),
          col = c_mid, border = NA)
  polygon(c(x, rev(x)), c(pad_inters[4,], rev(pad_inters[6,])),
          col = c_mid_highlight, border = NA)
  for (m in 1:M)
    lines(x[(2 * m - 1):(2 * m)], pad_inters[5,(2 * m - 1):(2 * m)],
          col=c_dark, lwd=2)

  indiv_names <- sapply(1:C, function(c) paste("Chain", c))
  ensemble_name <- "Ensemble\nAverage"

  axis(1, at=1:M, labels=c(indiv_names, ensemble_name), mgp=c(1, 2, 0))
}

2.4.2 R-hat

Consistency can also investigated more quantitatively. For example the \(\hat{R}\) diagnostic compares expectand values across multiple Markov chain realizations with a modified analysis-of-variance statistic. The closer to statistic is to \(1\) the more homogenous the behavior of the expectand across the individual Markov chains, the larger the statistic the more heterogeneous the behavior and the less likely a central limit theorem will accurately characterize empirical estimators.

The split \(\hat{R}\) diagnostic goes one step further, splitting each realized Markov chain into two halves and then using the modified analysis-of-variance statistic to quantify the heterogeneity across all \(2 \, C\) Markov chain segments. While this can be a bit aggressive -- Markov chain segments might not be long enough for a Markov chain Monte Carlo central limit theorem to hold even if the full Markov chains are -- the additional sensitivity to heterogeneity within the Markov chain realizations is often useful in practice.

split_chain <- function(chain) {
  N <- length(chain)
  M <- N %/% 2
  list(chain1 <- chain[1:M], chain2 <- chain[(M + 1):N])
}

split_rhat <- function(chains) {
  split_chains <- unlist(lapply(chains, function(c) split_chain(c)), recursive=FALSE)

  N_chains <- length(split_chains)
  N <- sum(sapply(chains, function(c) length(c)))

  means <- rep(0, N_chains)
  vars <- rep(0, N_chains)

  for (c in 1:N_chains) {
    summary <- welford_summary(split_chains[[c]], 1)
    means[c] <- summary[1]
    vars[c] <- summary[2]
  }

  total_mean <- sum(means) / N_chains
  W = sum(vars) / N_chains
  B = N * sum(sapply(means, function(m) (m - total_mean)**2)) / (N_chains - 1)

  rhat = NaN
  if (abs(W) > 1e-10)
    rhat = sqrt( (N - 1 + B / W) / N )

  (rhat)
}

One significant limitation of \(\hat{R}\) and split \(\hat{R}\) is that their sensitivity to heterogeneity across the realized Markov chains, or Markov chain segments depends on how much useful information is contained with each Markov chain. If the states were independent then the sensitivity would depend on the total number of states, but because the states in each Markov chain will be correlated the sensitivity will also depend on the correlation between the states. In other words these diagnostics can stray from \(1\) when the Markov chains are not long enough even if there are no actual obstructions to a Markov chain Monte Carlo central limit theorem.

2.4.3 K-hat

Diagnostics like split \(\hat{R}\) are sensitive to many different failure modes of the Markov chain Monte Carlo central limit theorem. That broad sensitivity, however, also makes it difficult to identify the particular failure mode responsible for the obstruction of a central limit theorem in any given application. More precise diagnostics that are sensitive to only particular failure modes can be much more useful for assigning blame and then identifying productive realizations.

For example one failure mode is the expectand not being sufficiently integrable. If \(\mathbb{E}_{\pi}[f]\) is not well-defined then there is nothing to which empirical averages can converge even asymptotically; if \(\mathbb{E}_{\pi}[f^{2}]\) is not well-defined then we the Markov chain Monte Carlo standard error will not be well-defined; and if \(\mathbb{E}_{\pi}[f^{4}]\) is not well-defined then we cannot estimate the standard error from realized Markov chains.

Typically the integrability of an expectand of determined by the tails of the pushforward distribution \(f_{*} \pi\). The more diffuse that pushforward distribution, and the heavier the tails, the less likely the higher powers of \(f\) will admit well-defined expectation values. The \(\hat{k}\) statistic empirically quantifies the heaviness of these pushforward tails. Heuristically values of \(\hat{k}\) larger than \(0.25\) typically indicate problematic behavior, and we will use that as a diagnostic cut off here.

# Empirical Pareto tail shape k-hat for sample of positive values
# Returns -2 if tail is too truncated
compute_khat <- function(fs) {
  N <- length(fs)
  sorted_fs <- sort(fs)

  if (sorted_fs[1] == sorted_fs[N]) {
    return (-2)
  }

  if (sorted_fs[1] < 0) {
    print("x must be positive!")
    return (NA)
  }

  # Estimate 25% quantile
  q <- sorted_fs[floor(0.25 * N + 0.5)]

  if (q == sorted_fs[1]) {
    return (-2)
  }

  # Heurstic Pareto configuration
  M <- 20 + floor(sqrt(N))

  b_hat_vec <- rep(0, M)
  log_w_vec <- rep(0, M)

  for (m in 1:M) {
    b_hat_vec[m] <- 1 / sorted_fs[N] + (1 - sqrt(M / (m - 0.5))) / (3 * q)
    k_hat <- - mean( log(1 - b_hat_vec[m] * sorted_fs) )
    log_w_vec[m] <- N * ( log(b_hat_vec[m] / k_hat) + k_hat - 1)
  }

  max_log_w <- max(log_w_vec)
  b_hat <- sum(b_hat_vec * exp(log_w_vec - max_log_w)) /
    sum(exp(log_w_vec - max_log_w))

  mean( log (1 - b_hat * sorted_fs) )
}

# Empirical k-hat tail diagnostic for a given expectand of interest
# Returns -2 when tail is too truncated
tail_khats <- function(fs) {
  f_center <- median(fs)
  fs_left <- abs(fs[fs < f_center] - f_center)
  fs_right <- fs[fs > f_center] - f_center

  # Default to 0 if left tail is ill-defined
  khat_left <- -2
  if (length(fs_left) > 0)
    khat_left <- compute_khat(fs_left)

  # Default to 0 if right tail is ill-defined
  khat_right <- -2
  if (length(fs_right) > 0)
    khat_right <- compute_khat(fs_right)

  c(khat_left, khat_right)
}

# Check tail k-hats for all Markov chains
ensemble_tail_khats <- function(chains) {
  sapply(chains, function(c) tail_khats(c))
}

Particular Markov transition distributions, in some cases even certain families of Markov transition distributions, can also admit bespoke diagnostics that are more sensitive than these generic diagnostics. Engineering these diagnostics, however, requires deep understanding of those particular Markov transition distributions.

Finally note that these diagnostics assume not only the existence of a Markov chain Monte Carlo central limit theorem but also that the Markov transition distribution is recurrent so that there is only one stationary distribution and no ambiguity about the limiting behavior of empirical averages. Central limit theorems can also hold for reducible and periodic Markov transition distributions, but we have to be much more careful about the initial distribution and to which stationary distribution realized Markov chains will converge. While Markov chain Monte Carlo is technically possible with non-recurrent Markov transitions it is often much too fragile to be all that effective in practice.

2.5 Thinning

If \(\tau\) is a recurrent Markov transition with unique stationary distribution \(\pi\) then \(\tau^{M}\) will also be a recurrent Markov transition with unique stationary distribution \(\pi\) for any integer \(M\). Consequently thinning realized Markov chains by keeping only every \(M\)th iteration will not change the asymptotic properties of Markov chain Monte Carlo estimators.

For example if \[ (x_{0}, x_{1}, x_{2}, x_{3}, x_{4}, \ldots, x_{2 \, N}) \] is a Markov chain generated from the Markov transition \(\tau\) then \[ (x_{0}, x_{2}, x_{4}, \ldots, x_{2 \, N}) \] will be a Markov chain generated from the Markov transition \(\tau^{2}\). Both \[ \frac{1}{2 \, N + 1} \sum_{n = 0}^{2 \, N} f(x_{n}) \] and \[ \frac{1}{N + 1} \sum_{n = 0}^{N} f(x_{2 \, n}) \] yield Markov chain Monte Carlo estimators that will asymptotically converge to the corresponding stationary expectation value.

For finite Markov chains thinning can influence the effective sample size for a given expectand, and hence the precision of the Markov chain Monte Carlo estimator. This dependence is complicated, but if the thinning period is shorter than the autocorrelation length then the effective sample size of the thinned Markov chain should not be much smaller than the effective sample size of the full Markov chain. When working with strongly autocorrelated Markov chains this means that thinning can yield shorter Markov chains, and hence a smaller memory footprint, without reducing the precision of Markov chain Monte Carlo estimators too much.

2.6 A Robust Markov Chain Monte Carlo Workflow

With the potentials and pitfalls of the Markov chain Monte Carlo central limit theorem established we can put together a workflow for implementing Markov chain Monte Carlo in a principled way.

We begin by implementing a function that realizes Markov transitions.

transition <- function(x) {
  # Sample a transition given the initial state x
}

Once we have specified a Markov transition in this way we are responsible for identifying the stationary distributions if any, as well as whether the Markov transition distribution is recurrent or not. Alternatively if we have a particular target distribution of interest then we have engineer a Markov transition distribution, ideally a recurrent one, that preserves that target distribution.

With a useful Markov transition ready we can generate Markov chain by repeatedly applying the transition to some initial states. The initial states themselves will itself have to be sampled from some initial distribution.

generate_chain <- function(x0, N, transition) {
  xs <- rep(0, N + 1)
  xs[1] <- x0
  for (n in 1:N) {
    xs[n + 1] <- transition(xs[n])
  }
  (xs)
}
N <- # Number of transitions
x0s <- # Vector of initial states
chains <- lapply(x0s, function(x0) generate_chain(x0, N, transition))

One we've generated a Markov chain we can consider whether or not we want to define and then exclude a warmup interval.

remove_warmup <- function(chains, N_warmup) {
  lapply(chains, function(c) c[(N_warmup + 1):length(c)])
}
N_warmup <- 51
plot_traces(chains, N_warmup)

chains <- remove_warmup(chains, N_warmup)

plot_traces(chains)

At this point we can thin the realized Markov chains if desired. I don't recommend thinning your Markov chains by default, but instead thinning only when required by exceptional circumstances.

thin_chains <- function(chains, period) {
  lapply(chains, function(c) c[seq(1, length(c), period)])
}

Now we can analyze the realized Markov chains for each real-valued expectand \(f: X \rightarrow \mathbb{R}\) of interest. For example if \(X = \mathbb{R}\) then we can consider the identify function and if \(X = \mathbb{R}^{D}\) then we might consider each of the coordinate functions \[ \begin{alignat*}{6} \varpi_{d} :\; &X& &\rightarrow& \; &\mathbb{R}& \\ &(x_{1}, \ldots, x_{d}, \ldots, x_{D})& &\mapsto& &x_{d}&. \end{alignat*} \]

expectand <- function(x) {
  # Real-valued output
}
pushforward_chains <- function(chains, expectand) {
  lapply(chains, function(c) sapply(c, function(x) expectand(x)))
}
expectand_pushforward_chains(chains, exectand)

For each of these expectands we need to check out empirical diagnostics for any behavior inconsistent with a Markov chain Monte Carlo central limit theorem.

check_diagnostics <- function(chains, name=NULL) {
  if (!is.null(name)) {
    cat(paste0('Expectand "', name, '":\n'))
  }

  cat("\n")

  rhat <- split_rhat(chains)
  if (is.nan(rhat)) {
    message <- paste0("  The expectand output is not varying within the ",
                      "individual Markov chains.  This could be because ",
                      "the exact posterior variance is zero or because ",
                      "the Markov transition is unable to move.\n")
    cat(message)
  } else if (rhat > 1.1) {
    message <- paste0("  Split r-hat is greater than 1.1 which is inconsistent ",
                      "with the expectand estimator satisfying a central limit ",
                      "theorem!\n")
    cat(message)
  } else {
    message <- paste0("  Split r-hat is less than 1.1 which is consistent ",
                      "with the expectand estimator satisfying a central limit ",
                      "theorem.\n")
    cat(message)
  }

  cat("\n")

  C <- length(chains)
  for (c in 1:C) {
    khat <- tail_khats(chains[[c]])
  
    cat(paste0("Chain ", c, ":\n"))
  
    if (khat[1] >= 0.25 & khat[2] >= 0.25) {
      message <- paste0("  Both tail k-hats are larger than 0.25 which is ",
                        "inconsistent with the expectand estimator satisfying ",
                        "a central limit theorem!\n")
      cat(message)
    } else if (khat[1] < 0.25 & khat[2] >= 0.25) {
      message <- paste0("  The upper tail k-hat is larger than 0.25  which is ",
                        "inconsistent with the expectand estimator satisfying ",
                        "a central limit theorem!\n")
      cat(message)
    } else if (khat[1] >= 0.25 & khat[2] < 0.25) {
      message <- paste0("  The lower tail k-hat is larger than 0.25  which is ",
                        "inconsistent with the expectand estimator satisfying ",
                        "a central limit theorem!\n")
      cat(message)
    } else {
      message <- paste0("  Both tail k-hats are smaller than 0.25 which is ",
                        "consistent with the expectand estimator satisfying ",
                        "a central limit theorem.\n")
      cat(message)
    }
  }
}
check_diagnostics(pushforward_chains, name="Expectand Name")

If we don't find any suspicious behavior then we can consider constructing Markov chain Monte Carlo estimators.

display_ensemble_mcmc_est(pushforward_chains, name="Expectand Name")

In addition to estimating the expectation value \(\mathbb{E}_{\pi}[f]\) we can also analyze the entire pushforward distribution \(f_{*} \pi\) by estimating bin probabilities to form histogram.

plot_density_hist <- function(chains, bins) {
  B <- length(bins) - 1
  mean_p <- rep(0, B)
  delta_p <- rep(0, B)

  for (b in 1:B) {
    bin_indicator <- function(x) {
      ifelse(bins[b] <= x & x < bins[b + 1], 1, 0)
    }
    indicator_chains <- pushforward_chains(chains, bin_indicator)
    est <- ensemble_mcmc_est(indicator_chains)
  
    # Normalize bin probabilities by bin width to allow
    # for direct comparison to probability density functions
    width = bins[b + 1] - bins[b]
    mean_p[b] = est[1] / width
    delta_p[b] = est[2] / width
  }

  idx <- rep(1:B, each=2)
  x <- sapply(1:length(idx), function(b) if(b %% 2 == 1) bins[idx[b]] else bins[idx[b] + 1])
  lower_inter <- sapply(idx, function (n) max(mean_p[n] - 2 * delta_p[n], 0))
  upper_inter <- sapply(idx, function (n) min(mean_p[n] + 2 * delta_p[n], 1))

  min_y <- min(lower_inter)
  max_y <- max(1.05 * upper_inter)

  plot(1, type="n", main="",
       xlim=range(bins), xlab="x",
       ylim=c(min_y, max_y), ylab="Estimated Bin\nProbability Densities")

  polygon(c(x, rev(x)), c(lower_inter, rev(upper_inter)),
          col = c_light, border = NA)
  lines(x, mean_p[idx], col=c_dark, lwd=2)
}
plot_density_hist(expectand_pushforward_chains, seq(f_min, f_max, delta_f))

Because of the fundamental diagnostic limitations this workflow will not guarantee that we can trust the ultimate estimators or their probabilistic errors, but it will ensure that we don't make any avoidable mistakes.

3 Building Markov Transitions

For Markov chain Monte Carlo to be useful in practice we need to be able to engineer well-behaved Markov transitions that preserve a given target distribution. This is no easy feat. Indeed engineering Markov transitions whose stationary distributions can even be identified is challenging. That said two techniques can often facilitate these tasks.

Useful Markov transitions are often easiest to construct through an intermediate variable \(\phi\) that is generated in between the initial and final states. In this case the Markov transition density function can be written as \[ \tau(x_{n + 1} \mid x_{n}) = \int \mathrm{d} \phi \, \tau(x_{n + 1}, \phi \mid x_{n}), \] where \(\tau(x_{n + 1}, \phi \mid x_{n})\) defines a joint Markov transition over the intermediate variable and final state.

When the intermediate variable is generated after the initial state but before the final state this joint transition density admits the conditional decomposition \[ \tau(x_{n + 1}, \phi \mid x_{n}) = \tau(x_{n + 1} \mid \phi, x_{n}) \, \tau(\phi \mid x_{n}) \] so that the final Markov transition density becomes \[ \tau(x_{n + 1} \mid x_{n}) = \int \mathrm{d} \phi \, \tau(x_{n + 1} \mid \phi, x_{n}) \, \tau(\phi \mid x_{n}). \] In other words each final state can be generated in two stages: first we sample an intermediate state \(\tilde{\phi}\) from the given initial state \(\tilde{x}_{n}\), \[ \tilde{\phi} \sim \tau(\phi \mid \tilde{x}_{n}) \] and then we sample a final state \(\tilde{x}_{n + 1}\) from \(\tilde{\phi}\) and \(\tilde{x}_{n}\) \[ \tilde{x}_{n + 1} \sim \tau(x_{n + 1} \mid \tilde{\phi}, \tilde{x}_{n}). \]

Another common tool in the construction of Markov transitions is the Dirac delta function. The Dirac delta function over a real-valued space, \(\delta(x - x')\), is defined by its integral properties, \[ \int \mathrm{d} x \, \delta(x - x') \, f(x) = f(x'), \] for all functions \(f(x)\). When the domain of integration is not the entire ambient space then we have to be careful to consider whether \(x'\) is in the domain of integration or not. If it is then the integral will behave the same, but if it is not then the integral will vanish, \[ \int_{A} \mathrm{d} x \, \delta(x - x') \, f(x) = \left\{ \begin{array}{rr} f(x') , & x' \in A \\ 0, & x' \notin A \end{array} \right. . \]

We can encode this inclusion criterion more compactly using an indicator function, \[ \int_{A} \mathrm{d} x \, \delta(x - x') \, f(x) = f(x') \, \mathbb{I}[x' \in A]. \] For example if \(X = \mathbb{R}\) and \(A = [x_{1}, x_{2}]\) then \[ \int_{x_{1}}^{x_{2}} \mathrm{d} x \, \delta(x - x') \, f(x) = f(x') \, \mathbb{I}[x_{1} \le x' \le x_{2} ]. \]

The Dirac delta function \(\delta(x - x')\) can be interpreted as a probability distribution that concentrates entirely on the point \(x'\). In particular this allows us to build completely deterministic Markov transitions. For example the Markov transition specified by the transition density \[ \tau(x_{n + 1} \mid t_{n}) = \delta( x_{n + 1} - x_{n} ) \] always returns the initial state. On the other hand \[ \tau(x_{n + 1} \mid t_{n}) = \delta( x_{n + 1} - x_{n} / 2 ) \] specifies a Markov transition that shrinks the initial state towards the origin.

Often these two techniques are integrated together. For example the transition \[ \begin{align*} \tau(x_{n + 1} \mid x_{n}) &= \int \mathrm{d} \phi \, \tau(x_{n + 1} \mid \phi, x_{n}) \, \tau(\phi \mid x_{n}) \\ &= \int \mathrm{d} \phi \, \delta(x_{n + 1} - \phi / 2)) \, \text{normal}(\phi \mid x_{n}, 1) \end{align*} \] can be implemented by first sampling an intermediate state around the initial state, \[ \tilde{\phi} \sim \text{normal}(\phi \mid \tilde{x}_{n}, 1) \] and then fixing the final state to half that value \[ \tilde{x}_{n + 1} = \phi / 2. \]

Using the properties of the Dirac delta function we can eliminate the intermediate variable in this case. First we make the substitution \(u = \phi / 2\) so that the Dirac delta function directly depends on the variable of integration, \[ \begin{align*} \tau(x_{n + 1} \mid x_{n}) &= \int \mathrm{d} \phi \, \delta(x_{n + 1} - \phi / 2) \, \text{normal}(\phi \mid x_{n}, 1) \\ &= 2 \int \mathrm{d} u \, \delta(x_{n + 1} - u) \, \text{normal}(2 \, u \mid x_{n}, 1), \end{align*} \] and then we apply the definition of the Dirac delta function, \[ \begin{align*} \tau(x_{n + 1} \mid x_{n}) &= 2 \int \mathrm{d} u \, \delta(x_{n + 1} - u)) \, \text{normal}(2 \, u \mid x_{n}, 1) \\ &= 2 \, \text{normal}(2 \, x_{n + 1} \mid x_{n}, 1) \\ &= \text{normal}(x_{n + 1} \mid x_{n} / 2, 1 / 2). \end{align*} \] This explicit Markov transition is often easier to theoretical analysis while the two stage Markov transition is more useful for implementing the Markov transition in practice.

4 Markov Chains in Action

To this point we have demonstrated some of the basics of Markov chain Monte Carlo with the superficially simple Markov transition density function \[ \tau(x_{n + 1} \mid x_{n}) = \text{normal}(x_{n + 1} \mid x_{n}, \omega). \] Unfortunately the only stationary distribution is somewhat ill-posed. In particular because the variance of most expectands is exactly zero, for example for indicator functions, infinite, for example for even powers, or ill-defined, for example for odd powers, a Markov chain Monte Carlo central limit theorem does not hold for the most common expectands.

In this section we'll consider some slightly more complicated Markov transition distributions defined over the ambient space \(X = \mathbb{R}\) that isolate, and allow us to study, various failure modes for Markov chain Monte Carlo. To avoid a deluge of integrals I will state properties of these Markov transition distributions here without proof, leaving the explicit calculations to the Appendix.

4.1 Transition One: Reducible

Let's begin by considering a reducible, but aperiodic, Markov transition distribution. Recall that Markov transition distribution is reducible if certain sets are allocated zero probability by all of the \(N\)-step distributions, \[ \tau^{n} \circ \rho (A) = \int_{A} \mathrm{d} x \, \tau^{n} \circ \rho (x_{n}) = 0 \] for all positive integers \(n \in \mathbb{N}\). In other words no matter how long our Markov chains become they will never reach that set \(A\).

Often reducibility manifests as a partition of the ambient space into disjoint sets, \[ \begin{align*} A_{i} &\subset X \\ A_{i} \cap A_{i' \neq i} &= \emptyset \\ \cup_{n} A_{i} &= X, \end{align*} \] such that for any \(x_{0} \in A_{i}\) \[ \tau^{n} \circ \delta_{x_{0}} (A_{i' \neq i}) = 0 \, \forall n \in \mathbb{N}. \] Equivalently we would have \[ \tau^{n} \circ \delta_{x_{0}} (A_{i}) = 1 \, \forall n \in \mathbb{N}. \] Intuitively any Markov chain initialized at the point \(x_{0} \in A_{i}\) will be forever confined to \(A_{i}\) and unable to escape to any other neighborhood of the ambient space.

Such reducible Markov transition distributions typically admit multiple stationary distributions that correspond to these disjoint sets. For example one of these stationary distributions would span all of the sets, and \(I\) would be confined to each of the partition sets. From these we could then build even more stationary distributions that span combinations of these sets.

In this case Markov chains will asymptotically converge to the stationary distributions compatible with the initial distribution. The \(N\)-step distribution derived from an initial distribution that spans all \(X\) will typically converge to the stationary distribution that spans all of \(X\). Likewise the \(N\)-step distribution derived from an initial distribution that spans only a single partition set \(A_{i}\) will typically converge to a stationary distribution that concentrates within that partition. Consequently many properties of the initial distribution will persist and continue to influence the behavior of realized Markov chains no matter how long they are!

To demonstrate these behaviors consider a Markov transition that simulates an intermediate variable from a distribution specified by an exponential density function, \[ \tilde{\phi} \sim \text{exponential} (\phi \mid 1) \] and then selects a new state using the conditional logic \[ \tilde{x}_{n + 1} = \left\{ \begin{array}{rr} 2 \, \min (\tilde{x}_{n}, +\tilde{\phi}) , & \tilde{x}_{n} > 0 \\ 2 \, \max (\tilde{x}_{n}, -\tilde{\phi}), & \tilde{x}_{n} \le 0 \end{array} \right. . \] This simulation is also straightforward to implement programatically.

transition <- function(x) {
  phi <- rexp(1, 1)
  if (x > 0)
    (2 * min(x, phi))
  else
    (2 * max(x, -phi))
}

Probabilistically we can encode this conditional but deterministic update using a mix of Dirac delta functions and indicator functions that partition the possible interactions between the initial state \(x_{n}\) and the intermediate variable \(\phi\), \[ \begin{align*} \tau(x_{n + 1} \mid \phi, x_{n}) &= \;\;\; \mathbb{I}[ x_{n} > 0 ] \, \delta( x_{n + 1} - 2 \, \min(x_{n}, \phi) ) \\ & \quad + \mathbb{I}[ x_{n} \le 0 ] \, \delta( x_{n + 1} - 2 \, \max(x_{n}, -\phi) ) \\ &= \;\;\; \mathbb{I}[ x_{n} > 0 ] \, \mathbb{I}[ x_{n} < \phi ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \\ & \quad + \mathbb{I}[ x_{n} > 0 ] \, \mathbb{I}[ x_{n} \ge \phi ] \, \delta( x_{n + 1} - 2 \, \phi ) \\ & \quad + \mathbb{I}[ x_{n} \le 0 ] \, \mathbb{I}[ x_{n} > -\phi ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \\ & \quad + \mathbb{I}[ x_{n} \le 0 ] \, \mathbb{I}[ x_{n} \le -\phi ] \, \delta( x_{n + 1} + 2 \, \phi ). \end{align*} \] All four of these terms can be further simplified by combining the indicator functions together, \[ \begin{align*} \tau(x_{n + 1} \mid \phi, x_{n}) &= \;\;\; \mathbb{I}[ x_{n} > 0 ] \, \mathbb{I}[ x_{n} < +\phi ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \\ & \quad + \mathbb{I}[ x_{n} > 0 ] \, \mathbb{I}[ x_{n} \ge +\phi ] \, \delta( x_{n + 1} - 2 \, \phi ) \\ & \quad + \mathbb{I}[ x_{n} \le 0 ] \, \mathbb{I}[ x_{n} > -\phi ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \\ & \quad + \mathbb{I}[ x_{n} \le 0 ] \, \mathbb{I}[ x_{n} \le -\phi ] \, \delta( x_{n + 1} + 2 \, \phi ) \\ &= \;\;\; \mathbb{I}[ 0 < x_{n} < +\phi ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \\ & \quad + \mathbb{I}[ 0 \le +\phi \le x_{n} ] \, \delta( x_{n + 1} - 2 \, \phi ) \\ & \quad + \mathbb{I}[ -\phi < x_{n} \le 0 ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \\ & \quad + \mathbb{I}[ x_{n} \le -\phi \le 0 ] \, \delta( x_{n + 1} + 2 \, \phi ). \end{align*} \]

The complete Markov transition density function is then given by \[ \begin{align*} \tau(x_{n + 1} \mid x_{n}) &= \int_{0}^{\infty} \mathrm{d} \phi \, \tau(x_{n + 1} \mid \phi, x_{n}) \, \tau(\phi \mid x_{n}) \\ &= \quad \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 < x_{n} < +\phi ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \text{exponential} (\phi \mid 1) \\ & \quad + \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 \le +\phi \le x_{n} ] \, \delta( x_{n + 1} - 2 \, \phi) \,\;\; \text{exponential} (\phi \mid 1) \\ & \quad + \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ -\phi < x_{n} \le 0 ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \text{exponential} (\phi \mid 1) \\ & \quad + \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ x_{n} \le -\phi \le 0 ] \, \delta( x_{n + 1} + 2 \, \phi) \,\;\; \text{exponential} (\phi \mid 1) \\ &= \quad \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 < x_{n} < +\phi ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \exp(-\phi) \\ & \quad + \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 \le +\phi \le x_{n} ] \, \delta( x_{n + 1} - 2 \, \phi) \,\;\; \exp(-\phi) \\ & \quad + \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ -\phi < x_{n} \le 0 ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \exp(-\phi) \\ & \quad + \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ x_{n} \le -\phi \le 0 ] \, \delta( x_{n + 1} + 2 \, \phi) \,\;\; \exp(-\phi). \end{align*} \]

A direct calculation shows that this Markov transition preserves the probability distribution specified by the Laplace density function, \[ \text{Laplace}(x; 0, 1) = \frac{1}{2} \exp( - | x | ). \] Under ideal circumstances Markov chain Monte Carlo estimators will also rapidly converge to stationary expectation values. To see how ideal the circumstances actually are let's realize some Markov chains.

We'll generate four Markov chains from different point initializations. If all of these Markov chains converge to the same stationary distribution then they should yield consistent Markov chain Monte Carlo estimators.

x0s <- c(-2, -5, -13, 20)
N <- 1000
chains <- lapply(x0s, function(x0) generate_chain(x0, N, transition))

Because the ambient space is \(\mathbb{R}\) we will see a pretty complete picture if we use the identify function as our expectand of interest.

identify <- function(x) {
  (x)
}

id_pushforward_chains <- pushforward_chains(chains, identity)

The exact expectation value of this expectand is \[ \mathbb{E}_{\pi}[\text{Id}] = \int_{-\infty}^{\infty} \mathrm{d} x \, \text{Laplace}(x; 0, 1) \, x = 0. \]

Unfortunately even after discarding states in an initial warmup period there are clear discrepancies between the expendant values in the four realized Markov chains. The first three are confined to only negative values while the last is confined to only positive values!

N_warmup <- 51
plot_traces(id_pushforward_chains, N_warmup)

id_pushforward_chains <- remove_warmup(id_pushforward_chains, N_warmup)

Fortunately the split \(\hat{R}\) statistic is sensitive to this discrepancy and our diagnostics flag the problem.

check_diagnostics(id_pushforward_chains, name="Id")
Expectand "Id":

  Split r-hat is greater than 1.1 which is inconsistent with the expectand estimator satisfying a central limit theorem!

Chain 1:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 2:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 3:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 4:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.

If we weren't paying attention to the diagnostics, however, and assumed that each of the Markov chains satisfied compatible central limit theorems then we would have arrived at an ensemble estimate far from the exact value of \(0\),

display_ensemble_mcmc_est(id_pushforward_chains, name="Id")
E[Id] = -0.449 +/- 0.076 (Total Effective Sample Size = 1295)

That said we might have noticed problems if we had looked at the Markov chain Monte Carlo estimators from each individual Markov chain.

plot_est_comp(id_pushforward_chains)

The discrepancy also manifests in the pushforward distribution of the expectand: the first three Markov chains estimator pushforward distributions that concentrate entirely on negative values while the fourth concentrates entirely on positive values. Both behaviors are inconsistent with convergence to a stationary distribution specified by a Laplace density function.

par(mfrow=c(2, 2))

C <- length(chains)
for (c in 1:C) {
  plot_density_hist(id_pushforward_chains[c], seq(-10, 10, 0.3))
  title(main=paste("Chain", c))
  plot_xs <- seq(-10, 10, 0.01)
  plot_ys <- 0.5 * dexp(abs(plot_xs), 1)
  lines(plot_xs, plot_ys, lwd=3, col="white")
  lines(plot_xs, plot_ys, lwd=2, col="black")
}

Naively aggregating these inconsistent behaviors results in a ensemble pushforward distribution with an awkward jump at \(x = 0\).

par(mfrow=c(1, 1))

plot_density_hist(id_pushforward_chains, seq(-10, 10, 0.3))
plot_xs <- seq(-10, 10, 0.01)
plot_ys <- 0.5 * dexp(abs(plot_xs), 1)
lines(plot_xs, plot_ys, lwd=3, col="white")
lines(plot_xs, plot_ys, lwd=2, col="black")

Had we more carefully analyzed the Markov transition distribution before generating our Markov chains we would have been far more prepared for these inconsistencies. The Markov transition distribution is recurrent, with transitions from states at positive values confined to other positive values and vice versa.

While \(\text{Laplace}(x; 0, 1)\) does specify a valid stationary distribution so too does \[ \begin{align*} \text{pos-Laplace}(x; 0, 1) &= \text{Laplace}(x; 0, 1) \, \mathbb{I}[x > 0] \\ &= \exp( x ) \, \mathbb{I}[x > 0] \end{align*} \] and \[ \begin{align*} \text{neg-Laplace}(x; 0, 1) &= \text{Laplace}(x; 0, 1) \, \mathbb{I}[x \le 0] \\ &= \exp( -x ) \, \mathbb{I}[x \le 0]! \end{align*} \]

The Markov chains from our three negative point initializations asymptotically converge to the stationary distribution specified by \(\text{neg-Laplace}(x; 0, 1)\). Indeed if we consider just those three Markov chains then we recover estimates consistent with the exact expectation value of \(-1\).

check_diagnostics(id_pushforward_chains[1:3], name="Id")
Expectand "Id":

  Split r-hat is less than 1.1 which is consistent with the expectand estimator satisfying a central limit theorem.

Chain 1:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 2:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 3:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
display_ensemble_mcmc_est(id_pushforward_chains[1:3], name="Id")
E[Id] = -1.011 +/- 0.067 (Total Effective Sample Size = 935)
plot_est_comp(id_pushforward_chains[1:3])

plot_density_hist(id_pushforward_chains[1:3], seq(-10, 1, 0.25))
plot_xs <- seq(-10, 10, 0.01)
plot_ys <- sapply(plot_xs, function(x) ifelse(x <= 0, dexp(-x), 0))
lines(plot_xs, plot_ys, lwd=3, col="white")
lines(plot_xs, plot_ys, lwd=2, col="black")

Similarly the last Markov chain yields estimates consistent with \(\text{pos-Laplace}(x; 0, 1)\) and it's exact identify expectation value of \(+1\).

check_diagnostics(id_pushforward_chains[4], name="Id")
Expectand "Id":

  Split r-hat is less than 1.1 which is consistent with the expectand estimator satisfying a central limit theorem.

Chain 1:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
display_ensemble_mcmc_est(id_pushforward_chains[4], name="Id")
E[Id] = 1.012 +/- 0.105 (Total Effective Sample Size = 360)
plot_est_comp(id_pushforward_chains[4])

par(mfrow=c(1, 1))

plot_density_hist(id_pushforward_chains[4], seq(-1, 10, 0.25))
plot_xs <- seq(-10, 10, 0.01)
plot_ys <- sapply(plot_xs, function(x) ifelse(x > 0, dexp(x), 0))
lines(plot_xs, plot_ys, lwd=3, col="white")
lines(plot_xs, plot_ys, lwd=2, col="black")

In order for our Markov chains to converge to \(\text{Laplace}(x; 0, 1)\) they need to be initialized from an initial distribution that covers both negative and positive values. No single Markov chain realization will converge to this stationary distribution, but an ensemble of Markov chain realizations from such an initialization might.

For example if we get lucky and generate the same number of positive and negative initializations then our ensemble Markov chain Monte Carlo estimators will be consistent with the stationary distribution specified by \(\text{Laplace}(x; 0, 1)\).

set.seed(48385939)
x0s <- rnorm(6, 0, 10)
chains <- lapply(x0s, function(x0) generate_chain(x0, N, transition))
chains <- remove_warmup(chains, N_warmup)
id_pushforward_chains <- pushforward_chains(chains, identity)
display_ensemble_mcmc_est(id_pushforward_chains, name="Id")
E[Id] = -0.014 +/- 0.062 (Total Effective Sample Size = 2089)
par(mfrow=c(1, 1))

plot_density_hist(id_pushforward_chains, seq(-10, 10, 0.3))
plot_xs <- seq(-10, 10, 0.01)
plot_ys <- 0.5 * dexp(abs(plot_xs), 1)
lines(plot_xs, plot_ys, lwd=3, col="white")
lines(plot_xs, plot_ys, lwd=2, col="black")

Relying on coincidentally balanced initializations, however, results in extremely fragile Markov chain Monte Carlo.

4.2 Transition Two: Periodic

With some familiarity of the fragility typical to reducible Markov transition distributions let's explore what can happen with periodic Markov transitions.

While aperiodicity is required for recurrence, and hence for the \(N\)-step distribution to converge towards a stationary distribution, that's not actually a requirement for ensemble averages to converge to stationary expectation values. If \(\tau\) is a periodic but irreducible Markov transition distribution then empirical averages will still asymptotically converge to stationary expectation values, and in some cases the variation of the empirical averages from finite Markov chains can be quantified with a central limit theorem.

The key subtlety of periodic Markov transition distributions is in the behavior of composite Markov transitions \[ \tau^{M} (x_{n + m} \mid x_{n}) = \left[ \prod_{m = 1}^{M} \mathrm{d} x_{n + m - 1} \, \tau(x_{n + m} \mid x_{n + m - 1}) \right]. \] For an irreducible and aperiodic Markov transition \(\tau^{M}\) will share the same stationary distribution as \(\tau\) for all positive integers \(M \in \mathbb{N}\); for an irreducible but periodic Markov transition distribution, however, the composite Markov transition distribution \(\tau^{M}\) for some \(M\) can exhibit a different stationary distribution than that of the base Markov transition distribution \(\tau\).

Formally Markov transition distribution is periodic if it admits a periodic decomposition of at least two disjoint sets \(A_{1}, \ldots, A_{i}, \ldots, A_{I}\) such that all Markov transitions that start in \(A_{i}\) concentrate in \(A_{i + 1}\), \[ \tau(A_{i + 1} \mid x_{0}) = 1 \] for all \(x_{0} \in A_{i}\). Realized Markov chains will always transition from a state in \(A_{1}\) to a state in \(A_{2}\) to a state in \(A_{3}\) and so on until after \(I\) transitions it returns to \(A_{1}\).

If we compose a period Markov transitions distribution with itself \(I\) times then we end up with a composite Markov transition distribution \(\tau^{I}\) that cannot escape the set in which it is initialized. In other words \(\tau^{M \, I}\) for any positive integer \(M\) will be a reducible Markov transition distribution even if \(\tau\) isn't. Markov chain Monte Carlo estimators constructed from Markov chains realized from \(\tau\) will will typically converge to the corresponding stationary expectation value asymptotically, but those from \(\tau^{M \, I}\) will not. Specifically thinning Markov chains with a period equal to any integer multiple of \(M\) will be problematic.

To demonstrate this behavior consider Markov transition distribution that repeatedly reflects each states across the origin \(x = 0\), \[ \begin{align*} \tilde{\phi} &\sim \text{pos-half-normal} (\phi \mid 0, 1) \\ \tilde{x}_{n + 1} &= - \text{sign}(x_{n}) \, \tilde{\phi}, \end{align*} \] where \[ \text{pos-half-normal}(x \mid 0, 1) \equiv 2 \, \text{normal}(x \mid 0, 1) \, \mathbb{I}[x > 0]. \]

transition <- function(x) {
  (-sign(x) * abs(rnorm(1, 0, 1)))
}

We can also probabilistically encode this as \[ \begin{align*} \tau(\phi \mid x_{n}) &= \text{pos-half-normal}(\phi \mid 0, 1) \\ \tau(x_{n + 1} \mid \phi, x_{n}) &= \delta( x_{n + 1} + \text{sign}(x_{n}) \, \phi ), \end{align*} \] so that full Markov transition density function is then given by \[ \begin{align*} \tau(x_{n + 1} \mid x_{n}) &= \int_{0}^{\infty} \mathrm{d} \phi \, \tau(x_{n + 1} \mid \phi, x_{n}) \, \tau(\phi \mid x_{n}) \\ &= \int_{0}^{\infty} \mathrm{d} \phi \, \delta( x_{n + 1} + \text{sign}(x_{n}) \, \phi) \, \text{pos-half-normal}(\phi \mid 0, 1) \\ &= \;\;\; \mathbb{I}[ x_{n} > 0 ] \, \int_{0}^{\infty} \mathrm{d} \phi \, \delta( x_{n + 1} + \phi) \, \text{pos-half-normal}(\phi \mid 0, 1) \\ &\quad + \mathbb{I}[ x_{n} \le 0 ] \, \int_{0}^{\infty} \mathrm{d} \phi \, \delta( x_{n + 1} - \phi) \, \text{pos-half-normal}(\phi \mid 0, 1) \\ &= \;\;\;\, \mathbb{I}[ x_{n} > 0 ] \, \text{neg-half-normal}(x_{n + 1} \mid 0, 1) \\ & \quad + \mathbb{I}[ x_{n} \le 0 ] \, \text{pos-half-normal}(x_{n + 1} \mid 0, 1), \end{align*} \] where \[ \text{neg-half-normal}(x \mid 0, 1) \equiv 2 \, \text{normal}(x \mid 0, 1) \, \mathbb{I}[x \le 0]. \]

This Markov transition distribution is reducible, but it oscillates between the positive interval \(\mathbb{R}^{+} = (0, \infty)\) and the negative interval \(\mathbb{R}^{-} = (-\infty, 0]\) after each transition. In other words \(\mathbb{R}^{+}\) and \(\mathbb{R}^{-}\) form a periodic decomposition with period \(2\).

Direct calculation confirms that the unique stationary distribution for this Markov transition is specified by the probability density function \(\text{normal}(x \mid 0, 1)\). Consequently Markov chain Monte Carlo estimators for the expectation of the identify function should converge to the exact value of \(0\).

As before we will start by generating four Markov chains from different point initializations and remove an initial warmup interval.

N <- 1000
x0s <- c(7, -7, 2, 18)
chains <- lapply(x0s, function(x0) generate_chain(x0, N, transition))

N_warmup <- 51
chains <- remove_warmup(chains, N_warmup)

Now let's see how the output of the identify function, which is just the states themselves, varies across the Markov chains.

id_pushforward_chains <- pushforward_chains(chains, identity)

plot_traces(id_pushforward_chains)

If we zoom in we can already see some interesting behavior manifesting in these realized Markov chains.

chain_segments <- lapply(id_pushforward_chains, function(c) c[100:110])
plot_traces(chain_segments)

Each realized Markov chain oscillates around \(x = 0\), with the particular phase of that oscillation depending on whether the Markov chain is initialized in \(\mathbb{R}^{+}\) or \(\mathbb{R}^{-}\). This is apparent repulsion in the expectand output between neighboring states is referred to as antithetic behavior.

Because all of the Markov chains share the same antithetic behavior our diagnostics don't pick up any signs of problems.

check_diagnostics(id_pushforward_chains, name="Id")
Expectand "Id":

  Split r-hat is less than 1.1 which is consistent with the expectand estimator satisfying a central limit theorem.

Chain 1:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 2:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 3:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 4:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.

Even better the ensemble estimator for \(\mathbb{E}_{\pi}[\text{Id}]\) is consistent with the exact value of \(0\).

display_ensemble_mcmc_est(id_pushforward_chains, name="Id")
E[Id] = 0.004 +/- 0.023 (Total Effective Sample Size = 7600)

Note also that the total effective sample size is larger than the \(3800\) total post-warmup iterations. Consequently the precision of this Markov chain Monte Carlo estimator is actually better than a pure Monte Carlo estimator based on exact samples from the stationary distribution!

Unfortunately this enhancement is not universal to all expectands. For example the pushforward of the Markov chain states along the square function, \[ \begin{alignat*}{6} \text{sq} :\; &\mathbb{R}& &\rightarrow& \; &\mathbb{R}^{+}& \\ &x& &\mapsto& &x^{2}&, \end{alignat*} \] does not exhibit any antithetical behavior.

square <- function(x) {
  x**2
}

sq_pushforward_chains <- pushforward_chains(chains, square)

chain_segments <- lapply(sq_pushforward_chains, function(c) c[100:110])
plot_traces(chain_segments)

Moreover the estimator for the expecation of the square function, whose exact value is \(1\), is less precise than a corresponding Monte Carlo estimator would be.

display_ensemble_mcmc_est(sq_pushforward_chains, name="Square")
E[Square] = 1.044 +/- 0.050 (Total Effective Sample Size = 3662)

That said both estimators are consistent with the the exact stationary expectation values which suggests that a Markov chain Monte Carlo central limit theorem does hold. This is corroborated by the estimated histogram which matches the stationary behavior.

par(mfrow=c(1, 1))

plot_density_hist(chains, seq(-5, 5, 0.3))
plot_xs <- seq(-10, 10, 0.01)
plot_ys <- dnorm(plot_xs, 0, 1)
lines(plot_xs, plot_ys, lwd=3, col="white")
lines(plot_xs, plot_ys, lwd=2, col="black")

Now if this Markov transition distribution were both irreducible and aperiodic then we should be able to thin it without compromising the accuracy of our Markov chain Monte Carlo estimators. Let's start by thinning to every other state.

thinned_chains <- thin_chains(chains, 2)
id_pushforward_chains <- pushforward_chains(chains, identity)

Trace plots indicate that this seemingly innocent thinning has introduced troubling behavior. All four of the thinned Markov chains are confined to either \(\mathbb{R}^{+}\) or \(\mathbb{R}^{-}\)!

plot_traces(id_pushforward_chains)

Consequently the split \(\hat{R}\) diagnostic flags a problem.

check_diagnostics(id_pushforward_chains, name="Id")
Expectand "Id":

  Split r-hat is less than 1.1 which is consistent with the expectand estimator satisfying a central limit theorem.

Chain 1:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 2:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 3:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 4:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.

Moreover none of the individual nor ensemble Markov chain Monte Carlo estimators are consistent with the exact stationary expectation value of \(\mathbb{E}_{\pi}[\text{Id}] = 0\).

plot_est_comp(id_pushforward_chains)

In hindsight this behavior shouldn't be a surprise. Because our Markov transition distribution \(\tau\) is periodic with period \(2\) the composite Markov transition distribution \(\tau^{2}\) will be reducible. Here we can work out an explicit probability density function for this composite Markov transition distribution \[ \begin{align*} \tau^{2}(x_{n + 2} \mid x_{n}) &= \;\;\; \mathbb{I}[ x_{n} \le 0 ] \, \text{neg-half-normal}(x_{n + 2} \mid 0, 1) \\ &\quad + \mathbb{I}[ x_{n} > 0 ] \, \text{pos-half-normal}(x_{n + 2} \mid 0, 1). \end{align*} \] This Markov transition distribution is reducible and admits three stationary distributions specified by the probability density functions \(\text{normal}(x \mid 0, 1)\), \(\text{pos-half-normal}(x \mid 0, 1)\), and \(\text{neg-half-normal}(x \mid 0, 1)\) respectively.

Markov chains generated by \(\tau^{2}\) and initialized at a positive state \(x_{0} > 0\) will forever remain at positive values while those Markov chains initialized at a negative state \(x_{0} \le 0\) will forever remain at negative values.

Looking back the Markov chain Monte Carlo estimators for the individual Markov chains are all consistent with the exact expectation values \[ \begin{align*} \int_{-\infty}^{\infty} \mathrm{d} x \, \text{pos-half-normal}(x \mid 0, 1) \, x &= + \sqrt{ \frac{2}{\pi} } \approx 0.8 \\ \int_{-\infty}^{\infty} \mathrm{d} x \, \text{pos-half-normal}(x \mid 0, 1) \, x &= - \sqrt{ \frac{2}{\pi} } \approx -0.8 \end{align*} \]

plot_est_comp(id_pushforward_chains)

Finally if we thin by any stride that is not divisible by \(2\) then we recover accurate, albeit less precise, Markov chain Monte Carlo estimators.

thinned_chains <- thin_chains(chains, 3)
id_pushforward_chains <- pushforward_chains(chains, identity)
check_diagnostics(id_pushforward_chains, name="Id")
Expectand "Id":

  Split r-hat is less than 1.1 which is consistent with the expectand estimator satisfying a central limit theorem.

Chain 1:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 2:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 3:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 4:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
plot_est_comp(id_pushforward_chains)

4.3 Transition Three: Poor Convergence

Having seen the unwanted behaviors that can arise for reducible or periodic Markov transition distributions, let's see what problematic behaviors are still possible for recurrent Markov transition distributions.

Here we'll consider the restricted ambient space \(X = \mathbb{R}^{+}\) and a truncated Student-t transition to avoid violating that positivity constraint, \[ \begin{align*} \tilde{\phi} &\sim \text{Student-t} (\phi \mid 2, -0.1, 1) \\ \tilde{x}_{n + 1} &= \max(0, x_{n + 1} + \tilde{\phi}). \end{align*} \]

transition <- function(x) {
  phi <- rt(1, 2) - 0.1
  (max(0, x + phi))
}

Intuitively the bias of \(\phi\) towards negative values induces a drift in the realized Markov chains towards zero, although for this particular configuration that drift is relatively weak. If a realized Markov chain gets too close to zero then Markov transitions can be truncated to avoid passing over to negative values.

Formally one can show that this Markov transition distribution is recurrent, and hence exhibits a unique stationary distribution, but an explicit probability density function for that stationary distribution is not immediately available so we do not know any exact stationary expectation values. Moreover while this Markov transition is recurrent one can show that it does not satisfy many of the conditions which guarantee a Markov chain Monte Carlo central limit theorem; in particular it is not geometrically ergodic. The details of these derivations are complicated and requires some sophisticated mathematical machinery; see Section 16.1.3 of [4].

In other words Markov chain Monte Carlo estimators will converge asymptotically but we may not be able to characterize how they converge. The preasymptotic convergence might be well-behaved or not, but without verification of a central limit theorem we will not have any theoretical guidance for what kind of behavior to expect. At the same time without an explicit form for the stationary distribution we can't work out any exact expectation values, and hence have no way to empirically confirm that realized Markov chain Monte Carlo estimators are converging to their asymptotic values.

Without only slow convergence the behavior of Markov chains can be quite chaotic, exhibiting extreme variations from realization to realization. In practice this can mean extreme variation across different pseudo-random number generator seeds or even between different computer environments where mathematical operators are implemented every so slightly differently. Needless to say this chaos substantially complicates reproducible Markov chain Monte Carlo!

Here we will set R's pseudo-random number generator seed to be particularly adverarial for the implementation of our chosen Markov transitions. This will allow us to see just how pathological the emprical averages can behave.

set.seed(5679650)

This Markov transition distribution yields Markov chains whose states tend to be highly correlated. To ensure that we have enough information to discriminate between different pathological behaviors we will run longer Markov chains here than in our other examples.

N <- 10000
x0s <- c(7, 3, 0.2, 15)
chains <- lapply(x0s, function(x0) generate_chain(x0, N, transition))

N_warmup <- 51
chains <- remove_warmup(chains, N_warmup)

id_pushforward_chains <- pushforward_chains(chains, identity)

The trace plots for each of the four realized Markov chains immediately reveal some of the key problems.

plot_traces(id_pushforward_chains)

At each iteration new states are generated by perturbating the initial state with a truncatd Student-t perturbation. For \(\nu = 2\) these perturbations are sufficiently heavy tailed that the realized Markov chains will occassionally make huge jumps to extreme values. The weak drift \(\mu = -0.1\), however, means that a Markov chain will require many, many iterations to return from this excursion back to smaller values. These excursions are expensive but without them we cannot fully explore the stationary distribution, and without that complete exploration we cannot construct accurate Markov chain Monte Carlo estimators. Here only the third realized Markov chain makes the necessary excursion, and it hasn't even returned to the boundary by the time we've exhausted our \(10000\) iterations.

What would happen if we weren't expecting pathological behavior and naively assumed that the behavior of our Markov chain Monte Carlo estimators was accurately characterized by a central limit theorem?

Following the robust workflow we start by consulting our limited diagnostics.

check_diagnostics(id_pushforward_chains, name="Id")
Expectand "Id":

  Split r-hat is greater than 1.1 which is inconsistent with the expectand estimator satisfying a central limit theorem!

Chain 1:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 2:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 3:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 4:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.

The split \(\hat{R}\) diagnostic does flag the discrepancy between the four realized Markov chains, although we're a little bit lucky here because one of our Markov chains made an excursion. If none of the realized Markov chains made an excursion then they would have been consistent with each other even through none of them would be consistent with the stationary distribution. One of the key reasons for running multiple Markov chains is to increase the sensitivity of our empirical diagnostics.

We can also see the variation across the four Markov chains manifest as inconsistencies between the individual Markov chain Monte Carlo estimators, with the estimator derived from the third Markov chain dragged up due to the excursion while the other concentrate too strongly around zero.

plot_est_comp(id_pushforward_chains)

Because this variation is inconsistent with a Markov chain Monte Carlo central limit theorem our empirical estimates of the Markov chain Monte Carlo standard errors don't correspond to anything meanginful, nor does the ensemble estimator that relies on those estimated standard errors. In other words we have no way to quantfiy how close any of these inconsistent estimates might be to the exact stationary expectation values.

Even more evidence that a Markov chain Monte Carlo central limit theorem does not hold can be found by investigating the evolution of the Markov chain Monte Carlo estimators with the increasing number of iterations. If a central limit theorem held then we would expect to see the estimators from all four Markov chains converge to each other within the estimated standard errors, but the estimator from the third Markov chain is off doing its own thing.

par(mfrow=c(2, 2))

segment_sizes <- seq(100, N - N_warmup, 50)

for (c in 1:4) {
  plot(0, type="n", main=paste("Chain", c),
       xlab="Included Iterations", xlim=c(0, N - N_warmup),
       ylab="Markov chain\nMonte Carlo Estimator", ylim=c(0, 800))

  for (M in segment_sizes) {
    chain_segment <- id_pushforward_chains[[c]][1:M]
    est <- mcmc_est(chain_segment)
  
    rect(M - 5, est[1] - 4 * est[2], M + 5, est[1] + 4 * est[2],
         col=c_light, border=NA)
    rect(M - 5, est[1] - 3 * est[2], M + 5, est[1] + 3 * est[2],
         col=c_light_highlight, border=NA)
    rect(M - 5, est[1] - 2 * est[2], M + 5, est[1] + 2 * est[2],
         col=c_mid, border=NA)
    rect(M - 5, est[1] - 1 * est[2], M + 5, est[1] + 1 * est[2],
         col=c_mid_highlight, border=NA)
    lines(c(M - 5, M + 5), c(est[1], est[1]), col=c_dark, lwd=2)
  }
}

There are many reasons why the behavior for a particular Markov chain Monte Carlo estimator might be problematic. It could be that our chosen Markov transition distribution fundamentally obstucts a Markov chain Monte Carlo central limit theorem; it could be that the particular expectand isn't sufficiently integrable for the central limit theorem to apply to it; or it could be that we just have run our Markov chains long enough for the central limit theorem to be come accurate.

Let's first consider what might happen if the identify function just wasn't sufficiently integrable. Ideally our $ diagnostic would have caught this but there's always a chance that it just wasn't sensitive enough, especially given that only one Markov chain explored the heavy tail of the stationary distribution.

To avoid any ambiguity let's look at an indicator function whose moments will always be finite.

indicator <- function(x) {
  ifelse(x < 50, 1, 0)
}

indicator_pushforward_chains <- pushforward_chains(chains, indicator)

The split \(\hat{R}\) diagnostic identifies inconsistencies in the behavior of this expectand across the four realized Markov chains as well.

check_diagnostics(indicator_pushforward_chains, name="I[0, 50]")
Expectand "I[0, 50]":

  Split r-hat is greater than 1.1 which is inconsistent with the expectand estimator satisfying a central limit theorem!

Chain 1:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 2:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 3:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 4:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.

Moreover the Markov chain Monte Carlo estimators are not compatible with each other.

plot_est_comp(indicator_pushforward_chains)

We can see even more detail about this inconsistency in the evolution of the Markov chain Monte Carlo estimators. Note also the sudden jumps in the estimator behavior even for the three Markov chains that don't make excursions to large values. These behavior is common when a Markov chain Monte Carlo central limit theorm fails.

par(mfrow=c(2, 2))

segment_sizes <- seq(100, N - N_warmup, 50)

for (c in 1:4) {
  plot(0, type="n", main=paste("Chain", c),
       xlab="Included Iterations", xlim=c(0, N - N_warmup),
       ylab="Markov chain\nMonte Carlo Estimator", ylim=c(0, 1))

  for (M in segment_sizes) {
    chain_segment <- indicator_pushforward_chains[[c]][1:M]
    est <- mcmc_est(chain_segment)
  
    rect(M - 5, est[1] - 4 * est[2], M + 5, est[1] + 4 * est[2],
         col=c_light, border=NA)
    rect(M - 5, est[1] - 3 * est[2], M + 5, est[1] + 3 * est[2],
         col=c_light_highlight, border=NA)
    rect(M - 5, est[1] - 2 * est[2], M + 5, est[1] + 2 * est[2],
         col=c_mid, border=NA)
    rect(M - 5, est[1] - 1 * est[2], M + 5, est[1] + 1 * est[2],
         col=c_mid_highlight, border=NA)
    lines(c(M - 5, M + 5), c(est[1], est[1]), col=c_dark, lwd=2)
  }
}

Finally let's consider the possibility that our realized Markov chains were just too short for a Markov chain Monte Carlo central limit theorem to have kicked in. To see what would happen for longer Markov chains we'll run \(100\) times more iterations, and to make those long Markov chains easier to work with we'll also thin them to every \(100\)th iteration.

set.seed(5679650)

N <- 1000000
x0s <- c(7, 3, 0.2, 15)
chains <- lapply(x0s, function(x0) generate_chain(x0, N, transition))

N_warmup <- 51
chains <- remove_warmup(chains, N_warmup)

thinned_chains <- thin_chains(chains, 100)

All of these longer Markov chains now exhibit some excursions away from small values, but some exhibit even more extreme excursions than others!

id_pushforward_chains <- pushforward_chains(thinned_chains, identity)

plot_traces(id_pushforward_chains)

These more extreme excursions also suggest that the stationary distribution is too heavy-tailed for the identify function to be sufficiently integrable. This is backed up by the failure of the \(\hat{k}\) diagnostics in some of the Markov chains.

check_diagnostics(id_pushforward_chains, name="I[0, 50]")
Expectand "I[0, 50]":

  Split r-hat is less than 1.1 which is consistent with the expectand estimator satisfying a central limit theorem.

Chain 1:
  The upper tail k-hat is larger than 0.25  which is inconsistent with the expectand estimator satisfying a central limit theorem!
Chain 2:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 3:
  The upper tail k-hat is larger than 0.25  which is inconsistent with the expectand estimator satisfying a central limit theorem!
Chain 4:
  The upper tail k-hat is larger than 0.25  which is inconsistent with the expectand estimator satisfying a central limit theorem!

What about the estimation of the indicator function? Here neither the split \(\hat{R}\) nor the \(\hat{k}\) diagnostic indicates any problematic behavior.

indicator_pushforward_chains <- pushforward_chains(thinned_chains, indicator)

check_diagnostics(indicator_pushforward_chains, name="I[0, 50]")
Expectand "I[0, 50]":

  Split r-hat is less than 1.1 which is consistent with the expectand estimator satisfying a central limit theorem.

Chain 1:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 2:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 3:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 4:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.

Moreover the Markov chain Monte Carlo estimators are consistent with each other, as is required if a Markov chain Monte Carlo central limit theorem holds. Note also that the estimators are converging in between the earlier estimates that we derived from the shorter Markov chains.

plot_est_comp(indicator_pushforward_chains)

There is no guarantee, however, that these Markov chains are still missing rarer excursions to even larger values. This is a circumstance where being able to characterize properties of the stationary distribution, such as how much probability it might allocate to extreme values, is incredibly useful.

The for the moment, however, let's assume that we've sufficiently characterized the tail of the stationary distribution. In this case the total effective sample size across all four Markov chains is incredibly small given the total number of iterations.

display_ensemble_mcmc_est(indicator_pushforward_chains, name="I[0, 50]")
E[I[0, 50]] = 0.705 +/- 0.022 (Total Effective Sample Size = 1794)

Even if a central limit theorem holds every additional iteration across all four Markov chains will increase the total effective sample size by only a small amount.

1794 / (4 * 1000000)
[1] 0.0004485

In other words it will take over two thousand iterations of a single Markov chain to increase the total effective sample size by one.

The existence of a Markov chain Monte Carlo central limit theorem is practically meaningless if it doesn't apply until the Markov chains have become infeasibly large. In these cases the domain of the central limit theorem is just as impractical as the final asymptotic limit itself. We need to exercise caution whenever the incremental effective sample size is so small.

4.4 Transition Four: Ill-Defined Moments

Let's put that ordeal behind us and consider a new Markov transition distribution defined by simulating a heavy-tailed intermediate variable and then deterministically mixing it with the initial state, \[ \begin{align*} \tilde{\phi} &\sim \text{Cauchy}(\phi \mid 0, 1) \\ \tilde{x}_{n + 1} &= (1 - \lambda) \, \phi + \lambda \, x_{n}. \end{align*} \] for any \(0 < \lambda < 1\). Here we'll take \(\lambda = 0.5\).

transition <- function(x) {
  lambda <- 0.5
  phi <- rcauchy(1, 0, 1)
   (1 - lambda) * phi + lambda * x
}

Mathematically we can encode these steps with the probability density functions \[ \begin{align*} \tau(\phi \mid x_{n}) &= \text{Cauchy}(\phi \mid 0, 1) \\ \tau(x_{n + 1} \mid \phi, x_{n}) &= \delta( x_{n + 1} - ((1 - \lambda) \, \phi + \lambda \, x_{n} ) ). \end{align*} \] The full Markov transition density function is then given by \[ \begin{align*} \tau(x_{n + 1} \mid x_{n}) &= \int_{-\infty}^{\infty} \mathrm{d} \phi \, \tau(x_{n + 1} \mid \phi, x_{n}) \, \tau(\phi \mid x_{n}) \\ &= \int_{-\infty}^{\infty} \mathrm{d} \phi \, \delta( x_{n + 1} - ( (1 - \lambda) \, \phi + \lambda \, x_{n} ) ) \, \text{Cauchy}(\phi \mid 0, 1) \\ &= \int_{-\infty}^{\infty} \mathrm{d} \phi \, \delta( x_{n + 1} - \lambda \, x_{n} - (1 - \lambda) \, \phi ) \, \text{Cauchy}(\phi \mid 0, 1). \end{align*} \]

Making the substitution \[ u = (1 - \lambda) \, \phi \] then gives \[ \begin{align*} \tau(x_{n + 1} \mid x_{n}) &= \int_{-\infty}^{\infty} \mathrm{d} \phi \, \delta( x_{n + 1} - \lambda \, x_{n} - ( 1 - \lambda) \, \phi ) \, \text{Cauchy}(\phi \mid 0, 1) \\ &= (1 - \lambda)^{-1} \int_{-\infty}^{\infty} \mathrm{d} u \, \delta( x_{n + 1} - \lambda \, x_{n} - u ) \, \text{Cauchy} \! \left( (1 - \lambda)^{-1} \, u \mid 0, 1 \right) \\ &= (1 - \lambda)^{-1} \text{Cauchy} \! \left( (1 - \lambda)^{-1} \, (x_{n + 1} - \lambda \, x_{n}) \mid 0, 1 \right). \end{align*} \] Because the Cauchy family of probability density functions is closed under translations and scalings we have \[ \begin{align*} \sigma^{-1} \text{Cauchy}( \sigma^{-1} \, (x - \mu) \mid 0, 1) &= \sigma^{-1} \, \frac{1}{\pi} \left[ 1 + \left( \sigma^{-1} \, (x - \mu) \right)^{2} \right] \\ &= \frac{1}{\pi \, \sigma} \left[ 1 + \left( \frac{x - \mu}{ \sigma} \right)^{2} \right] \\ &= \text{Cauchy}(x \mid \mu, \sigma). \end{align*} \] Consequently \[ \begin{align*} \tau(x_{n + 1} \mid x_{n}) &= (1 - \lambda)^{-1} \text{Cauchy} \! \left( (1 - \lambda)^{-1} \, (x_{n + 1} - \lambda \, x_{n}) \mid 0, 1 \right) \\ &= \text{Cauchy} \! \left( x_{n + 1} \mid \lambda \, x_{n} , 1 - \lambda \right). \end{align*} \]

This Markov transition distribution is irreducible and aperiodic; the unique stationary distribution is given by the probability density function \[ \text{Cauchy}(x \mid 0, 1) = \frac{1}{\pi} \frac{1}{1 + x^{2}}. \] Moreover the \(N\)-step distributions converges to this stationary distribution quickly so that a Markov chain Monte Carlo central limit theorem holds even for relatively small Markov chains.

Let's see just how well it converges in practice.

N <- 1000
x0s <- c(15, 3, -10, -5)
chains <- lapply(x0s, function(x0) generate_chain(x0, N, transition))

N_warmup <- 51
chains <- remove_warmup(chains, N_warmup)

All four of the realized Markov chains encounter occassional excursions to large values, but unlike in the previous example those excursions are short and the Markov chains return to small values quite quickly.

id_pushforward_chains <- pushforward_chains(chains, identity)

plot_traces(id_pushforward_chains)

Unfortunately while the output of the identity function is consistent across all four Markov chains the \(\hat{k}\) statistic suggests that this expectand is not sufficiently integrable for the corresponding expectation values to be well-characterized by a Markov chain Monte Carlo central limit theorem.

check_diagnostics(id_pushforward_chains, name="Id")
Expectand "Id":

  Split r-hat is less than 1.1 which is consistent with the expectand estimator satisfying a central limit theorem.

Chain 1:
  Both tail k-hats are larger than 0.25 which is inconsistent with the expectand estimator satisfying a central limit theorem!
Chain 2:
  Both tail k-hats are larger than 0.25 which is inconsistent with the expectand estimator satisfying a central limit theorem!
Chain 3:
  Both tail k-hats are larger than 0.25 which is inconsistent with the expectand estimator satisfying a central limit theorem!
Chain 4:
  Both tail k-hats are larger than 0.25 which is inconsistent with the expectand estimator satisfying a central limit theorem!

Indeed all moments of the stationary distribution are problematic; the even moments are infinite while the odd moments are not even well-defined. Consequently while we can naively construct an ensemble Markov chain Monte Carlo estimator that returns some number it won't actually correspond to any meaningful property of the stationary distribution.

display_ensemble_mcmc_est(id_pushforward_chains, name="Id")
E[Id] = 0.199 +/- 0.468 (Total Effective Sample Size = 1322)

While moments are ill-defined, probabilities will always be well-defined. Consequently we shouldn't have any problems estimating expectation values of indicator functions.

indicator <- function(x) {
  ifelse(-1 < x & x < 1, 1, 0)
}

indicator_pushforward_chains <- pushforward_chains(chains, indicator)

None of the empirical diagnostics indicate inconsistencies with a Markov chain Monte Carlo central limit theorem holding.

check_diagnostics(indicator_pushforward_chains, "I[-1, 1]")
Expectand "I[-1, 1]":

  Split r-hat is less than 1.1 which is consistent with the expectand estimator satisfying a central limit theorem.

Chain 1:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 2:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 3:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 4:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.

Consequently we have some confidence that our ensemble Markov chain Monte Carlo estimator will be well-behaved.

display_ensemble_mcmc_est(indicator_pushforward_chains, name="I[-1, 1]")
E[I[-1, 1]] = 0.502 +/- 0.023 (Total Effective Sample Size = 1872)

Indeed it's consistent with the exact stationary expectation value.

pcauchy(1, 0, 1) - pcauchy(-1, 0, 1)
[1] 0.5

We can also use these probability estimators to construst a histogram representation of the stationary distribution. In this case that histogram is perfectly consisent with the exact stationary density function.

par(mfrow=c(1, 1))

plot_density_hist(chains, seq(-10, 10, 0.3))
plot_xs <- seq(-10, 10, 0.01)
plot_ys <- dcauchy(plot_xs, 0, 1)
lines(plot_xs, plot_ys, lwd=3, col="white")
lines(plot_xs, plot_ys, lwd=2, col="black")

4.5 Transition Five: All Clear

After slogging through so many failure modes we've earned a final example where everything goes well.

Our final Markov transition distribution is similar to that of the last section only with a more moderately-tailed intermediate variable, \[ \begin{align*} \tilde{\phi} &\sim \text{normal} (\phi \mid 0, 1) \\ \tilde{x}_{n + 1} &= \sqrt{1 - \lambda^{2}} \, \phi + \lambda \, x_{n}. \end{align*} \] for any \(0 < \lambda < 1\). As before we'll take \(\lambda = 0.5\).

transition <- function(x) {
  lambda <- 0.5
  phi <- rnorm(1, 0, 1)
  sqrt(1 - lambda^2) * phi + lambda * x
}

This two-step Markov transition can be probabilistically encoded with the conditional density functions \[ \begin{align*} \tau(\phi \mid x_{n}) &= \text{normal}(\phi \mid 0, 1) \\ \tau(x_{n + 1} \mid \phi, x_{n}) &= \delta \big( x_{n + 1} - \big( \sqrt{1 - \lambda^{2}} \, \phi + \lambda \, x_{n} \big) \big), \end{align*} \] to give the full Markov transition \[ \begin{align*} \tau(x_{n + 1} \mid x_{n}) &= \int_{-\infty}^{\infty} \mathrm{d} \phi \, \tau(x_{n + 1} \mid \phi, x_{n}) \, \tau(\phi \mid x_{n}) \\ &= \int_{-\infty}^{\infty} \mathrm{d} \phi \, \delta \big( x_{n + 1} - \big( \sqrt{1 - \lambda^{2}} \, \phi + \lambda \, x_{n} \big) \big) \, \text{normal}(\phi \mid 0, 1) \\ &= \int_{-\infty}^{\infty} \mathrm{d} \phi \, \delta \big( x_{n + 1} - \lambda \, x_{n} - \sqrt{1 - \lambda^{2}} \, \phi \big) \, \text{normal}(\phi \mid 0, 1). \end{align*} \]

If we can isolate the variable of integration in the Dirac delta function then we can evaluate the integral immediately. To that end we make the substitution \[ u = \sqrt{1 - \lambda^{2}} \, \phi \] which gives \[ \begin{align*} \tau(x_{n + 1} \mid x_{n}) &= \int_{-\infty}^{\infty} \mathrm{d} \phi \, \delta( x_{n + 1} - \lambda \, x_{n} - \sqrt{1 - \lambda^{2}} \, \phi ) \, \text{normal}(\phi \mid 0, 1) \\ &= (1 - \lambda^{2})^{-\frac{1}{2}} \int_{-\infty}^{\infty} \mathrm{d} u \, \delta( x_{n + 1} - \lambda \, x_{n} - u ) \, \text{normal} \! \left( (1 - \lambda^{2})^{-\frac{1}{2}} \, u \mid 0, 1 \right) \\ &= (1 - \lambda^{2})^{-\frac{1}{2}} \text{normal} \! \left( (1 - \lambda^{2})^{-\frac{1}{2}} \, (x_{n + 1} - \lambda \, x_{n}) \mid 0, 1 \right). \end{align*} \]

Like the Cauchy family the normal family of probability density functions is closed under translations and scalings; in particular \[ \begin{align*} \sigma^{-1} \text{normal}( \sigma^{-1} \, (x - \mu) \mid 0, 1) &= \frac{1}{\sigma} \frac{1}{\sqrt{2 \pi}} \exp \left[ - \frac{1}{2} \left( \sigma^{-1} \, (x - \mu) \right)^{2} \right] \\ &= \frac{1}{\sqrt{2 \pi \sigma^{2} }} \exp \left[ - \frac{1}{2} \left( \frac{ x - \mu}{\sigma} \right)^{2} \right] \\ &= \text{normal}(x \mid \mu, \sigma). \end{align*} \] Consequently the Markov transition can be written as \[ \begin{align*} \tau(x_{n + 1} \mid x_{n}) &= (1 - \lambda^{2})^{-\frac{1}{2}} \int \mathrm{d} u \, \delta( x_{n + 1} - \lambda \, x_{n} - u ) \, \text{normal} \! \left( (1 - \lambda^{2})^{-\frac{1}{2}} \, u \mid 0, 1 \right) \\ &= \text{normal} \! \left( x_{n + 1} \mid \lambda \, x_{n} , \sqrt{1 - \lambda^{2}} \right). \end{align*} \]

Let's start by looking at short Markov chains so that we can see the residual influence of the initial states.

x0s <- c(-10, 1, 20, 15)

N <- 100
chains <- lapply(x0s, function(x0) generate_chain(x0, N, transition))

That influence persists until about ten iterations after which point all four realized Markov chains settle into the consist behavior.

id_pushforward_chains <- pushforward_chains(chains, identity)

plot_traces(id_pushforward_chains)

For inference we'll run longer Markov chains but we'll also exclude an initial warmup window so that this residual influence doesn't contaminate our Markov chain Monte Carlo estimators.

N <- 1000
chains <- lapply(x0s, function(x0) generate_chain(x0, N, transition))

N_warmup <- 51
chains <- remove_warmup(chains, N_warmup)

Evaluations of the identify expectand across the states in all four of the Markov chains looks reasonable.

id_pushforward_chains <- pushforward_chains(chains, identity)

plot_traces(id_pushforward_chains)

Our limited diagnostics are clean.

check_diagnostics(id_pushforward_chains, name="Id")
Expectand "Id":

  Split r-hat is less than 1.1 which is consistent with the expectand estimator satisfying a central limit theorem.

Chain 1:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 2:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 3:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 4:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.

The ensemble Markov chain Monte Carlo estimator is consistent with the exact stationary expectation value of \(\mathbb{E}_{\pi}[\text{Id}] = 0\).

display_ensemble_mcmc_est(id_pushforward_chains, name="Id")
E[Id] = 0.034 +/- 0.055 (Total Effective Sample Size = 1247)

Indeed the estimators from each individual Markov chain are compatible with each other, the ensemble estimator, and the exact expectation value as we would expect from a Markov chain Monte Carlo central limit theorem holding.

plot_est_comp(id_pushforward_chains)

Another sign that a central limit theorem holds is smooth convergence of the estimators around the exact expectation value.

par(mfrow=c(2, 2))

segment_sizes <- seq(100, N - N_warmup, 50)

for (c in 1:4) {
  plot(0, type="n", main=paste("Chain", c),
       xlab="Included Iterations", xlim=c(0, N - N_warmup),
       ylab="Markov chain\nMonte Carlo Estimator", ylim=c(-1, 1))

  for (M in segment_sizes) {
    chain_segment <- id_pushforward_chains[[c]][1:M]
    est <- mcmc_est(chain_segment)

    rect(M - 5, est[1] - 4 * est[2], M + 5, est[1] + 4 * est[2],
         col=c_light, border=NA)
    rect(M - 5, est[1] - 3 * est[2], M + 5, est[1] + 3 * est[2],
         col=c_light_highlight, border=NA)
    rect(M - 5, est[1] - 2 * est[2], M + 5, est[1] + 2 * est[2],
         col=c_mid, border=NA)
    rect(M - 5, est[1] - 1 * est[2], M + 5, est[1] + 1 * est[2],
         col=c_mid_highlight, border=NA)
    lines(c(M - 5, M + 5), c(est[1], est[1]), col=c_dark, lwd=2)
  }

  abline(h=0, col="grey", lty="dashed", lw=2) # Exact expectation value
}

par(mfrow=c(1, 1))

segment_sizes <- seq(25, N - N_warmup, 25)

plot(0, type="n", main="All Chains",
     xlab="Included Iterations Per Chain", xlim=c(0, N - N_warmup),
     ylab="Ensemble Markov chain\nMonte Carlo Estimator", ylim=c(-1, 1))

for (M in segment_sizes) {
  chain_segments <- lapply(id_pushforward_chains, function(c) c[1:M])
  est <- ensemble_mcmc_est(chain_segments)

  rect(M - 5, est[1] - 4 * est[2], M + 5, est[1] + 4 * est[2],
       col=c_light, border=NA)
  rect(M - 5, est[1] - 3 * est[2], M + 5, est[1] + 3 * est[2],
       col=c_light_highlight, border=NA)
  rect(M - 5, est[1] - 2 * est[2], M + 5, est[1] + 2 * est[2],
       col=c_mid, border=NA)
  rect(M - 5, est[1] - 1 * est[2], M + 5, est[1] + 1 * est[2],
       col=c_mid_highlight, border=NA)
  lines(c(M - 5, M + 5), c(est[1], est[1]), col=c_dark, lwd=2)
}

abline(h=0, col="grey", lty="dashed", lw=2) # Exact expectation value

Estimating bin probabilities using Markov chain Monte Carlo estimators gives a histogram which accurately characterizes the stationary density function.

par(mfrow=c(1, 1))

plot_density_hist(id_pushforward_chains, seq(-5, 5, 0.3))
plot_xs <- seq(-5, 5, 0.01)
plot_ys <- dnorm(plot_xs, 0, 1)
lines(plot_xs, plot_ys, lwd=3, col="white")
lines(plot_xs, plot_ys, lwd=2, col="black")

Finally let's consider thinning. Because our Markov transition distribution is aperiodic the Markov chain Monte Carlo central limit theorem should hold for any thinning stride, but what strides would be useful?

Let's go back to our Markov chain Monte Carlo ensemble estimator.

display_ensemble_mcmc_est(id_pushforward_chains, name="Id")
E[Id] = 0.034 +/- 0.055 (Total Effective Sample Size = 1247)

From the \(4 \, (1001 - 51) = 3800\) total post-warmup iterations we've recovered an effective sample size of only about \(1200\) due to the autocorrelations within each Markov chain. This suggests that there is some opportunity to thin the Markov chains slightly to reduce the autocorrelations without affecting the total effective sample size much. Naively we might even expect to be able to remove every \(3800 / 1200 \approx 3\) iterations.

Let's start by thinning to every second iteration.

thinned_chains <- thin_chains(chains, 2)
id_pushforward_chains <- pushforward_chains(thinned_chains, identity)

Diagnostics continue to look clean, which gives us confidence that our thinned Markov chains have become too short for the central limit theorem to still hold.

check_diagnostics(id_pushforward_chains, name="Id")
Expectand "Id":

  Split r-hat is less than 1.1 which is consistent with the expectand estimator satisfying a central limit theorem.

Chain 1:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 2:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 3:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 4:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.

Despite having removed half of the post-warmup states in each Markov chain the total effective sample size is nearly the same. Consequently the precision of the ensemble Markov chain Monte Carlo estimator has largely been unaffected by the thinning.

display_ensemble_mcmc_est(id_pushforward_chains, name="Id")
E[Id] = 0.026 +/- 0.058 (Total Effective Sample Size = 1130)

What happens when we thin to every third iteration as suggested by our naive calculation above?

thinned_chains <- thin_chains(chains, 3)
id_pushforward_chains <- pushforward_chains(thinned_chains, identity)

check_diagnostics(id_pushforward_chains, name="Id")
Expectand "Id":

  Split r-hat is less than 1.1 which is consistent with the expectand estimator satisfying a central limit theorem.

Chain 1:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 2:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 3:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 4:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
display_ensemble_mcmc_est(id_pushforward_chains, name="Id")
E[Id] = 0.046 +/- 0.059 (Total Effective Sample Size = 1032)

At this point the total effective sample size for the thinned Markov chains begins to decay. While the states we removed are correlated with their neighbors in the full Markov chain they are not completely dependent; consequently they do have a nontrivial impact on the precision of Markov chain Monte Carlo estimators.

If we continue to more extreme strides the effective sample size starts to decreases proportionally to the number of states that we remove.

thinned_chains <- thin_chains(chains, 6)
id_pushforward_chains <- pushforward_chains(thinned_chains, identity)

check_diagnostics(id_pushforward_chains, name="Id")
Expectand "Id":

  Split r-hat is less than 1.1 which is consistent with the expectand estimator satisfying a central limit theorem.

Chain 1:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 2:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 3:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
Chain 4:
  Both tail k-hats are smaller than 0.25 which is consistent with the expectand estimator satisfying a central limit theorem.
display_ensemble_mcmc_est(id_pushforward_chains, name="Id")
E[Id] = 0.003 +/- 0.080 (Total Effective Sample Size = 609)

Ultimately all of the states in our initial Markov chain inform Markov chain Monte Carlo estimators and keeping them all will always result in the most informative estimates. That said the aperiodicity of this Markov transition distribution gives us the flexibility to employ thinning in the limited circumstances where it can be helpful without having to worry about compromising the estimators.

5 Conclusion

Probabilistic computation is fundamentally difficult. In this light methods like Monte Carlo are somewhat miraculous, but they apply only in the exceptional cases where we can generate exact samples from the probability distributions of interest. Markov chain Monte Carlo is a more conflicted miracle; while it can be applied to many more circumstances it is much more fragile than the Monte Carlo method.

In order for us to deploy Markov chain Monte Carlo responsibly we cannot take it for granted. We have to be careful to validate useful conditions like irreducibility and aperiodicity and then look for any signs that a Markov chain Monte Carlo central limit theorem doesn't hold for each expectand of interest. With this care Markov chain Monte Carlo can make probabilistic computation feasible an incredibly diverse range of problems.

That said we have yet to address one of the most difficult challenges in the application of Markov chain Monte Carlo. How do we actually engineer Markov transition distributions whose stationary distribution is a given target distribution of interest, let alone one that ensures rapid convergence to that target distribution? This challenge requires dedicated methods for constructing Markov transition distributions from target distributions, which we will discuss in future case studies.

6 Appendix: Integral Verve

In this appendix we verify the properties of the demonstartive Markov transitions distributions that we employed in the main text. While all of these calculations employ only basic calculus techniques, many are quite involved and require some onerous algebra despite the seeming simplicity of the Markov transition distributions and their stationary distributions. These difficulty is one reason why working with bespoke Markov transition distributions is so demanding and often outside of practical feasibility.

6.1 Transition One

The Markov transition distribution in our first example was specified by the conditional density function \[ \begin{align*} \tau(x_{n + 1} \mid x_{n}) &= \quad \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 < x_{n} < +\phi ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \exp(-\phi) \\ & \quad + \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 \le +\phi \le x_{n} ] \, \delta( x_{n + 1} - 2 \, \phi) \,\;\; \exp(-\phi) \\ & \quad + \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ -\phi < x_{n} \le 0 ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \exp(-\phi) \\ & \quad + \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ x_{n} \le -\phi \le 0 ] \, \delta( x_{n + 1} + 2 \, \phi) \,\;\; \exp(-\phi). \end{align*} \]

First we want to verify this Markov transition distribution is properly normalized, \[ \begin{align*} \int_{-\infty}^{\infty} \mathrm{d} x_{n + 1} \, \tau(x_{n + 1} \mid x_{n}) &= \int_{-\infty}^{\infty} \mathrm{d} x_{n + 1} \, \int_{0}^{\infty} \mathrm{d} \phi \, \tau(x_{n + 1} \mid \phi, x_{n}) \, \tau(\phi \mid x_{n}) \\ &= \quad \int_{-\infty}^{\infty} \mathrm{d} x_{n + 1} \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 < x_{n} < +\phi ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \exp(-\phi) \\ & \quad + \int_{-\infty}^{\infty} \mathrm{d} x_{n + 1} \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 \le +\phi \le x_{n} ] \, \delta( x_{n + 1} - 2 \, \phi) \,\;\; \exp(-\phi) \\ & \quad + \int_{-\infty}^{\infty} \mathrm{d} x_{n + 1} \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ -\phi < x_{n} \le 0 ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \exp(-\phi) \\ & \quad + \int_{-\infty}^{\infty} \mathrm{d} x_{n + 1} \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ x_{n} \le -\phi \le 0 ] \, \delta( x_{n + 1} + 2 \, \phi) \,\;\; \exp(-\phi) \\ &= \quad \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 < x_{n} < +\phi ] \, \exp(-\phi) \\ & \quad + \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 \le +\phi \le x_{n} ] \, \exp(-\phi) \\ & \quad + \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ -\phi < x_{n} \le 0 ] \, \exp(-\phi) \\ & \quad + \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ x_{n} \le -\phi \le 0 ] \, \exp(-\phi) \\ &= \quad \int_{0}^{\infty} \mathrm{d} \phi \, \big( \mathbb{I}[ 0 < x_{n} < +\phi ] + \mathbb{I}[ 0 \le +\phi \le x_{n} ] \big) \, \exp(-\phi) \\ & \quad + \int_{0}^{\infty} \mathrm{d} \phi \, \big( \mathbb{I}[ -\phi < x_{n} \le 0 ] + \mathbb{I}[ x_{n} \le -\phi \le 0 ] \big)\, \exp(-\phi) \\ &= \quad \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 < +\phi ] \, \mathbb{I}[ 0 < x_{n} ] \, \exp(-\phi) \\ & \quad + \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ -\phi < 0 ] \, \mathbb{I}[ x_{n} \le 0 ] \, \exp(-\phi) \\ &= \;\;\; \mathbb{I}[ 0 < x_{n} ] \, \int_{0}^{\infty} \mathrm{d} \phi \, \exp(-\phi) \\ & \quad + \mathbb{I}[ x_{n} \le 0 ] \, \int_{0}^{\infty} \mathrm{d} \phi \, \exp(-\phi) \\ &= \;\;\; \mathbb{I}[ 0 < x_{n} ] + \mathbb{I}[ x_{n} \le 0 ] \\ &= \;\;\; 1. \end{align*} \]

One stationary distribution is specifeid by the standard Laplace density function, \[ \text{Laplace}(x \mid 0, 1) = \frac{1}{2} \exp( - | x | ). \] Verifying this requires evaluating the integral \[ \begin{align*} I &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{Laplace}(x_{n} \mid 0, 1) \, \tau(x_{n + 1} \mid x_{n}) \\ &= \quad \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{2} \exp( - | x_{n} | ) \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 < x_{n} < +\phi ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \exp(-\phi) \\ & \quad + \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{2} \exp( - | x_{n} | ) \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 \le +\phi \le x_{n} ] \, \delta( x_{n + 1} - 2 \, \phi) \,\;\; \exp(-\phi) \\ & \quad + \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{2} \exp( - | x_{n} | ) \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ -\phi < x_{n} \le 0 ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \exp(-\phi) \\ & \quad + \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{2} \exp( - | x_{n} | ) \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ x_{n} \le -\phi \le 0 ] \, \delta( x_{n + 1} + 2 \, \phi) \,\;\; \exp(-\phi) \\ &\equiv I_{1} + I_{2} + I_{3} + I_{4}. \end{align*} \]

Although they may appear imposing, these integrals are actually pretty straightforward to evaluate given the Dirac delta functions; we just have to be careful to respect all of the conditional logic. For example we can rearrange the first integral into the form \[ \begin{align*} I_{1} &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{2} \exp( - | x_{n} | ) \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 < x_{n} < +\phi ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \exp(-\phi) \\ &= \frac{1}{2} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( - | x_{n} | ) \, \delta( x_{n + 1} - 2 \, x_{n} ) \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 < x_{n} < +\phi ] \, \exp(-\phi) \\ &= \frac{1}{2} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( - | x_{n} | ) \, \delta( x_{n + 1} - 2 \, x_{n} ) \int_{x_{n}}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 < x_{n} ] \, \exp(-\phi) \\ &= \frac{1}{2} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( - | x_{n} | ) \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \mathbb{I}[ 0 < x_{n} ] \, \int_{x_{n}}^{\infty} \mathrm{d} \phi \, \, \exp(-\phi) \\ &= \frac{1}{2} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( - | x_{n} | ) \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \mathbb{I}[ 0 < x_{n} ] \, \big( \exp(-x_{n}) - \exp(-\infty) \big) \\ &= \frac{1}{2} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( - | x_{n} | - x_{n} ) \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \mathbb{I}[ 0 < x_{n} ] \end{align*} \] Whenever the indicator function is positive \(| x_{n} | = x_{n}\) so this reduces to \[ \begin{align*} I_{1} &= \frac{1}{2} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( - | x_{n} | - x_{n} ) \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \mathbb{I}[ 0 < x_{n} ] \\ &= \frac{1}{2} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( - 2 \, x_{n} ) \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \mathbb{I}[ 0 < x_{n} ]. \end{align*} \] Making the substitution \(u = 2 \, x_{n}\) this becomes \[ \begin{align*} I_{1} &= \frac{1}{4} \int_{-\infty}^{\infty} \mathrm{d} u \, \exp( - u ) \, \delta( x_{n + 1} - u) \, \mathbb{I}[ 0 < x_{n} ] \\ &= \frac{1}{4} \, \exp( - x_{n + 1} ) \, \mathbb{I}[ 0 < x_{n + 1} / 2 ] \\ &= \frac{1}{4} \, \exp( - x_{n + 1} ) \, \mathbb{I}[ 0 < x_{n + 1} ]. \end{align*} \]

The second integral proceeds slightly differently due to the intermediate variable appearing in the Dirac delta function, \[ \begin{align*} I_{2} &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{2} \exp( - | x_{n} | ) \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 \le +\phi \le x_{n} ] \, \delta( x_{n + 1} - 2 \, \phi) \, \exp(-\phi) \\ &= \frac{1}{2} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( - x_{n} ) \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 \le +\phi \le x_{n} ] \, \delta( x_{n + 1} - 2 \, \phi) \, \exp(-\phi) \\ &= \frac{1}{4} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( - x_{n} ) \, \int_{0}^{\infty} \mathrm{d} u \, \mathbb{I}[ 0 \le u / 2 \le x_{n} ] \, \delta( x_{n + 1} - u) \, \exp(-u / 2) \\ &= \frac{1}{4} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( - x_{n} ) \, \mathbb{I}[ 0 \le x_{n + 1} / 2 \le x_{n} ] \, \exp(-x_{n + 1} / 2) \\ &= \frac{1}{4} \exp(-x_{n + 1} / 2) \, \mathbb{I}[ 0 \le x_{n + 1} ] \, \int_{-x_{n} / 2}^{\infty} \mathrm{d} x_{n} \, \exp( - x_{n} ) \\ &= \frac{1}{4} \exp(-x_{n + 1} / 2) \, \mathbb{I}[ 0 \le x_{n + 1} ] \, \big( \exp( - x_{n + 1} / 2 ) - \exp(- \infty) \big) \\ &= \frac{1}{4} \exp(-x_{n + 1}) \, \mathbb{I}[ 0 \le x_{n + 1} ]. \end{align*} \]

The last two integrals follow the same progression as these two integrals only with flipped signs. In particular \[ \begin{align*} I_{3} &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{2} \exp( - | x_{n} | ) \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ -\phi < x_{n} \le 0 ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \exp(-\phi) \\ &= \frac{1}{2} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( x_{n} ) \, \delta( x_{n + 1} - 2 \, x_{n} ) \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ -\phi < x_{n} \le 0 ] \, \exp(-\phi) \\ &= \frac{1}{2} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( x_{n} ) \, \delta( x_{n + 1} - 2 \, x_{n} ) \int_{-x_{n}}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ x_{n} \le 0 ] \, \exp(-\phi) \\ &= \frac{1}{2} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( x_{n} ) \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \mathbb{I}[ x_{n} \le 0 ] \int_{-x_{n}}^{\infty} \mathrm{d} \phi \, \big( \exp(x_{n}) - \exp(-\infty) \big) \\ &= \frac{1}{2} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( 2 \, x_{n} ) \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \mathbb{I}[ x_{n} \le 0 ] \\ &= \frac{1}{4} \int_{-\infty}^{\infty} \mathrm{d} u \, \exp( u ) \, \delta( x_{n + 1} - u ) \, \mathbb{I}[ u / 2 \le 0 ] \\ &= \frac{1}{4} \exp( x_{n + 1} ) \, \mathbb{I}[ x_{n + 1} / 2 \le 0 ] \\ &= \frac{1}{4} \exp( x_{n + 1} ) \, \mathbb{I}[ x_{n + 1} \le 0 ], \end{align*} \] and \[ \begin{align*} I_{4} &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{2} \exp( - | x_{n} | ) \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ x_{n} \le -\phi \le 0 ] \, \delta( x_{n + 1} + 2 \, \phi) \, \exp(-\phi) \\ &= \frac{1}{2} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( x_{n} ) \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ x_{n} \le -\phi \le 0 ] \, \delta( x_{n + 1} + 2 \, \phi) \, \exp(-\phi) \\ &= \frac{1}{4} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( x_{n} ) \, \int_{0}^{\infty} \mathrm{d} u \, \mathbb{I}[ x_{n} \le -u / 2 \le 0 ] \, \delta( x_{n + 1} + u) \, \exp(-u / 2) \\ &= \frac{1}{4} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \exp( x_{n} ) \, \mathbb{I}[ x_{n} \le x_{n + 1} / 2 \le 0 ] \, \exp(x_{n + 1} / 2) \\ &= \frac{1}{4} \exp(x_{n + 1} / 2) \, \mathbb{I}[ x_{n + 1} / 2 \le 0 ] \, \int_{-\infty}^{x_{n + 1} / 2} \mathrm{d} x_{n} \, \exp( x_{n} ) \\ &= \frac{1}{4} \exp(x_{n + 1} / 2) \, \mathbb{I}[ x_{n + 1} / 2 \le 0 ] \, \big( \exp( x_{n + 1} / 2 ) - \exp(-\infty) \big) \\ &= \frac{1}{4} \exp(x_{n + 1}) \, \mathbb{I}[ x_{n + 1} / 2 \le 0 ] \\ &= \frac{1}{4} \exp(x_{n + 1}) \, \mathbb{I}[ x_{n + 1} \le 0 ]. \end{align*} \]

Putting all four terms together gives \[ \begin{align*} I &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{Laplace}(x_{n} \mid 0, 1) \, \tau(x_{n + 1} \mid x_{n}) \\ &= I_{1} + I_{2} + I_{3} + I_{4} \\ &= \quad \frac{1}{4} \exp(-x_{n + 1}) \, \mathbb{I}[ x_{n + 1} > 0 ] + \frac{1}{4} \exp(-x_{n + 1}) \, \mathbb{I}[ x_{n + 1} > 0 ] \\ & \quad + \frac{1}{4} \exp(+x_{n + 1}) \, \mathbb{I}[ x_{n + 1} \le 0 ] + \frac{1}{4} \exp(+x_{n + 1}) \, \mathbb{I}[ x_{n + 1} \le 0 ] \\ &= \quad \frac{1}{2} \exp(-x_{n + 1}) \, \mathbb{I}[ x_{n + 1} > 0 ] \\ & \quad + \frac{1}{2} \exp(+x_{n + 1}) \, \mathbb{I}[ x_{n + 1} \le 0 ] \\ &= \quad \frac{1}{2} \exp(- | x_{n + 1}| ) \\ &= \quad \text{Laplace}(x_{n + 1} \mid 0, 1), \end{align*} \] as desired.

Another stationary distribution is specified by a half-Laplace density function, \[ \text{half-Laplace}(x \mid 0, 1) = \frac{1}{4} \, \exp( - x ) \, \mathbb{I}[x > 0]. \] Verifying this requires evaluating the integral \[ \begin{align*} I &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{half-Laplace}(x_{n} \mid 0, 1) \, \tau(x_{n + 1} \mid x_{n}) \\ &= \quad \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{4} \, \exp( - x_{n} ) \, \mathbb{I}[x_{n} > 0] \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 < x_{n} < +\phi ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \exp(-\phi) \\ & \quad + \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{4} \, \exp( - x_{n} ) \, \mathbb{I}[x_{n} > 0] \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 \le +\phi \le x_{n} ] \, \delta( x_{n + 1} - 2 \, \phi) \,\;\; \exp(-\phi) \\ & \quad + \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{4} \, \exp( - x_{n} ) \, \mathbb{I}[x_{n} > 0] \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ -\phi < x_{n} \le 0 ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \exp(-\phi) \\ & \quad + \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{4} \, \exp( - x_{n} ) \, \mathbb{I}[x_{n} > 0] \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ x_{n} \le -\phi \le 0 ] \, \delta( x_{n + 1} + 2 \, \phi) \,\;\; \exp(-\phi) \\ &= \quad \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{4} \, \exp( - x_{n} ) \, \mathbb{I}[x_{n} > 0] \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 < x_{n} < +\phi ] \, \delta( x_{n + 1} - 2 \, x_{n} ) \, \exp(-\phi) \\ & \quad + \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{4} \exp( - x_{n} ) \, \mathbb{I}[x_{n} > 0] \, \int_{0}^{\infty} \mathrm{d} \phi \, \mathbb{I}[ 0 \le +\phi \le x_{n} ] \, \delta( x_{n + 1} - 2 \, \phi) \,\;\; \exp(-\phi) \\ &= \quad \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{4} \, \exp( - x_{n} ) \, \mathbb{I}[x_{n} > 0] \, \delta( x_{n + 1} - 2 \, x_{n} ) \int_{x_{n}}^{\infty} \mathrm{d} \phi \, \exp(-\phi) \\ & \quad + \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{8} \, \exp( - x_{n} ) \, \mathbb{I}[x_{n} > 0] \, \int_{0}^{\infty} \mathrm{d} u \, \mathbb{I}[ 0 \le u / 2 \le x_{n} ] \, \delta( x_{n + 1} - u) \, \exp(-u / 2) \\ &= \quad \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{4} \, \exp( - x_{n} ) \, \mathbb{I}[x_{n} > 0] \, \delta( x_{n + 1} - 2 \, x_{n} ) \exp(-x_{n}) \\ & \quad + \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{8} \, \exp( - x_{n} ) \, \mathbb{I}[x_{n} > 0] \, \mathbb{I}[ 0 \le x_{n + 1} / 2 \le x_{n} ] \, \exp(-x_{n + 1} / 2) \\ &= \quad \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{4} \, \exp( - 2 \, x_{n} ) \, \mathbb{I}[x_{n} > 0] \, \delta( x_{n + 1} - 2 \, x_{n} ) \\ & \quad + \exp(-x_{n + 1} / 2) \, \frac{1}{8} \, \mathbb{I}[ 0 \le x_{n + 1} / 2 ] \int_{x_{n + 1} / 2}^{\infty} \mathrm{d} x_{n} \, \exp( - x_{n} ) \\ &= \quad \frac{1}{8} \int_{-\infty}^{\infty} \mathrm{d} u \, \exp( - u ) \, \mathbb{I}[u / 2 > 0] \, \delta( x_{n + 1} - u ) \\ & \quad + \exp(-x_{n + 1} / 2) \, \frac{1}{8} \, \mathbb{I}[ 0 \le x_{n + 1} / 2 ] \, \exp( - x_{n + 1} / 2 ) \\ &= \quad \frac{1}{8} \, \exp( - x_{n + 1} ) \, \mathbb{I}[x_{n + 1} / 2 > 0] \\ & \quad + \frac{1}{8} \, \exp(-x_{n + 1}) \, \, \mathbb{I}[ 0 \le x_{n + 1} / 2 ] \\ &= \quad \frac{1}{4} \, \exp( - x_{n + 1} ) \, \mathbb{I}[0 < x_{n + 1}] \\ &= \text{half-Laplace}(x_{n + 1} \mid 0, 1). \end{align*} \]

A nearly equivalent calculation shows that \(\text{half-Laplace}(-x_{n} \mid 0, 1)\) also specifies a stationary distribution for this Markov transition.

6.2 Transition Two

Our second demonstrative Markov transition distribution was given by \[ \begin{align*} \tau(x_{n + 1} \mid x_{n}) &= \quad \mathbb{I}[ x_{n} > 0 ] \, \text{neg_half-normal}(x_{n + 1} \mid 0, 1) \\ & \quad + \mathbb{I}[ x_{n} \le 0 ] \, \text{pos-half-normal}(x_{n + 1} \mid 0, 1). \end{align*} \]

The Markov transition distribution is properly normalized \[ \begin{align*} \int_{-\infty}^{\infty} \mathrm{d} x_{n + 1} \, \tau(x_{n + 1} \mid x_{n}) &= \quad \mathbb{I}[ x_{n} > 0 ] \, \int_{-\infty}^{\infty} \mathrm{d} x_{n + 1} \, \text{neg_half-normal}(x_{n + 1} \mid 0, 1) \\ & \quad + \mathbb{I}[ x_{n} \le 0 ] \, \int_{-\infty}^{\infty} \mathrm{d} x_{n + 1} \, \text{pos-half-normal}(x_{n + 1} \mid 0, 1) \\ &= \quad \mathbb{I}[ x_{n} > 0 ] \, \\ & \quad + \mathbb{I}[ x_{n} \le 0 ] \\ &= \quad 1. \end{align*} \]

This Markov transition distribution preserves the probability distribution specified by the probability density function \(\text{normal}(x \mid 0, 1)\). We can verify this invariance by direct inspection, \[ \begin{align*} I &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid 0, 1) \, \tau(x_{n + 1} \mid x_{n}) \\ &= \quad \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid 0, 1) \, \mathbb{I}[ x_{n} > 0 ] \, \text{neg-half-normal}(x_{n + 1} \mid 0, 1) \\ & \quad + \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid 0, 1) \, \mathbb{I}[ x_{n} \le 0 ] \, \text{pos-half-normal}(x_{n + 1} \mid 0, 1) \\ &= \quad \text{neg-half-normal}(x_{n + 1} \mid 0, 1) \, \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid 0, 1) \, \mathbb{I}[ x_{n} > 0 ] \\ & \quad + \, \text{pos-half-normal}(x_{n + 1} \mid 0, 1) \, \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid 0, 1) \, \mathbb{I}[ x_{n} \le 0 ] \\ &= \quad \text{neg-half-normal}(x_{n + 1} \mid 0, 1) \, \int_{0}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid 0, 1) \, \\ & \quad + \, \text{pos-half-normal}(x_{n + 1} \mid 0, 1) \, \int_{-\infty}^{0} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid 0, 1) \, \\ &= \quad \text{neg-half-normal}(x_{n + 1} \mid 0, 1) \, \frac{1}{2} \\ & \quad + \text{pos-half-normal}(x_{n + 1} \mid 0, 1) \, \frac{1}{2} \\ &= \text{normal}(x_{n + 1} \mid 0, 1). \end{align*} \]

Convolving the Markov transition distribution with itself gives a composite Markov transition distribution specified by the conditional density function \[ \tau^{2}(x_{n + 2} \mid x_{n}) = \int_{-\infty}^{+\infty} \mathrm{d} x_{n + 1} \, \tau(x_{n + 2} \mid x_{n + 1}) \, \tau(x_{n + 1} \mid x_{n}). \] For our base Markov transition density function the integrand becomes \[ \begin{align*} \tau(x_{n + 2} \mid x_{n + 1}) \, \tau(x_{n + 1} \mid x_{n}) &= \;\; \bigg( \;\;\, \mathbb{I}[ x_{n + 1} > 0 ] \text{neg-half-normal}(x_{n + 2} \mid 0, 1) \\ &\quad\quad + \mathbb{I}[ x_{n + 1} \le 0 ] \, \text{pos-half-normal}(x_{n + 2} \mid 0, 1) \bigg) \\ & \quad \cdot \bigg( \;\;\, \mathbb{I}[ x_{n} > 0 ] \, \text{neg-half-normal}(x_{n + 1} \mid 0, 1) \\ &\quad\quad + \mathbb{I}[ x_{n} \le 0 ] \, \text{pos-half-normal}(x_{n + 1} \mid 0, 1) \bigg) \\ &= \quad\;\, \mathbb{I}[ x_{n + 1} > 0 ] \, \text{neg-half-normal}(x_{n + 2} \mid 0, 1) \\ &\quad\quad \cdot \mathbb{I}[ x_{n} > 0 ] \, \text{neg-half-normal}(x_{n + 1} \mid 0, 1) \\ &\quad + \;\; \mathbb{I}[ x_{n + 1} > 0 ] \, \text{neg-half-normal}(x_{n + 2} \mid 0, 1) \\ &\quad\quad \cdot \mathbb{I}[ x_{n} \le 0 ] \, \text{pos-half-normal}(x_{n + 1} \mid 0, 1) \\ &\quad + \;\; \mathbb{I}[ x_{n + 1} \le 0 ] \, \text{pos-half-normal}(x_{n + 2} \mid 0, 1) \\ & \quad\quad \cdot \mathbb{I}[ x_{n} > 0 ] \, \text{neg-half-normal}(x_{n + 1} \mid 0, 1) \\ &\quad + \;\; \mathbb{I}[ x_{n + 1} \le 0 ] \, \text{pos-half-normal}(x_{n + 2} \mid 0, 1) \\ &\quad\quad \cdot \mathbb{I}[ x_{n} \le 0 ] \, \text{pos-half-normal}(x_{n + 1} \mid 0, 1) \\ &= \quad\;\, \mathbb{I}[ x_{n} \le 0 ] \\ &\quad\quad \cdot \text{neg-half-normal}(x_{n + 2} \mid 0, 1) \, \text{pos-half-normal}(x_{n + 1} \mid 0, 1) \\ &\quad + \;\; \mathbb{I}[ x_{n} > 0 ] \, \\ & \quad\quad \cdot \text{pos-half-normal}(x_{n + 2} \mid 0, 1) \, \text{neg-half-normal}(x_{n + 1} \mid 0, 1) \end{align*} \]

Consequently \[ \begin{align*} \tau^{2}(x_{n + 2} \mid x_{n}) &= \int_{-\infty}^{+\infty} \mathrm{d} x_{n + 1} \, \tau(x_{n + 2} \mid x_{n + 1}) \, \tau(x_{n + 1} \mid x_{n}) \\ &= \quad\;\, \int_{-\infty}^{+\infty} \mathrm{d} x_{n + 1} \, \;\; \mathbb{I}[ x_{n} \le 0 ] \, \\ & \quad\quad\quad\quad\quad\quad\quad \cdot \text{neg-half-normal}(x_{n + 2} \mid 0, 1) \, \text{pos-half-normal}(+x_{n + 1} \mid 0, 1) \\ &\quad + \;\; \int_{-\infty}^{+\infty} \mathrm{d} x_{n + 1} \, \;\; \mathbb{I}[ x_{n} > 0 ] \\ &\quad\quad\quad\quad\quad\quad\quad \cdot \text{pos-half-normal}(x_{n + 2} \mid 0, 1) \, \text{neg-half-normal}(x_{n + 1} \mid 0, 1) \\ &= \;\; \mathbb{I}[ x_{n} \le 0 ] \, \text{neg-half-normal}(x_{n + 2} \mid 0, 1) \\ &\quad\quad \cdot \int_{-\infty}^{+\infty} \mathrm{d} x_{n + 1} \, \text{pos-half-normal}(x_{n + 1} \mid 0, 1) \\ &\quad + \;\; \mathbb{I}[ x_{n} > 0 ] \, \text{pos-half-normal}(x_{n + 2} \mid 0, 1) \\ &\quad\quad \cdot \int_{-\infty}^{+\infty} \mathrm{d} x_{n + 1} \, \text{neg-half-normal}(x_{n + 1} \mid 0, 1) \\ &= \;\;\; \mathbb{I}[ x_{n} \le 0 ] \, \text{neg-half-normal}(x_{n + 2} \mid 0, 1) \\ &\quad + \mathbb{I}[ x_{n} > 0 ] \, \text{pos-half-normal}(x_{n + 2} \mid 0, 1). \end{align*} \]

Notice the difference in signs: while the base Markov transition distribution flips signs at every iteration the composite Markov transition preserves the sign of the initial state. If we start in \(\mathbb{R}^{+}\) then we remain in \(\mathbb{R}^{+}\), and if we start in \(\mathbb{R}^{-}\) then we remain in \(\mathbb{R}^{-}\). In other words this composite Markov transition distribution is reducible!

This composite Markov transition distribution exhibits three stationary distributions. One is specified by the probability density function \(\text{normal}(x \mid 0, 1)\), \[ \begin{align*} I &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid 0, 1) \, \tau(x_{n + 2} \mid x_{n}) \\ &= \quad \text{neg-half-normal}(x_{n + 2} \mid 0, 1) \, \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \mathbb{I}[ x_{n} \le 0 ] \, \text{normal}(x_{n} \mid 0, 1) \\ &\quad + \text{pos-half-normal}(x_{n + 2} \mid 0, 1) \, \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \mathbb{I}[ x_{n} > 0 ] \, \text{normal}(x_{n} \mid 0, 1) \\ &= \quad \text{neg-half-normal}(x_{n + 2} \mid 0, 1) \, \int_{-\infty}^{0} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid 0, 1) \\ &\quad + \text{pos-half-normal}(x_{n + 2} \mid 0, 1) \, \int_{0}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid 0, 1) \\ &= \quad \text{neg-half-normal}(x_{n + 2} \mid 0, 1) \, \frac{1}{2} \\ &\quad + \text{pos-half-normal}(x_{n + 2} \mid 0, 1) \, \frac{1}{2} \\ &= \quad \text{normal}(x_{n + 2} \mid 0, 1). \end{align*} \]

Another is specified by the probability density function \(\text{pos-half-normal}(x \mid 0, 1)\), \[ \begin{align*} I &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{pos-half-normal}(x_{n} \mid 0, 1) \, \tau(x_{n + 2} \mid x_{n}) \\ &= \quad \text{neg-half-normal}(x_{n + 2} \mid 0, 1) \, \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \mathbb{I}[ x_{n} \le 0 ] \, \text{pos-half-normal}(x_{n} \mid 0, 1) \\ &\quad + \text{pos-half-normal}(x_{n + 2} \mid 0, 1) \, \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \mathbb{I}[ x_{n} > 0 ] \, \text{pos-half-normal}(x_{n} \mid 0, 1) \\ &= \quad \text{pos-half-normal}(x_{n + 2} \mid 0, 1) \, \int_{0}^{\infty} \mathrm{d} x_{n} \, \text{pos-half-normal}(x_{n} \mid 0, 1) \\ &= \quad \text{pos-half-normal}(x_{n + 2} \mid 0, 1) \, 1 \\ &= \quad \text{pos-half-normal}(x_{n + 2} \mid 0, 1). \end{align*} \]

The final stationary distribution is specified by the probability density function \(\text{neg-half-normal}(x \mid 0, 1)\), \[ \begin{align*} I &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{neg-half-normal}(x_{n} \mid 0, 1) \, \tau(x_{n + 2} \mid x_{n}) \\ &= \quad \text{neg-half-normal}(x_{n + 2} \mid 0, 1) \, \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \mathbb{I}[ x_{n} \le 0 ] \, \text{neg-half-normal}(x_{n} \mid 0, 1) \\ &\quad + \text{pos-half-normal}(x_{n + 2} \mid 0, 1) \, \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \mathbb{I}[ x_{n} > 0 ] \, \text{neg-half-normal}(x_{n} \mid 0, 1) \\ &= \quad \text{neg-half-normal}(x_{n + 2} \mid 0, 1) \, \int_{-\infty}^{0} \mathrm{d} x_{n} \, \text{neg-half-normal}(x_{n} \mid 0, 1) \\ &= \quad \text{neg-half-normal}(x_{n + 2} \mid 0, 1) \, 1 \\ &= \quad \text{neg-half-normal}(x_{n + 2} \mid 0, 1). \end{align*} \]

To which of these stationary distributions the \(N\)-step distribution will converge depends on the particular details of the initialization.

6.3 Transition Four

The next Markov transition distribution that admits analytic calculations is defined by the conditional density function \[ \tau(x_{n + 1} \mid x_{n}) = \text{Cauchy} \! \left( x_{n + 1} \mid \lambda \, x_{n} , 1 - \lambda \right). \]

The unique stationary distribution for this Markov transition is given by the probability density function \(\text{Cauchy}(x \mid 0, 1)\). We can verify this by direct inspection using only basic calculus techniques, but it will take quite a bit of elbow grease. More advanced integration techniques, namely Fourier transformation methods and contour integration, can make this derivation much less tortuous but they go far beyond basic calculus.

To verify that \(\text{Cauchy}(x \mid 0, 1)\) specifies a stationary distribution we need to calculate the integral \[ \begin{align*} I &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{Cauchy}(x_{n} \mid 0, 1) \, \tau(x_{n + 1} \mid x_{n}) \\ &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{Cauchy}(x_{n} \mid 0, 1) \, \text{Cauchy} \! \left( x_{n + 1} \mid \lambda \, x_{n} , 1 - \lambda \right). \end{align*} \] To ease the notation a bit let's define \(\sigma = 1 - \lambda\) for now so that the integral becomes \[ \begin{align*} I &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{Cauchy}(x_{n} \mid 0, 1) \, \text{Cauchy} \! \left( x_{n + 1} \mid \lambda \, x_{n} , \sigma \right) \\ &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{\pi} \frac{1}{1 + x_{n}^{2}} \, \frac{1}{\pi \, \sigma} \frac{1}{1 + \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right)^{2} } \\ &= \frac{1}{\pi^{2} \, \sigma} \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{ \bigg( 1 + x_{n}^{2} \bigg) \, \left( 1 + \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right)^{2} \right) } \\ &\equiv \frac{1}{\pi^{2} \, \sigma} \, I_{2} \end{align*} \]

In order to evaluate this using only basic calculus techniques we have to impose a partial fraction decomposition on the integrand, \[ \frac{1}{ \bigg( 1 + x_{n}^{2} \bigg) \, \left( 1 + \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right)^{2} \right) } = \frac{A + B \, x_{n}}{1 + x_{n}^{2}} + \frac{C + D \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right) } {1 + \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right)^{2} }, \] for some terms \(A\), \(B\), \(C\), and \(D\) that are independent of \(x_{n}\). Before solving for these terms, however, let's consider the consequence of this presumed decomposition to the integral, \[ \begin{align*} I_{2} &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{1}{ \bigg( 1 + x_{n}^{2} \bigg) \, \left( 1 + \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right)^{2} \right) } \\ &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \left[ \frac{A + B \, x_{n} }{1 + x_{n}^{2}} + \frac{C + D \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right) } {1 + \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right)^{2} } \right] \\ &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{A + B \, x_{n} }{1 + x_{n}^{2}} + \int \mathrm{d} x_{n} \, \frac{C + D \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right) } {1 + \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right)^{2} }. \end{align*} \]

In the first term we can make the substitution \(z = x_{n}\) while in the second term we can make the substitution \(z = -(x_{n + 1} - \lambda \, x_{n}) / \sigma\) to give \[ \begin{align*} I_{2} &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{A + B \, x_{n}}{1 + x_{n}^{2}} + \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \frac{C + D \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right) } {1 + \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right)^{2} } \\ &= \int_{-\infty}^{\infty} \mathrm{d} z \, \frac{A + B \, z}{1 + z^{2}} + + \frac{\sigma}{\lambda} \int_{-\infty}^{\infty} \mathrm{d} z \, \frac{C - D \, z }{1 + z^{2} } \\ &= \left(A + \frac{\sigma}{\lambda} \, C \right) \int_{-\infty}^{\infty} \mathrm{d} z \, \frac{1}{1 + z^{2}} + \left(B - \frac{\sigma}{\lambda} \, D \right) \int_{-\infty}^{\infty} \mathrm{d} z \, \frac{z}{1 + z^{2}}. \end{align*} \] Consequently we don't need to solve for \(A\), \(B\), \(C\), and \(D\) individually but rather just these particular combinations which simplifies the task just a little bit.

To go further we need we need to construct equations which constrain \(A\), \(B\), \(C\), and \(D\). These are given by equating the partial fraction decomposition to the exact integrand, \[ \begin{align*} \frac{1}{ \bigg( 1 + x_{n}^{2} \bigg) \, \left( 1 + \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right)^{2} \right) } &= \frac{A + B \, x_{n}}{1 + x_{n}^{2}} + \frac{C + D \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right) } {1 + \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right)^{2} } \\ \frac{1}{ \bigg( 1 + x_{n}^{2} \bigg) \, \left( 1 + \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right)^{2} \right) } &= \quad \frac{ \bigg( A + B \, x_{n} \bigg) \, \left( 1 + \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right)^{2} \right)} { \bigg( 1 + x_{n}^{2} \bigg) \, \left( 1 + \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right)^{2} \right) } \\ &\quad + \frac{ \left( C + D \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right) \right) \, \bigg( 1 + x_{n}^{2} \bigg) } { \bigg( 1 + x_{n}^{2} \bigg) \, \left( 1 + \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right)^{2} \right) }. \end{align*} \] Cancelling the common denominator gives \[ \begin{align*} 1 &= \quad \Bigg( A + B \, x_{n} \Bigg) \, \left( 1 + \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right)^{2} \right) \\ &\quad + \left( C + D \left( \frac{x_{n + 1} - \lambda \, x_{n}}{\sigma} \right) \right) \, \bigg( 1 + x_{n}^{2} \bigg) \\ 1 &= \quad A \, \left( 1 + \frac{x_{n + 1}^{2}}{\sigma^{2}} \right) + C + D \, \frac{x_{n + 1}}{\sigma} \\ & \quad + \left[ -2 \, A \, \frac{ \lambda \, x_{n + 1} }{ \sigma^{2} } + B \, \left( 1 + \frac{x_{n + 1}^{2}}{\sigma^{2}} \right) - D \, \frac{\lambda}{\sigma} \right] \, x_{n} \\ & \quad + \left[ A \frac{ \lambda^{2} }{ \sigma^{2} } - 2 \, B \, \frac{ \lambda \, x_{n + 1} }{ \sigma^{2} } + C + D \, \frac{ x_{n + 1} }{ \sigma} \right] \, x_{n}^{2} \\ & \quad + \frac{\lambda}{\sigma} \, \left[ B \, \frac{ \lambda }{ \sigma } - D \right] \, x_{n}^{3}. \end{align*} \] For the partial fraction decomposition to hold for all \(x_{n}\) each of the bracketed expression must vanish independently of the others, resulting in the system of equations \[ \begin{align*} 1 &= A \, \left( 1 + \frac{x_{n + 1}^{2}}{\sigma^{2}} \right) + C + D \, \frac{x_{n + 1}}{\sigma} \\ 0 &= -2 \, A \, \frac{ \lambda \, x_{n + 1} }{ \sigma^{2} } + B \, \left( 1 + \frac{x_{n + 1}^{2}}{\sigma^{2}} \right) - D \, \frac{\lambda}{\sigma} \\ 0 &= A \frac{ \lambda^{2} }{ \sigma^{2} } - 2 \, D \, \frac{ \lambda \, x_{n + 1} }{ \sigma} + C + D \, \frac{ x_{n + 1} }{ \sigma} \\ 0 &= B \, \frac{ \lambda }{ \sigma } - D. \end{align*} \]

The last equation can be immediately solved for \[ D = B \, \frac{ \lambda }{ \sigma }. \] This already implies that \(B - (\sigma / \lambda) D = 0\), and hence that the contribution from the second integral completely vanishes! Substituting this back into the remaining equations and simplifying gives the reduced system of equations \[ \begin{align*} 1 &= A \, \left( 1 + \frac{x_{n + 1}^{2}}{\sigma^{2}} \right) + C + B \, \frac{\lambda \, x_{n + 1}}{\sigma^{2}} \\ 0 &= -2 \, A \, \frac{ \lambda \, x_{n + 1} }{ \sigma^{2} } + B \, \left( \frac{\sigma^{2} + x_{n + 1}^{2} - \lambda^{2} }{\sigma^{2}} \right) \\ 0 &= A \frac{ \lambda^{2} }{ \sigma^{2} } - B \, \frac{ \lambda \, x_{n + 1} }{ \sigma^{2} } + C. \end{align*} \]

The second of these reduced equations can be immediately solved for \(B\), \[ B = A \, \frac{2 \, \lambda , x_{n + 1} }{ \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} }. \] Substituting this into the remaining two equations and simplifying then gives \[ \begin{align*} 1 &= A \, \left( 1 + \frac{ x_{n + 1}^{2} }{ \sigma^{2} } + \frac{ 2 \, \lambda^{2} \, x_{n + 1}^{2} } { \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} } \right) + C \\ 0 &= A \, \left( \frac{ \lambda^{2} }{ \sigma^{2} } - \frac{ 2 \, \lambda^{2} \, x_{n + 1}^{2} } { \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} } \right) + C. \end{align*} \] Subtracting the second of these reduced equations from the first then gives \[ \begin{align*} 1 &= A \, \left( \frac{ \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} }{ \sigma^{2} } + \frac{ 4 \, \lambda^{2} \, x_{n + 1}^{2} } { \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} } \right) \\ 1 &= A \, \frac{ ( \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} )^{2} + 4 \, \lambda^{2} \, x_{n + 1}^{2} } { \sigma^{2} ( \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} ) }. \end{align*} \]

Here we take advantage of the fact that the numerator admits a pretty miraculous factorization, \[ \begin{align*} 1 &= A \, \frac{ ( \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} )^{2} + 4 \, \lambda^{2} \, x_{n + 1}^{2} } { \sigma^{2} ( \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} ) } \\ 1 &= A \, \frac{ \big( (\lambda - \sigma)^{2} + x_{n + 1}^{2} \big) \, \big( (\lambda + \sigma)^{2} + x_{n + 1}^{2} \big) } { \sigma^{2} ( \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} ) }, \end{align*} \] or \[ A = \frac{ \sigma^{2} ( \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} ) } { \big( (\lambda - \sigma)^{2} + x_{n + 1}^{2} \big) \, \big( (\lambda + \sigma)^{2} + x_{n + 1}^{2} \big) }. \]

At this point we could substitute into the remaining equations to solve for \(B\), \(C\), and \(D\), but recall that we need to solve for only \[ A + \frac{ \sigma }{ \lambda } \, C = A \left[ 1 + \frac{ \sigma }{ \lambda } \frac{C}{A} \right]. \] Conveniently the second of our last two reduced equations allows us to solve for \(C / A\), \[ \begin{align*} 0 &= A \, \left( \frac{ \lambda^{2} }{ \sigma^{2} } - \frac{ 2 \, \lambda^{2} \, x_{n + 1}^{2} } { \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} } \right) + C \\ 0 &= A \, \frac{ \lambda^{2} }{ \sigma^{2} } \left( \frac{ \lambda^{2} + x_{n + 1}^{2} - \sigma^{2} } { \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} } \right) + C \\ \frac{C}{A} &= - \frac{ \lambda^{2} }{ \sigma^{2} } \left( \frac{ \lambda^{2} + x_{n + 1}^{2} - \sigma^{2} } { \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} } \right). \end{align*} \] Consequently \[ \begin{align*} 1 + \frac{ \sigma }{ \lambda } \frac{C}{A} &= 1 - \frac{ \lambda }{ \sigma } \left( \frac{ \lambda^{2} + x_{n + 1}^{2} - \sigma^{2} } { \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} } \right) \\ &= \frac{ \sigma \, \big( \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} \big) + \lambda \, \big( \lambda^{2} + x_{n + 1}^{2} - \sigma^{2} \big) } { \sigma \, \big( \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} \big) }. \end{align*} \] You'll never believe it but the numerator here admits a miraculous factorization of its own: \[ \begin{align*} 1 + \frac{ \sigma }{ \lambda } \frac{C}{A} &= \frac{ \sigma \, \big( \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} \big) + \lambda \, \big( \lambda^{2} + x_{n + 1}^{2} - \sigma^{2} \big) } { \sigma \, \big( \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} \big) } \\ &= \frac{ \big( \lambda + \sigma \big) \, \big( (\lambda - \sigma)^{2} + x_{n + 1}^{2} \big) } { \sigma \, \big( \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} \big) }. \end{align*} \]

Now we can finally put everything together. First we have \[ \begin{align*} A + \frac{ \sigma }{ \lambda } \, C &= A \left[ 1 + \frac{ \sigma }{ \lambda } \frac{C}{A} \right] \\ &= \frac{ \sigma^{2} ( \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} ) } { \big( (\lambda - \sigma)^{2} + x_{n + 1}^{2} \big) \, \big( (\lambda + \sigma)^{2} + x_{n + 1}^{2} \big) } \, \frac{ \big( \lambda + \sigma \big) \, \big( (\lambda - \sigma)^{2} + x_{n + 1}^{2} \big) } { \sigma \, \big( \sigma^{2} + x_{n + 1}^{2} - \lambda^{2} \big) } \\ &= \frac{ \sigma } { \big( (\lambda + \sigma)^{2} + x_{n + 1}^{2} \big) } \, \frac{ \big( \lambda + \sigma \big) }{ 1 } \\ &= \frac{ \sigma \, (\lambda + \sigma) } { (\lambda + \sigma)^{2} + x_{n + 1}^{2} }. \end{align*} \] Then \[ \begin{align*} I_{2} &= \left(A + \frac{\sigma}{\lambda} \, C \right) \int_{-\infty}^{\infty} \mathrm{d} z \, \frac{1}{1 + z^{2}} + \left(B - \frac{\sigma}{\lambda} \, D \right) \int \mathrm{d} z \, \frac{z}{1 + z^{2}} \\ &= \frac{ \sigma \, (\lambda + \sigma) } { (\lambda + \sigma)^{2} + x_{n + 1}^{2} } \int_{-\infty}^{\infty} \mathrm{d} z \, \frac{1}{1 + z^{2}} + \left( 0 \right) \int_{-\infty}^{\infty} \mathrm{d} z \, \frac{z}{1 + z^{2}} \\ &= \frac{ \sigma \, (\lambda + \sigma) } { (\lambda + \sigma)^{2} + x_{n + 1}^{2} } \int_{-\infty}^{\infty} \mathrm{d} z \, \frac{1}{1 + z^{2}} \\ &= \frac{ \sigma \, (\lambda + \sigma) } { (\lambda + \sigma)^{2} + x_{n + 1}^{2} } \, \pi. \end{align*} \] Finally \[ \begin{align*} \tau(x_{n + 1}) &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{Cauchy}(x_{n} \mid 0, 1) \, \text{Cauchy} \! \left( x_{n + 1} \mid \lambda \, x_{n} , \sigma \right) \\ &= \frac{1}{\pi^{2} \, \sigma} \, I_{2} \\ &= \frac{1}{\pi^{2} \, \sigma} \frac{ \sigma \, (\lambda + \sigma) } { (\lambda + \sigma)^{2} + x_{n + 1}^{2} } \, \pi \\ &= \frac{1}{\pi} \frac{ \lambda + \sigma }{ (\lambda + \sigma)^{2} + x_{n + 1}^{2} } \\ &= \frac{1}{\pi (\lambda + \sigma)} \frac{ 1 }{ 1 + \left( \frac{ x_{n + 1} }{ \lambda + \sigma } \right) ^{2} } \\ &= \text{Cauchy} (x_{n + 1} \mid 0, \lambda + \sigma). \end{align*} \]

For \(\sigma = 1 - \lambda\) this becomes \[ \begin{align*} I &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{Cauchy}(x_{n} \mid 0, 1) \, \text{Cauchy} \! \left( x_{n + 1} \mid \lambda \, x_{n} , 1 - \lambda \right) \\ &= \text{Cauchy} (x_{n + 1} \mid 0, \lambda + 1 - \lambda) \\ &= \text{Cauchy} (x_{n + 1} \mid 0, 1), \end{align*} \] at long last confirming that \(\text{Cauchy}(x \mid 0, 1)\) does indeed specify a stationary distribution for this Markov transition.

6.4 Transition Five

Finally we have the very nicely behaved Markov transition defined by the conditional density function \[ \tau(x_{n + 1} \mid x_{n}) = \text{normal} \! \left( x_{n + 1} \mid \lambda \, x_{n} , \sqrt{1 - \lambda^{2}} \right). \]

We can verify that \(\tau(x_{n + 1} \mid x_{n})\) is properly normalized and hence well-defines a Markov transition distribution, \[ \begin{align*} \int_{-\infty}^{\infty} \mathrm{d} x_{n + 1} \, \tau(x_{n + 1} \mid x_{n}) &= \int_{-\infty}^{\infty} \mathrm{d} x_{n + 1} \, \text{normal} \! \left( x_{n + 1} \mid \lambda \, x_{n} , \sqrt{1 - \lambda^{2}} \right) \\ &= 1. \end{align*} \]

The unique stationary distribution for this Markov transition is given by the probability density function \(\text{normal}(x \mid 0, 1)\). We can verify this by direct inspection, \[ \begin{align*} I &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid 0, 1) \, \tau(x_{n + 1} \mid x_{n}) \\ &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid 0, 1) \, \text{normal} \! \left( x_{n + 1} \mid \lambda \, x_{n}, \sqrt{1 - \lambda^{2}} \right). \end{align*} \]

Now the product of the two normal density functions can be rewritten as \[ \begin{align*} & \text{normal}(x_{n} \mid 0, 1) \, \text{normal}(x_{n + 1} \mid \lambda \, x_{n}, \sqrt{1 - \lambda^{2}} ) \\ &\hspace{30mm}= \quad \frac{1}{\sqrt{2 \pi}} \exp \left[ - \frac{1}{2} x_{n}^{2} \right] \\ &\hspace{30mm} \quad\;\; \cdot \frac{1}{\sqrt{2 \pi (1 - \lambda^{2}) }} \exp \left[ - \frac{1}{2} \frac{ ( x_{n + 1} - \lambda \, x_{n} )^{2} }{ 1 - \lambda^{2} } \right] \\ &\hspace{30mm}= \frac{1}{\sqrt{2 \pi}} \, \frac{1}{\sqrt{2 \pi (1 - \lambda^{2}) }} \\ &\hspace{30mm} \quad\;\; \cdot \exp \left[ - \frac{1}{2} \left( x_{n}^{2} + \frac{ (x_{n + 1} - \lambda \, x_{n})^{2} }{ 1 - \lambda^{2} } \right) \right]. \end{align*} \] Expanding and refactoring of the exponential argument gives \[ \begin{align*} x_{n}^{2} + \frac{ (x_{n + 1} - \lambda \, x_{n})^{2} }{ 1 - \lambda^{2} } &= x_{n}^{2} + \frac{ x_{n + 1}^{2} - 2 \, \lambda \, x_{n} \, x_{n + 1} + \lambda^{2} \, x_{n}^{2} } { 1 - \lambda^{2} } \\ &= \frac{ x_{n}^{2} - \lambda^{2} x_{n}^{2} + x_{n + 1}^{2} - 2 \, \lambda \, x_{n} \, x_{n + 1} + \lambda^{2} \, x_{n}^{2} } { 1 - \lambda^{2} } \\ &= \frac{ x_{n}^{2} + x_{n + 1}^{2} - 2 \, \lambda \, x_{n} \, x_{n + 1} } { 1 - \lambda^{2} } \\ &= \frac{ x_{n}^{2} - 2 \, \lambda \, x_{n} \, x_{n + 1} + \lambda^{2} \, x_{n + 1}^{2} + (1 - \lambda^{2} ) x_{n + 1}^{2} } { 1 - \lambda^{2} } \\ &= \frac{ x_{n}^{2} - 2 \, \lambda \, x_{n} \, x_{n + 1} + \lambda^{2} \, x_{n}^{2} } { 1 - \lambda^{2} } + \frac{ (1 - \lambda^{2} ) x_{n + 1}^{2} }{ 1 - \lambda^{2} } \\ &= \frac{ (x_{n} - \lambda \, x_{n + 1} )^{2} }{ 1 - \lambda^{2} } + x_{n + 1}^{2}. \end{align*} \] Consequently \[ \begin{align*} & \text{normal}(x_{n} \mid 0, 1) \, \text{normal}(x_{n + 1} \mid \lambda \, x_{n}, \sqrt{1 - \lambda^{2}} ) \\ &\hspace{30mm}= \quad \frac{1}{\sqrt{2 \pi}} \, \frac{1}{\sqrt{2 \pi (1 - \lambda^{2}) }} \\ &\hspace{30mm} \quad\;\; \cdot \exp \left[ - \frac{1}{2} \left( x_{n}^{2} + \frac{ (x_{n + 1} - \lambda \, x_{n})^{2} }{ 1 - \lambda^{2} } \right) \right] \\ &\hspace{30mm}= \quad \frac{1}{\sqrt{2 \pi}} \, \frac{1}{\sqrt{2 \pi (1 - \lambda^{2}) }} \\ &\hspace{30mm} \quad\;\; \cdot \exp \left[ - \frac{1}{2} \left( \frac{ (x_{n} - \lambda \, x_{n + 1} )^{2} } { 1 - \lambda^{2} } + x_{n + 1}^{2} \right) \right] \\ &\hspace{30mm}= \quad \frac{1}{\sqrt{2 \pi (1 - \lambda^{2}) }} \exp \left[ - \frac{1}{2} \frac{ (x_{n} - \lambda \, x_{n + 1} )^{2} } { 1 - \lambda^{2} } \right] \\ &\hspace{30mm} \quad\;\; \cdot \frac{1}{\sqrt{2 \pi}} \, \exp \left[ - \frac{1}{2} x_{n + 1}^{2} \right] \\ &\hspace{30mm}= \quad \text{normal}(x_{n} \mid \lambda \, x_{n + 1}, \sqrt{1 - \lambda^{2}}) \, \text{normal}(x_{n + 1} \mid 0, 1). \end{align*} \]

At this point the integration over \(x_{n}\) is straightforward, \[ \begin{align*} I &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid 0, 1) \, \tau(x_{n + 1} \mid x_{n}) \\ &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid 0, 1) \, \text{normal} \! \left( x_{n + 1} \mid \lambda \, x_{n} , \sqrt{1 - \lambda^{2}} \right) \\ &= \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid \lambda \, x_{n + 1}, \sqrt{1 - \lambda^{2}}) \, \text{normal}(x_{n + 1} \mid 0, 1) \\ &= \text{normal}(x_{n + 1} \mid 0, 1) \int_{-\infty}^{\infty} \mathrm{d} x_{n} \, \text{normal}(x_{n} \mid \lambda \, x_{n + 1}, \sqrt{1 - \lambda^{2}}) \\ &= \text{normal}(x_{n + 1} \mid 0, 1), \end{align*} \] which confirms that \(\text{normal}(x \mid 0, 1)\) does indeed specify a stationary distribution for this Markov transition.

7 Acknowledgements

I thank Sam Power, Sam Livingstone, and Finn Lindgren for critical insights.

References

[1] Betancourt, M. (2021). A short review of ergodicity and convergence of markov chain monte carlo estimators.

[2] Roberts, G. O. and Rosenthal, J. S. (2004). General state space Markov chains and MCMC algorithms. Probability Surveys 1 20–71.

[3] Geyer, C. J. (1992). Practical Markov Chain Monte Carlo. Statistical Science 473–83.

[4] Meyn, S. P. and Tweedie, R. L. (2009). Markov chains and stochastic stability. Cambridge University Press.

License

The code in this case study is copyrighted by Michael Betancourt and licensed under the new BSD (3-clause) license:

https://opensource.org/licenses/BSD-3-Clause

The text and figures in this case study are copyrighted by Michael Betancourt and licensed under the CC BY-NC 4.0 license:

https://creativecommons.org/licenses/by-nc/4.0/

Original Computing Environment

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_4.0.2  magrittr_1.5    tools_4.0.2     htmltools_0.4.0
 [5] yaml_2.2.1      Rcpp_1.0.4.6    stringi_1.4.6   rmarkdown_2.2  
 [9] knitr_1.29      stringr_1.4.0   xfun_0.16       digest_0.6.25  
[13] rlang_0.4.6     evaluate_0.14