Under ideal conditions only a small neighborhood of model configurations will be consistent with both the observed data and the domain expertise we encode in our prior model, resulting in a posterior distribution that strongly concentrates along each parameter. This not only yields precise inferences and but also facilitates accurate estimation of those inferences. Under more realistic conditions, however, our measurements and domain expertise can be much less informative, allowing our posterior distribution to stretch across more expansive, complex neighborhoods of the model configuration space. These intricate uncertainties complicate not only the utility of our inferences but also our ability to quantify those inferences computationally.

In this case study we will explore the theoretical concept of identifiability and its more geometric counterpart degeneracy that better generalizes to applied statistical practice. We will also discuss the principled ways in which we can identify and then compensate for degenerate inferences before demonstrating these strategies in a series of pedagogical examples.

1 Identifiability

An observational model $\pi_{\mathcal{S}}(y ; \theta)$ with observations $y \in Y$ and model configurations $\theta \in \Theta$ is identified when all of the data generating processes are unique [1]. In other words any two distinct model configurations $\theta_{1} \ne \theta_{2}$ specify different probability distribution over the observational space, or equivalently different probability density functions, \[ \pi_{\mathcal{S}}(y ; \theta_{1}) \ne \pi_{\mathcal{S}}(y ; \theta_{2}). \]

For example consider the observational model specified by normal probability density functions whose location parameter is the sum of two model parameters, \[ \pi_{\mathcal{S}}(y ; \alpha, \beta, \sigma) = \text{normal}(y ; \alpha + \beta, \sigma). \] In this case the infinite number of model configurations that fall on the line $\alpha + \beta = \mu$ for any real number $\mu \in \mathbb{R}$ all specify exactly the same data generating process.

Formal identifiability, however, does not always imply such extreme degeneracy. The sets of equivalent model configurations, for example, might be compact, or bounded within a finite volume, unlike the lines that stretch all the way to infinity in the previous example. At the same time the sets of equivalent model configurations might contain only a finite number of elements, indicating that only a few model configurations specify exactly the same data generating process, with the bulk of the observational model identified.

That said, just because the data generating processes in the observational model are different doesn't mean that realized likelihood functions or realized posterior density functions from this model will be. Identifiability offers no guarantee that the differences between the different data generating processes will be large enough to distinguish them after only a finite number of observations. An infinite number of observations, however, should be enough to resolve what could be minuscule differences between the different data generating processes.

To be more precise let's define a product observational model that results from repeated, independent measurements, $\{ y_{1}, \ldots, y_{N} \} \in Y^{N}$, \[ \pi_{\mathcal{S}_{N}}(y_{1}, \ldots, y_{N}; \theta) = \prod_{n = 1}^{N} \pi_{\mathcal{S}}(y_{n} ; \theta). \] As $N$ approaches infinity the inferences informed by this product observational model will be well-behaved only if the component observational model is identified. The exact meaning of well-behaved depends on the nature of our inferences, in particular whether they are derived from a frequentist or a Bayesian perspective.

1.1 Frequentist Asymptotics

While there many frequentist estimators here we will consider only the maximum likelihood estimator.

Recall that evaluating the product observational probability density function on a realized observation, $\{ \tilde{y}_{1}, \ldots, \tilde{y}_{N} \}$, defines a likelihood function from the model configuration space into the real numbers, \[ \mathcal{L}_{\tilde{y}_{1}, \ldots, \tilde{y}_{N}}(\theta) = \pi_{\mathcal{S}_{N}}(\tilde{y}_{1}, \ldots, \tilde{y}_{N}; \theta). \] The higher the output likelihood relative to the outputs from other model configurations the more consistent that model configuration is with the realized observation relative to the other model configurations. In particular the model configuration with the highest output likelihood is the most consistent relative to the others.

This model configuration with the largest likelihood output is exactly the definition of the maximum likelihood estimator, \[ \hat{\theta}_{\text{ML}}(\tilde{y}_{1}, \ldots, \tilde{y}_{N}) = \underset{\theta \in \Theta}{\text{argmax}} \mathcal{L}_{\tilde{y}_{1}, \ldots, \tilde{y}_{N}}(\theta). \] Ultimately the utility of this estimator is determined by how close the data generating process specified by the maximum likelihood estimator, \[ \pi_{\mathcal{S}}(y; \hat{\theta}_{\text{ML}}(\tilde{y}_{1}, \ldots, \tilde{y}_{N})) \] is to the true data generating process.

Identifiability is a critical requirement for utility of the maximum likelihood estimator even under ideal circumstances [3]. More formally let's assume an $\mathcal{M}$-closed context such that the observed data are drawn from a data generating process within the observational model, \[ y_{n} \sim \pi(y_{n} ; \theta^{*}), \, \theta^{*} \in \Theta. \] In the asymptotic $N \rightarrow \infty$ limit the maximum likelihood estimator will converge to the true model configuration for almost all observations only if the component observational model is identified. In other words if our observational model is identified then we will almost always be able to exactly recover the truth given enough data, at least within the context of that model.

Subject to some regularity conditions -- such as the true model configuration being in the interior of the model configuration space and not on a boundary, the likelihood functions being sufficiently smooth, etc -- the asymptotic consistency of the maximum likelihood estimator also implies that the likelihood function realized from almost every product observation itself converges to a delta function around the true model configuration. With some even stronger regularity conditions asymptotic consistency implies that the likelihood functions contract to this delta function uniformly in all directions. Consequently given these regularity conditions identifiability implies that the geometry of the likelihood function is well-defined asymptotically.

Critically these consequences of identifiability are well-defined only in the $\mathcal{M}$-closed context. Even when an observational model is identified are no guarantees regarding how the likelihood functions realized from external observations, that is observations generated from data generating processes outside of the observational model, might behave in the asymptotic limit.

Although confirmation of identifiability has limited utility in practice, where we have neither infinitely large observations or an $\mathcal{M}$-closed context, the failure of frequentist identifiability can be a useful diagnostic tool. Observational models that don't work well in the idealized $\mathcal{M}$-closed, infinite limit are unlikely to provide well-behaved inferences in $\mathcal{M}$-open, finite data circumstances either. Consequently while frequentist identifiability is not a sufficient condition for useful models it is typically a necessary one, and checking it can help us to find pathological model behavior.

If our observational model depends on auxiliary data, for example geographical locations where measurements are being performed, concepts like identifiability become a bit more complicated. More formally let's define the data, $y$, that results from the measurement as a variate and data, $x$, that defines the context of a measurement as a covariate. In this case our observational model takes the conditional form \[ \pi_{\mathcal{S}}(y | x ; \theta). \] In this case we can define identifiability conditionally on a given covariate value or we can define it marginally given an explicit model for the covariates, $\pi_{S}(x)$.

The asymptotic limit, and the consequences of these different forms of identifiability, also become more complicated in the presence of covariates. For example we can consider the asymptotic limit given a fixed covariate value, in which case identifiability conditioned on that value is necessary for good asymptotic behavior. We get a much more informative picture, however, when we consider the asymptotic limit with the covariates values spanning the entire range of possible covariate values, giving us sensitivity to possible pathologies near the nominal values. In the statistics literature this limit is known as infill asymptotics.

1.2 Bayesian Asymptotics

The asymptotic behavior of the maximum likelihood estimator and the likelihood function itself also has a natural Bayesian equivalent. If the observational model is identified and the realized likelihood functions all concentrate around the true model configuration then the realized posterior density functions should as well, at least provided that the prior density functions don't obstruct the convergence. This intuition is formalized in the Bernstein-von Mises theorem [3] which defines the conditions under which realized posterior point estimates, such as the posterior mean, median, and mode converge to the true model configuration while the realized posterior density functions contract around the true model configuration.

As with frequentist asymptotics, the practical utility of formal Bayesian asymptotics is diagnosing any particularly nasty pathologies in the observational model or the prior model that persist to the asymptotic limit but are likely to corrupt preasymptotic inferences as well.

2 Degeneracy

We will never have perfect models or infinitely many repeated observations. What really matters in practice is the preasymptotic behavior of our inferences when our model, especially when the model only approximates the true data generating process. While the specific results considered above hold only in the asymptotic limit, the geometric considerations are far more general and are perfectly well-defined not only preasymptotically but also in the $\mathcal{M}$-open context. In particular we can always investigate how uniformly a realized likelihood function or posterior density function concentrates around any single point in the model configuration space.

Following that intuition let's define degeneracy as a qualitative description of how strongly and how uniformly a realized likelihood function or posterior density function concentrates around a single point in the model configuration space. Our inferences are strongly non-degenerate, or weakly degenerate, if they appear to concentrate around a single point rather than some extended, degenerate surface.

The stronger and more uniformly a realized likelihood function or posterior density function concentrates around a single point the more informed inferences about the given parameters will be. In particular degeneracy is not binary qualification like asymptotic consistency but rather a continuous one. Exactly how degenerate our inferences can be before they become problematic depends on the details of a given analysis.

Note that nonidentified models will, by definition, by degenerate both asymptotically and preasymptotically. Consequently degeneracy expands upon the former identifiability.

2.1 Consequences of Degeneracy

A realized likelihood function or posterior density function becomes problematic when the resulting inferences don't have enough information to achieve the inferential goals or, perhaps more critically, when their geometry frustrates accurate computation of those inferences in the first place.

Whether or not we can expect to accumulate enough information to achieve our inferential goals from typical measurement outcomes is a question of experimental design. Degenerate inferences suggests a mismatch between the experimental design and those goals -- either our design doesn't probe the desired phenomena well enough or our goals are simply too ambitious for the available experimental technology.

Regardless of our particular inferential goals, degeneracy can have critical computational consequences. The more degenerate a realized likelihood function or posterior density function, the more expensive it will be to algorithmically quantify the full extent of the model configurations consistent with the observed data, if it is possible at all. For example deterministic approximations based on point estimates, such as maximum likelihood confidence intervals and Laplace approximations, become less accurate the less uniformly the likelihood function concentrates. At the same time stochastic approximations like Markov chain Monte Carlo will typically be more expensive, rendering them less accurate for a fixed computational cost.

If degeneracy prevents us from being able to fully quantify a realized likelihood function or posterior density function, and hence the inferences they encode, then we won't be able to accurately judge the utility of those inferences at all. In practice we need our inferences to be sufficiently well-identified so that our computational tools have a reasonable opportunity to succeed, and we have a reasonable opportunity to understand how our model responds to a given observations.

2.2 Subtleties of Degeneracy

Although this geometric perspective of degeneracy is motivated by ideal asymptotic behavior, there are some subtle but critically important differences between the asymptotic and preasymptotic perspectives.

2.2.1 A Million Ways To Be Cruel

While the realized likelihood functions and posterior density functions from non-degenerate models all look similar, the realizations from less well-behaving models can each be degenerate in their own unique way. In particular the exact consequences of strong degeneracies will depend on the shape and orientation of the specific surface on which the realizations appear to concentrate. I will refer to the particular geometry of a degenerate likelihood function or posterior density function as a degeneracy.

Perhaps the most important classification of these geometries is between continuous degeneracies and discontinuous degeneracies. A continuous degeneracy is characterized by model configurations consistent with the observed data organizing into a single connected, fuzzy surface.

We can further classify continuous degeneracies that concentrate within any finite volume, such as the right two degeneracies above, as compact, and those that extend all the way to infinity in more or more directions, such as the left degeneracy above, as non-compact.

On the other hand discontinuous degeneracies are characterized by the model configurations consistent with the observed data clustering into modes surrounded by regions of inconsistent model configurations in all directions.

A realized likelihood function or posterior density function can suffer from one or both of these degeneracies at the same time. While they might decompose into a number of discontinuous degeneracies where each individual mode is itself locally non-degenerate as above, a particularly pathological realization might be comprised of a collection of continuously degenerate modes.

2.2.2 The Most Capricious of Insects

Asymptotic non-degeneracy requires the collapse of the likelihood functions realized from almost every infinitely large observation. In particular it does not depend on the details of any particular observation, and hence it characterizes the entire observational model.

The shape of the likelihood functions and posterior density functions realized from finite observations, however, will in general depend on those particular details.

Consequently preasymptotic degeneracies will also depend on the particular observation considered. The realized likelihood functions and posterior density functions might be only weakly degenerate for some particularly informative observations, but manifest strong degeneracies for others. Indeed the more complex the measurement process and higher dimensional the observational space, the more opportunities there will be for fluctuations to conspire to obscure the latent phenomena of interest.

In other words we can't talk about an entire model being preasymptotically non-degenerate unless we can verify that every realization of a likelihood function or posterior density function is strongly and uniformly concentrated. At the same time we can't talk about the robustness of a model, and how prone it might be to contorting itself to accommodate misfit, unless we can quantify the behavior of realizations from data generated outside of the model. Such inferential calibrations require either careful theoretical work or expansive simulation studies.

2.2.3 Parameterization Dependence

Because the particular geometry of a realized likelihood function or posterior density function depends on a chosen parameterization of the model configuration space so too will our defintion of degeneracy. A realization that appears degenerate in one parameterization might appear much less degenerate in another!

This parameterization dependence, however, also provides an opportunity for moderating degeneracy problems. Weak degeneracy implies that every parameter is strongly informed independently of the others. If some parameters are strongly coupled then we might be able to find a reparameterization that transforms them into parameters that are less coupled, and hence better identified. At the same time if some parameters are particularly weakly-informed then we might be able to remove them from the model entirely.

3 Managing Degeneracies

Degeracy is ultimately a question of information, in particular how much information we have about the chosen parameters that coordinate the model configuration space. If a realized likelihood function or posterior density function is degenerate then we need to incorporate more information into our inferences, change the parameters that we're trying to identify, or weaken our inferential goals so that what little information we have is sufficient.

3.1 More Information

The most direct way to reduce degenerate behavior in our inferences is to add more information. In turn the most direct way to add more information is to repeat the measurement process multiple times and aggregate the resulting observations together into one larger, more informative observation. In practice, however, repeating the measurement process exactly is often prohibitively expensive if not outright impossible. Moreover there is no guarantee that some degeneracies won't persist through repeated measurements.

A powerful alternative to repeating the initial measurement process is to complement it with additional measurement processes that are sensitive to only some phenomenological, environmental, or experimental parameters. This complementary data can then break degeneracies in the likelihood function, allowing the original data to inform the rest of the parameters.

Consider, for example, a measurement of some phenomenon, quantified by the parameter $\lambda_{S}$, that is corrupted by some irreducible background, quantified with the parameter $\lambda_{B}$. No matter how much data we collect we will not be able to differentiate between the desired phenomenon and the background, leading to degenerate likelihood functions.

Incorporating data from additional measurements where the phenomenon has been removed, however, informs only the background contribution, allowing the excess observed in the original data to inform the latent phenomenon.

Bayesian analyses also benefit from the opportunity to add information through the prior model. For example principled domain expertise can suppress extreme model configurations and contain the posterior density function into a reasonably small neighborhood even when the realized likelihood function is degenerate.

At the same time targeted domain expertise can complement the nominal measurement process in the same way as auxiliary measurements, pinning down a few parameters and preventing degeneracies in the realized likelihood function from propagating to the posterior density function.

Of course we have to be responsible and add only information that is actually compatible with our domain expertise. Forcing less principled information into the analysis simply because it resolves an inconvenient degeneracy is a highway to the poor generalizability zone.

3.2 Reparameterizations

The parameters in the initial parameterization being poorly informed doesn't mean that parameters in other parameterizations suffer the same fate. In some instances we can reparameterize the model configuration space and replace poorly-identified parameters with new parameters are that better suited to the measurement process. Some times we can find alternative parameterizations where all of the parameters are strongly informed, and even if we can't we might be able to find parameterizations that isolate degeneracies into just a few parameters which are then easier to manage.

For example consider the observational model that depends on the parameters only through a function $f : \Theta \rightarrow \mathbb{R}$, \[ \pi_{S}(y; f(\theta_{1}, \ldots, \theta_{N})). \] If the function $f$ is many-to-one then the observational model will be nonidentified and no amount of data will be able to inform the parameters $\{\theta_{1}, \ldots, \theta_{N} \}$ individually. On the other hand if we can find a new parameterization with a parameter \[ \phi_{1} = f(\theta_{1}, \ldots, \theta_{N}) \] then the data will be able to inform $\phi_{1}$ and we can resort to the prior model to inform the remaining $\{ \phi_{2}, \ldots, \phi_{N} \}$.

Reparameterizations are especially important when strong degeneracies limits how well a computational algorithm is able to survey the entirety of a realized likelihood function or posterior density function. Careful reparameterizations can reorganize the model configurations consistent with the data into geometries that better match the capabilities of the algorithm and actually facilitate accurate computation.

Many modeling techniques have natural parameterizations that are optimal in different circumstances. In order to use these techniques as responsibly as possible we need to familiarize ourselves with not only the basic structure of the technique but also the natural parameterizations and how they might interact with observed data to yield degenerate behavior.

3.3 Lowered Expectations

Sometimes the degeneracies of a realized likelihood function or posterior density function are inescapable, communicating that we have neither the data nor the domain expertise to learn enough about our model of the latent data generating process.

If our computational tools are able to accurately quantify the degenerate likelihood function or posterior density function then we might have to just report the weak inferences that result as the honest outcome of the analysis. Careful review of these degeneracies can then determine the failings of the initial experimental design and suggest refined measurements moving forwards.

On the other hand when the degeneracies are so bad that they obstruct the complete quantification of the realized likelihood function or posterior density function we won't be able to report faithful inferences at all. In that case we might have to set aside the analysis until we can equip better algorithms that allow us to better understand our inferences and their limitations.

4 Diagnosing Degeneracies

If we know that a realized likelihood function or posterior density function is degenerate then we can manage the degeneracies with the strategies discussed in Section 3. But how exactly can we work out how degenerate a realization is in practice, especially when our model is sophisticated and the model configuration space is high-dimensional?

4.1 All That We See or Seem

One approach is to utilize computational methods that explore the entirety of the realized likelihood function or posterior density function and examine projections of that exploration onto lower-dimensional surfaces that we can visualize. In particular we can't study degeneracy with methods that return only point estimates.

In Bayesian analyses Markov chain Monte Carlo has especially notable diagnostic power. Effective Markov chains will expand into the most relevant neighborhoods of the model configuration space, probing any potential degenerate behavior in the process. Moreover the individual states of the resulting Markov chains can be immediately projected onto any subspace allowing for immediate visualization.

A common diagnostic strategy is to consider pairs plots, or scatter plots of the realized Markov chain states projected onto each pair of functions of interest, for example the parameter functions.

Because we are looking for degenerate behavior in the posterior density function that might frustrate algorithmic exploration we want to focus on the posterior density function actually encountered by our chosen algorithm. In tools like Stan that means looking at the parameter functions on the unconstrained model configuration space and not the nominal constrained model configuration space that appears in a Stan program itself. Exploring the unconstrained space is made easier by using the diagnostic_file option in the Stan interfaces which writes the unconstrained state at each iteration of the Markov chain to a text file.

These plots can be useful in identifying degeneracies that manifest in only two parameters, but they are not necessarily revealing of degeneracies that manifest on higher-dimensional surfaces. In this case the scatter plots collapse the projected surface and obstruct the underlying degeneracy.

Only by looking at higher-dimensional scatter plots can we see the true nature of the degeneracy.

To avoid this projection collapse we would have to slice the samples into bands of the remaining parameters. Beyond a few dimensions, however, slicing generates an impractical number of plots, each of which is only sparsely populated and hence conveys little information on their own.

Even without slicing the number of possible pairs plots is the biggest limitation to this approach. Without domain expertise to motivate potential degenerate surfaces we would have to force ourselves through at least $D \cdot (D + 1) / 2$ scatter plots, where $D$ is the number of parameters in our model. Even for moderate dimensions this quickly expands to hundreds if not thousands of plots that might not even reveal problematic degeneracies at all.

Exploring high-dimensional spaces through visualizations is hard and it's all the more challenging when we don't know for what we're looking!

4.2 My Engine Makes This Strange Sound

Visualizations of Markov chains are only as good as the ability of those Markov chains to capture the full extent of the posterior distribution. In particular slow or ineffectual Markov chains provide only an incomplete view of the full posterior distribution and hence can lead to misleading pairs plots diagnostics.

Consider, for example, a Markov chain that slowly wanders around its initialization, exploring only a very small part of the posterior distribution that concentrates on a curved surface. In this case the pairs plots would show no signs of serious degeneracy.

If we ran multiple Markov chains then we would visualize only multiple incomplete explorations. This could then easily mislead us into thinking that we're dealing with the wrong kind of degeneracy, for example a discrete one instead of a continuous one.

This behavior is particularly troubling when degenerate behavior in a realized posterior density function prevents Markov chains from effectively exploring the full posterior distribution in the first place. Our diagnostics will fail exactly when we need them!

Fortunately not all Markov chain Monte Carlo algorithms fail so silently. Some algorithms, namely Hamiltonian Monte Carlo, actually complain quite loudly when they struggle with degenerate posterior density functions. At least, that is, if you listen to them. Although these complaints are certainly an annoyance they are often our only hint at the underlying pathologies.

Consequently our most powerful diagnostic strategy is a postmortem investigation where we carefully scrutinize how a Markov chain Monte Carlo algorithm struggles or outright fails in order track down the underlying pathologies and motivate hypotheses for their origins. The louder our algorithm, the more information we will have to guide us to degenerate behavior.

Exactly how degenerate a posterior density function has to be before a Markov chain Monte Carlo algorithm starts to struggle, or breaks outright, depends on the detailed interactions between the algorithm, the configuration of that algorithm, and the posterior density function. Some algorithms are fragile, breaking with only small degeneracies while some are more robust, only slowly bending before the degeneracies become strong enough. In order to maximize our diagnostic power we need to be familiar how a given algorithm responds to different degeneracies.

For example by comparing multiple Markov chains initialized at different places in the model configuration space, perhaps with summary statistics like the potential scale reduction factor $\hat{R}$, we might be able to detect discrete degeneracies. The more Markov chains we run the better we will be able resolve the full extent of the degeneracy, even if each individual Markov chain explores only a local mode.

4.3 Sonic Boom

Hamiltonian Monte Carlo has a reputation for being fast, but that is only one part of its success. Yes, for non-degenerate posterior density functions Hamiltonian Monte Carlo rapidly explores the posterior distribution even in high-dimensional model configuration spaces. Yes, Hamiltonian Monte Carlo can manage moderately degerate posterior density functions by consuming more computational resources. What makes Hamiltonian Monte Carlo so powerful in practice, however, is that when it struggles or outright fails to fully quantify a degerate posterior density function it does so loudly.

Hamiltonian Monte Carlo [4] exploits Hamiltonian dynamics which generate trajectories that rapidly and coherently explore the posterior distribution. In practice we can't generate these trajectories exactly so instead we approximate them with discrete, numerical trajectories. The smaller the discretization time, or step size, the more accurate but also more computationally expensive these numerical trajectories will be.

Any residual error introduced by these numerical trajectories is corrected by introducing a sampling procedure that preferentially returns points along the trajectory with smaller error. Under ideal conditions the performance of this sampling procedure can also be used to automatically tune the step size to achieve near optimal performance in tools like Stan [5]. When Stan encounters complex geometries the step size will automatically be decreased so that the numerical trajectories can resolve all of the fine details of the realized posterior density function as much as possible.

Unfortunately there is a limit to this adaptive feedback. When the posterior density function changes too quickly the numerical integration can become unstable, resulting in numerical trajectories that rocket towards infinity instead of exploring the delicate features. On the other hand these divergent trajectories also provide a clear indication of when Hamiltonian Monte Carlo is unable to fully resolve the details of the target posterior distribution.

Markov chain iterations sampled from divergent trajectories are marked as divergent iterations in Stan. Because the states in divergent trajectories are spatially correlated with the neighborhoods of high curvature that induce instabilities, we can use these divergent iterations to approximately locate these neighborhoods. We can verify any suspicious behavior by rerunning with a less aggressive, more refined, step size adaptation that should result in a smaller step size, and more accurate trajectories that can better penetrate the problematic neighborhood.

The details of the numerical Hamiltonian dynamics that drive Hamiltonian Monte Carlo serve as a powerful probe of degenerate behavior. For example the less uniformly the posterior density function concentrates the smaller the step size will have to be to resolve the directions of stronger concentration, while the longer the overall trajectories will have to be to fully explore the directions of smaller concentration.

In other words the more continuously degenerate the posterior density function the more integrator steps, or leapfrog steps, we will need to simulate the latent Hamiltonian dynamics. The number of leapfrog steps needed for each iteration can then help to identify problematic geometries.

Particularly extreme degeneracies often "pinch" at their boundaries, manifesting in neighborhoods where the curvature rapidly increases and renders numerical trajectories unstable. Consequently the appearance and structure of divergent iterations provide a sensitive diagnostic of these problematic boundaries. Because divergent iterations are sampled from divergent trajectories they will concentrate towards the problematic neighborhoods, literally pointing the way to the pathology. Overlaying divergent iterations and non-divergent iterations in a pairs plot is particularly informative, especially when the functions being compared are well-chosen.

Sometimes the geometry of the posterior density function is so adversarial that even exact Hamiltonian trajectories don't go very far and Hamiltonian Monte Carlo devolves into incoherent, diffusive exploration. Although this is a relatively rare pathology it can detected with another diagnostic bespoke to Hamiltonian Monte Carlo -- the energy fraction of missing information, or E-FMI [6]. Because these pathologies are somewhat rare, and often buried under other more pressing pathologies, we don't yet have a full understanding of the offending degeneracies and hence how to resolve them.

5 Don't Do Me Like That

Because the diversity of degenerate behavior is so expansive, identifying and and investigating it, not to mention reacting effectively, can be an an overwhelming challenge. To reduce the difficulty we need to build experience, and experience building is facilitated by compartmentalizing the possibilities. Limiting ourselves to a single degeneracy at a time presents a much more manageable lesson, especially when those degeneracies are typical of a new modeling technique that we're learning.

In case studies where I introduce new modeling techniques I take the opportunity to also discuss the degeneracies to which the technique might be vulnerable and the bespoke ways of diagnosing and resolving the underlying pathologies. Here we will consider some much less ambitious examples that demonstrate relatively simple degeneracies and their consequences.

Although these examples are artificial let's assume that in the initial parameterizations our domain expertise would constrain the parameters to values around or smaller than one. Consequently $1$ defines the scale that separates "weakly informative" from "strongly informative".

As always, however, we have to start by setting up our local R environment.

library(rstan)

Loading required package: StanHeaders

Loading required package: ggplot2

rstan (Version 2.19.3, GitRev: 2e1f913d3ca3)

For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)

rstan_options(auto_write = TRUE)            # Cache compiled Stan programs
options(mc.cores = parallel::detectCores()) # Parallelize chains

util <- new.env()
source('stan_utility.R', local=util)

set.seed("5389533")

c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")

c_light_trans <- c("#DCBCBC80")
c_dark_trans <- c("#8F272780")
c_green_trans <- c("#00FF0080")

5.1 Extraneous Parameters

The simplest degeneracy is having an extraneous parameter. Here there are no problematic interactions between the parameters, just one parameter that doesn't influence the likelihood function at all. Technically the extraneous parameter makes this observational model fully non-identified.

To see the problem let's simulate an observation from an arbitrary model configuration.

writeLines(readLines("stan_programs/simu_extra.stan"))

data {
  int N;
}

transformed data {
  real a = 1;
  real b = -0.5;
}

generated quantities {
  real<lower=0> sigma = 0.75;
  real y[N];
  for (n in 1:N) y[n] = normal_rng(a, sigma);
}

N <- 100
simu_data <- list("N" = N)

simu <- stan(file='stan_programs/simu_extra.stan', data=simu_data,
             iter=1, chains=1, seed=4838282, algorithm="Fixed_param")


SAMPLING FOR MODEL 'simu_extra' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                4.5e-05 seconds (Sampling)
Chain 1:                4.5e-05 seconds (Total)
Chain 1:

y <- array(extract(simu)$y[1,])
sigma <- extract(simu)$sigma[1]
input_data <- list("N" = N, "y" = y, "sigma" = sigma)

We will first consider a Bayesian model with a uniform prior density function that allows any degeneracies in the likelihood function to fully propagate to the posterior density function.

writeLines(readLines("stan_programs/fit_extra_flat.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real a;
  real b;
}

model {
  // Implicit flat prior model
  
  // Observational model
  y ~ normal(a, sigma);
}

fit <- stan(file='stan_programs/fit_extra_flat.stan', data=input_data,
            seed=4938483, refresh=0)

Unfortunately diagnostic checks indicate that Stan's Hamiltonian Monte Carlo sampler is complaining. Not only is $\hat{R}$ for $b$ indicating mixing problems, but also the numerical trajectories are hitting the default upper bound of 1024 leapfrog steps indicating that the sampler is working extraordinarily hard to generate each transition.

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat for parameter b is 1.35009175072942!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "2402 of 4000 iterations saturated the maximum tree depth of 10 (60.05%)"
[1] "  Run again with max_treedepth set to a larger value to avoid saturation"
[1] "E-FMI indicated no pathological behavior"

The pairs plot of the two parameters shows the Markov chains struggling to explore the non-compact degeneracy that stretches all the way to positive and negative infinity along the extraneous dimension. Note the scale of the vertical axis!

samples <- extract(fit)

par(mfrow=c(1, 1))

plot(samples$a, samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="a", ylab="b")

The struggle is even clearer when we visualize each Markov chain separately, revealing how each struggles to capture the infinite breadth in a different way.

unpermuted_samples <- extract(fit, permute=FALSE)

par(mfrow=c(2, 2))

plot(samples$a, samples$b,
     main="Chain 1", col=c_light_trans, pch=16, cex=0.8,
     xlab="a", ylab="b")
points(unpermuted_samples[,1,1], unpermuted_samples[,1,2],
       col=c_dark_trans, pch=16, cex=0.8)

plot(samples$a, samples$b,
     main="Chain 2", col=c_light_trans, pch=16, cex=0.8,
     xlab="a", ylab="b")
points(unpermuted_samples[,2,1], unpermuted_samples[,2,2],
       col=c_dark_trans, pch=16, cex=0.8)

plot(samples$a, samples$b,
     main="Chain 3", col=c_light_trans, pch=16, cex=0.8,
     xlab="a", ylab="b")
points(unpermuted_samples[,3,1], unpermuted_samples[,3,2],
       col=c_dark_trans, pch=16, cex=0.8)

plot(samples$a, samples$b,
     main="Chain 4", col=c_light_trans, pch=16, cex=0.8,
     xlab="a", ylab="b")
points(unpermuted_samples[,4,1], unpermuted_samples[,4,2],
       col=c_dark_trans, pch=16, cex=0.8)

To suppress the infinite degeneracy we can incorporate domain expertise into our prior model that disfavors infinite values. Let's start with weak prior with relatively mild suppression relative to the natural scales of the system.

writeLines(readLines("stan_programs/fit_extra_weak.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real a;
  real b;
}

model {
  // Prior model
  a ~ normal(0, 100);
  b ~ normal(0, 100);
  
  // Observational model
  y ~ normal(a, sigma);
}

fit <- stan(file='stan_programs/fit_extra_weak.stan', data=input_data,
            seed=4938483, refresh=0)

The diagnostics now have much less to say, suggesting no obstructions to an accurate fit.

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Although our computation is now accurate, the posterior density function is still only weakly informed along the extraneous direction.

samples <- extract(fit)

par(mfrow=c(1, 1))

plot(samples$a, samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="a", xlim=c(-300, 300), ylab="b", ylim=c(-300, 300))
lines(300 * cos(seq(0, 2 * pi, 2 * pi / 250)),
      300 * sin(seq(0, 2 * pi, 2 * pi / 250)),
      col=c_mid)

We can reduce the degeneracy by incorporating stronger prior information.

writeLines(readLines("stan_programs/fit_extra_strong.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real a;
  real b;
}

model {
  // Prior model
  a ~ normal(0, 1);
  b ~ normal(0, 1);
  
  // Observational model
  y ~ normal(a, sigma);
}

fit <- stan(file='stan_programs/fit_extra_strong.stan', data=input_data,
            seed=4938483, refresh=0)

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

samples <- extract(fit)

par(mfrow=c(1, 1))

plot(samples$a, samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="a", xlim=c(-3, 3), ylab="b", ylim=c(-3, 3))
lines(3* cos(seq(0, 2 * pi, 2 * pi / 250)),
      3 * sin(seq(0, 2 * pi, 2 * pi / 250)),
      col=c_mid)

Alternatively, if the extraneous parameter is not needed then we can prune it altogether.

writeLines(readLines("stan_programs/fit_extra_prune.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real a;
}

model {
  // Prior model
  a ~ normal(0, 100);
  
  // Observational model
  y ~ normal(a, sigma);
}

fit <- stan(file='stan_programs/fit_extra_prune.stan', data=input_data,
            seed=4938483, refresh=0)

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

samples <- extract(fit)

hist(samples$a, breaks=seq(0.5, 1.3, 0.05), main="",
     col=c_dark, border=c_dark_highlight,
     xlab="a", yaxt='n', ylab="")

5.2 Additive Degeneracy

A slightly more sophisticated, and more interesting, example is an additive degeneracy where measurements are informed by only the sum of two parameters. Once again this observational model is formally nonidentified.

writeLines(readLines("stan_programs/simu_add.stan"))

data {
  int N;
}

transformed data {
  real a = 1;
  real b = -0.5;
}

generated quantities {
  real<lower=0> sigma = 0.75;
  real y[N];
  for (n in 1:N) y[n] = normal_rng(a + b, sigma);
}

N <- 100
simu_data <- list("N" = N)

simu <- stan(file='stan_programs/simu_add.stan', data=simu_data,
             iter=1, chains=1, seed=4838282, algorithm="Fixed_param")


SAMPLING FOR MODEL 'simu_add' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                3.5e-05 seconds (Sampling)
Chain 1:                3.5e-05 seconds (Total)
Chain 1:

y <- array(extract(simu)$y[1,])
sigma <- extract(simu)$sigma[1]
input_data <- list("N" = N, "y" = y, "sigma" = sigma)

An improper prior model specified by a uniform prior density function will carry the full extent of the non-compact degeneracy from the likelihood function to the posterior density function.

writeLines(readLines("stan_programs/fit_add_flat.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real a;
  real b;
}

model {
  // Implicit flat prior model
  
  // Observational model
  y ~ normal(a + b, sigma);
}

fit <- stan(file='stan_programs/fit_add_flat.stan', data=input_data,
            seed=4938483, refresh=0)

The sampler absolutely grouses.

util$check_all_diagnostics(fit)

[1] "n_eff / iter for parameter a is 0.000592951750244318!"
[1] "n_eff / iter for parameter b is 0.000592953557480123!"
[1] "  n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated"
[1] "Rhat for parameter a is 4.61545697627965!"
[1] "Rhat for parameter b is 4.61543779735064!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "2512 of 4000 iterations saturated the maximum tree depth of 10 (62.8%)"
[1] "  Run again with max_treedepth set to a larger value to avoid saturation"
[1] "E-FMI indicated no pathological behavior"

As in the extraneous case we see the Markov chains trying to explore all the way out towards infinity, only this time along the line $b = \left< y \right> - a$, where \[ \left< y \right> = \frac{1}{N} \sum_{n = 1}^{N} y_{n}. \]

samples <- extract(fit)

par(mfrow=c(1, 1))

plot(samples$a, samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="a", ylab="b")

If we visualize each Markov chain separately then we can also see that each explores a different part of the degenerate line, which isn't unexpected given that they would be able to fully quantify the degeneracy only after an infinite number of iterations. Infinity is far away!

unpermuted_samples <- extract(fit, permute=FALSE)

par(mfrow=c(2, 2))

plot(samples$a, samples$b,
     main="Chain 1", col=c_light_trans, pch=16, cex=0.8,
     xlab="a", ylab="b")
points(unpermuted_samples[,1,1], unpermuted_samples[,1,2],
       col=c_dark_trans, pch=16, cex=0.8)

plot(samples$a, samples$b,
     main="Chain 2", col=c_light_trans, pch=16, cex=0.8,
     xlab="a", ylab="b")
points(unpermuted_samples[,2,1], unpermuted_samples[,2,2],
       col=c_dark_trans, pch=16, cex=0.8)

plot(samples$a, samples$b,
     main="Chain 3", col=c_light_trans, pch=16, cex=0.8,
     xlab="a", ylab="b")
points(unpermuted_samples[,3,1], unpermuted_samples[,3,2],
       col=c_dark_trans, pch=16, cex=0.8)

plot(samples$a, samples$b,
     main="Chain 4", col=c_light_trans, pch=16, cex=0.8,
     xlab="a", ylab="b")
points(unpermuted_samples[,4,1], unpermuted_samples[,4,2],
       col=c_dark_trans, pch=16, cex=0.8)

Fortunately there are many ways that we can incorporate additional information into our Bayesian inferences and suppress this degenerate behavior.

5.2.1 Suppression Strategies

Let's first consider introducing relatively weak domain expertise about just the parameter $a$ to break the additive degeneracy.

writeLines(readLines("stan_programs/fit_add_weak_a.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real a;
  real b;
}

model {
  // Prior model
  a ~ normal(0, 100);
  
  // Observational model
  y ~ normal(a + b, sigma);
}

fit <- stan(file='stan_programs/fit_add_weak_a.stan', data=input_data,
            seed=4938483, refresh=0)

This moderates the most serious diagnostic issues, but the tree depth warning indicates that the sample is still working extraordinarily hard.

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat for parameter a is 1.11466686601204!"
[1] "Rhat for parameter b is 1.11465553673283!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "2081 of 4000 iterations saturated the maximum tree depth of 10 (52.025%)"
[1] "  Run again with max_treedepth set to a larger value to avoid saturation"
[1] "E-FMI indicated no pathological behavior"

Indeed the weakly informative prior model is able to contain the likelihood function to a finite neighborhood that gives the sampler a decent chance of fully quantifying the posterior distribution.

samples <- extract(fit)

par(mfrow=c(1, 1))

plot(samples$a, samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="a", xlim=c(-300, 300), ylab="b", ylim=c(-300, 300))
lines(c(-300, -300), c(-325, 325), col=c_mid)
lines(c(300, 300), c(-325, 325), col=c_mid)

Weak domain expertise isolated on the second parameter, $b$, performs similarly, although the Markov chains explore a bit more slowly and so have yet to fully equilibrate.

writeLines(readLines("stan_programs/fit_add_weak_b.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real a;
  real b;
}

model {
  // Prior model
  b ~ normal(0, 100);
  
  // Observational model
  y ~ normal(a + b, sigma);
}

fit <- stan(file='stan_programs/fit_add_weak_b.stan', data=input_data,
            seed=4938483, refresh=0)

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "1939 of 4000 iterations saturated the maximum tree depth of 10 (48.475%)"
[1] "  Run again with max_treedepth set to a larger value to avoid saturation"
[1] "E-FMI indicated no pathological behavior"

samples <- extract(fit)

par(mfrow=c(1, 1))

plot(samples$a, samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="a", xlim=c(-300, 300), ylab="b", ylim=c(-300, 300))
lines(c(-325, 325), c(-300, -300), col=c_mid)
lines(c(-325, 325), c(300, 300), col=c_mid)

Introducing a relatively weakly informative prior model on both parameters constrains the posterior density function in all directions, suppressing the additive degeneracy a bit more strongly.

writeLines(readLines("stan_programs/fit_add_weak_ab.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real a;
  real b;
}

model {
  // Prior model
  a ~ normal(0, 100);
  b ~ normal(0, 100);
  
  // Observational model
  y ~ normal(a + b, sigma);
}

fit <- stan(file='stan_programs/fit_add_weak_ab.stan', data=input_data,
            seed=4938483, refresh=0)

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "1897 of 4000 iterations saturated the maximum tree depth of 10 (47.425%)"
[1] "  Run again with max_treedepth set to a larger value to avoid saturation"
[1] "E-FMI indicated no pathological behavior"

samples <- extract(fit)

par(mfrow=c(1, 1))

plot(samples$a, samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="a", xlim=c(-300, 300), ylab="b", ylim=c(-300, 300))
lines(300 * cos(seq(0, 2 * pi, 2 * pi / 250)),
      300 * sin(seq(0, 2 * pi, 2 * pi / 250)),
      col=c_mid)

Even through the prior model tames the initial non-compact degeneracy to a compact one, the realized posterior distribution is still quite degenerate. In particular, we don't learn much about about each parameter marginally.

a_prior_samples <- rnorm(4000, 0, 100)
b_prior_samples <- rnorm(4000, 0, 100)

par(mfrow=c(1, 2))

hist(a_prior_samples, breaks=seq(-400, 400, 10), prob=T,
     col=c_light, border=c_light_highlight, main="",
     xlim=c(-300, 300), xlab="a", yaxt='n', ylim=c(0, 0.0075), ylab="")
hist(samples$a, breaks=seq(-250, 250, 10), prob=T,
     col=c_dark, border=c_dark_highlight, add=T)

hist(b_prior_samples, breaks=seq(-400, 400, 10), prob=T,
     col=c_light, border=c_light_highlight, main="",
     xlim=c(-300, 300), xlab="b", yaxt='n', ylim=c(0, 0.0075), ylab="")
hist(samples$b, breaks=seq(-250, 250, 10), prob=T,
     col=c_dark, border=c_dark_highlight, add=T)

What do the measurements inform? If we rotate the model configuration space we can see that the observation informs only the transformed parameter a + b while providing no information about the orthogonal direction a - b.

par(mfrow=c(1, 2))

hist(a_prior_samples - b_prior_samples, breaks=seq(-600, 600, 10), prob=T,
     col=c_light, border=c_light_highlight, main="",
     xlim=c(-500, 500), xlab="a - b", yaxt='n', ylim=c(0, 0.0075), ylab="")
hist(samples$a - samples$b, breaks=seq(-500, 500, 10), prob=T,
     col=c_dark, border=c_dark_highlight, add=T)

hist(a_prior_samples + b_prior_samples, breaks=seq(-600, 600, 10), prob=T,
     col=c_light, border=c_light_highlight, main="",
     xlim=c(-400, 400), xlab="a + b", yaxt='n', ylim=c(0, 0.0075), ylab="")
hist(samples$a + samples$b, breaks=seq(-250, 250, 10), prob=T,
     col=c_dark, border=c_dark_highlight, add=T)

While the prior model provides a powerful tool for suppressing the degenerate behavior of generate likelihood functions it's not our only tool if we are able to incorporate additional data.

For example measurements that inform either of the parameters independently of the other allow the initial data to inform the remaining parameter. Here let's consider an auxiliary measurement that directly informs a to complement what we learn from the initial data.

writeLines(readLines("stan_programs/fit_add_weak_aux.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
  
  int<lower=1> N_aux;      // Number of auxiliary observations
  real y_aux[N_aux];       // Auxiliary observations
  real<lower=0> sigma_aux; // Auxiliary measurement variability
}

parameters {
  real a;
  real b;
}

model {
  // Prior model
  a ~ normal(0, 100);
  b ~ normal(0, 100);
  
  // Auxiliary observational model
  y_aux ~ normal(a, sigma_aux);
  
  // Observational model
  y ~ normal(a + b, sigma);
}

For convenience we can simulate the auxiliary measurement directly in R.

aux_data <- list("N_aux" = 25, "y_aux" = rnorm(25, 1, 0.5), "sigma_aux" = 0.5)
total_data <- c(input_data, aux_data)

fit <- stan(file='stan_programs/fit_add_weak_aux.stan',
            data=total_data, seed=4938483, refresh=0)

With the added information the posterior density function presents no diagnosed problems for our sampler.

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

The efficient computation is no surprise given how informed the posterior density function is in all directions with the combined power of both measurements.

aux_samples <- extract(fit)

par(mfrow=c(1, 1))

plot(samples$a, samples$b,
     col=c_light_trans, pch=16, cex=0.8,
     xlab="a", xlim=c(-300, 300), ylab="b", ylim=c(-300, 300))
points(aux_samples$a, aux_samples$b,
       col=c_dark_trans, pch=16, cex=0.8)
lines(300 * cos(seq(0, 2 * pi, 2 * pi / 250)),
      300 * sin(seq(0, 2 * pi, 2 * pi / 250)),
      col=c_mid)

Finally we can consider reparameterizing the model configuration space to isolate the degenerate behavior to one parameter and then removing that uninformed parameter entirely. Note that we have to modify the prior density function accordingly to ensure that we're not incorporating different domain expertise.

writeLines(readLines("stan_programs/fit_add_reparam.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real a_plus_b;
}

model {
  // Prior model
  a_plus_b ~ normal(0, 100 * sqrt(2));
  
  // Observational model
  y ~ normal(a_plus_b, sigma);
}

fit <- stan(file='stan_programs/fit_add_reparam.stan', data=input_data,
            seed=4938483, refresh=0)

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

samples <- extract(fit)

hist(samples$a_plus_b, breaks=seq(0, 0.8, 0.05), main="",
     col=c_dark, border=c_dark_highlight,
     xlab="a + b", yaxt='n', ylab="")

5.2.2 Comparing The Performance of Scaling Strategies

While all of these strategies are effective at moderating the additive degeneracy, not all of them do so as computationally efficiently as the others. In particular let's compare the weakly informative prior model to the reparameterization and removal.

fit_weak <- stan(file='stan_programs/fit_add_weak_ab.stan', data=input_data,
                 seed=4938483, refresh=0)
fit_reparam <- stan(file='stan_programs/fit_add_reparam.stan', data=input_data,
                    seed=4938483, refresh=0)

The reparameterized and excised model is able to utilize a much larger step sizes which should result in cheaper numerical trajectories.

extra_stepsizes <- sapply(1:4, function(c) get_sampler_params(fit_weak, inc_warmup=FALSE)[[c]][,'stepsize__'][1])
reparam_stepsizes <- sapply(1:4, function(c) get_sampler_params(fit_reparam, inc_warmup=FALSE)[[c]][,'stepsize__'][1])

xs <- rep(1:2, 4)
ys <- c(rbind(extra_stepsizes, reparam_stepsizes))

par(mfrow=c(1, 1))

plot(xs, ys, col=c_dark, type="p", pch=16, cex=0.8, log="y",
     xlab="", xaxt='n', ylab="Adapted Step Size")
axis(1, at=1:2, labels=c("Weak", "Reparameterized"))

The difference in number of leapfrog steps is quite striking with a large number of trajectories in the weakly identified model having to expand out to the default upper limit.

steps1 <- do.call(rbind,
                  get_sampler_params(fit_weak, inc_warmup=FALSE))[,'n_leapfrog__']
table(steps1)

steps1
   1    3    7   11   15   19   23   27   31   35   39   43   47   51   55   59 
 235 1102  157   17    9    7    4    6    2    6    6    8    2    2    2    3 
  63   67   71   75   79   83   87   91   95   99  103  107  115  119  127  131 
   3    3    2    4    2    2    5    4    3    2    4    4    4    4    4    2 
 135  139  143  147  151  155  159  163  167  171  175  179  183  187  191  195 
   1    1    2    2    2    3    1    1    2    5    2    1    3    6    3    2 
 199  203  207  211  219  223  227  235  239  243  247  255  259  263  267  271 
   3    2    1    3    3    3    2    2    4    2    2    1    2    2    3    6 
 279  283  287  291  295  299  303  307  315  319  327  335  339  343  347  351 
   2    3    4    4    2    2    2    1    2    4    1    5    4    3    1    3 
 355  359  363  367  371  375  379  383  387  391  395  403  407  411  415  419 
   1    2    3    1    1    3    5    3    5    1    1    1    3    3    2    3 
 423  427  439  443  447  451  455  463  467  471  475  479  483  487  495  499 
   1    4    4    4    2    4    2    2    6    3    3    1    4    2    1    2 
 503  507  511  519  523  527  531  535  543  547  551  555  559  563  567  579 
   1    1    2    1    1    2    4    3    2    2    2    5    3    5    2    2 
 583  587  591  595  599  603  607  611  615  619  623  627  631  635  639  643 
   2    3    1    1    4    5    2    4    2    2    5    1    1    1    2    7 
 651  655  659  663  667  671  675  679  687  691  695  699  703  707  711  719 
   4    1    2    2    4    3    2    4    2    3    3    3    4    1    2    2 
 735  739  743  747  751  755  763  767  771  775  779  787  795  799  803  807 
   4    2    6    6    1    6    1    5    1    1    1    1    2    2    3    3 
 811  819  823  831  839  843  847  851  859  863  867  871  875  879  883  887 
   2    2    1    1    1    2    7    1    1    4    5    1    3    1    2    4 
 891  895  899  907  911  915  923  927  931  935  939  943  951  955  959  963 
   1    2    4    1    1    1    5    1    2    5    4    7    1    1    3    4 
 967  971  979  987  991  995 1003 1007 1011 1015 1019 1023 
   1    1    2    3    2    3    2    3    3    1    5 1901

steps2 <- do.call(rbind,
                  get_sampler_params(fit_reparam, inc_warmup=FALSE))[,'n_leapfrog__']
table(steps2)

steps2
   1    3    7 
1157 2482  361

Ultimately the reparameterized and excised model is able to take advantage of much longer trajectories that explore the posterior distribution more efficiently.

int_times1 <- unlist(lapply(1:4, function(c) extra_stepsizes[c] * steps1[(1000 * (c - 1) + 1): (1000 * c)]))
int_times2 <- unlist(lapply(1:4, function(c) reparam_stepsizes[c] * steps2[(1000 * (c - 1) + 1): (1000 * c)]))

par(mfrow=c(1, 2))

int_time_breaks <- seq(0, 10, 0.1)

hist(int_times1, breaks=int_time_breaks, main="Weak",
     col=c_dark, border=c_dark_highlight,
     xlab="Integration Time", yaxt='n', ylab="")

hist(int_times2, breaks=int_time_breaks, main="Reparameterized",
     col=c_dark, border=c_dark_highlight,
     xlab="Integration Time", yaxt='n', ylab="")

This manifests in an effective sample size that's an order of magnitude larger for the reparameterized and excised model.

summary1 <- summary(fit_weak, probs = c(0.5))$summary
summary2 <- summary(fit_reparam, probs = c(0.5))$summary

a_ess <- c(summary1[,'n_eff'][1], summary2[,'n_eff'][1])

par(mfrow=c(1, 1))

plot(1:2, a_ess / a_ess[1], col=c_dark, type="l", log="y",
     xlab="", xaxt='n',
     ylab="Relative Effective Sample Size (a)", ylim=c(0.5, 20))
axis(1, at=1:2, labels=c("Weak", "Reparameterized"))

The increased effective sample size and decreased cost of each Hamiltonian transition results in a four orders of magnitude increase in the statistical efficiency of the reparameterized and excised model.

a_ess_per <- c(summary1[,'n_eff'][1] / sum(steps1),
               summary2[,'n_eff'][1] / sum(steps2))

par(mfrow=c(1, 1))

plot(1:2, a_ess_per / a_ess_per[1], col=c_dark, type="l", log="y",
     xlab="", xaxt='n',
     ylab="Relative Effective Sample Size / Grad Eval (a)", ylim=c(0.5, 5000))
axis(1, at=1:2, labels=c("Weak", "Reparmeterized"))

It's expensive to work with degenerate models, although when working with complex data generating processes it's often unavoidable.

5.2.3 Scaling with Observation Size

While repeating the same measurement will introduce additional information into our model, it won't always tame degenerate behaviors. Let's consider, for example, how the Bayesian model with a strongly informative prior performs with an increasing number of repeated observations.

writeLines(readLines("stan_programs/fit_add_strong.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real a;
  real b;
}

model {
  // Prior model
  a ~ normal(0, 1);
  b ~ normal(0, 1);
  
  // Observational model
  y ~ normal(a + b, sigma);
}

Let's start with a single observation.

N <- 1
simu_data <- list("N" = N)

simu <- stan(file='stan_programs/simu_add.stan', data=simu_data,
             iter=1, chains=1, seed=4838282, algorithm="Fixed_param")


SAMPLING FOR MODEL 'simu_add' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                1.4e-05 seconds (Sampling)
Chain 1:                1.4e-05 seconds (Total)
Chain 1:

y <- array(extract(simu)$y[1,])
sigma <- extract(simu)$sigma[1]
input_data.1 <- list("N" = N, "y" = y, "sigma" = sigma)

fit1 <- stan(file='stan_programs/fit_add_strong.stan', data=input_data.1,
             seed=4938483, refresh=0)

util$check_all_diagnostics(fit1)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

samples1 <- extract(fit1)

Then one hundred.

N <- 100
simu_data <- list("N" = N)

simu <- stan(file='stan_programs/simu_add.stan', data=simu_data,
             iter=1, chains=1, seed=4838282, algorithm="Fixed_param")


SAMPLING FOR MODEL 'simu_add' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                2.6e-05 seconds (Sampling)
Chain 1:                2.6e-05 seconds (Total)
Chain 1:

y <- array(extract(simu)$y[1,])
sigma <- extract(simu)$sigma[1]
input_data.100 <- list("N" = N, "y" = y, "sigma" = sigma)

fit2 <- stan(file='stan_programs/fit_add_strong.stan', data=input_data.100,
             seed=4938483, refresh=0)

util$check_all_diagnostics(fit2)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

samples2 <- extract(fit2)

And finally ten thousand.

N <- 10000
simu_data <- list("N" = N)

simu <- stan(file='stan_programs/simu_add.stan', data=simu_data,
             iter=1, chains=1, seed=4838282, algorithm="Fixed_param")


SAMPLING FOR MODEL 'simu_add' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                0.001013 seconds (Sampling)
Chain 1:                0.001013 seconds (Total)
Chain 1:

y <- array(extract(simu)$y[1,])
sigma <- extract(simu)$sigma[1]
input_data.10000 <- list("N" = N, "y" = y, "sigma" = sigma)

fit3 <- stan(file='stan_programs/fit_add_strong.stan', data=input_data.10000,
             seed=4938483, refresh=0)

util$check_all_diagnostics(fit3)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

samples3 <- extract(fit3)

Comparing the posterior samples we can see that adding more data results in a posterior distribution that concentrates more strongly around the degenerate surface.

par(mfrow=c(1, 3))

plot(samples1$a, samples1$b,
     col=c_dark_trans, pch=16, cex=0.8, main="N = 1",
     xlab="a", xlim=c(-3, 3), ylab="b", ylim=c(-4, 4))
lines(3 * cos(seq(0, 2 * pi, 2 * pi / 250)),
      3 * sin(seq(0, 2 * pi, 2 * pi / 250)),
      col=c_mid)

plot(samples2$a, samples2$b,
     col=c_dark_trans, pch=16, cex=0.8, main="N = 100",
     xlab="a", xlim=c(-3, 3), ylab="b", ylim=c(-4, 4))
lines(3 * cos(seq(0, 2 * pi, 2 * pi / 250)),
      3 * sin(seq(0, 2 * pi, 2 * pi / 250)),
      col=c_mid)

plot(samples3$a, samples3$b,
     col=c_dark_trans, pch=16, cex=0.8, main="N = 10000",
     xlab="a", xlim=c(-3, 3), ylab="b", ylim=c(-4, 4))
lines(3 * cos(seq(0, 2 * pi, 2 * pi / 250)),
      3 * sin(seq(0, 2 * pi, 2 * pi / 250)),
      col=c_mid)

Because the measurement will only ever identify a + b the additional data only concentrates the likelihood function around the degenerate line even more. In other words the concentration of the posterior density function becomes less uniform, and more degenerate, with more data.

We can see the consequences of this increasing degeneracy by looking at the behavior of the Hamiltonian Monte Carlo sampler.

The sampler needs to decrease the step size to resolve the increasingly stronger degeneracy.

stepsizes1 <- sapply(1:4, function(c) get_sampler_params(fit1, inc_warmup=FALSE)[[c]][,'stepsize__'][1])
stepsizes2 <- sapply(1:4, function(c) get_sampler_params(fit2, inc_warmup=FALSE)[[c]][,'stepsize__'][1])
stepsizes3 <- sapply(1:4, function(c) get_sampler_params(fit3, inc_warmup=FALSE)[[c]][,'stepsize__'][1])

xs <- rep(1:3, 4)
ys <- c(rbind(stepsizes1, stepsizes2, stepsizes3))

plot(xs, ys, col=c_dark, type="p", pch=16, cex=0.8, log="y",
     xlab="N", xaxt='n', ylab="Adapted Step Size")
axis(1, at=1:3, labels=c("1", "100", "10000"))

To compensate the number of leapfrog steps, and the cost of each transition, increases.

steps1 <- do.call(rbind, get_sampler_params(fit1, inc_warmup=FALSE))[,'n_leapfrog__']
table(steps1)

steps1
   1    3    5    7 
 278 1960  217 1545

steps2 <- do.call(rbind, get_sampler_params(fit2, inc_warmup=FALSE))[,'n_leapfrog__']
table(steps2)

steps2
   1    3    5    7    9   11   13   15   17   19   21   23   25   27   29   31 
 247 1132   10  284    8  204   17  262   12  235   16  236   14  225   14  233 
  33   35   37   39   41   43   45   47   49   51   53   55   57   59   63 
   6  189   12  202    8  173    7  119    2   70    5   43    2    6    7

steps3 <- do.call(rbind, get_sampler_params(fit3, inc_warmup=FALSE))[,'n_leapfrog__']
table(steps3)

steps3
   1    3    7   11   15   19   23   27   31   35   39   43   47   51   55   59 
 216 1100  183   26   27   21   19   29   18   23   24   23   35   23   31   22 
  63   67   71   75   79   83   87   91   95   99  103  107  111  115  119  123 
  34   23   34   26   30   28   20   25   29   14   19   28   29   24   22   20 
 127  131  135  139  143  147  151  155  159  163  167  171  175  179  183  187 
  27   21   24   28   22   20   27   15   27   28   18   24   32   22   20   19 
 191  195  199  203  207  211  215  219  223  227  231  235  239  243  247  251 
  23   20   17   15   26   18   24   11   29   24   26   17   24   20   27   14 
 255  259  263  267  271  275  279  283  287  291  295  299  303  307  311  315 
  19   17   20   17   36   19   29   19   18   30   23   10   21   23   16   12 
 319  323  327  329  331  335  339  343  347  351  355  359  363  367  371  375 
  15   22   16    1   23   18   20   16   15   14   21   21   17   24   17   13 
 379  383  387  391  395  399  403  407  411  415  419  423  427  431  435  439 
  19   17   27   18   11   18   19   19   14   13   13   13   14   15   13    8 
 443  447  451  455  459  463  467  471  475  479  483  487  491  495  499  503 
  12    7   12   11   15   14    8    5    8    6   12    9    3    9    7    6 
 507  511  515  519  523  527  531  535  539  543  547  555  559  567  571  575 
   2    3    7    6   11    7    2    2    3    3    1    1    6    2    1    4 
 579  583  591  599  603  615  623 
   1    1    2    1    1    1    1

The optimal integration time should be set by the longest length scale, which in this case is the width of the posterior density function along the poorly identified a - b direction. Because that's set by the prior model, the longest length scale, and hence the optimal integration times, shouldn't change with the number of observations. The empirical results suggest that the dynamic integration time in Stan is doing a reasonable job of maintaining optimal integration times even as the degeneracy increases.

int_times1 <- unlist(lapply(1:4, function(c) stepsizes1[c] * steps1[(1000 * (c - 1) + 1): (1000 * c)]))
int_times2 <- unlist(lapply(1:4, function(c) stepsizes2[c] * steps2[(1000 * (c - 1) + 1): (1000 * c)]))
int_times3 <- unlist(lapply(1:4, function(c) stepsizes3[c] * steps3[(1000 * (c - 1) + 1): (1000 * c)]))

par(mfrow=c(1, 3))

int_time_breaks <- seq(-0.1, 8.1, 0.1)

hist(int_times1, breaks=int_time_breaks, main="N = 1",
     col=c_dark, border=c_dark_highlight,
     xlab="Integration Time", yaxt='n', ylab="")

hist(int_times2, breaks=int_time_breaks, main="N = 100",
     col=c_dark, border=c_dark_highlight,
     xlab="Integration Time", yaxt='n', ylab="")

hist(int_times3, breaks=int_time_breaks, main="N = 10000",
     col=c_dark, border=c_dark_highlight,
     xlab="Integration Time", yaxt='n', ylab="")

The ultimate measure of computational efficacy for Markov chain Monte Carlo is the effective sample size of the desired Markov chain Monte Carlo estimators. Here we see that the effective sample size of the estimators of the parameter function expectation values decreases swiftly with the number of dimensions.

summary1 <- summary(fit1, probs = c(0.5))$summary
summary2 <- summary(fit2, probs = c(0.5))$summary
summary3 <- summary(fit3, probs = c(0.5))$summary

a_ess <- c(summary1[,'n_eff'][1], summary2[,'n_eff'][1], summary3[,'n_eff'][1])
b_ess <- c(summary1[,'n_eff'][2], summary2[,'n_eff'][2], summary3[,'n_eff'][2])

par(mfrow=c(1, 2))

plot(1:3, a_ess / a_ess[1], col=c_dark, type="l", log="y",
     xlab="N", xaxt='n',
     ylab="Relative Effective Sample Size (a)", ylim=c(0.2, 1))
axis(1, at=1:3, labels=c("1", "100", "10000"))

plot(1:3, b_ess / b_ess[1], col=c_dark, type="l", log="y",
     xlab="N", xaxt='n',
     ylab="Relative Effective Sample Size (b)", ylim=c(0.2, 1))
axis(1, at=1:3, labels=c("1", "100", "10000"))

Similarly the effective sample size per leapfrog step, which quantifies a measure of computational efficiency, decreases with the number of observations. Accounting for the increasing cost of each leapfrog step, the effective sample size per computational cost plummets even faster.

a_ess_per <- c(summary1[,'n_eff'][1] / sum(steps1),
               summary2[,'n_eff'][1] / sum(steps2),
               summary3[,'n_eff'][1] / sum(steps3))
b_ess_per <- c(summary1[,'n_eff'][2] / sum(steps1),
               summary2[,'n_eff'][2] / sum(steps2),
               summary3[,'n_eff'][2] / sum(steps3))

par(mfrow=c(1, 2))

plot(1:3, a_ess_per / a_ess_per[1], col=c_dark, type="l", log="y",
     xlab="N", xaxt='n',
     ylab="Relative Effective Sample Size / Grad Eval (a)", ylim=c(0.01, 1))
axis(1, at=1:3, labels=c("1", "100", "10000"))

plot(1:3, b_ess_per / b_ess_per[1], col=c_dark, type="l", log="y",
     xlab="N", xaxt='n',
     ylab="Relative Effective Sample Size / Grad Eval (b)", ylim=c(0.01, 1))
axis(1, at=1:3, labels=c("1", "100", "10000"))

5.2.4 Incorporating Unknown Measurement Variability

The additive degeneracy inherent to the previous model was evident even with the measurement variability $\sigma$ fixed. What happens to the degeneracy when we have to infer $\sigma$ as well as the two initial parameters?

First let's cache the fit results from the model where the measurement variability is known so that we have a baseline.

fit <- fit1
samples <- samples1

For this model we'll use a relatively strongly informative prior model not only for $a$ and $b$ but also for $\sigma$.

writeLines(readLines("stan_programs/fit_add_meas_var.stan"))

data {
  int<lower=1> N; // Number of observations
  real y[N];      // Observations
}

parameters {
  real a;
  real b;
  real<lower=0> sigma;
}

model {
  // Prior model
  a ~ normal(0, 1);
  b ~ normal(0, 1);
  sigma ~ normal(0, 1);
  
  // Observational model
  y ~ normal(a + b, sigma);
}

Although the model with a fixed measurement variability encountered no problems fitting this data, the new model where we have to infer the measurement variability enjoys a good number of divergent iterations.

fit_mv <- stan(file='stan_programs/fit_add_meas_var.stan', data=input_data.1,
               seed=4938483, refresh=0)

util$check_all_diagnostics(fit_mv)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "278 of 4000 iterations ended with a divergence (6.95%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

samples_mv <- extract(fit_mv)

If we ignore the diagnostic warnings and compare the recovered posterior samples we see that fitting the measurement variability seems to wash out the degenerate line from the previous fit a little bit.

par(mfrow=c(1, 2))

plot(samples$a, samples$b, main="Fixed Measurement Variability",
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="a", xlim=c(-3, 3), ylab="b", ylim=c(-4, 4))
lines(3 * cos(seq(0, 2 * pi, 2 * pi / 250)),
      3 * sin(seq(0, 2 * pi, 2 * pi / 250)),
      col=c_mid)

plot(samples_mv$a, samples_mv$b, main="Fitted Measurement Variability",
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="a", xlim=c(-3, 3), ylab="b", ylim=c(-4, 4))
lines(3 * cos(seq(0, 2 * pi, 2 * pi / 250)),
      3 * sin(seq(0, 2 * pi, 2 * pi / 250)),
      col=c_mid)

Interestingly the step size for the measurement variability fit also becomes much smaller.

fixed_stepsizes <- sapply(1:4, function(c) get_sampler_params(fit, inc_warmup=FALSE)[[c]][,'stepsize__'][1])
fitted_stepsizes <- sapply(1:4, function(c) get_sampler_params(fit_mv, inc_warmup=FALSE)[[c]][,'stepsize__'][1])

xs <- rep(1:2, 4)
ys <- c(rbind(fixed_stepsizes, fitted_stepsizes))

par(mfrow=c(1, 1))

plot(xs, ys, col=c_dark, type="p", pch=16, cex=0.8, log="y",
     xlab="", xaxt='n', ylab="Adapted Step Size")
axis(1, at=1:2, labels=c("Fixed", "Fitted"))

Likewise the number of leapfrog steps increases indicating that the sampler is working much harder to cover as much of the posterior distribution as it can.

fixed_steps <- do.call(rbind, get_sampler_params(fit, inc_warmup=FALSE))[,'n_leapfrog__']
table(fixed_steps)

fixed_steps
   1    3    5    7 
 278 1960  217 1545

fitted_steps <- do.call(rbind, get_sampler_params(fit_mv, inc_warmup=FALSE))[,'n_leapfrog__']
table(fitted_steps)

fitted_steps
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   19 
 145   26  570   15   94   27 1660    3   49   15  265    1   29    8 1084    1 
  21   23   31 
   1    3    4

Even the integration times are a little bit longer.

int_times_fixed <- unlist(lapply(1:4, function(c) fixed_stepsizes[c] * fixed_steps[(1000 * (c - 1) + 1): (1000 * c)]))
int_times_fitted <- unlist(lapply(1:4, function(c) fitted_stepsizes[c] * fitted_steps[(1000 * (c - 1) + 1): (1000 * c)]))

par(mfrow=c(1, 2))

int_time_breaks <- seq(0, 15, 0.1)

hist(int_times_fixed, breaks=int_time_breaks, main="Fixed",
     col=c_dark, border=c_dark_highlight,
     xlab="Integration Time", yaxt='n', ylab="")

hist(int_times_fitted, breaks=int_time_breaks, main="Fitted",
     col=c_dark, border=c_dark_highlight,
     xlab="Integration Time", yaxt='n', ylab="")

Let's follow good practice, however, and not get too far ahead of ourselves before we investigate the degenerate behavior. In particular let's go hunting for divergences.

Because $\sigma$ is constrained to be positive in our Stan program we want to look at the unconstrained parameter function in the postmortem analysis. If we know that Stan ensures positivity by fitting $\log(\sigma)$ instead of $\sigma$ then we can just apply the transformation ourselves.

partition <- util$partition_div(fit_mv)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

par(mfrow=c(1, 3))

plot(nondiv_samples$a, nondiv_samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="a", ylab="b")
points(div_samples$a, div_samples$b,
       col="green", pch=16, cex=0.8)

plot(nondiv_samples$a, log(nondiv_samples$sigma),
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="a", ylab="log(sigma)")
points(div_samples$a, log(div_samples$sigma),
       col="green", pch=16, cex=0.8)

plot(nondiv_samples$b, log(nondiv_samples$sigma),
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="b", ylab="log(sigma)")
points(div_samples$b, log(div_samples$sigma),
       col="green", pch=16, cex=0.8)

Alternatively we can run with a diagnostic file which saves the unconstrained values directly. Because the unconstrained parameter names in the diagnostic fit are the same as the constrained parameter names in the nominal fit care is needed to avoid confusion!

stan(file='stan_programs/fit_add_meas_var.stan', data=input_data.1,
     seed=4938483, refresh=0, diagnostic_file = "output/hmc_diag.csv")

Inference for Stan model: fit_add_meas_var.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
a      0.27    0.03 0.79 -1.31 -0.25  0.26  0.80  1.83   937 1.00
b      0.30    0.02 0.78 -1.28 -0.21  0.30  0.81  1.80  1001 1.01
sigma  0.77    0.03 0.55  0.09  0.35  0.66  1.08  2.10   384 1.01
lp__  -1.55    0.04 1.20 -4.60 -2.06 -1.23 -0.67 -0.24   817 1.01

Samples were drawn using NUTS(diag_e) at Thu Jun 25 21:49:54 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

diag_files <- sapply(1:4, function (c) paste("output/hmc_diag_", c, ".csv", sep=""))
unconstrained_fit <- rstan::read_stan_csv(diag_files)

partition <- util$partition_div(unconstrained_fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

par(mfrow=c(1, 3))

plot(nondiv_samples$a, nondiv_samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="a", ylab="b")
points(div_samples$a, div_samples$b,
       col="green", pch=16, cex=0.8)

plot(nondiv_samples$a, nondiv_samples$sigma,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="a", ylab="log(sigma)")
points(div_samples$a, div_samples$sigma,
       col="green", pch=16, cex=0.8)

plot(nondiv_samples$b, nondiv_samples$sigma,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="b", ylab="log(sigma)")
points(div_samples$b, div_samples$sigma,
       col="green", pch=16, cex=0.8)

The divergent iterations slightly concentrate towards smaller values of $\sigma$, as well as a line $b = \text{const} - a$, but there's no smoking gun in the naive pairs plots. Perhaps the degenerate surface is no longer aligned with the parameters?

Hunting for complex degeneracies is greatly facilitated by considering the structure of the model. In this case we know that for given any value of $\sigma$ the conditional posterior density function will concentrate around the line $b = -a + \left<y \right>$. That then suggests that we look at the pairs plot between the measurement variability and the distance from this surface, $\delta$.

Remember that div_samples and nondiv_samples still contain the unconstrained values so $sigma here actually returns the unconstrained values log(sigma).

par(mfrow=c(1, 1))

nondiv_delta = nondiv_samples$a + nondiv_samples$b - input_data.1$y
div_delta = div_samples$a + div_samples$b - input_data.1$y

plot(nondiv_delta, nondiv_samples$sigma,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="delta", ylab="log(sigma)")
points(div_delta, div_samples$sigma,
       col="green", pch=16, cex=0.8)

There it is! The divergent iterations concentrate at small values of $\sigma$ where the parameters $a$ and $b$ start to collapse onto the degenerate surface $b = -a + \left<y \right>$. This suggest that the sampler is not able to fully resolve the posterior density function close to the degenerate line.

To confirm let's run again but with less aggressive step size adaptation. In Stan the step size adaptation is controlled by a configuration parameter called delta or adapt_delta depending on the interface. When there are no divergences the optimal value is $\delta = 0.8$, but increasing $\delta$ towards 1 results in smaller step sizes that can better resolve any problematic behavior.

fit_mv99 <- stan(file='stan_programs/fit_add_meas_var.stan',
                 data=input_data.1, seed=4938483,
                 control=list(adapt_delta=0.99), refresh=0)

Increasing adapt_delta, and decreasing the step size, does reduce the number of divergent iterations, although it doesn't remove them entirely.

util$check_all_diagnostics(fit_mv99)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat for parameter sigma is 1.10278111065977!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "275 of 4000 iterations ended with a divergence (6.875%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

More importantly increasing adapt_delta allows the Hamiltonian Monte Carlo fit to push further towards small values of $\sigma$ that were always part of the degenerate posterior density function even though they weren't captured by our first, more heavily biased fit.

samples_mv99 <- extract(fit_mv99)
delta08 = samples_mv$a + samples_mv$b - rep(input_data.1$y, 4000)
delta99 = samples_mv99$a + samples_mv99$b - rep(input_data.1$y, 4000)

plot(delta99, log(samples_mv99$sigma),
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="delta", ylab="log(sigma)")
points(delta08, log(samples_mv$sigma),
       col=c_light_trans, pch=16, cex=0.8)

Unfortunately this behavior is not uncommon. Degeneracies between phenomenological and/or environmental parameters are often amplified and warped by interactions with the environmental parameters, making them more problematic and harder to diagnose. If it's challenging to characterize the system being studied when the experimental conditions are known perfectly then it's going to be even harder when one has to characterize the experiment using the same data. Moreover these pathologies are fundamentally preasymptotic and hence don't show up in classical asymptotic analyses where the infinite data informs parameters like the measurement variability completely.

The strategies for tempering this worsened degeneracy are the same as before. In particular we desperately need more information.

For example let's consider the circumstance where our domain expertise is inconsistent with smaller values of $\sigma$, and conveniently the neighborhoods where the degeneracy becomes really pathological. In this case we can refine our domain expertise into a zero-avoiding prior density function that suppresses the problematic model configurations. Here we'll use Stan's algebraic solver to find prior configuration consistent with two tail conditions informed by our assumed domain expertise.

writeLines(readLines("stan_programs/prior_tune.stan"))

functions {
  // Differences between inverse Gamma tail probabilities and target probabilities
  vector tail_delta(vector y, vector theta, real[] x_r, int[] x_i) {
    vector[2] deltas;
    deltas[1] = inv_gamma_cdf(theta[1], exp(y[1]), exp(y[2])) - 0.01;
    deltas[2] = 1 - inv_gamma_cdf(theta[2], exp(y[1]), exp(y[2])) - 0.01;
    return deltas;
  }
}

transformed data {
  // Target quantiles
  real l = 0.25; // Lower quantile
  real u = 1;   // Upper quantile
  vector[2] theta = [l, u]';
  
  // Initial guess at inverse Gamma parameters 
  // using asymmetric Gaussian approximation
  real dl = 0.2;
  real du = 2.5;
  real alpha_guess = square(2 * (dl * u + du * l) / (u - l)) + 2;
  real beta_guess = (alpha_guess - 1) * 0.5 * (dl * u + du * l) / (dl + du);
  vector[2] y_guess = [log(alpha_guess), log(beta_guess)]';

  // Find inverse Gamma density parameters that ensure 
  // 1% probabilty below l and 1% probabilty above u
  vector[2] y;
  real x_r[0];
  int x_i[0];

  y = algebra_solver(tail_delta, y_guess, theta, x_r, x_i);

  print("alpha = ", exp(y[1]));
  print("beta = ", exp(y[2]));
}

generated quantities {
  real alpha = exp(y[1]);
  real beta = exp(y[2]);
}

stan(file='stan_programs/prior_tune.stan', iter=1, warmup=0, chains=1,
     seed=4838282, algorithm="Fixed_param")

alpha = 11.8303
beta = 5.31567

SAMPLING FOR MODEL 'prior_tune' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                1.8e-05 seconds (Sampling)
Chain 1:                1.8e-05 seconds (Total)
Chain 1:

Inference for Stan model: prior_tune.
1 chains, each with iter=1; warmup=0; thin=1; 
post-warmup draws per chain=1, total post-warmup draws=1.

       mean se_mean sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha 11.83      NA NA 11.83 11.83 11.83 11.83 11.83     0  NaN
beta   5.32      NA NA  5.32  5.32  5.32  5.32  5.32     0  NaN
lp__   0.00      NA NA  0.00  0.00  0.00  0.00  0.00     0  NaN

Samples were drawn using (diag_e) at Thu Jun 25 21:49:59 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

This then motivates a refined prior model that encounters no diagnostic warning signs.

writeLines(readLines("stan_programs/fit_add_meas_var_za.stan"))

data {
  int<lower=1> N; // Number of observations
  real y[N];      // Observations
}

parameters {
  real a;
  real b;
  real<lower=0> sigma;
}

model {
  // Prior model
  a ~ normal(0, 1);
  b ~ normal(0, 1);
  sigma ~ inv_gamma(11.83, 5.32);
  
  // Observational model
  y ~ normal(a + b, sigma);
}

fit <- stan(file='stan_programs/fit_add_meas_var_za.stan',
            data=input_data.1, seed=4938483, refresh=0)

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

As expected the refined prior model suppresses the more problematic geometries which allows Hamiltonian Monte Carlo to explore unabated.

samples <- extract(fit)
delta <- samples$a + samples$b - rep(input_data.1$y, 4000)

par(mfrow=c(1, 1))

plot(delta, log(samples$sigma),
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="delta", ylab="log(sigma)")

While the refined prior model proved successful here it was valid only because it was consistent with our domain expertise. Introducing a prior model that suppresses degenerate behavior solely to suppress degenerate behavior leads to dangerous feedback effects that mislead our inferences away from anything principled.

If we're not confident in our domain expertise then we can always consider new measurements to compliment our initial data. Here's let's consider some data from a calibration experiment that constrains $\sigma$ and hopefully prevents the worst of the new degeneracy from propagating to the posterior density function.

calib_data <- list("N_calib" = 10, "y_calib" = rnorm(10, 0, 0.75))
total_data <- c(input_data.1, calib_data)

writeLines(readLines("stan_programs/fit_add_meas_var_calib.stan"))

data {
  int<lower=1> N;  // Number of observations
  real y[N];       // Observations
  
  int<lower=1> N_calib;  // Number of calibration observations
  real y_calib[N_calib]; // Calibration observations
}

parameters {
  real a;
  real b;
  real<lower=0> sigma;
}

model {
  // Prior model
  a ~ normal(0, 1);
  b ~ normal(0, 1);
  sigma ~ normal(0, 1);
  
  // Calibration observational model
  y_calib ~ normal(0, sigma);
  
  // Observational model
  y ~ normal(a + b, sigma);
}

The lack of fitting issues is a good sign.

fit <- stan(file='stan_programs/fit_add_meas_var_calib.stan',
            data=total_data, seed=4938483, refresh=0)

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

What really gives us confidence, however, is that $\sigma$ is much better identified than before.

samples <- extract(fit)
delta <- samples$a + samples$b - rep(input_data.1$y, 4000)

par(mfrow=c(1, 1))

plot(delta, log(samples$sigma),
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="delta", ylab="log(sigma)")

Finally we can consider adding repeated measurements from the initial data generating process. We know that repeated measurements won't improve the degeneracy between $a$ and $b$, but they might inform $\sigma$ independently of those two parameters.

fit100 <- stan(file='stan_programs/fit_add_meas_var.stan',
               data=input_data.100, seed=4938483, refresh=0)

Again the lack of diagnostic warnings bodes well.

util$check_all_diagnostics(fit100)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

As does the much better informed measurement variability.

samples100 <- extract(fit100)
delta = samples$a + samples$b - rep(mean(input_data.1$y), 4000)
delta100 = samples100$a + samples100$b - rep(mean(input_data.100$y), 4000)

plot(delta, log(samples$sigma),
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="delta", ylab="log(sigma)")
points(delta100, log(samples100$sigma),
       col=c_light_trans, pch=16, cex=0.8)

This model demonstrates some of the critical differences between asymptotic and preasymptotic degeneracies. The parameters $a$ and $b$ are nonidentified and will remain degenerate even as we collect an infinite number of repeated observations. In fact the realized likelihood functions will concentrate around the degenerate line and make it even harder to resolve the posterior density function algorithmically. $\sigma$, however, is conflated with $a$ and $b$ only when the number of observations is small, and so is inherently a preasymptotic consideration.

5.3 Multiplicative Degeneracy

Let's now consider a similar problem -- a model where the data informs only the product of two parameters. Once again this defines a properly nonidentified observational model.

First we need to simulate some data from an arbitrary model configuration with parameter values around our nominal scale 1.

writeLines(readLines("stan_programs/simu_multi.stan"))

data {
  int N;
}

transformed data {
  real a = 1;
  real b = -0.5;
}

generated quantities {
  real<lower=0> sigma = 0.75;
  real y[N];
  for (n in 1:N) y[n] = normal_rng(a * b, sigma);
}

N <- 100
simu_data <- list("N" = N)

simu <- stan(file='stan_programs/simu_multi.stan', data=simu_data,
             iter=1, chains=1, seed=4838282, algorithm="Fixed_param")


SAMPLING FOR MODEL 'simu_multi' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                3.6e-05 seconds (Sampling)
Chain 1:                3.6e-05 seconds (Total)
Chain 1:

y <- array(extract(simu)$y[1,])
sigma <- extract(simu)$sigma[1]
input_data <- list("N" = N, "y" = y, "sigma" = sigma)

We'll then begin the fitting with an improper prior model specified by a uniform prior density that really wants a lot of posterior probablity around infinity.

writeLines(readLines("stan_programs/fit_multi_flat.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real a;
  real b;
}

model {
  // Implicit flat prior model
  
  // Observational model
  y ~ normal(a * b, sigma);
}

fit <- stan(file='stan_programs/fit_multi_flat.stan', data=input_data,
            seed=4938483, refresh=0)

The diagnostics are...unpleasant.

util$check_all_diagnostics(fit)

[1] "n_eff / iter for parameter b is 0.000503348658248784!"
[1] "  n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated"
[1] "Rhat for parameter a is 1.39754608642401!"
[1] "Rhat for parameter b is 13.5174900949699!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "484 of 4000 iterations ended with a divergence (12.1%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "1 of 4000 iterations saturated the maximum tree depth of 10 (0.025%)"
[1] "  Run again with max_treedepth set to a larger value to avoid saturation"
[1] "E-FMI indicated no pathological behavior"

Plotting the pairs plot with highlighted divergences quickly reveals the problem. We are dealing not only with a continuous multiplicative degeneracy but also a discrete degeneracy corresponding to the relative signs between the two parameters.

par(mfrow=c(1, 1))

partition <- util$partition_div(fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

plot(nondiv_samples$a, nondiv_samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="a", ylab="b")
points(div_samples$a, div_samples$b,
       col=c_green_trans, pch=16, cex=0.8)

First things first, let's manage the discrete degeneracy by fixing the sign of $a$ to be positive. By doing this we admit that the absolute signs of $a$ and $b$ don't have any real meaning -- it's only the relative signs that matter for the data generating process. In other words we're choosing a single, arbitrary convention instead of forcing our sampler to explore all of the possible conventions. The subsequent model is valid only if these assumptions are!

writeLines(readLines("stan_programs/fit_multi_flat_pos.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real<lower=0> a;
  real b;
}

model {
  // Implicit flat prior model
  
  // Observational model
  y ~ normal(a * b, sigma);
}

fit <- stan(file='stan_programs/fit_multi_flat_pos.stan',
                 data=input_data, seed=4938483, refresh=0)

The diagnostics aren't as bad, but they're still not great.

util$check_all_diagnostics(fit)

[1] "n_eff / iter for parameter lp__ is 0.000780597551803195!"
[1] "  n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated"
[1] "Rhat for parameter a is 1.24098013898964!"
[1] "Rhat for parameter lp__ is 1.77593233631591!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "273 of 4000 iterations ended with a divergence (6.825%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "144 of 4000 iterations saturated the maximum tree depth of 10 (3.6%)"
[1] "  Run again with max_treedepth set to a larger value to avoid saturation"
[1] "E-FMI indicated no pathological behavior"

The divergence-highlighted pairs plots shows that while we have successfully removed the discrete degeneracy the continuous degeneracy remains an obstacle to accurate sampling. In particular the divergences indicate that the narrowing ends of the posterior density function as it tries to stretch towards infinity are problematic.

par(mfrow=c(1, 1))

partition <- util$partition_div(fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

plot(log(nondiv_samples$a), nondiv_samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="log(a)", ylab="b")
points(log(div_samples$a), div_samples$b,
       col=c_green_trans, pch=16, cex=0.8)

To tame the continuous degeneracy let's introduce a weakly informative prior model to at least contain the posterior density function within a finite neighborhood.

writeLines(readLines("stan_programs/fit_multi_weak.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real<lower=0> a;
  real b;
}

model {
  // Prior model
  a ~ normal(0, 100);
  b ~ normal(0, 100);
  
  // Observational model
  y ~ normal(a * b, sigma);
}

fit <- stan(file='stan_programs/fit_multi_weak.stan',
            data=input_data, seed=4938483, refresh=0)

Hamiltonian Monte Carlo just won't shut up.

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat for parameter a is 1.46941239324622!"
[1] "Rhat for parameter lp__ is 1.16220685257658!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "545 of 4000 iterations ended with a divergence (13.625%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Then again it shouldn't. The posterior density function seems to be narrowing too much for the sampler to accurately resolve. The Markov chains don't even reach the containment that it supposed to be enforced by the more informative prior model!

par(mfrow=c(1, 1))

partition <- util$partition_div(fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

plot(log(nondiv_samples$a), nondiv_samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="log(a)", xlim=c(-7, 7), ylab="b", ylim=c(-300, 300))
points(log(div_samples$a), div_samples$b,
       col=c_green_trans, pch=16, cex=0.8)
xs <- seq(-0.394807 + 0.01, 5.86915, 0.01)
ys <- 100 * sqrt(0.78966 + 2 * xs - exp(2 * xs) / 100**2)
xs <- c(-0.394807, xs, 5.86915)
ys <- c(0, ys, 0)
lines(c(xs, rev(xs)), c(ys, -rev(ys)), col=c_mid)

To confirm we run again with adapt_delta larger than the default 0.8 but smaller than the maximum value of 1. If our hypothesis about the narrowing ends is true then we should see this more refined fit expand out further.

fit <- stan(file='stan_programs/fit_multi_weak.stan',
            data=input_data, seed=4938483,
            control=list(adapt_delta=0.99), refresh=0)

util$check_all_diagnostics(fit)

[1] "n_eff / iter for parameter lp__ is 0.000834918304731432!"
[1] "  n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated"
[1] "Rhat for parameter a is 1.8147259683543!"
[1] "Rhat for parameter b is 1.13832715752886!"
[1] "Rhat for parameter lp__ is 1.8361170306752!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "23 of 4000 iterations ended with a divergence (0.575%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "900 of 4000 iterations saturated the maximum tree depth of 10 (22.5%)"
[1] "  Run again with max_treedepth set to a larger value to avoid saturation"
[1] "E-FMI indicated no pathological behavior"

Conveniently this is exactly what we see.

par(mfrow=c(1, 1))

partition <- util$partition_div(fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

plot(log(nondiv_samples$a), nondiv_samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="log(a)", xlim=c(-7, 7), ylab="b", ylim=c(-300, 300))
points(log(div_samples$a), div_samples$b,
       col=c_green_trans, pch=16, cex=0.8)
xs <- seq(-0.394807 + 0.01, 5.86915, 0.01)
ys <- 100 * sqrt(0.78966 + 2 * xs - exp(2 * xs) / 100**2)
xs <- c(-0.394807, xs, 5.86915)
ys <- c(0, ys, 0)
lines(c(xs, rev(xs)), c(ys, -rev(ys)), col=c_mid)

To give our sampler a reasonable opportunity at success we need to temper the continuous multiplicative degeneracy even more.

If our domain expertise isn't consistent with the more extreme model configurations at the ends of the degeneracy then we can consider a refined prior model. Here let's see if a refined prior model for the parameter $b$ is sufficient.

writeLines(readLines("stan_programs/fit_multi_strong_b.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real<lower=0> a;
  real b;
}

model {
  // Prior model
  a ~ normal(0, 100);
  b ~ normal(0, 1);
  
  // Observational model
  y ~ normal(a * b, sigma);
}

fit <- stan(file='stan_programs/fit_multi_strong_b.stan',
            data=input_data, seed=4938483, refresh=0)

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat for parameter a is 1.28205244565146!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "283 of 4000 iterations ended with a divergence (7.075%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Unfortunately the additional information tempers only one end of the degeneracy, leaving the other to frustrate the sampler.

par(mfrow=c(1, 1))

partition <- util$partition_div(fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

plot(log(nondiv_samples$a), nondiv_samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="log(a)", xlim=c(-7, 7), ylab="b", ylim=c(-3, 3))
points(log(div_samples$a), div_samples$b,
       col=c_green_trans, pch=16, cex=0.8)
xs <- seq(-0.394807 + 0.01, 5.86915, 0.01)
ys <- 1 * sqrt(0.78966 + 2 * xs - exp(2 * xs) / 100**2)
xs <- c(-0.394807, xs, 5.86915)
ys <- c(0, ys, 0)
lines(c(xs, rev(xs)), c(ys, -rev(ys)), col=c_mid)

We can confirm this hypothesis by running again with less aggressive step size adaptation.

fit <- stan(file='stan_programs/fit_multi_strong_b.stan',
            data=input_data, seed=4938483,
            control=list(adapt_delta=0.99), refresh=0)

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "43 of 4000 iterations ended with a divergence (1.075%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "380 of 4000 iterations saturated the maximum tree depth of 10 (9.5%)"
[1] "  Run again with max_treedepth set to a larger value to avoid saturation"
[1] "E-FMI indicated no pathological behavior"

With more careful exploration our Hamiltonian Monte Carlo sampler does indeed push deeper into the end of the degeneracy that's still consistent with our prior model.

par(mfrow=c(1, 1))

partition <- util$partition_div(fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

plot(log(nondiv_samples$a), nondiv_samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="log(a)", xlim=c(-7, 7), ylab="b", ylim=c(-3, 3))
points(log(div_samples$a), div_samples$b,
       col=c_green_trans, pch=16, cex=0.8)
xs <- seq(-0.394807 + 0.01, 5.86915, 0.01)
ys <- 1 * sqrt(0.78966 + 2 * xs - exp(2 * xs) / 100**2)
xs <- c(-0.394807, xs, 5.86915)
ys <- c(0, ys, 0)
lines(c(xs, rev(xs)), c(ys, -rev(ys)), col=c_mid)

To reduce the degeneracy we need a stronger domain expertise influencing both parameters. Assuming that it is consistent with a more careful reflection of our domain expertise we might consider the relatively strongly informative prior model encoded in the following Stan program.

writeLines(readLines("stan_programs/fit_multi_strong_ab.stan"))

data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real<lower=0> a;
  real b;
}

model {
  // Prior model
  a ~ normal(0, 1);
  b ~ normal(0, 1);
  
  // Observational model
  y ~ normal(a * b, sigma);
}

fit <- stan(file='stan_programs/fit_multi_strong_ab.stan',
            data=input_data, seed=4938483, refresh=0)

Stan doesn't complain quite as furiously but divergent iterations remain.

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "24 of 4000 iterations ended with a divergence (0.6%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

The location of the divergent iterations suggests that the right edge of the posterior density function is still narrowing too quickly.

par(mfrow=c(1, 1))

partition <- util$partition_div(fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

plot(log(nondiv_samples$a), nondiv_samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="log(a)", xlim=c(-7, 7), ylab="b", ylim=c(-3, 3))
points(log(div_samples$a), div_samples$b,
       col=c_green_trans, pch=16, cex=0.8)
xs <- seq(-4.99998 + 0.01, 1.26398, 0.01)
ys <- 1 * sqrt(10 + 2 * xs - exp(2 * xs) / 1**2)
xs <- c(-4.99998, xs, 1.26398)
ys <- c(0, ys, 0)
lines(c(xs, rev(xs)), c(ys, -rev(ys)), col=c_mid)

If we weren't sure about the narrowing hypothesis then we could run longer Markov chains and accumulate additional divergent iterations to better locate the degenerate behavior. Here we'll instead try to confirm our hypothesis by running with a less aggressive step size adaptation.

fit <- stan(file='stan_programs/fit_multi_strong_ab.stan',
            data=input_data, seed=4938483,
            control=list(adapt_delta=0.99), refresh=0)

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Finally our Hamiltonian Monte Carlo sampler has stopped complaining and we can have some confidence that we are accurately quantifying the full extent of the now somewhat less degenerate posterior density function.

par(mfrow=c(1, 1))

partition <- util$partition_div(fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

plot(log(nondiv_samples$a), nondiv_samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="log(a)", xlim=c(-7, 7), ylab="b", ylim=c(-3, 3))
points(log(div_samples$a), div_samples$b,
       col=c_green_trans, pch=16, cex=0.8)
xs <- seq(-4.99998 + 0.01, 1.26398, 0.01)
ys <- 1 * sqrt(10 + 2 * xs - exp(2 * xs) / 1**2)
xs <- c(-4.99998, xs, 1.26398)
ys <- c(0, ys, 0)
lines(c(xs, rev(xs)), c(ys, -rev(ys)), col=c_mid)

5.4 Discrete Degeneracy

Finally let's consider a notoriously degenerate model -- the exchangeable mixture model. A mixture model considers multiple component observational models that contribute to the overall data generating process in different, unknown proportions; an exchangeable mixture model considers component observational models that all fall in the same parametric family of probability density functions. Because the exchangeable components all have exactly the same functionality any permutation of the components won't affect the composite data generating processes. In other words the exchangeable mixture model is nonidentified by construction.

First let's simulate a relatively large observation that would ensure very weak degeneracies for most low-dimensional models.

writeLines(readLines("stan_programs/simu_mix.stan"))

data {
  int N;
}

transformed data {
  vector[2] mu = [-1, 3]';
  real<lower=0> sigma[2] = {0.75, 1.25};
  real<lower=0, upper=1> theta = 0.75;
}

generated quantities {
  real y[N];
  for (n in 1:N) {
    if (bernoulli_rng(theta)) {
      y[n] = normal_rng(mu[1], sigma[1]);
    } else {
      y[n] = normal_rng(mu[2], sigma[2]);
    }
  }
}

N <- 100
simu_data <- list("N" = N)

simu <- stan(file='stan_programs/simu_mix.stan', data=simu_data,
             iter=1, chains=1, seed=4838282, algorithm="Fixed_param")


SAMPLING FOR MODEL 'simu_mix' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                4.5e-05 seconds (Sampling)
Chain 1:                4.5e-05 seconds (Total)
Chain 1:

y <- array(extract(simu)$y[1,])
input_data <- list("N" = N, "y" = y)

An attempt to fit the exchangeable mixture model greets us with a flurry of diagnostic warnings.

writeLines(readLines("stan_programs/fit_mix.stan"))

data {
 int<lower = 0> N;
 vector[N] y;
}

transformed data {
  real<lower=0> sigma[2] = {0.75, 1.25};
}

parameters {
  vector[2] mu;
  real<lower=0, upper=1> theta;
}

model {
  mu ~ normal(0, 2);
  theta ~ beta(3, 3);
  for (n in 1:N)
    target += log_mix(theta,
                      normal_lpdf(y[n] | mu[1], sigma[1]),
                      normal_lpdf(y[n] | mu[2], sigma[2]));
}

fit <- stan(file='stan_programs/fit_mix.stan', data=input_data,
            seed=4938483, refresh=0)

util$check_all_diagnostics(fit)

[1] "n_eff / iter for parameter mu[1] is 0.000501893208050068!"
[1] "n_eff / iter for parameter mu[2] is 0.000510953389617975!"
[1] "n_eff / iter for parameter theta is 0.000508771638125401!"
[1] "n_eff / iter for parameter lp__ is 0.000509180690539086!"
[1] "  n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated"
[1] "Rhat for parameter mu[1] is 16.711796656677!"
[1] "Rhat for parameter mu[2] is 6.58801457973221!"
[1] "Rhat for parameter theta is 7.34034065964465!"
[1] "Rhat for parameter lp__ is 7.18373663140378!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

These warnings are backed up by the pairs plot which exhibits both continuous and discrete degeneracies!

samples <- extract(fit)

par(mfrow=c(1, 1))

plot(samples$mu[,1], samples$mu[,2],
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="mu[1]", ylab="mu[2]")

To gain deeper insight into the split $\hat{R}$ warnings lets plot the samples from each chain separately.

unpermuted_samples <- extract(fit, permute=FALSE)

par(mfrow=c(2, 2))

plot(samples$mu[,1], samples$mu[,2],
     main="Chain 1", col=c_light_trans, pch=16, cex=0.8,
     xlab="mu[1]", ylab="mu[2]")
points(unpermuted_samples[,1,1], unpermuted_samples[,1,2],
       col=c_dark_trans, pch=16, cex=0.8)

plot(samples$mu[,1], samples$mu[,2],
     main="Chain 2", col=c_light_trans, pch=16, cex=0.8,
     xlab="mu[1]", ylab="mu[2]")
points(unpermuted_samples[,2,1], unpermuted_samples[,2,2],
       col=c_dark_trans, pch=16, cex=0.8)

plot(samples$mu[,1], samples$mu[,2],
     main="Chain 3", col=c_light_trans, pch=16, cex=0.8,
     xlab="mu[1]", ylab="mu[2]")
points(unpermuted_samples[,3,1], unpermuted_samples[,3,2],
       col=c_dark_trans, pch=16, cex=0.8)

plot(samples$mu[,1], samples$mu[,2],
     main="Chain 4", col=c_light_trans, pch=16, cex=0.8,
     xlab="mu[1]", ylab="mu[2]")
points(unpermuted_samples[,4,1], unpermuted_samples[,4,2],
       col=c_dark_trans, pch=16, cex=0.8)

Each Markov chain is confined to one of the discrete modes in the posterior density function. Even though we can't construct accurate Markov chain Monte Carlo estimators from this ensemble of Markov chains it does provide us with the diagnostic power to identify the discrete degeneracy. This reinforces why running multiple Markov chains is so critical to finding and then analyzing potential degenerate behavior.

One immediate source of the discrete degenerate behavior is the exchangeability of the components. If we permute the components then they'll just reconfigure to achieve the same fit, resulting in a label switching degeneracy.

To break this degeneracy we need a prior model that distinguishes between the mixture components, making each component responsible for behaviors that overlap as little as possible. Here let's consider an asymmetric prior density function that should separate the two components relatively well.

writeLines(readLines("stan_programs/fit_mix_asym.stan"))

data {
 int<lower = 0> N;
 vector[N] y;
}

transformed data {
  real<lower=0> sigma[2] = {0.75, 1.25};
}

parameters {
  vector[2] mu;
  real<lower=0, upper=1> theta;
}

model {
  mu[1] ~ normal(-2, 2);
  mu[2] ~ normal(2, 2);
  theta ~ beta(3, 3);
  for (n in 1:N)
    target += log_mix(theta,
                      normal_lpdf(y[n] | mu[1], sigma[1]),
                      normal_lpdf(y[n] | mu[2], sigma[2]));
}

fit <- stan(file='stan_programs/fit_mix_asym.stan', data=input_data,
            seed=4938483, refresh=0)

Stan was not satisified with our attempt.

util$check_all_diagnostics(fit)

[1] "n_eff / iter for parameter mu[1] is 0.00050190135470843!"
[1] "n_eff / iter for parameter mu[2] is 0.000511021563047303!"
[1] "n_eff / iter for parameter theta is 0.000508438976360781!"
[1] "n_eff / iter for parameter lp__ is 0.000506497536484728!"
[1] "  n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated"
[1] "Rhat for parameter mu[1] is 16.5940912431524!"
[1] "Rhat for parameter mu[2] is 6.56821565667692!"
[1] "Rhat for parameter theta is 7.46357615289791!"
[1] "Rhat for parameter lp__ is 8.62843575712416!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

The pairs plot clearly indicates why -- despite the prior model strongly disfavoring label switching the posterior density function includes it anyway!

samples <- extract(fit)

par(mfrow=c(1, 1))

plot(samples$mu[,1], samples$mu[,2],
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="mu[1]", ylab="mu[2]")

To break exchangeability we'll have to be even more aggressive. Introducing an ordering to the component parameters ensures that there is exactly zero overlap in behaviors that they can explain. If we interpret the ordering as part of the observational model then this theoretically makes the model identifiable.

writeLines(readLines("stan_programs/fit_mix_ordered.stan"))

data {
 int<lower = 0> N;
 vector[N] y;
}

transformed data {
  real<lower=0> sigma[2] = {0.75, 1.25};
}

parameters {
  ordered[2] mu;
  real<lower=0, upper=1> theta;
}

model {
  mu ~ normal(0, 2);
  theta ~ beta(3, 3);
  for (n in 1:N)
    target += log_mix(theta,
                      normal_lpdf(y[n] | mu[1], sigma[1]),
                      normal_lpdf(y[n] | mu[2], sigma[2]));
}

fit <- stan(file='stan_programs/fit_mix_ordered.stan', data=input_data,
            seed=4938483, refresh=0)

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Without any overlap the posterior density function shows no signs of degeneracy!

samples <- extract(fit)

par(mfrow=c(1, 1))

plot(samples$mu[,1], samples$mu[,2],
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="mu[1]", ylab="mu[2]")

To make things really interesting, however, let's consider what happens when we try to infer not only the ordered component parameters but also the measurement variabilities, $\sigma_{1}$ and $\sigma_{2}$.

writeLines(readLines("stan_programs/fit_mix_ordered_meas_var.stan"))

data {
 int<lower = 0> N;
 vector[N] y;
}

parameters {
  ordered[2] mu;
  real<lower=0> sigma[2];
  real<lower=0, upper=1> theta;
}

model {
  mu ~ normal(0, 2); 
  sigma ~ normal(0, 2);
  theta ~ beta(3, 3);
  for (n in 1:N)
    target += log_mix(theta,
                      normal_lpdf(y[n] | mu[1], sigma[1]),
                      normal_lpdf(y[n] | mu[2], sigma[2]));
}

fit <- stan(file='stan_programs/fit_mix_ordered_meas_var.stan', data=input_data,
            seed=4938483, refresh=0)

The Stan diagnostics are not mad at us, they're just disappointed.

util$check_all_diagnostics(fit)

[1] "n_eff / iter for parameter mu[2] is 0.000564992396259443!"
[1] "n_eff / iter for parameter sigma[1] is 0.000518406026926374!"
[1] "n_eff / iter for parameter sigma[2] is 0.000644965349862316!"
[1] "n_eff / iter for parameter theta is 0.000550092784591242!"
[1] "n_eff / iter for parameter lp__ is 0.000517959792160454!"
[1] "  n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated"
[1] "Rhat for parameter mu[2] is 2.83899298359751!"
[1] "Rhat for parameter sigma[1] is 5.24134613987636!"
[1] "Rhat for parameter sigma[2] is 2.04321248360514!"
[1] "Rhat for parameter theta is 3.17569463735988!"
[1] "Rhat for parameter lp__ is 5.18167489538321!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "3 of 4000 iterations ended with a divergence (0.075%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Looking at the partially recovered posterior distribution demonstrates why. When the measurement variabilities are not known exactly the mixture components are able to continuously rearrange themselves without affecting the overall fit to even large observations. This results in strong degeneracies that persist even as we aggregate relatively large data sets.

samples <- extract(fit)

par(mfrow=c(2, 2))

plot(samples$mu[,1], samples$mu[,2],
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="mu[1]", ylab="mu[2]")

plot(samples$mu[,1], log(samples$sigma[,1]),
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="mu[1]", ylab="log(sigma[1])")

plot(samples$mu[,2], log(samples$sigma[,2]),
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="mu[2]", ylab="log(sigma[2])")

plot(log(samples$sigma[,1]), log(samples$sigma[,1]),
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="log(sigma[1])", ylab="log(sigma[2])")

This example just barely opens the Pandora's box that is exchangeable mixture models. The potential preasymptotic degeneracies become even nastier when we don't know the number of components exactly or when we consider mixtures over higher-dimensional spaces. When dealing with exchangeable mixture models addressing one degeneracy just seems to reveals another degeneracy laying in wait.

6 Conclusion

With only a finite number of observations the precision with which we can make Bayesian inferences, and our ability to accurate quantify those inferences algorithmically, is limited by the geometry of the posterior density function. The less uniformly the posterior density function concentrates the more our algorithms will have to work, and the lower the chances that they will be able to provide faithful characterizations of the posterior density function at all.

As unfortunate as these consequences are, they don't leave us entirely helpless. If we pay attention to the details of the algorithmic failures, especially those from particularly informative methods like Hamiltonian Monte Carlo, then we can investigate the nature of the underlying degeneracies and motivate principled resolutions that not only facilitate more accurate computation but also provide constructive criticism of our experimental design.

These investigations are more straightforward when degeneracies are limited to just a few parameters that can be directly visualized with two-dimensional or even three-dimensional pairs plots. In order to investigate more complex interactions that span more parameters we need to use the structure of the model to motivate functions that isolate the degeneracy as much as possible.

As with every other step in probabilistic modeling there are no defaults procedures that are effective in general. Everything depends on the unique context of each particular application.

Acknowledgements

I thank Vianey Leos Barajas for helpful conversations.

A very special thanks to everyone supporting me on Patreon: Abhishek Vasu, Adam Slez, Aki Vehtari, Alan O'Donnell, Alexander Bartik, Alexander Noll, Allison, Anders Valind, Andrea Serafino, Andrew Rouillard, Anthony Wuersch, Antoine Soubret, Arya, asif zubair, Austin Rochford, Austin Rochford, Aviv Keshet, Avraham Adler, Ben Matthews, Benoit Essiambre, Bo Schwartz Madsen, Brian Callander, Brian Hartley, Bryan Yu, Canaan Breiss, Cat Shark, Chad Scherrer, Charles Naylor, Chase Dwelle, Chris Mehrvarzi, Chris Zawora, Cole Monnahan, Colin Carroll, Colin McAuliffe, D, Damien Mannion, dan mackinlay, Dan W Joyce, Daniel Elnatan, Daniel Hocking, Daniel Rowe, Darshan Pandit, David Burdelski, David Humeau, David Pascall, David Wych, Derek G Miller, Diogo Melo, Doug Rivers, Ed Berry, Ed Cashin, Elizaveta Semenova, Ero Carrera, Ethan Goan, Evan Russek, Federico Carrone, Finn Lindgren, Florian Wellmann, Francesco Corona, Ganesh Jagadeesan, Garrett Mooney, Gene K, George Ho, Georgia S, Granville Matheson, Guido Biele, Guilherme Marthe, Hamed Bastan-Hagh, Hany Abdulsamad, Haonan Zhu, Hernan Bruno, Ian, Ilan Reinstein, Ilaria Prosdocimi, Isaac S, J Michael Burgess, Jair Andrade Ortiz, James Wade, JamesU, Janek Berger, Jeff Dotson, Jeff Helzner, Jeffrey Arnold, Jessica Graves, Joel Kronander, John Flournoy, John Helveston, John Zito, Jonathan Sedar, Jonathan St-onge, Jonathon Vallejo, Jose Pereira, Joseph Abrahamson, Josh Weinstock, Joshua Duncan, Joshua Mayer, Josué Mendoza, JS, Justin Bois, Juuso Parkkinen, Kapil Sachdeva, Karim Naguib, Karim Osman, Kazuki Yoshida, Kees ter Brugge, Kejia Shi, Kevin Thompson, Kádár András, Lars Barquist, Leo Burgy, lizzie, Luiz Carvalho, Lukas Neugebauer, Mads Koefoed, Marc, Marc Dotson, Marc Trunjer Kusk Nielsen, Mark Donoghoe, Markus P., Martin Modrák, Martin Rosecky, Matt Wescott, Matthew Kay, Matthew Quick, Matthew T Stern, Maurits van der Meer, Maxim Kesin, Michael Colaresi, Michael DeWitt, Michael Dillon, Michael Griffiths, Michael Redman, Michael Thomas, Mike Lawrence, Mikhail Popov, mikko heikkilä, MisterMentat, Márton Vaitkus, Nathan Rutenbeck, Nicholas Cowie, Nicholas Erskine, Nicholas Knoblauch, Nicholas Ursa, Nicolas Frisby, Noah Guzman, Ole Rogeberg, Oliver Clark, Oliver Crook, Olivier Ma, Onuralp Soylemez, Patrick Boehnke, Pau Pereira Batlle, Paul Oreto, Pieter van den Berg, Pintaius Pedilici, Ravin Kumar, Reed Harder, Riccardo Fusaroli, Richard Jiang, Richard Price, Robert Frost, Robert Goldman, Robert J Neal, Robert kohn, Robin Taylor, Scott Block, Sean Talts, Seth Axen, Shira, sid phani, Simon Duane, Simon Lilburn, Sonia Mendizábal, Srivatsa Srinath, Stephen Lienhard, Stephen Malina, Stephen Oates, Steve Bertolani, Steven Langsford, Stijn, Susan Holmes, Suyog Chandramouli, Sören Berg, Taco Cohen, Teddy Groves, Teresa Ortiz, Thomas Littrell, Thomas Siegert, Thomas Vladeck, Tiago Cabaço, Tim Howes, Tim Radtke, Tom McEwen, Tommy Naugle, Trey Spiller, Tyrel Stokes, Vanda Inacio de Carvalho, Vince, Virginia Webber, vittorio trotta, Will Farr, William Grosso, yolhaj, yureq, Z, and ZAKH.

References

[1] Casella, G. and Berger, R. L. (2002). Statistical inference. Duxbury Thomson Learning.

[2] Bernardo, J.-M. and Smith, A. F. M. (2009). Bayesian theory. John Wiley & Sons, Ltd., Chichester.

[3] Vaart, A. W. van der. (2000). Asymptotic statistics. vol 3 Cambridge University Press.

[4] Betancourt, M. (2018). A conceptual introduction to hamiltonian monte carlo. ArXiv e-prints 1701.02434.

[5] Betancourt, M., Byrne, S. and Girolami, M. (2014). Optimizing the integrator step size for hamiltonian monte carlo. ArXiv e-prints 1410.5110.

[6] Betancourt, M. (2016). Diagnosing suboptimal cotangent disintegrations in hamiltonian monte carlo. ArXiv e-prints 1604.00695.

License

A repository containing the material used in this case study is available on GitHub.

The code in this case study is copyrighted by Michael Betancourt and licensed under the new BSD (3-clause) license:

https://opensource.org/licenses/BSD-3-Clause

The text and figures in this case study are copyrighted by Michael Betancourt and licensed under the CC BY-NC 4.0 license:

https://creativecommons.org/licenses/by-nc/4.0/

Original Computing Environment

writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))

CC=clang

CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined -Wno-unneeded-internal-declaration
CXX=clang++ -arch x86_64 -ftemplate-depth-256

CXX14FLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined -Wno-unneeded-internal-declaration -Wno-unknown-pragmas
CXX14=clang++ -arch x86_64 -ftemplate-depth-256

sessionInfo()

R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.19.3         ggplot2_3.3.1        StanHeaders_2.21.0-3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6       pillar_1.4.4       compiler_4.0.0     prettyunits_1.1.1 
 [5] tools_4.0.0        digest_0.6.25      pkgbuild_1.0.8     evaluate_0.14     
 [9] lifecycle_0.2.0    tibble_3.0.1       gtable_0.3.0       pkgconfig_2.0.3   
[13] rlang_0.4.6        cli_2.0.2          parallel_4.0.0     yaml_2.2.1        
[17] xfun_0.14          loo_2.2.0          gridExtra_2.3      withr_2.2.0       
[21] stringr_1.4.0      dplyr_1.0.0        knitr_1.28         generics_0.0.2    
[25] vctrs_0.3.0        stats4_4.0.0       grid_4.0.0         tidyselect_1.1.0  
[29] glue_1.4.1         inline_0.3.15      R6_2.4.1           processx_3.4.2    
[33] fansi_0.4.1        rmarkdown_2.2      purrr_0.3.4        callr_3.4.3       
[37] magrittr_1.5       codetools_0.2-16   matrixStats_0.56.0 ps_1.3.3          
[41] scales_1.1.1       ellipsis_0.3.1     htmltools_0.4.0    assertthat_0.2.1  
[45] colorspace_1.4-1   stringi_1.4.6      munsell_0.5.0      crayon_1.3.4

Identity Crisis

Michael Betancourt

June 2020