Under ideal conditions only a small neighborhood of model configurations will be consistent with both the observed data and the domain expertise we encode in our prior model, resulting in a posterior distribution that strongly concentrates along each parameter. This not only yields precise inferences and but also facilitates accurate estimation of those inferences. Under more realistic conditions, however, our measurements and domain expertise can be much less informative, allowing our posterior distribution to stretch across more expansive, complex neighborhoods of the model configuration space. These intricate uncertainties complicate not only the utility of our inferences but also our ability to quantify those inferences computationally.

In this case study we will explore the theoretical concept of identifiability and its more geometric counterpart degeneracy that better generalizes to applied statistical practice. We will also discuss the principled ways in which we can identify and then compensate for degenerate inferences before demonstrating these strategies in a series of pedagogical examples.

1 Identifiability

An observational model \(\pi_{\mathcal{S}}(y ; \theta)\) with observations \(y \in Y\) and model configurations \(\theta \in \Theta\) is identified when all of the data generating processes are unique [1]. In other words any two distinct model configurations \(\theta_{1} \ne \theta_{2}\) specify different probability distribution over the observational space, or equivalently different probability density functions, \[ \pi_{\mathcal{S}}(y ; \theta_{1}) \ne \pi_{\mathcal{S}}(y ; \theta_{2}). \]

For example consider the observational model specified by normal probability density functions whose location parameter is the sum of two model parameters, \[ \pi_{\mathcal{S}}(y ; \alpha, \beta, \sigma) = \text{normal}(y ; \alpha + \beta, \sigma). \] In this case the infinite number of model configurations that fall on the line \(\alpha + \beta = \mu\) for any real number \(\mu \in \mathbb{R}\) all specify exactly the same data generating process.

Formal identifiability, however, does not always imply such extreme degeneracy. The sets of equivalent model configurations, for example, might be compact, or bounded within a finite volume, unlike the lines that stretch all the way to infinity in the previous example. At the same time the sets of equivalent model configurations might contain only a finite number of elements, indicating that only a few model configurations specify exactly the same data generating process, with the bulk of the observational model identified.

That said, just because the data generating processes in the observational model are different doesn't mean that realized likelihood functions or realized posterior density functions from this model will be. Identifiability offers no guarantee that the differences between the different data generating processes will be large enough to distinguish them after only a finite number of observations. An infinite number of observations, however, should be enough to resolve what could be minuscule differences between the different data generating processes.

To be more precise let's define a product observational model that results from repeated, independent measurements, \(\{ y_{1}, \ldots, y_{N} \} \in Y^{N}\), \[ \pi_{\mathcal{S}_{N}}(y_{1}, \ldots, y_{N}; \theta) = \prod_{n = 1}^{N} \pi_{\mathcal{S}}(y_{n} ; \theta). \] As \(N\) approaches infinity the inferences informed by this product observational model will be well-behaved only if the component observational model is identified. The exact meaning of well-behaved depends on the nature of our inferences, in particular whether they are derived from a frequentist or a Bayesian perspective.

1.1 Frequentist Asymptotics

While there many frequentist estimators here we will consider only the maximum likelihood estimator.

Recall that evaluating the product observational probability density function on a realized observation, \(\{ \tilde{y}_{1}, \ldots, \tilde{y}_{N} \}\), defines a likelihood function from the model configuration space into the real numbers, \[ \mathcal{L}_{\tilde{y}_{1}, \ldots, \tilde{y}_{N}}(\theta) = \pi_{\mathcal{S}_{N}}(\tilde{y}_{1}, \ldots, \tilde{y}_{N}; \theta). \] The higher the output likelihood relative to the outputs from other model configurations the more consistent that model configuration is with the realized observation relative to the other model configurations. In particular the model configuration with the highest output likelihood is the most consistent relative to the others.

This model configuration with the largest likelihood output is exactly the definition of the maximum likelihood estimator, \[ \hat{\theta}_{\text{ML}}(\tilde{y}_{1}, \ldots, \tilde{y}_{N}) = \underset{\theta \in \Theta}{\text{argmax}} \mathcal{L}_{\tilde{y}_{1}, \ldots, \tilde{y}_{N}}(\theta). \] Ultimately the utility of this estimator is determined by how close the data generating process specified by the maximum likelihood estimator, \[ \pi_{\mathcal{S}}(y; \hat{\theta}_{\text{ML}}(\tilde{y}_{1}, \ldots, \tilde{y}_{N})) \] is to the true data generating process.

Identifiability is a critical requirement for utility of the maximum likelihood estimator even under ideal circumstances [3]. More formally let's assume an \(\mathcal{M}\)-closed context such that the observed data are drawn from a data generating process within the observational model, \[ y_{n} \sim \pi(y_{n} ; \theta^{*}), \, \theta^{*} \in \Theta. \] In the asymptotic \(N \rightarrow \infty\) limit the maximum likelihood estimator will converge to the true model configuration for almost all observations only if the component observational model is identified. In other words if our observational model is identified then we will almost always be able to exactly recover the truth given enough data, at least within the context of that model.

Subject to some regularity conditions -- such as the true model configuration being in the interior of the model configuration space and not on a boundary, the likelihood functions being sufficiently smooth, etc -- the asymptotic consistency of the maximum likelihood estimator also implies that the likelihood function realized from almost every product observation itself converges to a delta function around the true model configuration. With some even stronger regularity conditions asymptotic consistency implies that the likelihood functions contract to this delta function uniformly in all directions. Consequently given these regularity conditions identifiability implies that the geometry of the likelihood function is well-defined asymptotically.

Critically these consequences of identifiability are well-defined only in the \(\mathcal{M}\)-closed context. Even when an observational model is identified are no guarantees regarding how the likelihood functions realized from external observations, that is observations generated from data generating processes outside of the observational model, might behave in the asymptotic limit.

Although confirmation of identifiability has limited utility in practice, where we have neither infinitely large observations or an \(\mathcal{M}\)-closed context, the failure of frequentist identifiability can be a useful diagnostic tool. Observational models that don't work well in the idealized \(\mathcal{M}\)-closed, infinite limit are unlikely to provide well-behaved inferences in \(\mathcal{M}\)-open, finite data circumstances either. Consequently while frequentist identifiability is not a sufficient condition for useful models it is typically a necessary one, and checking it can help us to find pathological model behavior.

If our observational model depends on auxiliary data, for example geographical locations where measurements are being performed, concepts like identifiability become a bit more complicated. More formally let's define the data, \(y\), that results from the measurement as a variate and data, \(x\), that defines the context of a measurement as a covariate. In this case our observational model takes the conditional form \[ \pi_{\mathcal{S}}(y | x ; \theta). \] In this case we can define identifiability conditionally on a given covariate value or we can define it marginally given an explicit model for the covariates, \(\pi_{S}(x)\).

The asymptotic limit, and the consequences of these different forms of identifiability, also become more complicated in the presence of covariates. For example we can consider the asymptotic limit given a fixed covariate value, in which case identifiability conditioned on that value is necessary for good asymptotic behavior. We get a much more informative picture, however, when we consider the asymptotic limit with the covariates values spanning the entire range of possible covariate values, giving us sensitivity to possible pathologies near the nominal values. In the statistics literature this limit is known as infill asymptotics.

1.2 Bayesian Asymptotics

The asymptotic behavior of the maximum likelihood estimator and the likelihood function itself also has a natural Bayesian equivalent. If the observational model is identified and the realized likelihood functions all concentrate around the true model configuration then the realized posterior density functions should as well, at least provided that the prior density functions don't obstruct the convergence. This intuition is formalized in the Bernstein-von Mises theorem [3] which defines the conditions under which realized posterior point estimates, such as the posterior mean, median, and mode converge to the true model configuration while the realized posterior density functions contract around the true model configuration.

As with frequentist asymptotics, the practical utility of formal Bayesian asymptotics is diagnosing any particularly nasty pathologies in the observational model or the prior model that persist to the asymptotic limit but are likely to corrupt preasymptotic inferences as well.

2 Degeneracy

We will never have perfect models or infinitely many repeated observations. What really matters in practice is the preasymptotic behavior of our inferences when our model, especially when the model only approximates the true data generating process. While the specific results considered above hold only in the asymptotic limit, the geometric considerations are far more general and are perfectly well-defined not only preasymptotically but also in the \(\mathcal{M}\)-open context. In particular we can always investigate how uniformly a realized likelihood function or posterior density function concentrates around any single point in the model configuration space.

Following that intuition let's define degeneracy as a qualitative description of how strongly and how uniformly a realized likelihood function or posterior density function concentrates around a single point in the model configuration space. Our inferences are strongly non-degenerate, or weakly degenerate, if they appear to concentrate around a single point rather than some extended, degenerate surface.




The stronger and more uniformly a realized likelihood function or posterior density function concentrates around a single point the more informed inferences about the given parameters will be. In particular degeneracy is not binary qualification like asymptotic consistency but rather a continuous one. Exactly how degenerate our inferences can be before they become problematic depends on the details of a given analysis.

Note that nonidentified models will, by definition, by degenerate both asymptotically and preasymptotically. Consequently degeneracy expands upon the former identifiability.

2.1 Consequences of Degeneracy

A realized likelihood function or posterior density function becomes problematic when the resulting inferences don't have enough information to achieve the inferential goals or, perhaps more critically, when their geometry frustrates accurate computation of those inferences in the first place.

Whether or not we can expect to accumulate enough information to achieve our inferential goals from typical measurement outcomes is a question of experimental design. Degenerate inferences suggests a mismatch between the experimental design and those goals -- either our design doesn't probe the desired phenomena well enough or our goals are simply too ambitious for the available experimental technology.

Regardless of our particular inferential goals, degeneracy can have critical computational consequences. The more degenerate a realized likelihood function or posterior density function, the more expensive it will be to algorithmically quantify the full extent of the model configurations consistent with the observed data, if it is possible at all. For example deterministic approximations based on point estimates, such as maximum likelihood confidence intervals and Laplace approximations, become less accurate the less uniformly the likelihood function concentrates. At the same time stochastic approximations like Markov chain Monte Carlo will typically be more expensive, rendering them less accurate for a fixed computational cost.

If degeneracy prevents us from being able to fully quantify a realized likelihood function or posterior density function, and hence the inferences they encode, then we won't be able to accurately judge the utility of those inferences at all. In practice we need our inferences to be sufficiently well-identified so that our computational tools have a reasonable opportunity to succeed, and we have a reasonable opportunity to understand how our model responds to a given observations.

2.2 Subtleties of Degeneracy

Although this geometric perspective of degeneracy is motivated by ideal asymptotic behavior, there are some subtle but critically important differences between the asymptotic and preasymptotic perspectives.

2.2.1 A Million Ways To Be Cruel

While the realized likelihood functions and posterior density functions from non-degenerate models all look similar, the realizations from less well-behaving models can each be degenerate in their own unique way. In particular the exact consequences of strong degeneracies will depend on the shape and orientation of the specific surface on which the realizations appear to concentrate. I will refer to the particular geometry of a degenerate likelihood function or posterior density function as a degeneracy.

Perhaps the most important classification of these geometries is between continuous degeneracies and discontinuous degeneracies. A continuous degeneracy is characterized by model configurations consistent with the observed data organizing into a single connected, fuzzy surface.




We can further classify continuous degeneracies that concentrate within any finite volume, such as the right two degeneracies above, as compact, and those that extend all the way to infinity in more or more directions, such as the left degeneracy above, as non-compact.

On the other hand discontinuous degeneracies are characterized by the model configurations consistent with the observed data clustering into modes surrounded by regions of inconsistent model configurations in all directions.




A realized likelihood function or posterior density function can suffer from one or both of these degeneracies at the same time. While they might decompose into a number of discontinuous degeneracies where each individual mode is itself locally non-degenerate as above, a particularly pathological realization might be comprised of a collection of continuously degenerate modes.




2.2.2 The Most Capricious of Insects

Asymptotic non-degeneracy requires the collapse of the likelihood functions realized from almost every infinitely large observation. In particular it does not depend on the details of any particular observation, and hence it characterizes the entire observational model.

The shape of the likelihood functions and posterior density functions realized from finite observations, however, will in general depend on those particular details.




Consequently preasymptotic degeneracies will also depend on the particular observation considered. The realized likelihood functions and posterior density functions might be only weakly degenerate for some particularly informative observations, but manifest strong degeneracies for others. Indeed the more complex the measurement process and higher dimensional the observational space, the more opportunities there will be for fluctuations to conspire to obscure the latent phenomena of interest.

In other words we can't talk about an entire model being preasymptotically non-degenerate unless we can verify that every realization of a likelihood function or posterior density function is strongly and uniformly concentrated. At the same time we can't talk about the robustness of a model, and how prone it might be to contorting itself to accommodate misfit, unless we can quantify the behavior of realizations from data generated outside of the model. Such inferential calibrations require either careful theoretical work or expansive simulation studies.

2.2.3 Parameterization Dependence

Because the particular geometry of a realized likelihood function or posterior density function depends on a chosen parameterization of the model configuration space so too will our defintion of degeneracy. A realization that appears degenerate in one parameterization might appear much less degenerate in another!




This parameterization dependence, however, also provides an opportunity for moderating degeneracy problems. Weak degeneracy implies that every parameter is strongly informed independently of the others. If some parameters are strongly coupled then we might be able to find a reparameterization that transforms them into parameters that are less coupled, and hence better identified. At the same time if some parameters are particularly weakly-informed then we might be able to remove them from the model entirely.

3 Managing Degeneracies

Degeracy is ultimately a question of information, in particular how much information we have about the chosen parameters that coordinate the model configuration space. If a realized likelihood function or posterior density function is degenerate then we need to incorporate more information into our inferences, change the parameters that we're trying to identify, or weaken our inferential goals so that what little information we have is sufficient.

3.1 More Information

The most direct way to reduce degenerate behavior in our inferences is to add more information. In turn the most direct way to add more information is to repeat the measurement process multiple times and aggregate the resulting observations together into one larger, more informative observation. In practice, however, repeating the measurement process exactly is often prohibitively expensive if not outright impossible. Moreover there is no guarantee that some degeneracies won't persist through repeated measurements.

A powerful alternative to repeating the initial measurement process is to complement it with additional measurement processes that are sensitive to only some phenomenological, environmental, or experimental parameters. This complementary data can then break degeneracies in the likelihood function, allowing the original data to inform the rest of the parameters.

Consider, for example, a measurement of some phenomenon, quantified by the parameter \(\lambda_{S}\), that is corrupted by some irreducible background, quantified with the parameter \(\lambda_{B}\). No matter how much data we collect we will not be able to differentiate between the desired phenomenon and the background, leading to degenerate likelihood functions.




Incorporating data from additional measurements where the phenomenon has been removed, however, informs only the background contribution, allowing the excess observed in the original data to inform the latent phenomenon.




Bayesian analyses also benefit from the opportunity to add information through the prior model. For example principled domain expertise can suppress extreme model configurations and contain the posterior density function into a reasonably small neighborhood even when the realized likelihood function is degenerate.




At the same time targeted domain expertise can complement the nominal measurement process in the same way as auxiliary measurements, pinning down a few parameters and preventing degeneracies in the realized likelihood function from propagating to the posterior density function.




Of course we have to be responsible and add only information that is actually compatible with our domain expertise. Forcing less principled information into the analysis simply because it resolves an inconvenient degeneracy is a highway to the poor generalizability zone.

3.2 Reparameterizations

The parameters in the initial parameterization being poorly informed doesn't mean that parameters in other parameterizations suffer the same fate. In some instances we can reparameterize the model configuration space and replace poorly-identified parameters with new parameters are that better suited to the measurement process. Some times we can find alternative parameterizations where all of the parameters are strongly informed, and even if we can't we might be able to find parameterizations that isolate degeneracies into just a few parameters which are then easier to manage.

For example consider the observational model that depends on the parameters only through a function \(f : \Theta \rightarrow \mathbb{R}\), \[ \pi_{S}(y; f(\theta_{1}, \ldots, \theta_{N})). \] If the function \(f\) is many-to-one then the observational model will be nonidentified and no amount of data will be able to inform the parameters \(\{\theta_{1}, \ldots, \theta_{N} \}\) individually. On the other hand if we can find a new parameterization with a parameter \[ \phi_{1} = f(\theta_{1}, \ldots, \theta_{N}) \] then the data will be able to inform \(\phi_{1}\) and we can resort to the prior model to inform the remaining \(\{ \phi_{2}, \ldots, \phi_{N} \}\).

Reparameterizations are especially important when strong degeneracies limits how well a computational algorithm is able to survey the entirety of a realized likelihood function or posterior density function. Careful reparameterizations can reorganize the model configurations consistent with the data into geometries that better match the capabilities of the algorithm and actually facilitate accurate computation.

Many modeling techniques have natural parameterizations that are optimal in different circumstances. In order to use these techniques as responsibly as possible we need to familiarize ourselves with not only the basic structure of the technique but also the natural parameterizations and how they might interact with observed data to yield degenerate behavior.

3.3 Lowered Expectations

Sometimes the degeneracies of a realized likelihood function or posterior density function are inescapable, communicating that we have neither the data nor the domain expertise to learn enough about our model of the latent data generating process.

If our computational tools are able to accurately quantify the degenerate likelihood function or posterior density function then we might have to just report the weak inferences that result as the honest outcome of the analysis. Careful review of these degeneracies can then determine the failings of the initial experimental design and suggest refined measurements moving forwards.

On the other hand when the degeneracies are so bad that they obstruct the complete quantification of the realized likelihood function or posterior density function we won't be able to report faithful inferences at all. In that case we might have to set aside the analysis until we can equip better algorithms that allow us to better understand our inferences and their limitations.

4 Diagnosing Degeneracies

If we know that a realized likelihood function or posterior density function is degenerate then we can manage the degeneracies with the strategies discussed in Section 3. But how exactly can we work out how degenerate a realization is in practice, especially when our model is sophisticated and the model configuration space is high-dimensional?

4.1 All That We See or Seem

One approach is to utilize computational methods that explore the entirety of the realized likelihood function or posterior density function and examine projections of that exploration onto lower-dimensional surfaces that we can visualize. In particular we can't study degeneracy with methods that return only point estimates.

In Bayesian analyses Markov chain Monte Carlo has especially notable diagnostic power. Effective Markov chains will expand into the most relevant neighborhoods of the model configuration space, probing any potential degenerate behavior in the process. Moreover the individual states of the resulting Markov chains can be immediately projected onto any subspace allowing for immediate visualization.

A common diagnostic strategy is to consider pairs plots, or scatter plots of the realized Markov chain states projected onto each pair of functions of interest, for example the parameter functions.




Because we are looking for degenerate behavior in the posterior density function that might frustrate algorithmic exploration we want to focus on the posterior density function actually encountered by our chosen algorithm. In tools like Stan that means looking at the parameter functions on the unconstrained model configuration space and not the nominal constrained model configuration space that appears in a Stan program itself. Exploring the unconstrained space is made easier by using the diagnostic_file option in the Stan interfaces which writes the unconstrained state at each iteration of the Markov chain to a text file.

These plots can be useful in identifying degeneracies that manifest in only two parameters, but they are not necessarily revealing of degeneracies that manifest on higher-dimensional surfaces. In this case the scatter plots collapse the projected surface and obstruct the underlying degeneracy.




Only by looking at higher-dimensional scatter plots can we see the true nature of the degeneracy.




To avoid this projection collapse we would have to slice the samples into bands of the remaining parameters. Beyond a few dimensions, however, slicing generates an impractical number of plots, each of which is only sparsely populated and hence conveys little information on their own.

Even without slicing the number of possible pairs plots is the biggest limitation to this approach. Without domain expertise to motivate potential degenerate surfaces we would have to force ourselves through at least \(D \cdot (D + 1) / 2\) scatter plots, where \(D\) is the number of parameters in our model. Even for moderate dimensions this quickly expands to hundreds if not thousands of plots that might not even reveal problematic degeneracies at all.

Exploring high-dimensional spaces through visualizations is hard and it's all the more challenging when we don't know for what we're looking!

4.2 My Engine Makes This Strange Sound

Visualizations of Markov chains are only as good as the ability of those Markov chains to capture the full extent of the posterior distribution. In particular slow or ineffectual Markov chains provide only an incomplete view of the full posterior distribution and hence can lead to misleading pairs plots diagnostics.

Consider, for example, a Markov chain that slowly wanders around its initialization, exploring only a very small part of the posterior distribution that concentrates on a curved surface. In this case the pairs plots would show no signs of serious degeneracy.




If we ran multiple Markov chains then we would visualize only multiple incomplete explorations. This could then easily mislead us into thinking that we're dealing with the wrong kind of degeneracy, for example a discrete one instead of a continuous one.




This behavior is particularly troubling when degenerate behavior in a realized posterior density function prevents Markov chains from effectively exploring the full posterior distribution in the first place. Our diagnostics will fail exactly when we need them!

Fortunately not all Markov chain Monte Carlo algorithms fail so silently. Some algorithms, namely Hamiltonian Monte Carlo, actually complain quite loudly when they struggle with degenerate posterior density functions. At least, that is, if you listen to them. Although these complaints are certainly an annoyance they are often our only hint at the underlying pathologies.

Consequently our most powerful diagnostic strategy is a postmortem investigation where we carefully scrutinize how a Markov chain Monte Carlo algorithm struggles or outright fails in order track down the underlying pathologies and motivate hypotheses for their origins. The louder our algorithm, the more information we will have to guide us to degenerate behavior.

Exactly how degenerate a posterior density function has to be before a Markov chain Monte Carlo algorithm starts to struggle, or breaks outright, depends on the detailed interactions between the algorithm, the configuration of that algorithm, and the posterior density function. Some algorithms are fragile, breaking with only small degeneracies while some are more robust, only slowly bending before the degeneracies become strong enough. In order to maximize our diagnostic power we need to be familiar how a given algorithm responds to different degeneracies.

For example by comparing multiple Markov chains initialized at different places in the model configuration space, perhaps with summary statistics like the potential scale reduction factor \(\hat{R}\), we might be able to detect discrete degeneracies. The more Markov chains we run the better we will be able resolve the full extent of the degeneracy, even if each individual Markov chain explores only a local mode.

4.3 Sonic Boom

Hamiltonian Monte Carlo has a reputation for being fast, but that is only one part of its success. Yes, for non-degenerate posterior density functions Hamiltonian Monte Carlo rapidly explores the posterior distribution even in high-dimensional model configuration spaces. Yes, Hamiltonian Monte Carlo can manage moderately degerate posterior density functions by consuming more computational resources. What makes Hamiltonian Monte Carlo so powerful in practice, however, is that when it struggles or outright fails to fully quantify a degerate posterior density function it does so loudly.

Hamiltonian Monte Carlo [4] exploits Hamiltonian dynamics which generate trajectories that rapidly and coherently explore the posterior distribution. In practice we can't generate these trajectories exactly so instead we approximate them with discrete, numerical trajectories. The smaller the discretization time, or step size, the more accurate but also more computationally expensive these numerical trajectories will be.




Any residual error introduced by these numerical trajectories is corrected by introducing a sampling procedure that preferentially returns points along the trajectory with smaller error. Under ideal conditions the performance of this sampling procedure can also be used to automatically tune the step size to achieve near optimal performance in tools like Stan [5]. When Stan encounters complex geometries the step size will automatically be decreased so that the numerical trajectories can resolve all of the fine details of the realized posterior density function as much as possible.

Unfortunately there is a limit to this adaptive feedback. When the posterior density function changes too quickly the numerical integration can become unstable, resulting in numerical trajectories that rocket towards infinity instead of exploring the delicate features. On the other hand these divergent trajectories also provide a clear indication of when Hamiltonian Monte Carlo is unable to fully resolve the details of the target posterior distribution.




Markov chain iterations sampled from divergent trajectories are marked as divergent iterations in Stan. Because the states in divergent trajectories are spatially correlated with the neighborhoods of high curvature that induce instabilities, we can use these divergent iterations to approximately locate these neighborhoods. We can verify any suspicious behavior by rerunning with a less aggressive, more refined, step size adaptation that should result in a smaller step size, and more accurate trajectories that can better penetrate the problematic neighborhood.




The details of the numerical Hamiltonian dynamics that drive Hamiltonian Monte Carlo serve as a powerful probe of degenerate behavior. For example the less uniformly the posterior density function concentrates the smaller the step size will have to be to resolve the directions of stronger concentration, while the longer the overall trajectories will have to be to fully explore the directions of smaller concentration.




In other words the more continuously degenerate the posterior density function the more integrator steps, or leapfrog steps, we will need to simulate the latent Hamiltonian dynamics. The number of leapfrog steps needed for each iteration can then help to identify problematic geometries.

Particularly extreme degeneracies often "pinch" at their boundaries, manifesting in neighborhoods where the curvature rapidly increases and renders numerical trajectories unstable. Consequently the appearance and structure of divergent iterations provide a sensitive diagnostic of these problematic boundaries. Because divergent iterations are sampled from divergent trajectories they will concentrate towards the problematic neighborhoods, literally pointing the way to the pathology. Overlaying divergent iterations and non-divergent iterations in a pairs plot is particularly informative, especially when the functions being compared are well-chosen.

Sometimes the geometry of the posterior density function is so adversarial that even exact Hamiltonian trajectories don't go very far and Hamiltonian Monte Carlo devolves into incoherent, diffusive exploration. Although this is a relatively rare pathology it can detected with another diagnostic bespoke to Hamiltonian Monte Carlo -- the energy fraction of missing information, or E-FMI [6]. Because these pathologies are somewhat rare, and often buried under other more pressing pathologies, we don't yet have a full understanding of the offending degeneracies and hence how to resolve them.

5 Don't Do Me Like That

Because the diversity of degenerate behavior is so expansive, identifying and and investigating it, not to mention reacting effectively, can be an an overwhelming challenge. To reduce the difficulty we need to build experience, and experience building is facilitated by compartmentalizing the possibilities. Limiting ourselves to a single degeneracy at a time presents a much more manageable lesson, especially when those degeneracies are typical of a new modeling technique that we're learning.

In case studies where I introduce new modeling techniques I take the opportunity to also discuss the degeneracies to which the technique might be vulnerable and the bespoke ways of diagnosing and resolving the underlying pathologies. Here we will consider some much less ambitious examples that demonstrate relatively simple degeneracies and their consequences.

Although these examples are artificial let's assume that in the initial parameterizations our domain expertise would constrain the parameters to values around or smaller than one. Consequently \(1\) defines the scale that separates "weakly informative" from "strongly informative".

As always, however, we have to start by setting up our local R environment.

library(rstan)
Loading required package: StanHeaders
Loading required package: ggplot2
rstan (Version 2.19.3, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
rstan_options(auto_write = TRUE)            # Cache compiled Stan programs
options(mc.cores = parallel::detectCores()) # Parallelize chains

util <- new.env()
source('stan_utility.R', local=util)

set.seed("5389533")

c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")

c_light_trans <- c("#DCBCBC80")
c_dark_trans <- c("#8F272780")
c_green_trans <- c("#00FF0080")

5.1 Extraneous Parameters

The simplest degeneracy is having an extraneous parameter. Here there are no problematic interactions between the parameters, just one parameter that doesn't influence the likelihood function at all. Technically the extraneous parameter makes this observational model fully non-identified.

To see the problem let's simulate an observation from an arbitrary model configuration.

writeLines(readLines("stan_programs/simu_extra.stan"))
data {
  int N;
}

transformed data {
  real a = 1;
  real b = -0.5;
}

generated quantities {
  real<lower=0> sigma = 0.75;
  real y[N];
  for (n in 1:N) y[n] = normal_rng(a, sigma);
}
N <- 100
simu_data <- list("N" = N)

simu <- stan(file='stan_programs/simu_extra.stan', data=simu_data,
             iter=1, chains=1, seed=4838282, algorithm="Fixed_param")

SAMPLING FOR MODEL 'simu_extra' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                4.5e-05 seconds (Sampling)
Chain 1:                4.5e-05 seconds (Total)
Chain 1: 
y <- array(extract(simu)$y[1,])
sigma <- extract(simu)$sigma[1]
input_data <- list("N" = N, "y" = y, "sigma" = sigma)

We will first consider a Bayesian model with a uniform prior density function that allows any degeneracies in the likelihood function to fully propagate to the posterior density function.

writeLines(readLines("stan_programs/fit_extra_flat.stan"))
data {
  int<lower=1> N;      // Number of observations
  real y[N];           // Observations
  real<lower=0> sigma; // Measurement variability
}

parameters {
  real a;
  real b;
}

model {
  // Implicit flat prior model
  
  // Observational model
  y ~ normal(a, sigma);
}
fit <- stan(file='stan_programs/fit_extra_flat.stan', data=input_data,
            seed=4938483, refresh=0)

Unfortunately diagnostic checks indicate that Stan's Hamiltonian Monte Carlo sampler is complaining. Not only is \(\hat{R}\) for \(b\) indicating mixing problems, but also the numerical trajectories are hitting the default upper bound of 1024 leapfrog steps indicating that the sampler is working extraordinarily hard to generate each transition.

util$check_all_diagnostics(fit)
[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat for parameter b is 1.35009175072942!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "2402 of 4000 iterations saturated the maximum tree depth of 10 (60.05%)"
[1] "  Run again with max_treedepth set to a larger value to avoid saturation"
[1] "E-FMI indicated no pathological behavior"

The pairs plot of the two parameters shows the Markov chains struggling to explore the non-compact degeneracy that stretches all the way to positive and negative infinity along the extraneous dimension. Note the scale of the vertical axis!

samples <- extract(fit)

par(mfrow=c(1, 1))

plot(samples$a, samples$b,
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="a", ylab="b")