Hierarchical modeling is a powerful technique for modeling heterogeneity and, consequently, it is becoming increasingly ubiquitous in contemporary applied statistics. Unfortunately that ubiquitous application has not brought with it an equivalently ubiquitous understanding for how awkward these models can be to fit in practice. In this case study we dive deep into hierarchical models, from their theoretical motivations to their inherent degeneracies and the strategies needed to ensure robust computation. We'll learn not only how to use hierarchical models but also how to use them robustly.

1 Modeling Heterogeneity

In probabilistic modeling heterogeneity refers to systematic variation in the structure of a data generating process beyond the unpredictable, probabilistic variation inherent to measurement processes. Critically heterogeneity implies the existence of distinct contexts in which the structure of the data generating process is fixed; the resulting variation manifests only in measurements that span multiple contexts.

For example heterogeneity might manifest because of different ways a measurement is made, corresponding to the systematic variation in measurements made at different locations or times, with different measurement techniques, or even by different technicians. At the same time heterogeneity might be due to the nature of what is being observed, corresponding to varying subjects, environmental conditions, and the like. Heterogeneity could even be a manifestation of how our domain expertise varies across contexts.

In this case study I will focus on discrete heterogeneity defined by a discrete, although potentially infinite, set of contexts. This includes for example geographic heterogeneity across different countries, but not as a continuous function of distance. I will refer to the set of all contexts and corresponding behaviors as the population and the particular contexts as individuals or items.

From a modeling perspective systematic variation manifests as the replication of some or even all of the parameters across the defining contexts. Here we'll denote the nominal parameters potentially sensitive to the heterogeneity as \[ \theta \in \Theta. \] Note that $\theta$ might be one-dimensional but could also be multi-dimensional depending on how sophisticatedly we model each context context.

In a homogeneous or complete pooling model we ignore the potential variation entirely and assume that the behavior in each individual context is captured by the nominal parameters. Because every context is modeled with the same, monolithic parameters all of the observations across those contexts pool together to inform those parameters, ensuring strongly informed inferences. If the actual variations are large, however, then these strong inferences come at the cost of substantial model misfit.

To model heterogeneity in general we have to replace the monolithic parameters with a deterministic function of the varying contexts. In the case of discrete heterogeneity we can replace the monolithic parameters $\theta \in \Theta$ with a set of $K$ parameters modeling the individual contexts, \[ \{ \theta_{1}, \ldots, \theta_{K} \}. \]

For example consider the humble normal observational model, \[ y \sim \text{normal}(\mu, \sigma). \] If we expected heterogeneity to manifest in the location of the observations then we might replace the monolithic parameter $\mu$ with the varying parameters $\{ \mu_{1}, \ldots, \mu_{K} \}$, \[ y_{k} \sim \text{normal}(\mu_{k}, \sigma). \] At the same time heterogeneity in the scale, also know as heteroskedasticity, could be captured by replacing $\sigma$ with $\{ \sigma_{1}, \ldots, \sigma_{K} \}$, \[ y_{k} \sim \text{normal}(\mu, \sigma_{k}). \] Of course both the location and scale could be sensitive to the varying contexts, in which case we would need to replace $\{ \mu, \sigma \}$ with \[ \{ \{\mu_{1}, \sigma_{1} \}, \ldots, \{\mu_{K}, \sigma_{K} \} \} \] to give \[ y_{k} \sim \text{normal}(\mu_{k}, \sigma_{k}). \]

A model for discrete heterogeneity, however, isn't complete until we specify a probabilistic model for the individual parameters, $\pi(\theta_{1}, \ldots, \theta_{K})$.

An independent heterogeneity or no pooling model assumes no latent interactions between the individual parameters, \[ \pi(\theta_{1}, \ldots, \theta_{K}) = \pi_{1}(\theta_{1}) \cdot \ldots \cdot \pi_{K}(\theta_{K}). \] The flexibility afforded by this independence assumption allows the model to capture any context-to-context variation. It also implies that only observations within each context inform the individual parameters, resulting in less informed inferences relative to the homogeneous model.

Including latent interactions between the individual parameters implements a partial pooling model. Here the interactions allow observations to potentially inform other contexts without assuming full homogeneity. If the interaction model is a reasonable approximation to the heterogeneity in the true data generating process then our inferences would be more strongly informed than in the no pooling model but suffer less misfit than the complete pooling model.

This raises the question, however, of exactly how we might model those interactions between the individual parameters.

2 Can You Take Me Hier(archical)?

Conveniently in practice discrete heterogeneity often manifests, or at least approximately manifests, in a way that drastically constrains the latent interactions. In this section we discuss the permutation symmetry inherent to exchangeable populations and the hierarchical models they imply.

2.1 Exchangeability

Often we are left without any underlying theory for why a data generating process behaves differently in different contexts. In this case we know that the behavior should vary across contexts but we don't have any canonical way of ordering or arranging those contexts by the unknown behavior.

Probabilistically this ignorance implies that the behavior in each context correlates exactly the same with any other context; without a sense of order we can't define a sense of distance let alone distance-dependent interactions. Mathematically any model of heterogeneity consistent with this ignorance must be invariant to any permutation of the individual contexts, and hence any permutation of the individual parameters that model the behavior in those contexts. Invariance of a probabilistic model to any permutation of the parameters is known as exchangeability.

Consider for example modeling the rules of a game based on a deck of cards labeled only by mysterious symbols. Even if we know that each card has a different effect in the game we have no knowledge about how those effects might be ordered. Consequently any model of how that game plays out must be invariant to shuffling of the cards.

Typically exchangeability is the result of an explicit or implicit censoring process where we ignore any information about why the data generating process behaves differently in different contexts. In other words exchangeability is not a fundamental property of a system being observed but rather a consequence of how we choose to observe that system!

From this perspective we can think about exchangeability as a variant of Three Card Monte. Although we know that technically there are multiple balls, each with their own discriminating characteristics, once the balls are covered and shuffled we lose track of which item features which characteristics. What information remains has to be invariant to the many permutations of which we can't keep track, naturally motivating exchangeable prior models.

Another way of interpreting exchangeability is as label invariance. Once we've censored all discriminating information about each context then any labeling-- alphabetical, numerical, or otherwise -- of those contexts will be arbitrary.

In this case any relabeling of the contexts is equivalent to permuting the contexts while keeping the labels fixed, and once again we are naturally lead to consider exchangeable models.

All of this said, anytime we consider exchangeability we have to recognize that ignoring discriminating information can be a very poor modeling choice. Consider for example modeling the behavior of a continuous function $f(x)$ with a set of function values at discretized inputs, $f_{k} = f(x_{k})$.

Any knowledge about the smoothness of the continuous functional behavior is incompatible with exchangeability. The smoother the function the more neighboring function values will correlate relative to remote function values, and the more the functional behavior will change when we permute the inputs.

Exchangeability might be a much more reasonable assumption, however, for the residual functional behavior around some fixed baseline. If the baseline functional model $g(x)$ captures the large scale behavior of $f(x)$ then the residuals \[ f_{k} - g_{k} = f(x_{k}) - g(x_{k}) \] will be much less spatially correlated and much more consistent with exchangeability.

In any principled modeling application we have to remain diligent in recognizing that exchangeability is an assumption, and one that cannot be made without careful consideration. Knowledge about the source of heterogeneity can drastically improve inferences, and we should consider exploiting that information whenever possible. Only when we don't have that knowledge, or don't have the resources to properly exploit it, will exchangeability be a reasonable assumption.

2.2 de Finetti's Theorem and Hierarchical Models

Conveniently hierarchical models are naturally compatible with exchangeability and provide a powerful tool for modeling exactly these circumstances.

Exchangeability is an inherently recursive property of a population. If a population is exchangeable then any subset of that population will also be exchangeable, and so too will be any subset of that subset. At the same time we can always consider any exchangeable population as a subset of some larger, hypothetical exchangeable population including contexts that we have not yet considered. In particular any exchangeable population can hypothetically be included within some infinitely large exchangeable population.

de Finetti showed that the any probability density function compatible with an infinitely large, exchangeable population must be of the form of a conditionally-independent, continuous mixture \[ \pi(\theta_{1}, \ldots, \theta_{K}) = \int \mathrm{d} \phi \, \pi(\phi) \cdot \prod_{k = 1}^{K} \pi(\theta_{k} \mid \phi). \] Here conditionally independent refers to the fact that given a fixed value of $\phi$ the individual parameters are independent, while continuous mixture refers to the marginalization over $\phi$. For a more in depth discussion of de Finetti's theorem see Chapter 9 of [1] and Chapter 7 of [2].

Conceptually the conditional probability density function $\pi(\theta_{k} \mid \phi)$ explicitly models an infinitely large, exchangeable latent population from which the individual parameters are drawn, with the population parameters $\phi$ characterizing the exact configuration of that population. In other words de Finetti's theorem tells us that we can always model an exchangeable population by jointly modeling the individuals and the infinite population at the same time. This is especially relevant in practice where we implicitly implement the marginalization over the population parameters by fitting the joint model \[ \pi(\theta_{1}, \ldots, \theta_{K}, \phi) = \pi(\phi) \cdot \prod_{k = 1}^{K} \pi(\theta_{k} \mid \phi). \] By modeling both jointly we have the flexibility to focus our inferences on just the individual contexts we've observed or the entire latent population, depending on which might be more appropriate in any given application.

The latent parameters $\phi$ are commonly referred to as hyperparameters and $\pi(\phi)$ as the hyperprior. While "hyper" is admittedly fun to say in my opinion this terminology confuses the underlying assumptions and so I prefer to use the more precise terms population parameters and population prior model, respectively.

Technically any finite exchangeable population will be compatible with a larger family of probability density functions, but as the population size increases that family will quickly converge to a de Finetti mixture [3]. Moreover de Finetti's theorem gives both the individual parameters and latent parameters an explicit interpretation which facilitates principled prior modeling. This is especially useful when we might need to consider unforeseen contexts, and hence a larger exchangeable population, in the future; because these models are compatible with infinite exchangeability we can add new contexts without changing the model for the existing contexts.

Joint models of individuals and the population from which they are examples of hierarchical models, a name that makes particular sense given the shape of their directed graphical model representation.

The full generality of hierarchical models has a long and somewhat convoluted history -- see for example [4] which is also available as Chapter 9 of [5]. Here we will avoid too much philosophical baggage by focusing on the application of hierarchical models to capture infinitely exchangeable heterogeneity, leaving the exact interpretation of that heterogeneity to the practitioner.

Mathematically the exchangeability of the de Finetti mixture arises from the commutativity of multiplication -- the product of conditional probability density functions is the same regardless of the order of the individual factors. \[ \begin{align*} \pi(\theta_{1}, \ldots, \theta_{k}, \ldots, \theta_{K}, \phi) &= \pi(\theta_{1} \mid \phi) \cdot \ldots \cdot \pi(\theta_{k} \mid \phi) \cdot \ldots \cdot \pi(\theta_{K} \mid \phi) \cdot \pi(\phi) \\ &= \pi(\theta_{1} \mid \phi) \cdot \ldots \cdot \pi(\theta_{K} \mid \phi) \cdot \ldots \cdot \pi(\theta_{k} \mid \phi) \cdot \pi(\phi) \\ &= \pi(\theta_{1}, \ldots, \theta_{K}, \ldots, \theta_{k}). \end{align*} \]

Graphically the exchangeability arises due to the symmetry of the directed graphical model -- permuting or relabeling the nodes of the individual parameters doesn't change the conditional structure and hence the shape of the directed graph.

The conditional independence of the de Finetti mixture also ensures compatibility with larger exchangeable populations. To introduce a new individual we can just multiply another factor of the population density function without disturbing the initial model, \[ \begin{align*} \pi(\theta_{1}, \ldots, \theta_{K + 1}, \phi) &= \prod_{n = 1}^{K + 1} \pi(\theta_{k} \mid \phi) \cdot \pi(\phi) \\ &= \pi(\theta_{K + 1} \mid \phi) \cdot \prod_{k = 1}^{K} \pi(\theta_{k} \mid \phi) \cdot \pi(\phi) \\ &= \pi(\theta_{K + 1} \mid \phi) \cdot \pi(\theta_{1}, \ldots, \theta_{K}, \phi). \end{align*} \]

Typically observations are generated within only one context at a time. In this case the likelihood function decomposes into individual likelihood functions that depend on only one set of individual parameters, and the full Bayesian model will also be exchangeable, \[ \begin{align*} \pi(y_{1}, \ldots, y_{K}, \theta_{1}, \ldots, \theta_{K}, \phi) &= \pi(y_{1}, \ldots, y_{K} \mid \theta_{1}, \ldots, \theta_{K}, \phi) \cdot \pi(\theta_{1}, \ldots, \theta_{K}, \phi) \\ &= \prod_{k = 1}^{K} \pi(y_{k} \mid \theta_{k}) \cdot \prod_{k = 1}^{K} \pi(\theta_{k} \mid \phi) \cdot \pi(\phi) \\ &= \left[ \prod_{k = 1}^{K} \pi(y_{k} \mid \theta_{k}) \cdot \pi(\theta_{k} \mid \phi) \right] \pi(\phi). \end{align*} \]

Even though observations directly inform only a single set of parameters, the latent population model couples the individual parameters and provides a backdoor for observations to inform all of the contexts. For example the observations from the $k$th context, $y_{k}$, directly inform the parameters that quantify the behavior of that context, $\theta_{k}$. Those parameters, however, directly inform the population parameters $\phi$ which then inform all of the other contexts through the population model. Similarly observations that directly inform the other contexts indirectly inform the population parameters which then feeds back into the $k$th context.

The strength of this pooling is determined by the shape of the population model, $\pi(\theta_{k} \mid \phi)$. Configurations where the population model is narrow force strong pooling across contexts; in the limit of an infinitely narrow population model the hierarchical model reduces to the homogeneous model with complete pooling. On the other hand model configurations where the population model is diffuse weakens the pooling, and in the limit of an infinitely wide population model the hierarchical model reduces to the independent heterogeneity model with no pooling.

In other words hierarchical models interpolate between the extreme homogenous and independent heterogeneous assumptions. By inferring the population parameters jointly with the individual parameters we can dynamically learn the shape of the population model and hence the amount of pooling consistent with the observed data.

2.3 Interpreting Hierarchical Models

Mathematically hierarchical models are incredibly versatile and fit naturally into many modeling applications. The interpretation of the hierarchy in each of those applications, however, can vary considerably. Defining that interpretation is critical in understanding not only how the resulting inferences can be used but also how they should be communicated.

The latent population implied by de Finetti's theorem, for example, is just a latent population consistent with the particular contexts observed and the exchangeable heterogeneity that manifests in those contexts. It need not correspond to any real population, and we have to be careful to extrapolate the hierarchical model to other contexts only when appropriate.

Hierarchical models capture how behaviors vary across exchangeable contexts, not why they vary. Without some understanding for what causes those variations we cannot determine how meaningful the hierarchical population is beyond the current analysis. If new contexts are influenced by different phenomena then the hierarchical model will poorly generalize to those contexts. At the same time if the underlying cause of the heterogeneity is not static then the hierarchical model might not even generalize to future interactions with the observed contexts!

For example if the observed contexts are influenced by unmodeled selection effects then the latent population will accurately capture only the behavior within contexts governed by those exact selection effects. The latent population will not be applicable to contexts chosen under different selection effects or under no selection effects at all.

When the interpretation of the latent population generalizes, for example when modeling individual objects drawn from a real, persistent population, then inferences about both the individual parameters and population parameters are relevant. Which is most relevant, however, depends on the details of the observed contexts.

If the contexts encountered in the analysis fully exhaust some finite set of possible contexts and the contexts can be intentionally selected or prepared in future circumstances, then inferences of the individual parameters are sufficient for understanding future interactions with those contexts. For example we can generate predictions within any of the observed contexts by sampling from those marginal posteriors distributions, \[ \begin{align*} \tilde{\theta}_{k} &\sim \pi(\theta_{k} \mid \tilde{y} ) \\ y &\sim \pi(y \mid \tilde{\theta}_{k}). \end{align*} \] Consequently when sharing the results of these analyses it's natural to include only inferences of the individual parameters.

One the other hand if the contexts encountered in an analysis are incidental or uncontrollable and if the observed contexts are only a small sample of the possible contexts that might encounter, then inferences about the population parameters are necessary. In this case predictions require sampling new contexts from the population model, \[ \begin{align*} \tilde{\phi} &\sim \pi(\phi) \\ \tilde{\theta} &\sim \pi(\theta \mid \tilde{\phi}) \\ y &\sim \pi(y \mid \tilde{\theta}), \end{align*} \] Now the marginal posterior of the population parameters captures the relevant information and is most natural in analysis summaries.

The interpretation of the latent population, however, does not always generalize. For example if heterogeneity between the observed contexts arises from some complex, poorly-understood side effect unique to a particular experiment, then any hierarchical model of that heterogeneity will not generalize beyond the scope of that experiment. In this case the hierarchical model is best interpreted as a means of absorbing these systematic variations in order to isolate some hidden phenomenological behavior that does generalize beyond the experiment.

Consider an experiment designed to study some baseline behavior $\theta_{\text{baseline}}$; this behavior might for example be captured by parameters or by the output of some complex mechanistic model. If that baseline behavior is corrupted during the measurement, however, then we might model the behavior in each observed context as \[ \theta_{k} = \theta_{\text{baseline}} + \delta_{k}, \] where the residuals $\delta_{k}$ model the corruption of the baseline behavior that manifests in each observed context. Because these residuals model the corruption unique to one particular realization of an experiment, they have no useful meaning outside of analyses of that measurement. By modeling them jointly with the baseline behavior, however, we can account for that particular corruption, filtering out the noise and recovering the desired signal.

In this case inferences about both the individual residuals and the latent population model for the residuals would be incidental, and of limited interest when summarizing an analysis. Instead the relevant inferences would be captured in the marginal posterior distribution for $\theta_{\text{baseline}}$.

2.4 Normal Hierarchical Models

In order to completely specify a hierarchical model we need to specify the population model, $\pi(\theta_{k} \mid \phi)$, that characterizes the infinite, exchangeable population from which the individual parameters are drawn. A useful population model will, as always, depend on the details of any particular application. For example individual behaviors might cluster or otherwise decompose into a sum of multiple contributions, suggesting a discrete mixture model. The more we know about the structure of the heterogeneity the more sophisticated of a population model we can motivate.

When we don't know much about that structure, however, we often appeal to the common families of probability density functions that feature particularly interpretable and computationally convenient properties. For example if we assume that individual behaviors can be quantified by a single, unconstrained, one-dimensional parameter $\theta_{k} \in \mathbb{R}$ then we might consider a population model specified by a normal probability density function, \[ \theta_{k} \sim \text{normal}(\mu, \tau). \]

If the individual parameters are one-dimensional but constrained to an interval of the real line then we can appeal to families of probability density functions compatible with that constraint, or we can model the unconstrained parameter with a normal hierarchical model. For example when modeling a population of nominally positive parameters \[ \{ \lambda_{1}, \ldots, \lambda_{K} \} \in \mathbb{R}^{+} \] we might model the log parameters $\kappa_{k} = \log(\lambda_{k})$ being drawn from the normal population model, \[ \begin{align*} \kappa_{k} &\sim \text{normal}(\mu, \tau) \\ \lambda_{k} &= \exp(\kappa_{k}). \end{align*} \]

When the behavior within the individual contexts is characterized by multiple parameters then the modeling of the latent population becomes more complicated. In Section 5 we'll consider multivariate normal hierarchical models appropriate for multiple unconstrained parameters.

Within a normal hierarchical model the population location parameter, $\mu$, quantifies the centrality of the individual behaviors while the population scale parameter, $\tau$, quantifies the dispersion, and hence the overall strength of the heterogeneity. As $\tau$ shrinks to zero the heterogeneity weakens and the behavior across contexts becomes more homogenous, but as $\tau$ increases the heterogeneity grows. In the infinite limit $\tau \rightarrow \infty$ the individual parameters decouple from the latent population, and each other.

These limits are especially important when considering the population prior model, $\pi(\mu, \tau)$, which we typically assume to decompose into independent prior density functions for each parameter, $\pi(\mu) \cdot \pi(\tau)$. For example a uniform prior density function on $\tau$ concentrates towards the independent heterogeneity limit which gives the model excess flexibility that possibly facilitates overfitting. If we're expanding around a simpler homogeneous model, and want to allow only some limited heterogeneity without precluding that the heterogeneity might be negligible, then a more appropriate choice is a prior model that concentrates towards $\tau = 0$, such as that specified by half-normal prior density function.

Normal density functions can be parameterized in many different ways and, consequently, so too can normal hierarchical models. We can always reparameterize the individual parameters, \[ \eta_{k} = f(\theta_{k}) \] before deterministically reconstructing them as needed in the rest of the model, \[ \begin{align*} \eta_{k} &\sim \pi \big( f^{-1} ( \eta_{k} ) \big) \cdot \big|J \! \big( f^{-1}( \eta_{k} ) \big) \big| \\ \theta_{k} &= f ( \eta_{k} ) \sim \pi(\theta_{k}). \end{align*} \] where $\big| J \! \big( f^{-1} ( \eta_{k} ) \big) \big|$ is the absolute value of the determinant of the Jacobian matrix of the transformation.

Because the normal family of probability density functions is closed under translations and scalings these are particularly convenient transformations to consider. In particular the non-centered parameterization [6, 7] generates the nominal individual parameters from latent parameters specified by independent, standard normal density functions, \[ \begin{align*} \eta_{k} &\sim \text{normal}(0, 1) \\ \theta_{k} &= \mu + \tau \cdot \eta_{k} \sim \text{normal}(\mu, \tau). \end{align*} \]

Conceptually the non-centered parameters capture the deviations of the individual behaviors relative to the latent population, \[ \eta_{k} = \frac{ \theta_{k} - \mu }{ \tau } \] while the centered parameters capture the absolute behaviors independent of the latent population.

Depending on the circumstances we can model any or all of the individual contexts with the centered normal parameters and the rest with the non-centered parameters. Indeed the complementary relationships between the individual parameters and the latent population in these two parameterizations will prove to be a critical tool in moderating the degeneracies inherent to normal hierarchical models.

Although we will only rarely need to consider beyond the centered and non-centered parameterizations in practice, for completeness note that the centered and non-centered parameterizations can be considered the extremes of an entire family of partially-centered parameterizations [8] \[ \begin{align*} \eta_{k} &\sim \text{normal}(\lambda_{k} \cdot \mu, \tau^{\lambda_{k}}) \\ \theta_{k} &= \mu + \tau^{1 - \lambda_{k}} ( \eta_{k} - \lambda_{k} \cdot \mu) \sim \text{normal}(\mu, \tau). \end{align*} \] When $\lambda_{k} = 0$ the $k$th individual is modeled with a non-centered parameterization, when $\lambda_{k} = 1$ reduces to a centered parameterization, and intermediate values interpolate between the two.

3 The Fundamental Degeneracies of Normal Hierarchical Models

The theoretical utility of normal hierarchical models translates to practical utility only if we can accurately fit them. Unfortunately normal hierarchical models are rich with degenerate behavior that manifests in both in the individual parameters and the population parameters, and these degeneracies requires deliberate care to overcome.

In this section we'll investigate these degeneracies and the strategies we have to restrain them.

3.1 Learning the Population Parameters

Let's first consider the population parameters. The posterior distribution for the population parameters alone is given by marginalizing out the individual parameters, \[ \begin{align*} \pi(\mu, \tau \mid \tilde{y}_{1}, \ldots, \tilde{y}_{K}) &= \int \prod_{k = 1}^{K} \mathrm{d} \theta_{k} \, \pi(\theta_{1}, \ldots, \theta_{K}, \mu, \tau \mid \tilde{y}_{1}, \ldots, \tilde{y}_{K}) \\ &\propto \int \prod_{k = 1}^{K} \mathrm{d} \theta_{k} \, \left[ \prod_{k = 1}^{K} \pi(\tilde{y}_{k} \mid \theta_{k}) \cdot \text{normal}(\theta_{k} \mid \mu, \tau) \right] \pi(\mu, \tau) \\ &\propto \left[ \prod_{k = 1}^{K} \int \mathrm{d} \theta_{k} \, \pi(\tilde{y}_{k} \mid \theta_{k}) \cdot \text{normal}(\theta_{k} \mid \mu, \tau) \right] \pi(\mu, \tau). \end{align*} \]

In general this integration cannot be performed analytically, but to build some conceptual intuition let's consider the limit where the data are infinitely informative and we can approximate the likelihood functions with Dirac delta functions that are non-zero only at a single point $\hat{\theta}_{k}(\tilde{y}_{k})$, \[ \pi(\tilde{y}_{k} \mid \theta_{k}) \approx \delta \left( \theta_{k} - \hat{\theta}_{k}(\tilde{y}_{k}) \right). \] Within this limit the integration becomes trivial -- we just evaluate the integrand at this point, \[ \begin{align*} \pi(\mu, \tau \mid \tilde{y}_{1}, \ldots, \tilde{y}_{K}) &\propto \left[ \prod_{k = 1}^{K} \int \mathrm{d} \theta_{k} \, \pi(\tilde{y}_{k} \mid \theta_{k}) \cdot \text{normal}(\theta_{k} \mid \mu, \tau) \right] \pi(\mu, \tau) \\ &\propto \left[ \prod_{k = 1}^{K} \int \mathrm{d} \theta_{k} \, \delta \left( \theta_{k} - \hat{\theta}_{k}(\tilde{y}_{k}) \right) \cdot \text{normal}(\theta_{k} \mid \mu, \tau) \right] \pi(\mu, \tau) \\ &\propto \left[ \prod_{k = 1}^{K} \text{normal} \left( \hat{\theta}_{k}(\tilde{y}_{k}) \mid \mu, \tau \right) \right] \pi(\mu, \tau). \end{align*} \] The inferences of the normal population parameters reduces to a straightforward normal inference problem!

To better understand the influence of the observed data we can refactor the marginal likelihood function into sufficient statistics, \[ \begin{align*} \pi(\tilde{y} \mid \mu, \tau) &= \prod_{k = 1}^{K} \text{normal} \left( \hat{\theta}_{k}(\tilde{y}_{k}) \mid \mu, \tau \right) \\ &\propto \quad \text{normal} \left( \mu \mid \hat{\mu}(\tilde{y}), \frac{\tau}{\sqrt{K}} \right) \\ &\quad \times \text{inverse-chi} \left( \tau \mid K- 1, \frac{ \hat{\tau}(\tilde{y}) }{ \sqrt{K - 1} } \right), \end{align*} \] where $\hat{\mu}(\tilde{y})$ is the empirical mean, \[ \hat{\mu}(\tilde{y}) = \frac{1}{K} \sum_{k = 1}^{K} \hat{\theta}_{k}(\tilde{y}_{k}), \] and $\hat{\tau}(\tilde{y})$ is the empirical standard deviation \[ \hat{\tau}(\tilde{y}) = \sqrt{ \frac{1}{K - 1} \sum_{K = 1}^{K} \left( \hat{\theta}_{k}(\tilde{y}_{k}) - \hat{\mu}(\tilde{y}) \right)^{2} }. \]

Critically for small $K$ the inverse $\chi$ density function with $K - 1$ degrees of freedom is quite diffuse, and the population scale $\tau$ is not particularly well informed. In hindsight this isn't too surprising -- to learn the variance of a population one needs a reasonably large set of individuals. Unfortunately it also has unfortunate consequences for the inferences of the population location parameter.

The population location term in the marginal likelihood function, \[ \text{normal} \left( \mu \mid \hat{\mu}(\tilde{y}), \frac{\tau}{\sqrt{K}} \right), \] concentrates on values within \[ \begin{align*} \left( \frac{ \mu - \hat{\mu}(\tilde{y}) }{ \frac{\tau}{\sqrt{K}} } \right)^{2} &\lessapprox 1 \\ \frac{ | \mu - \hat{\mu}(\tilde{y}) | }{ \frac{\tau}{\sqrt{K}} } &\lessapprox 1 \\ | \mu - \hat{\mu}(\tilde{y}) | &\lessapprox \frac{\tau}{\sqrt{K}} \\ | \mu - \hat{\mu}(\tilde{y}) | &\lessapprox \frac{\exp(\log \tau)}{\sqrt{K}}. \end{align*} \]

For small values of $\tau$ the population location $\mu$ is tightly bound to the empirical mean, but for larger values $\mu$ is far less constrained. When $\tau$ is poorly informed the marginal likelihood function will encompass both of these behaviors and exhibit a funnel degeneracy.

Funnel degeneracies easily frustrate most Bayesian computational methods, and hence the accurate quantification of corresponding probability distribution. We'll discuss the pathologies of funnel densities in more detail in Section 3.3.

In other words if we observe only a small number of contexts then the population parameters will exhibit a funnel degeneracy no matter how much data we have within each context. For finite data the problem is even worse: the funnel degeneracy can manifest even for large $K$ if there isn't enough data within each context to sufficiently inform the individual likelihood functions.

If we don't have enough sufficiently informed contexts then we have to lean on our domain expertise through the population prior model \[ \pi(\mu, \tau) = \pi(\mu) \cdot \pi(\tau) \] to isolate just one regime of the funnel and enable effective computation. For example upper bounds on how much heterogeneity is reasonable can inform infinity-suppressing priors that cut off the top of the funnel.

At the same time any lower bound on the amount of heterogeneity between contexts can inform zero-suppressing priors that cut off the bottom of the funnel.

Of course these strategies will be useful only if the population prior models are actually compatible with our domain expertise and not simply chosen for computational convenience. Any knowledge of the magnitude of heterogeneity is extremely powerful when implementing any hierarchical model, let alone a normal hierarchical model!

3.2 Learning the Individual Parameters

Now we can consider the degeneracies inherent to the individual parameters, in particular under what circumstances degeneracies arise in both the centered and non-centered parameterizations.

3.2.1 Centered Parameterization

Recall that in a centered parameterization we model the individual parameters drawn directly from the latent population.

Here we'll assume a half normal prior on the population scale typical of applications where we're expanding around a homogenous model, \[ \pi(\mu) \cdot \pi(\tau) = \text{normal}(\mu | 0, \omega_{\mu}) \cdot \text{half-normal}(\tau | 0, \omega_{\tau}). \]

This latent population model strongly couples the individual parameters to the population parameters. When the population scale $\tau$ is small the individual parameters are forced to strongly concentrate around the population location, $\mu$, but when the population scale is large the individual parameters expand over a much wider range of values. This results in another funnel degeneracy in each individual parameter, although the assumed infinity-suppressing prior on $\tau$ cuts off the top of that funnel.

On the other hand observations within the $k$th context, $\tilde{y}_{k}$, inform the $k$th individual parameter independently of the the population parameters.

Observations within the other contexts, however, inform the populations scale $\tau$ and complement what we learn from $\tilde{y}_{k}$ alone.

How much this funnel degeneracy manifests in the posterior density function depends on the shape of the total likelihood function. If the data are not informative then the prior density function dominates and the funnel degeneracy will propagate to the posterior density function. We have to be careful, however, by exactly what we mean by informative.

If the data local to the $k$th context are not particularly informative then the $k$th contribution to the likelihood function, $\pi(\tilde{y}_{k} \mid \theta_{k})$, will be wide and the contribution from the prior density function will dominate the posterior density function.

If the data local to the $k$th context are informative enough then the corresponding contribution to the likelihood function will be narrow and dominate the posterior density function.

The individual likelihood functions, however, are not the only data that constrain the individual parameters. As long as $\tau < \infty$ information in the other likelihood functions will inform $\mu$ and $\tau$ as well. Even if an individual likelihood function is diffuse this partial pooling can suppress the worst of the funnel degeneracy. The larger the heterogeneity the more the marginal likelihood function will concentrate away from $\tau = 0$ and suppress the worst of the funnel behavior.

If the $k$th individual is already strongly informed then the partial pooling only improves the precision of our inferences.

How much the partial pooling contributes relative to the individual likelihood function depends on both how many individual contexts are being modeled and how strongly those contexts are informed by the observed data. Degeneracies arise from an intricate ballet between the structure of the model and the details of any specific observation.

3.2.2 Non-Centered Parameterization

In a non-centered parameterization we don't model the individual parameters directly but rather the relative deviations of those parameters from the latent population, \[ \begin{align*} \eta_{k} &\sim \text{normal}(0, 1) \\ \theta_{k} &= \mu + \tau \cdot \eta_{k}. \end{align*} \]

Once again we'll assume a half normal prior on the population scale typical of applications where we're expanding around a homogenous model, \[ \pi(\mu) \cdot \pi(\tau) = \text{normal}(\mu | 0, \omega_{\mu}) \cdot \text{half-normal}(\tau | 0, \omega_{\tau}). \]

Without any data the relative deviations completely decouple from the population parameters, resulting in an independent prior density function.

The coupling of the individual parameters and the population parameters is instead isolated to the deterministic transformation from the relative parameters back to the absolute parameters, \[ \theta_{k} = \mu + \tau \cdot \eta_{k}. \] Information that constrains the individual parameters manifests in an awkward geometry for the relative parameters. If the $\theta_{k}$ are fixed then an increase in $\tau$ must be accompanied by either an increase in $\mu$ or a decrease in $\eta_{k}$, resulting in an inverted funnel degeneracy in the individual likelihood functions that becomes more pathological the more strongly informed the likelihood functions!

Any backdoor information gained from the partial pooling, however, helps to cut off the worst of the inverted funnel and suppress this pathological behavior.

Consequently if the data local to the $k$th context are not particularly informative then the $k$th contribution to the likelihood function, $\pi(\tilde{y}_{k} \mid \theta_{k})$, will be diffuse and the contribution from the prior density function will dominate, resulting in a nice posterior density function.

On the other hand when the the data local to the $k$th context becomes more informative then the corresponding contribution to the likelihood function will be narrow and dominate the posterior density function. In this case the inverted funnel degeneracy propagates to the posterior density function.

Note, however, how the infinity-suppressing prior model for $\tau$ actually cuts off the worst of the inverted funnel degeneracy. A weaker prior model would allow the degeneracy to be even more pathological.

Of course once we consider the other contexts these behaviors are moderated by the partial pooling. In the weakly informative regime the information provided by the partial pooling only complements the prior model.

More importantly in the strongly informative regime the partial pooling can cut off the worst of the inverted funnel degeneracy in the individual likelihood function and suppress what propagates to the posterior density function.

3.2.3 Backhanded Complements

If we consider only the contribution from an individual likelihood function, and ignore the information induced by partial pooling, then the centered and non-centered parameterization of each individual context admit complementary pathologies.

When an individual likelihood function is strongly informed then that contribution to the posterior density function will dominate. In a non-centered parameterization this allows the inverse funnel degeneracy of the non-centered likelihood function to propagate to the posterior density function, but in a centered parameterization this suppresses the funnel degeneracy of the centered prior density function. Consequently the centered parameterization is better behaved.

If an individual likelihood function is only weakly informed then everything is inverted. The funnel degeneracy of the centered parameterization propagates to the posterior density function while the inverted funnel degeneracy of the non-centered likelihood function is overwhelmed by the nicer prior density function. In this case the non-centered parameterization is better behaved.

What about partially-centered parameterizations? For any given likelihood function a partially-centered parameterization may perform better than either the centered or non-centered parameterizations, but in practice the differences are usually negligible. Consequently we can consider only the centered and non-centered parameterizations without any significant loss of performance.

Critically each individual context can be parameterized differently, and the optimal parameterization for each will depend on the particular behavior of the local likelihood function. If the behavior of the individual likelihood functions varies across contexts then monolithically centering or non-centering all of the contexts will result in a suboptimal posterior density function.

All of that said the consequences of choosing a suboptimal parameterization for any individual parameter may be limited by partial pooling. If enough contexts are strongly informed then they can constrain the population parameters sufficiently to cut off the worst of the funnel degeneracies in the other contexts. For example if the individual contexts are uniformly strongly informed then the partial pooling naturally compensates for the inverted funnel degeneracy of the individual likelihood functions in the non-centered parameterization. The pathological behavior is the worst exactly when the partial pooling is strongest! Consequently the non-centered parameterization is often a more robust choice in practice, especially when we're considering more than a few contexts.

Finally we have to consider the important role of the population prior model. The inverted funnel degeneracy of the non-centered parameterization becomes more pathological as $\tau$ increases, but this is exactly the regime that is suppressed by an infinity-suppressing prior model such as the one assumed above. Similarly the worst of the funnel degeneracy in the centered parameterization can be moderated by zero-suppressing priors.

In other words the non-centered parameterization, and an interpretation of the individual contexts relative to the latent population, is particularly well-behaved when expanding around a homogeneous model while the centered parameterization, and an interpretation independent of the latent population, is particularly well-behaved when expanding around an independently heterogeneous model. The performance and interpretability of each parameterization are intimately related.

3.3 Funnel Degeneracies and Bayesian Computation

Funnel degeneracies show up all over the place in hierarchical models, and unfortunately those degeneracies are especially problematic to accurate probabilisitic computation. To understand why let's take a closer look at a probability density function that manifests a funnel degeneracy and the probability distribution it represents.

As we saw above a probability density function with a funnel degeneracy is characterized by a single parameter that controls the concentration of probability density -- for large values the density is diffuse while for large values it condenses into an ever narrowing neck. For hierarchical models the population scale $\tau$ controls this variation in the population location or the individual parameters. I'll denote the latter collectively as $\phi$.

In order to effectively approximate the probability distribution specified by this probability density function we need to quantify the typical set where the differential probability mass concentrates. At the top of the funnel the probability density decays but also spreads out over large expanses of volume; in the neck of the funnel the probability density grows large but is confined to a narrow volume. The product of the probability density and differential volume, however, is relatively constant across the two extremes and the typical set takes the shape of an inverted tear drop.

In order to accurately quantify a target distribution a probabilistic computational algorithm must be able to quantify both extremes of the funnel, from the flat expanses at the top to the narrow valley at the bottom, at the same time. For Markov chain Monte Carlo algorithms this requires many transitions from the top of the funnel to the bottom and back again.

For most Markov transitions this exploration up and down the funnel is limited by the shape of the concentrating typical set. At any point there is more volume, and hence room to transition, immediately upwards than downwards, and the disparity grows exponentially fast with the dimension of $\phi$. Transitions into that upwards volume, however, are limited by the decaying probability density. As the dimensionality of $\phi$ increases these dueling behaviors result in Markov transitions that concentrating around only narrow values of $\tau$, rapidly exploring conditional values of $\phi$ but only slowing exploring the full extent of $\tau$.

As the Markov chain explores deeper into the neck of the funnel the ever increasing concentration of the probability density into narrow and narrower regions becomes especially pathological.

Most Markov transitions will venture into these depths only rarely, and when they do they tend to get stuck and unable to return for long periods of time.

Together with the slow exploration of $\tau$ this results in a characteristic behavior where Markov chains slowly move up and down the funnel, occasionally venturing too far and getting stuck.

If Markov chains are run long enough then this behavior may become clear in trace plots, allowing practitioners to diagnose funnel degeneracies hidden in their model. Unfortunately for many Markov chain Monte Carlo algorithms this might require so much computation that it becomes impractical.

Fortunately the sensitive diagnostics of Hamiltonian Monte Carlo [9] are much more responsive and can identify the presence of funnel degeneracies within relatively short Markov chains.

For example the extreme curvature deep in the neck of the funnel causes instabilities in the numerical Hamiltonian trajectories that generate each transition, resulting in divergent transitions that can then be used to diagnose the unfortunate geometry.

Because states generated from divergent transitions will concentrate towards the pathological neck more than non-divergent transitions, we can overlay scatter plots of the two ensembles to identify any lurking funnel degeneracies.

Even without these instabilities, however, the typical implementations of Hamiltonian Monte Carlo that utilize constant metrics will suffer from slow exploration up and down the funnel. At any point the Hamiltonian trajectories generated by these implementations are confined to only narrow regions in $\tau$, and no matter how long or accurately we integrate the transitions will extend only so far.

The underlying problem is that the energy within each Hamiltonian transition can vary only so much which limits how far into the ambient space the trajectory can expand. Unfortunately for funnel degeneracies the energy also correlates strongly with $\tau$.

Marginalizing out $\tau$ we are left with a wide range of energy values that we need to explore.

At the same time conditioning on any given value of $\log \tau$ shows how little variation in energy each transition can accommodate.

In other words we have only small steps with which to explore the large expanse of energies, resulting in slow exploration up and down the funnel.

Conveniently the energy fraction of missing information, or E-FMI, is sensitive to exactly this stunted exploration. Even without divergences the E-FMI diagnostic can readily identify the consequences of a latent funnel degeneracy.

Hamiltonian Monte Carlo is especially well-suited for working with the funnel degeneracies inherent to hierarchical models. The sensitive diagnostics inform not only when the funnels have become too strong but also where in parameter space the degeneracy is most problematic. That in turn guides the most effective resolution, whether it be refined prior models or reparameterization of the individual parameters.

More sophisticated variants of Hamiltonian Monte Carlo can potentially overcome these problems using more information about the posterior probability density function, although they are difficult to implement robustly. For more on this and further details on Hamiltonian Monte Carlo and hierarchical models see [10].

4 Empirical Investigations

To explore how the degenerate behavior inherent to hierarchical models manifests in practice let's get our hands dirty with some coding experiments. In this section we'll investigate various aspects of these degeneracies with targeted examples before considering a more realistic modeling problem.

First, however, we have to set up our local environment.

library(rstan)

Loading required package: StanHeaders

Loading required package: ggplot2

rstan (Version 2.19.3, GitRev: 2e1f913d3ca3)

For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")

util <- new.env()
source('stan_utility.R', local=util)

c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")

c_light_trans <- c("#DCBCBC80")
c_dark_trans <- c("#8F272780")
c_green_trans <- c("#00FF0080")

par(family="CMU Serif", las=1, bty="l", cex.axis=1, cex.lab=1, cex.main=1,
    xaxs="i", yaxs="i", mar = c(5, 5, 3, 5))

4.1 Funnels Aches

Let's start by investigating the funnel degeneracy directly with a simple Stan program that couples a set of $K$ parameters $\phi$ to a scale parameter $\tau$.

writeLines(readLines("stan_programs/funnel.stan"))

data {
  int<lower=0> K;
}

parameters {
  real<lower=0> tau;
  real phi[K];
}

model {
 tau ~ normal(0, 5);
 phi ~ normal(0, tau);
}

This program specifies a probability density function that is similar but not quite identical to Radford Neal's funnel density function [11]. Critically Neal's funnel model employs a log normal density function for $\tau$ that suppresses both zero and infinity, and hence some of the degenerate geometry. The half normal prior density function that we use here allows the full neck of the funnel to manifest.

We beginning by considering $K = 1$ to emulate the funnel degeneracy that can appear between the population location and population scale.

data <- list("K" = 1)

fit <- stan(file='stan_programs/funnel.stan', data=data,
            seed=4938483, refresh=1000)

Divergent iterations indicate incomplete exploration.

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "225 of 4000 iterations ended with a divergence (5.625%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Before following up on the divergences let's take a look at the trace plots for the population scale, $\tau$. As expected each Markov chain only slowly explores up and the funnel, occasionally lingering once it reaches small enough values.

unpermuted_samples <- extract(fit, permute=FALSE)

par(mfrow=c(2, 2))

plot(1:1000, log(unpermuted_samples[,1,1]), type="l", lwd=1, col=c_dark,
     main="Chain 1",
     xlab="Iteration",  xlim=c(1, 1000),
     ylab="log(tau)", ylim=c(-3, 3))
plot(1:1000, log(unpermuted_samples[,2,1]), type="l", lwd=1, col=c_dark,
     main="Chain 2",
     xlab="Iteration",  xlim=c(1, 1000),
     ylab="log(tau)", ylim=c(-3, 3))
plot(1:1000, log(unpermuted_samples[,3,1]), type="l", lwd=1, col=c_dark,
     main="Chain 3",
     xlab="Iteration",  xlim=c(1, 1000),
     ylab="log(tau)", ylim=c(-3, 3))
plot(1:1000, log(unpermuted_samples[,4,1]), type="l", lwd=1, col=c_dark,
     main="Chain 4",
     xlab="Iteration",  xlim=c(1, 1000),
     ylab="log(tau)", ylim=c(-3, 3))

To further investigate the degeneracy we can overlay the scatter plot between $\tau$ and $\phi$ for both the non-divergent and divergent iterations of the four Markov chains.

partition <- util$partition_div(fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

par(mfrow=c(1, 1))

name_x <- paste("phi[", 1, "]", sep='')

plot(nondiv_samples[name_x][,1], log(nondiv_samples$tau),
     col=c_dark_trans, pch=16, main="",
     xlab=name_x, xlim=c(-40, 40), ylab="log(tau)", ylim=c(-3, 3))
points(div_samples[name_x][,1], log(div_samples$tau),
       col=c_green_trans, pch=16)

The overlaid scatter plots reveal a truncated funnel geometry in the non-divergent iterations consistent with obstructed exploration. We also see a concentration of the divergent iterations near that truncation, further highlighting the problem.

We can confirm this obstructed exploration hypothesis by running Stan with a higher adapt_delta. This forces a less aggressive step size adaptation, resulting in smaller step sizes, and more computationally expensive Markov transitions, but more precise exploration that should extend deeper into the funnel if our hypothesis is correct.

fit <- stan(file='stan_programs/funnel.stan', data=data,
            seed=4938483, refresh=1000,
            control=list(adapt_delta=0.999))

The less aggressive adaptation doesn't eliminate the divergent transitions entirely, but it does reduce them.

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "6 of 4000 iterations ended with a divergence (0.15%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Returning to the scatter plot we see that the more precise exploration does indeed push deeper into the funnel, confirming that our initial fit was not completely exploring the target distribution.

partition <- util$partition_div(fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

plot(nondiv_samples[name_x][,1], log(nondiv_samples$tau),
     col=c_dark_trans, pch=16, main="",
     xlab=name_x, xlim=c(-40, 40), ylab="log(tau)", ylim=c(-3, 3))
points(div_samples[name_x][,1], log(div_samples$tau),
       col=c_green_trans, pch=16)

Now we can consider a higher-dimensional funnel that emulates a centered hierarchical population model.

K <- 9
data <- list("K" = K)

fit <- stan(file='stan_programs/funnel.stan', data=data,
            seed=4938483, refresh=1000)

We see a few lingering divergences and some energy fraction of missing information warnings.

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "2 of 4000 iterations ended with a divergence (0.05%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "Chain 2: E-FMI = 0.133710287240095"
[1] "Chain 4: E-FMI = 0.160102892478438"
[1] "  E-FMI below 0.2 indicates you may need to reparameterize your model"

Funnels degeneracies manifest in each direction of $\phi$ and the lingering divergences seem to be drawn towards the neck indicative of incomplete exploration.

partition <- util$partition_div(fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

par(mfrow=c(3, 3))
for (k in 1:K) {
  name_x <- paste("phi[", k, "]", sep='')

  plot(nondiv_samples[name_x][,1], log(nondiv_samples$tau),
       col=c_dark_trans, pch=16, main="",
       xlab=name_x, xlim=c(-40, 40), ylab="log(tau)", ylim=c(-3, 3))
  points(div_samples[name_x][,1], log(div_samples$tau),
         col=c_green_trans, pch=16)
}

Once again we can confirm our hypothesis with a less aggressive adaptation.

fit <- stan(file='stan_programs/funnel.stan', data=data,
            seed=4938483, refresh=1000,
            control=list(adapt_delta=0.999))

While the divergences have vanished the energy fraction of missing information warnings remain.

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "Chain 3: E-FMI = 0.166271852197344"
[1] "Chain 4: E-FMI = 0.0838157733361158"
[1] "  E-FMI below 0.2 indicates you may need to reparameterize your model"

Looking at the trace plots we see the Markov chains 3 and 4 explore the deepest part of the funnel only slowly while Markov chains 1 and 2 don't venture into the depths at all. This incomplete exploration is why those two chains don't trigger E-FMI warnings!

unpermuted_samples <- extract(fit, permute=FALSE)

par(mfrow=c(2, 2))

plot(1:1000, log(unpermuted_samples[,1,1]), type="l", lwd=1, col=c_dark,
     main="Chain 1",
     xlab="Iteration",  xlim=c(1, 1000),
     ylab="log(tau)", ylim=c(-3, 3))
plot(1:1000, log(unpermuted_samples[,2,1]), type="l", lwd=1, col=c_dark,
     main="Chain 2",
     xlab="Iteration",  xlim=c(1, 1000),
     ylab="log(tau)", ylim=c(-3, 3))
plot(1:1000, log(unpermuted_samples[,3,1]), type="l", lwd=1, col=c_dark,
     main="Chain 3",
     xlab="Iteration",  xlim=c(1, 1000),
     ylab="log(tau)", ylim=c(-3, 3))
plot(1:1000, log(unpermuted_samples[,4,1]), type="l", lwd=1, col=c_dark,
     main="Chain 4",
     xlab="Iteration",  xlim=c(1, 1000),
     ylab="log(tau)", ylim=c(-3, 3))

Looking at the scatter plots we see some strong funnel degeneracies that limit Hamiltonian Monte Carlo's exploration up and down values of $\log \tau$.

partition <- util$partition_div(fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

par(mfrow=c(3, 3))

for (k in 1:K) {
  name_x <- paste("phi[", k, "]", sep='')

  plot(nondiv_samples[name_x][,1], log(nondiv_samples$tau),
       col=c_dark_trans, pch=16, main="",
       xlab=name_x, xlim=c(-40, 40), ylab="log(tau)", ylim=c(-3, 3))
  points(div_samples[name_x][,1], log(div_samples$tau),
         col=c_green_trans, pch=16)
}

We can confirm the source of the energy fraction of missing information warning by overlaying the sampled energies and the energy jumps between each transition. The higher-dimensional funnel spans a wider range of energies and the implementation of Hamiltonian Monte Carlo in Stan can't keep up. While each Markov transition is able to explore the $\phi$ parameters efficiently it can move along $\tau$ only slowly, requiring many transitions, and hence long Markov chains, to get a full picture of the target distribution.

par(mfrow=c(1, 1))

sampler_params <- get_sampler_params(fit, inc_warmup=FALSE)

energies <- sapply(1:4, function(n) sampler_params[n][[1]][,'energy__'])
diff_energies <- sapply(1:4, function(n) diff(energies[,n]))

hist(energies - mean(energies), breaks=seq(-35, 35, 0.5),
     col=c_light, border=c_light_highlight, main="",
     xlim=c(-30, 25), xaxt='n', xlab="Energies",
     yaxt='n', ylim=c(0, 300), ylab="")
hist(diff_energies, breaks=seq(-35, 35, 0.5),
     col=c_dark, border=c_dark_highlight, add=T)

4.2 Hierarchical Funnels

With a better understanding for how funnel degeneracies can be identified and investigated with Hamiltonian Monte Carlo's sensitive diagnostics let's now see how these degeneracies manifest in normal hierarchical models. We'll begin with uniformly weakly informative likelihood functions before considering uniformly strongly informative likelihood functions and then the circumstance where the behavior of the likelihood functions varies from context to context. Finally we'll investigate a hierarchical model with only a few individual contexts and the funnel degeneracy that can manifest between the population parameters.

4.2.1 Uniformly Weakly Informative Likelihood Functions

To simplify our initial exploration let's consider a circumstance where all of the individual likelihood functions are uniformly weakly informative. In other words the likelihood functions are all wide enough that the prior model dominates the shape of the posterior density function.

We could engineer this circumstance by playing with the amount of data but here we'll control the shape of the likelihood functions directly by setting the measurement variability $\sigma$ to a large constant.

writeLines(readLines("stan_programs/generate_data.stan"))

data {
  // Number of observations
  int<lower=1> N;
  
  // Number of individuals in hierarchy
  int<lower=0> K;
  
  // Individual from which each observation is generated
  int<lower=1, upper=K> indiv_idx[N]; 
  
  // Measurement variability
  real<lower=0> sigma; 
}

transformed data {
  real mu = 4.5;           // Population location
  real<lower=0> tau = 3.5; // Population scale
  
  // Individual parameters
  real theta[K] = normal_rng(rep_vector(mu, K), tau);
}

generated quantities {
   // Simulated observations
  real y[N] = normal_rng(theta[indiv_idx], sigma);
}

N <- 9
K <- 9
indiv_idx <- 1:9
sigma <- 10

fit <- stan(file='stan_programs/generate_data.stan',
            data=c("N", "K", "indiv_idx", "sigma"), iter=1, chains=1,
            seed=194838, algorithm="Fixed_param")


SAMPLING FOR MODEL 'generate_data' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                1.9e-05 seconds (Sampling)
Chain 1:                1.9e-05 seconds (Total)
Chain 1:

y <- extract(fit)$y[1,]

data <- list("N" = N, "y" = y, "K" = K, "indiv_idx" = indiv_idx, "sigma" = sigma)

Here indiv_idx serves as a lookup table that maps each datum to the context in which it was observed.

Because the behavior of the likelihood functions is uniform across the individual contexts we need only consider monolithic parameterizations where all of the individual parameters are either centered or non-centered at the same time.

4.2.1.1 Monolithically Centered Parameterization

Let's start with a centered parameterization which, according to our theoretical understanding, should manifest a funnel geometry due to the weakly informative likelihood functions.

Instead of implementing this model in Stan with nested loops we can take advantage of indiv_idx to implement the model with a single loop over the data. Because indiv_idx maps each datum to the appropriate context we can write the observational model as \[ \tilde{y}_{n} \sim \text{normal}(\theta_{k(n)}, \sigma). \] This lookup form becomes more useful when there are multiple observations in each context, especially when the number of observations varies from context to context.

writeLines(readLines("stan_programs/hierarchical_cp.stan"))

data {
  int<lower=0> N;      // Number of observations
  vector[N] y;         // Observations
  real<lower=0> sigma; // Measurement variability
  
  // Number of individual contexts in hierarchy
  int<lower=0> K;
  // Individual context from which each observation is generated
  int<lower=1, upper=K> indiv_idx[N]; 
}

parameters {
  real mu;           // Population location
  real<lower=0> tau; // Population scale
  vector[K] theta;   // Centered individual parameters
}

model {
  mu ~ normal(0, 5);                   // Prior model
  tau ~ normal(0, 5);                  // Prior model
  theta ~ normal(mu, tau);             // Centered hierarchical model
  y ~ normal(theta[indiv_idx], sigma); // Observational model
}

// Simulate a full observation from the current value of the parameters
generated quantities {
  real y_post_pred[N] = normal_rng(theta[indiv_idx], sigma);
}

To clarify the finer details of the realized Markov transition behavior let's run particularly long Markov chains.

cp_fit <- stan(file='stan_programs/hierarchical_cp.stan', data=data, seed=4938483,
               iter=11000, warmup=1000, refresh=11000)

Once the Markov chains have run we dutifully check our diagnostics and see the divergence warnings that we were expecting.

util$check_all_diagnostics(cp_fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "1188 of 40000 iterations ended with a divergence (2.97%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 40000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "Chain 4: E-FMI = 0.150207784934508"
[1] "  E-FMI below 0.2 indicates you may need to reparameterize your model"

If we look only at the first 1000 iterations that would have been run with the default Stan configuration we wee that Markov chains 1, 3, and 4 exhibit the characteristic funnel behavior in the trace plots for the population scale. Markov chain 2 doesn't even bother to dip its toes into the funnel. Note also that only Markov chain 4, which spends the most time deep in the funnel, triggers the E-FMI warning.

unpermuted_samples <- extract(cp_fit, permute=FALSE)

par(mfrow=c(2, 2))

plot(1:10000, log(unpermuted_samples[,1,2]), type="l", lwd=1, col=c_dark,
     main="Chain 1",
     xlab="Iteration", xlim=c(1, 1000),
     ylab="log(tau)", ylim=c(-8, 3))
plot(1:10000, log(unpermuted_samples[,2,2]), type="l", lwd=1, col=c_dark,
     main="Chain 2",
     xlab="Iteration", xlim=c(1, 1000),
     ylab="log(tau)", ylim=c(-8, 3))
plot(1:10000, log(unpermuted_samples[,3,2]), type="l", lwd=1, col=c_dark,
     main="Chain 3",
     xlab="Iteration", xlim=c(1, 1000),
     ylab="log(tau)", ylim=c(-8, 3))
plot(1:10000, log(unpermuted_samples[,4,2]), type="l", lwd=1, col=c_dark,
     main="Chain 4",
     xlab="Iteration", xlim=c(1, 1000),
     ylab="log(tau)", ylim=c(-8, 3))

Looking at all 10000 iterations we can see the funnel pathologies manifest more often but still not often enough to necessarily be a clear problem. We have to always keep in mind that empirical Markov chain Monte Carlo diagnostics are only as useful as the information they can extract from the Markov chains, and the slower the Markov chains explore the less information that can possibly be extracted!

unpermuted_samples <- extract(cp_fit, permute=FALSE)

par(mfrow=c(2, 2))

plot(1:10000, log(unpermuted_samples[,1,2]), type="l", lwd=1, col=c_dark,
     main="Chain 1",
     xlab="Iteration", xlim=c(1, 10000),
     ylab="log(tau)", ylim=c(-8, 3))
plot(1:10000, log(unpermuted_samples[,2,2]), type="l", lwd=1, col=c_dark,
     main="Chain 2",
     xlab="Iteration", xlim=c(1, 1000),
     ylab="log(tau)", ylim=c(-8, 3))
plot(1:10000, log(unpermuted_samples[,3,2]), type="l", lwd=1, col=c_dark,
     main="Chain 3",
     xlab="Iteration", xlim=c(1, 10000),
     ylab="log(tau)", ylim=c(-8, 3))
plot(1:10000, log(unpermuted_samples[,4,2]), type="l", lwd=1, col=c_dark,
     main="Chain 4",
     xlab="Iteration", xlim=c(1, 10000),
     ylab="log(tau)", ylim=c(-8, 3))

To really confirm the presence of a funnel, however, we have to investigate the more sensitive divergence diagnostics with scatter plots. Indeed we see funnels in each of the centered parameters.

partition <- util$partition_div(cp_fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

par(mfrow=c(3, 3))

for (k in 1:K) {
  name_x <- paste("theta[", k, "]", sep='')

  plot(nondiv_samples[name_x][,1], log(nondiv_samples$tau),
       col=c_dark_trans, pch=16, main="",
       xlab=name_x, xlim=c(-40, 40), ylab="log(tau)", ylim=c(-8, 3))
  points(div_samples[name_x][,1], log(div_samples$tau),
         col=c_green_trans, pch=16)
}

We can try increasing adapt_delta to force a more refined exploration in exchange for more computational cost.

cp_fit <- stan(file='stan_programs/hierarchical_cp.stan', data=data, seed=4938483,
               iter=11000, warmup=1000, refresh=11000,
               control=list(adapt_delta=0.99))

Divergences have been reduced, but not completely.

util$check_all_diagnostics(cp_fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "163 of 40000 iterations ended with a divergence (0.4075%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 40000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

The Markov chains do go deeper into the funnels but the lingering divergent transitions suggest that the still don't go far enough.

partition <- util$partition_div(cp_fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

par(mfrow=c(3, 3))

for (k in 1:K) {
 name_x <- paste("theta[", k, "]", sep='')

 plot(nondiv_samples[name_x][,1], log(nondiv_samples$tau),
      col=c_dark_trans, pch=16, main="",
      xlab=name_x, xlim=c(-40, 40), ylab="log(tau)", ylim=c(-8, 3))
 points(div_samples[name_x][,1], log(div_samples$tau),
        col=c_green_trans, pch=16)
}

4.2.1.2 Monolithically Non-Centered Parameterization

The observed funnel degeneracies between the centered individual parameters and the population scale suggests that the non-centered parameterization might be better suited to this particular posterior density function.

writeLines(readLines("stan_programs/hierarchical_ncp.stan"))

data {
  int<lower=0> N;      // Number of observations
  vector[N] y;         // Observations
  real<lower=0> sigma; // Measurement variability
  
  // Number of individual contexts in hierarchy
  int<lower=0> K;
  // Individual context from which each observation is generated
  int<lower=1, upper=K> indiv_idx[N]; 
}

parameters {
  real mu;            // Population location
  real<lower=0> tau;  // Population scale
  vector[K] eta;      // Non-centered individual parameters
}

transformed parameters {
  // Recentered individual parameters
  vector[K] theta = mu + tau * eta;
}

model {
  mu ~ normal(0, 5);                   // Prior model
  tau ~ normal(0, 5);                  // Prior model
  eta ~ normal(0, 1);                  // Non-centered hierarchical model
  y ~ normal(theta[indiv_idx], sigma); // Observational model
}

// Simulate a full observation from the current value of the parameters
generated quantities {
  real y_post_pred[N] = normal_rng(theta[indiv_idx], sigma);
}

ncp_fit <- stan(file='stan_programs/hierarchical_ncp.stan',
                data=data, seed=4938483,
                iter=11000, warmup=1000, refresh=11000)

Indeed the diagnostics are refreshingly clean.

util$check_all_diagnostics(ncp_fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 40000 iterations ended with a divergence (0%)"
[1] "0 of 40000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

There isn't a funnel to be found in any of the non-centered parameters!

partition <- util$partition_div(ncp_fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

par(mfrow=c(3, 3))

for (k in 1:K) {
 name_x <- paste("eta[", k, "]", sep='')

 plot(nondiv_samples[name_x][,1], log(nondiv_samples$tau),
      col=c_dark_trans, pch=16, main="",
      xlab=name_x, xlim=c(-10, 10), ylab="log(tau)", ylim=c(-8, 3))
 points(div_samples[name_x][,1], log(div_samples$tau),
        col=c_green_trans, pch=16)
}

Comparing the non-centered fit to the precise, and expensive, non-centered fit we can see that even with a high adapt_delta the centered fit wasn't accurately quantifying the posterior distribution.

cp_samples <- extract(cp_fit)
ncp_samples <- extract(ncp_fit)

par(mfrow=c(3, 3))

for (k in 1:K) {
  plot(ncp_samples$theta[,k], log(ncp_samples$tau),
       col=c_dark_trans, pch=16,
       xlab=paste("theta[", k, "]", sep=''), ylab="log(tau)")
  points(cp_samples$theta[,k], log(cp_samples$tau),
         col=c_light_trans, pch=16)
}

We can now investigate our relatively weak posterior inferences. Without much information constraining each individual context we don't learn much about the individual parameters or the population location.

par(mfrow=c(1, 1))

M <- data$K + 1
idx <- rep(1:M, each=2)
x <- sapply(1:length(idx), function(m) if(m %% 2 == 0) idx[m] + 0.5 else idx[m] - 0.5)

probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:data$K, function(k) quantile(ncp_samples$theta[, k], probs=probs))
cred <- cbind(cred, quantile(ncp_samples$mu, probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(m) cred[1:9, m]))

plot(1, type="n", main="",
     xlim=c(0.5, M + 0.5), xlab="", xaxt="n",
     ylim=c(min(pad_cred[1,]), max(pad_cred[9,])), ylab="Marginal Posteriors")
abline(v=M-0.5, col="gray80", lwd=2, lty=3)

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
for (m in 1:M)
  lines(x[(2 * m - 1):(2 * m)], pad_cred[5,(2 * m - 1):(2 * m)], col=c_dark, lwd=2)

axis(1, at=1:M, labels=c("th1", "th2", "th3", "th4", "th5",
                         "th6", "th7", "th8", "th9", "mu"))

Visualizing the marginal posteriors for all of the individual parameters together like this is particular helpful for critiquing the assumption of a normal latent population model. In particular any outliers and clustering in the marginal posteriors would suggest that a more sophisticated latent model might be appropriate.

We also don't learn much about the population scale, which is why partial pooling didn't alleviate the funnel degeneracy in the centered parameterization.

ncp_samples <- extract(ncp_fit)

set.seed(7488393)

hist(abs(rnorm(40000, 0, 5)), breaks=seq(0, 25, 0.25),
     col=c_light, border=c_light_highlight, main="",
     xlim=c(0, 20), xlab="tau", yaxt='n', ylim=c(0, 1800), ylab="")
hist(ncp_samples$tau, breaks=seq(0, 25, 0.25),
     col=c_dark, border=c_dark_highlight, add=T)

Because posterior retrodictive checks follow the structure of the observational model they naturally decompose into checks within each individual context.

M <- data$K
idx <- rep(1:M, each=2)
x <- sapply(1:length(idx), function(m) if(m %% 2 == 0) idx[m] + 0.5 else idx[m] - 0.5)

probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:data$K, function(k) quantile(ncp_samples$y_post_pred[, k], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(m) cred[1:9, m]))

plot(1, type="n", main="",
     xlim=c(0.5, M + 0.5), xlab="", xaxt="n",
     ylim=c(min(pad_cred[1,]) - 5, max(pad_cred[9,]) + 5),
     ylab="Marginal Posterior Predictives")

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
for (m in 1:M)
  lines(x[(2 * m - 1):(2 * m)], pad_cred[5,(2 * m - 1):(2 * m)], col=c_dark, lwd=2)

pad_obs <- do.call(cbind, lapply(idx, function(k) data$y[k]))

lines(x, pad_obs, lwd=1.5, col="white")
lines(x, pad_obs, lwd=1.25, col="black")

axis(1, at=1:M, labels=c("y1", "y2", "y3", "y4", "y5",
                         "y6", "y7", "y8", "y9"))

4.2.2 Uniformly Strongly Informative Likelihood Functions

To contrast let's look at the case where all of the individual likelihood functions are narrow enough to dominate the structure of the posterior density function. All we have to do is generate data again but with a much smaller measurement variability.

sigma <- 0.1

fit <- stan(file='stan_programs/generate_data.stan',
            data=c("N", "K", "indiv_idx", "sigma"),
            iter=1, chains=1, seed=194838, algorithm="Fixed_param")


SAMPLING FOR MODEL 'generate_data' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                1.3e-05 seconds (Sampling)
Chain 1:                1.3e-05 seconds (Total)
Chain 1:

y <- extract(fit)$y[1,]

data <- list("N" = N, "y" = y, "K" = K, "indiv_idx" = indiv_idx, "sigma" = sigma)

4.2.2.1 Monolithically Centered Parameterization

Once again we start with a monolithic centered parameterization, although in this case our theoretical understanding suggests it should perform well.

cp_fit <- stan(file='stan_programs/hierarchical_cp.stan', data=data, seed=4938483,
               iter=11000, warmup=1000, refresh=11000)

Indeed there are no indications of incomplete exploration.

util$check_all_diagnostics(cp_fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 40000 iterations ended with a divergence (0%)"
[1] "0 of 40000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

The narrow likelihood functions strongly inform each individual parameter which in turn informs the population location parameter, $\mu$.

cp_samples <- extract(cp_fit)

M <- data$K + 1
idx <- rep(1:M, each=2)
x <- sapply(1:length(idx), function(m) if(m %% 2 == 0) idx[m] + 0.5 else idx[m] - 0.5)

probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:data$K, function(k) quantile(cp_samples$theta[, k], probs=probs))
cred <- cbind(cred, quantile(cp_samples$mu, probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(m) cred[1:9, m]))

plot(1, type="n", main="",
     xlim=c(0.5, M + 0.5), xlab="", xaxt="n",
     ylim=c(min(pad_cred[1,]), max(pad_cred[9,])), ylab="Marginal Posteriors")
abline(v=M-0.5, col="gray80", lwd=2, lty=3)

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
for (m in 1:M)
  lines(x[(2 * m - 1):(2 * m)], pad_cred[5,(2 * m - 1):(2 * m)], col=c_dark, lwd=2)

axis(1, at=1:M, labels=c("th1", "th2", "th3", "th4", "th5",
                         "th6", "th7", "th8", "th9", "mu"))

Because we have a reasonable number of contexts we also learn the population scale reasonably well.

hist(abs(rnorm(40000, 0, 5)), breaks=seq(0, 25, 0.25),
     col=c_light, border=c_light_highlight, main="",
     xlim=c(0, 20), xlab="tau", yaxt='n', ylim=c(0, 5000), ylab="")
hist(cp_samples$tau, breaks=seq(0, 25, 0.25),
     col=c_dark, border=c_dark_highlight, add=T)

As before posterior retrodictive checks naturally decompose into separate checks within each individual context.

M <- data$K
idx <- rep(1:M, each=2)
x <- sapply(1:length(idx), function(m) if(m %% 2 == 0) idx[m] + 0.5 else idx[m] - 0.5)

probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:data$K, function(k) quantile(cp_samples$y_post_pred[, k], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(m) cred[1:9, m]))

plot(1, type="n", main="",
     xlim=c(0.5, M + 0.5), xlab="", xaxt="n",
     ylim=c(min(pad_cred[1,]), max(pad_cred[9,])), ylab="Marginal Posterior Predictives")

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
for (m in 1:M)
  lines(x[(2 * m - 1):(2 * m)], pad_cred[5,(2 * m - 1):(2 * m)], col=c_dark, lwd=2)

pad_obs <- do.call(cbind, lapply(idx, function(k) data$y[k]))

lines(x, pad_obs, lwd=1.5, col="white")
lines(x, pad_obs, lwd=1.25, col="black")

axis(1, at=1:M, labels=c("y1", "y2", "y3", "y4", "y5",
                         "y6", "y7", "y8", "y9"))

4.2.2.2 Monolithically Non-centered Parameterization

For completeness let's also try the non-centered parameterization where the posterior density function should exhibit an inverted funnel degeneracy.

ncp_fit <- stan(file='stan_programs/hierarchical_ncp.stan',
                data=data, seed=4938483,
                iter=11000, warmup=1000, refresh=11000)

Oddly there are no indications of the fitting pathologies associated with funnel degeneracies.

util$check_all_diagnostics(ncp_fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 40000 iterations ended with a divergence (0%)"
[1] "0 of 40000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Moreover we see only mild inverted funnel shapes, if any at all!

ncp_samples <- extract(ncp_fit)

par(mfrow=c(3, 3))

for (k in 1:K) {
  plot(ncp_samples$eta[,k], log(ncp_samples$tau),
       col=c_dark_trans, pch=16,
       xlab=paste("eta[", k, "]", sep=''), xlim=c(-5, 5),
       ylab="log(tau)", ylim=c(0, 3))
}

Where did the funnels go? They were suppressed by the partial pooling! The information acquired from the individual contexts pushes the posterior distribution away from both small and large $, slicing off enough of the inverted funnel to allow Hamiltonian Monte Carlo to explore without any obstructions.

par(mfrow=c(1, 1))

plot(ncp_samples$"mu", log(ncp_samples$tau),
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="mu", ylab="log(tau)", ylim=c(0, 3))

That said, although the partial pooling moderates the worst of the inverted funnel degeneracy the resulting geometry is still worse than the geometry of the centered parameterization.

Looking deeper into the performance of Hamiltonian Monte Carlo we can see that the non-centered fit required much smaller step sizes to fully resolve the posterior density function.

cp_stepsizes <- sapply(1:4, function(c) get_sampler_params(cp_fit, inc_warmup=FALSE)[[c]][,'stepsize__'][1])
ncp_stepsizes <- sapply(1:4, function(c) get_sampler_params(ncp_fit, inc_warmup=FALSE)[[c]][,'stepsize__'][1])

rbind(cp_stepsizes, ncp_stepsizes)

                    [,1]       [,2]       [,3]       [,4]
cp_stepsizes  0.73666517 0.65064911 0.66353329 0.71898244
ncp_stepsizes 0.01950636 0.01800581 0.01646203 0.01877123

The funnel geometry also requires longer trajectories to fully explore.

cp_steps <- do.call(rbind, get_sampler_params(cp_fit, inc_warmup=FALSE))[,'n_leapfrog__']
ncp_steps <- do.call(rbind, get_sampler_params(ncp_fit, inc_warmup=FALSE))[,'n_leapfrog__']

cp_int_times <- unlist(lapply(1:4, function(c) cp_stepsizes[c] * cp_steps[(1000 * (c - 1) + 1): (1000 * c)]))
ncp_int_times <- unlist(lapply(1:4, function(c) ncp_stepsizes[c] * ncp_steps[(1000 * (c - 1) + 1): (1000 * c)]))

par(mfrow=c(1, 2))

int_time_breaks <- seq(-0.1, 11.1, 0.1)

hist(cp_int_times, breaks=int_time_breaks, main="Centered",
     col=c_dark, border=c_dark_highlight,
     xlab="Integration Time", yaxt='n', ylab="")

hist(ncp_int_times, breaks=int_time_breaks, main="Non-Centered",
     col=c_dark, border=c_dark_highlight,
     xlab="Integration Time", yaxt='n', ylab="")

These smaller step sizes and longer integration times then require more steps in each numerical Hamiltonian trajectory, each of which requires an expensive evaluation of the posterior density gradient and drives up the total cost of the algorithm.

cp_counts <- as.data.frame(table(cp_steps))
colnames(cp_counts) <- c("Leapfrog Steps", "CP Counts")

ncp_counts <- as.data.frame(table(ncp_steps))
colnames(ncp_counts) <- c("Leapfrog Steps", "NCP Counts")

comp <- merge(cp_counts, ncp_counts, by="Leapfrog Steps", all=TRUE, sort=TRUE)
comp[is.na(comp)] <- 0
print(comp, row.names=FALSE)

 Leapfrog Steps CP Counts NCP Counts
              1         1          0
              3      4011        118
              7     35977       5502
             11         2         17
             15         9      13642
             19         0          8
             23         0        495
             27         0          2
             31         0       3337
             35         0          2
             39         0        251
             43         0          4
             47         0       1485
             51         0          2
             55         0        240
             59         0          3
             63         0       1393
             67         0          5
             71         0        210
             75         0          5
             79         0       1026
             83         0          5
             87         0        194
             91         0          6
             95         0       1121
             99         0          1
            103         0        196
            107         0          3
            111         0        887
            119         0        198
            123         0          5
            127         0       1039
            131         0          7
            135         0        193
            139         0          1
            143         0        806
            147         0          4
            151         0        189
            155         0          2
            159         0        817
            163         0          3
            167         0        166
            171         0          2
            175         0        626
            179         0          1
            183         0        131
            191         0        600
            195         0          5
            199         0        101
            207         0        468
            211         0          4
            215         0        103
            219         0          3
            223         0        505
            227         0          1
            231         0         97
            235         0          2
            239         0        330
            243         0          2
            247         0         73
            255         0        807
            259         0          1
            263         0         73
            271         0        226
            279         0         56
            283         0          1
            287         0        243
            295         0         54
            299         0          1
            303         0        163
            311         0         39
            319         0        185
            327         0         36
            335         0        156
            343         0         34
            351         0        161
            359         0         41
            363         0          1
            367         0        125
            375         0         32
            379         0          1
            383         0        110
            391         0         29
            395         0          1
            399         0         92
            407         0         33
            411         0          1
            415         0         83
            423         0         23
            431         0         52
            439         0         16
            447         0         46
            455         0         15
            463         0         34
            471         0          7
            479         0         41
            487         0          4
            495         0         19
            503         0          8
            507         0          1
            511         0        301
            535         0          1
            543         0          1
            607         0          1
            767         0          1

Even when a funnel degeneracy isn't pathological enough to fully obstruct exploration it can still slow exploration. Consequently in performance limited applications it can be wise to keep an eye on these Hamiltonian Monte Carlo performance metrics to motivate alternative parameterizations.

4.2.3 Unbalanced Likelihood Functions

In many practical circumstances the data is unbalanced, with some individual contexts enjoying more data than others, resulting in individual likelihoods functions whose shape can vary strongly from context to context. To emulate that more realistic situation let's simulate an unbalanced observation across our nine contexts.

K <- 9
N_per_indiv <- c(10, 5, 1000, 10, 1, 5, 100, 10, 5)
indiv_idx <- do.call(c, lapply(1:K, function(k) rep(k, N_per_indiv[k])))
N <- length(indiv_idx)
sigma <- 10

fit <- stan(file='stan_programs/generate_data.stan',
            data=c("N", "K", "indiv_idx", "sigma"),
            iter=1, chains=1, seed=194838, algorithm="Fixed_param")


SAMPLING FOR MODEL 'generate_data' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                0.000102 seconds (Sampling)
Chain 1:                0.000102 seconds (Total)
Chain 1:

y <- extract(fit)$y[1,]

data <- list("N" = N, "y" = y, "K" = K, "indiv_idx" = indiv_idx, "sigma" = sigma)

As before we'll first try a monolithic centered parameterization, where all of the individuals are modeled with centered parameterizations.

cp_fit <- stan(file='stan_programs/hierarchical_cp.stan',
               data=data, seed=4938483,
               iter=11000, warmup=1000, refresh=11000)

The presence of a few divergent transitions indicates that something isn't right.

util$check_all_diagnostics(cp_fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "6 of 40000 iterations ended with a divergence (0.015%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 40000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

As we might expect from our earlier theoretical analysis the individual parameters that model contexts with only sparse data -- everyone but contexts 3 and 7 -- exhibit funnel behavior, and we see divergence iterations concentrating near the neck of those funnels.

partition <- util$partition_div(cp_fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

par(mfrow=c(3, 3))

for (k in 1:K) {
  name <- paste("theta[", k, "]", sep="")
  plot(nondiv_samples[name][[1]], log(nondiv_samples$tau),
       col=c_dark_trans, pch=16, cex=0.8,
       xlab=name, xlim=c(-30, 30), ylab="log(tau)", ylim=c(-2, 3))
  points(div_samples[name][[1]], log(div_samples$tau),
         col=c_green_trans, pch=16, cex=0.8)
}

Perhaps a monolithic non-centered parameterization will fare better?

ncp_fit <- stan(file='stan_programs/hierarchical_ncp.stan',
                data=data, seed=4938483,
                iter=11000, warmup=1000, refresh=11000)

Unfortunately a few divergences persist.

util$check_all_diagnostics(ncp_fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "2 of 40000 iterations ended with a divergence (0.005%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 40000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Now the individual parameters modeling contexts with many data -- especially contexts 3 and 7 -- exhibit particularly strong inverted funnel degeneracies!

partition <- util$partition_div(ncp_fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

par(mfrow=c(3, 3))

for (k in 1:K) {
  name <- paste("eta[", k, "]", sep="")
  plot(nondiv_samples[name][[1]], log(nondiv_samples$tau),
       col=c_dark_trans, pch=16, cex=0.8,
        xlab=name, xlim=c(-4, 4), ylab="log(tau)", ylim=c(-2, 3))
  points(div_samples[name][[1]], log(div_samples$tau),
         col=c_green_trans, pch=16, cex=0.8)
}

Critically the full power of these inverted funnel degeneracies is limited by the partial pooling, along with a little help from the infinity-suppressing prior model for $\tau$.

par(mfrow=c(1, 1))

ncp_samples <- extract(ncp_fit)

hist(abs(rnorm(40000, 0, 5)), breaks=seq(0, 25, 0.25),
     col=c_light, border=c_light_highlight, main="",
     xlim=c(0, 20), xlab="tau", yaxt='n', ylim=c(0, 3000), ylab="")
hist(ncp_samples$tau, breaks=seq(0, 25, 0.25),
     col=c_dark, border=c_dark_highlight, add=T)

The fact that neither monolithic parameterization is satisfactory isn't surprising; they are appropriate only when the behavior of the individual likelihoods is relatively uniform. To accurately fit this posterior density function we'll need to find the right parameterization for each individual context one by one.

Remember that the shape of the individual likelihood functions is what determines how much they contribute to the posterior density function, and hence which parameterization is more appropriate. For relatively simple, independent observations the number of data in each context serves as a reasonable proxy for the width of the individual likelihood functions. Although the exact number of data that separates where centered and non-centered parameterizations are optimal will in general depend on the particular details of the observational model, in practice we can often identify it heuristically.

Looking more carefully at the number of data in each context we see something of a heavy tail with only two contexts containing most of the observations.

as.data.frame(N_per_indiv)

  N_per_indiv
1          10
2           5
3        1000
4          10
5           1
6           5
7         100
8          10
9           5

Let's set a threshold that separates out these two contexts from the others.

thresh <- 25

We can then model the two well-informed contexts with centered parameterizations and the two sparsely-informed contexts with non-centered parameterizations.

ncp_idx <- which(table(data$indiv_idx) <= thresh)
cp_idx <- which(table(data$indiv_idx) > thresh)

data$K_ncp <- length(ncp_idx)
data$ncp_idx <- array(ncp_idx)

data$K_cp <- length(cp_idx)
data$cp_idx <- array(cp_idx)

Once we have the classifying indices set up implementing a mixed parameterization in Stan is straightforward. Here we denote the set of indices that identify centered contexts as $\mathfrak{K}_{\mathrm{CP}}$ and the set of indices that identify non-centered contexts as $\mathfrak{K}_{\mathrm{NCP}}$.

writeLines(readLines("stan_programs/hierarchical_mix.stan"))

data {
  int<lower=0> N;      // Number of observations
  vector[N] y;         // Observations
  real<lower=0> sigma; // Measurement variability
  
  // Number of individual contexts in hierarchy
  int<lower=0> K;
  // Individual context from which each observation is generated
  int<lower=1, upper=K> indiv_idx[N]; 
  
  int<lower=0, upper=K> K_ncp;          // Number of noncentered individuals
  int<lower=1, upper=K> ncp_idx[K_ncp]; // Index of noncentered individuals
  
  int<lower=0, upper=K> K_cp;           // Number of centered individuals
  int<lower=1, upper=K> cp_idx[K_cp];   // Index of noncentered individuals
}

parameters {
  real mu;                  // Population location
  real<lower=0> tau;        // Population scale
  vector[K_ncp] theta_ncp;  // Non-centered individual parameters
  vector[K_cp]  theta_cp;   // Ccentered individual parameters
}

transformed parameters {
  // Recentered individual parameters
  vector[K] theta;
  theta[ncp_idx] = mu + tau * theta_ncp;
  theta[cp_idx] = theta_cp;
}

model {
  mu ~ normal(0, 5);          // Prior model
  tau ~ normal(0, 5);         // Prior model
  
  theta_ncp ~ normal(0, 1);   // Non-centered hierarchical model
  theta_cp ~ normal(mu, tau); // Centered hierarchical model
  
  y ~ normal(theta[indiv_idx], sigma); // Observational model
}

mix_fit <- stan(file='stan_programs/hierarchical_mix.stan',
                data=data, seed=4938483,
                iter=11000, warmup=1000, refresh=11000)

The diagnostic warnings have vanished.

util$check_all_diagnostics(mix_fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 40000 iterations ended with a divergence (0%)"
[1] "0 of 40000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Moreover the scatter plots reveal only mild funnel degeneracies in the posterior density function.

partition <- util$partition_div(mix_fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

par(mfrow=c(3, 3))

for (k in 1:data$K_cp) {
        name <- paste("theta_cp[", k, "]", sep="")
        plot(nondiv_samples[name][[1]], log(nondiv_samples$tau),
             col=c_dark_trans, pch=16, cex=0.8,
             xlab=name, xlim=c(-5, 5), ylab="log(tau)", ylim=c(-2, 3))
        points(div_samples[name][[1]], log(div_samples$tau),
               col=c_green_trans, pch=16, cex=0.8)
}

for (k in 1:data$K_ncp) {
  name <- paste("theta_ncp[", k, "]", sep="")
  plot(nondiv_samples[name][[1]], log(nondiv_samples$tau),
       col=c_dark_trans, pch=16, cex=0.8,
       xlab=name, xlim=c(-5, 5), ylab="log(tau)", ylim=c(-2, 3))
  points(div_samples[name][[1]], log(div_samples$tau),
         col=c_green_trans, pch=16, cex=0.8)
}

To see the importance of a well-chosen parameterization let's look at the three fits in a little bit more detail. In particular the incomplete exploration of the monolithically centered parameterization is evident in the scatter plot of the population parameters.

par(mfrow=c(1, 3))

partition <- util$partition_div(cp_fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

name <- "mu"
plot(nondiv_samples[name][[1]], log(nondiv_samples$tau),
     col=c_dark_trans, pch=16, cex=0.8, main="Monolithically Centered",
     xlab=name,  xlim=c(-5, 15), ylab="log(tau)", ylim=c(-1, 3))
points(div_samples[name][[1]], log(div_samples$tau),
       col=c_green_trans, pch=16, cex=0.8)

partition <- util$partition_div(ncp_fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

name <- "mu"
plot(nondiv_samples[name][[1]], log(nondiv_samples$tau),
     col=c_dark_trans, pch=16, cex=0.8, main="Monolithically Non-Centered",
     xlab=name,  xlim=c(-5, 15), ylab="log(tau)", ylim=c(-1, 3))
points(div_samples[name][[1]], log(div_samples$tau),
       col=c_green_trans, pch=16, cex=0.8)

partition <- util$partition_div(mix_fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

name <- "mu"
plot(nondiv_samples[name][[1]], log(nondiv_samples$tau),
     col=c_dark_trans, pch=16, cex=0.8, main="Mixed",
     xlab=name,  xlim=c(-5, 15), ylab="log(tau)", ylim=c(-1, 3))
points(div_samples[name][[1]], log(div_samples$tau),
       col=c_green_trans, pch=16, cex=0.8)

The monolithically non-centered parameterization exhibits less problems, although the Hamiltonian Monte Carlo sampler has to work hard to achieve that exploration.

cp_stepsizes <- sapply(1:4, function(c) get_sampler_params(cp_fit, inc_warmup=FALSE)[[c]][,'stepsize__'][1])
ncp_stepsizes <- sapply(1:4, function(c) get_sampler_params(ncp_fit, inc_warmup=FALSE)[[c]][,'stepsize__'][1])
mix_stepsizes <- sapply(1:4, function(c) get_sampler_params(mix_fit, inc_warmup=FALSE)[[c]][,'stepsize__'][1])

rbind(cp_stepsizes, ncp_stepsizes, mix_stepsizes)

                    [,1]      [,2]       [,3]       [,4]
cp_stepsizes  0.41772762 0.4603452 0.45484148 0.48217335
ncp_stepsizes 0.08768759 0.0969424 0.09230419 0.09783956
mix_stepsizes 0.39208230 0.3873533 0.37230349 0.43755347

In other words Stan's Hamiltonian Monte Carlo adaptation is robust enough to compensate for much of the inverted funnel degeneracy that manifests when we incorrectly apply a non-centered parameterization. That compensation, however, is expensive.

cp_steps <- do.call(rbind, get_sampler_params(cp_fit, inc_warmup=FALSE))[,'n_leapfrog__']
cp_counts <- as.data.frame(table(cp_steps))
colnames(cp_counts) <- c("Leapfrog Steps", "CP Counts")

ncp_steps <- do.call(rbind, get_sampler_params(ncp_fit, inc_warmup=FALSE))[,'n_leapfrog__']
ncp_counts <- as.data.frame(table(ncp_steps))
colnames(ncp_counts) <- c("Leapfrog Steps", "NCP Counts")

mix_steps <- do.call(rbind, get_sampler_params(mix_fit, inc_warmup=FALSE))[,'n_leapfrog__']
mix_counts <- as.data.frame(table(mix_steps))
colnames(mix_counts) <- c("Leapfrog Steps", "Mix Counts")

comp <- merge(cp_counts, ncp_counts, by="Leapfrog Steps", all=TRUE, sort=TRUE)
comp <- merge(comp, mix_counts, by="Leapfrog Steps", all=TRUE)
comp[is.na(comp)] <- 0
print(comp, row.names=FALSE)

 Leapfrog Steps CP Counts NCP Counts Mix Counts
              3       268          1        167
              5         1          0          0
              7     34703         31      25545
             10         1          0          0
             11        59          1         41
             15      4966       1240      14181
             23         2         36         39
             19         0          3          2
             21         0          1          0
             26         0          1          0
             27         0          1          0
             31         0      26783         25
             35         0          1          0
             39         0         17          0
             47         0        411          0
             55         0         13          0
             63         0      11143          0
             79         0         31          0
             95         0        230          0
            111         0          2          0
            127         0         51          0
            159         0          2          0
            191         0          1          0

When we're using infinity-suppressing prior models for the population scale a monolithic non-centered parameterization is more robust than a monolithic centered parameterization, but both pale in comparison to a proper mixed parameterization.

4.2.4 Small Ensembles

Finally let's consider what happens when we have only a few contexts and hence limited inferences for the population parameters no matter how much data we collect within the individual contexts.

In particular let's reduce the number of contexts from 9 to 2 and assume that the individual likelihood functions are narrow and strongly informative.

N <- 2
K <- 2
indiv_idx <- 1:2
sigma <- 0.1

fit <- stan(file='stan_programs/generate_data.stan',
            data=c("N", "K", "indiv_idx", "sigma"),
            iter=1, chains=1, seed=194838,
            algorithm="Fixed_param")


SAMPLING FOR MODEL 'generate_data' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                1e-05 seconds (Sampling)
Chain 1:                1e-05 seconds (Total)
Chain 1:

y <- extract(fit)$y[1,]

data <- list("N" = N, "y" = y, "K" = K, "indiv_idx" = indiv_idx, "sigma" = sigma)

We start with a monolithically entered parameterization which should be optimal.

cp_fit <- stan(file='stan_programs/hierarchical_cp.stan',
               data=data, seed=4938483,
               iter=11000, warmup=1000, refresh=11000)

Unfortunately the diagnostics indicate that something is obstructing the exploration of the posterior distribution.

util$check_all_diagnostics(cp_fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "57 of 40000 iterations ended with a divergence (0.1425%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 40000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

The problem is that with only two individual contexts we cannot learn much about the population scale beyond our prior model.

cp_samples <- extract(cp_fit)

par(mfrow=c(1, 1))

hist(abs(rnorm(40000, 0, 5)), breaks=seq(0, 25, 0.25),
     col=c_light, border=c_light_highlight, main="",
     xlim=c(0, 20), xlab="tau", yaxt='n', ylim=c(0, 2500), ylab="")
hist(cp_samples$tau, breaks=seq(0, 25, 0.25),
     col=c_dark, border=c_dark_highlight, add=T)

What we do learn manifests in a funnel degeneracy between the population location, $\mu$, and scale, $\sigma$.

partition <- util$partition_div(cp_fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

plot(nondiv_samples$mu, log(nondiv_samples$tau),
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="mu", ylab="log(tau)")
points(div_samples$mu, log(div_samples$tau),
       col=c_green_trans, pch=16, cex=0.8)

A monolithically non-centered parameterization shouldn't fare much better but let's try anyways.

ncp_fit <- stan(file='stan_programs/hierarchical_ncp.stan',
                data=data, seed=4938483,
                iter=11000, warmup=1000, refresh=11000)

The diagnostics indicate that the pathology has indeed worsened.

util$check_all_diagnostics(ncp_fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "239 of 40000 iterations ended with a divergence (0.5975%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 40000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Now we have multiple funnel degeneracies! The $\theta$-$\tau$ inverted funnel results in divergent transitions for large values of $\tau$ while the $\mu$-$\tau$ funnel results in divergent transitions for small values of $\tau$.

partition <- util$partition_div(ncp_fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

par(mfrow=c(1, 3))

plot(nondiv_samples$"theta[1]", log(nondiv_samples$tau),
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="theta[1]", ylab="log(tau)")
points(div_samples$"theta[1]", log(div_samples$tau),
       col=c_green_trans, pch=16, cex=0.8)

plot(nondiv_samples$"eta[1]", log(nondiv_samples$tau),
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="eta[1]", ylab="log(tau)")
points(div_samples$"eta[1]", log(div_samples$tau),
       col=c_green_trans, pch=16, cex=0.8)

plot(nondiv_samples$mu, log(nondiv_samples$tau),
     col=c_dark_trans, pch=16, cex=0.8,
     xlab="mu", ylab="log(tau)")
points(div_samples$mu, log(div_samples$tau),
       col=c_green_trans, pch=16, cex=0.8)

It's pathology on pathology and divergences everywhere! Fortunately with our deep understanding of these pathologies we can isolate the source of each and identify principled management strategies without getting too overwhelmed.

Critically we can't parameterize ourselves out of the $\mu$-$\tau$ funnel. The only way that we can moderate the geometry is to add more information, either by incorporating measurements in more contexts or by employing a more informative prior model on the population location and scale parameters.

4.3 Now I Know my ADCs

With a strong foundation on the computational aspects of hierarchical models in place let's now consider an example that demonstrates the ways in which we can integrate hierarchical models into principled model building.

Consider a device that converts some continuous, or analog, input into a discrete, or digital, output. For example the input might be an electric current or concentration of a chemical, and the output a digital signal amenable for computer readouts. In order to use these analog-to-digital converters, or ADCs, in practice we need to know how the strength of the analog input translates to the strength of the digital output.

The analog-to-digital process is often well modeled by a Poisson process where the rate of output events is determined by the input integrated over some time. Statistically this suggests modeling the output counts $y$ with a Poisson density function whose intensity parameter is moderated by the input $I$, \[ \begin{align*} y &\sim \text{Poisson}(\lambda) \\ \lambda &\propto I. \end{align*} \]

In practice the input is often measured relative to some reference input, $I_{0}$. In that case we can model \[ \begin{align*} \lambda &= \psi \cdot \frac{I}{I_{0}} \\ \lambda &\equiv \psi \cdot \rho_{I}, \end{align*} \] where $\rho_{I}$ is a unitless, proportional signal. The statistical challenge is then to infer the calibration coefficient $\psi$.

To that end let's say that a series of experiments measuring the behavior of the device is conducted. In each experiment the device is repeatedly exposed to a an input source that has been calibrated to a precise value before a measurement is taken of the resulting digital output. The outcome of these experiments is then collected into the following file.

data <- read_rdump("data/exp.data.R")

names(data)

[1] "y"         "log_rho_I" "exp_idx"   "N"         "N_exp"

With the data in hand and a conceptual understanding of the measurements carefully considered we can role up our sleeves and start modeling.

4.3.1 Attempt One

Initially let's assume that the calibration coefficient is constant across all of the experiments. At the same time we'll presume domain expertise that places this universal coefficient between $0.1$ and $10$, which we can model with a log normal prior density function, \[ \pi( \log \psi ) = \text{normal}(0, \log 10). \] Finally we'll denote the number of observations in the $k$th experiment by $N_{k}$ and the proportional input in the $k$th experiment by $\rho_{I, k}$.

Before implementing and then fitting this model in Stan we need to also consider our suspicions. The data were collected across $N_{\text{exp}} = 21$ different experiments, each of which could have been subject to any number of varying conditions such as who collected the data and how the source was prepared. To investigate any significant heterogeneity across the experiments let's design some summary statistics that are sensitive to any variation in the analog-to-digital conversion process.

Assuming that our initial homogenous model is true the average measurement in each experiment would be \[ \begin{align*} \left< y \right>_{k} &= \lambda_{k} \\ &= \psi \cdot \rho_{I, k}. \end{align*} \] In particular the normalized average would be constant, \[ \frac{ \left< y \right>_{k} }{ \rho_{I, k} } = \psi, \] and the normalized empirical average, \[ \frac{1}{ \rho_{I, k} } \frac{1}{N_{k}} \sum_{n = 1}^{N_{k}} \tilde{y}_{n, k}, \] approximately so. Consequently any variation in these normalized empirical means would suggest limitations of the homogeneity assumption.

To that end let's consider the normalized empirical averages of the data within each experiment as well as the analysis of variance ratio of these averages as summary statistics. Before fitting our model we need to evaluate these summary statistics on the observed data.

writeLines(readLines("stan_programs/obs_F.stan"))

functions {
  vector group_means(vector y, int N, int[] indiv_idxs, int N_indiv) {
    vector[N_indiv] counts = rep_vector(0, N_indiv);
    vector[N_indiv] means = rep_vector(0, N_indiv);
    for (n in 1:N) {
      int idx = indiv_idxs[n];
      real delta = y[n] - means[idx];
      counts[idx] += 1;
      means[idx] += delta / counts[idx];
    }
    return means;
  }
  
  real F_stat(vector y, int N, int[] indiv_idxs, int N_indiv) {
    real global_mean = 0;
    vector[N_indiv] counts = rep_vector(0, N_indiv);
    vector[N_indiv] means = rep_vector(0, N_indiv);
    vector[N_indiv] sum_squares = rep_vector(0, N_indiv);
    
    real between = 0;
    real within = 0;

    for (n in 1:N) {
      int idx = indiv_idxs[n];
      real delta;

      delta = y[n] - global_mean;
      global_mean += delta / n;

      counts[idx] += 1;
      delta = y[n] - means[idx];
      means[idx] += delta / counts[idx];
      sum_squares[idx] += (y[n] - means[idx]) * delta;
    }

    for (l in 1:N_indiv)
      between += counts[l] * square(means[l] - global_mean) / (N_indiv - 1);
    within = mean(sum_squares) / (N - N_indiv);

    return between / within;
  }
}

data {
  int<lower=0> N;      // Number of observations
  int y[N];            // Observed counts
  
  // Number of individual experiments
  int<lower=0> N_exp;
  
  // Experiment from which each observation is generated
  int<lower=1, upper=N_exp> exp_idx[N]; 

  // Nominal input for each experiment
  vector[N_exp] log_rho_I;
}

generated quantities {
  vector[N_exp] norm_ave = group_means(to_vector(y) ./ exp(log_rho_I[exp_idx]), N, exp_idx, N_exp);
  real F_exp = F_stat(to_vector(y) ./ exp(log_rho_I[exp_idx]), N, exp_idx, N_exp);
}

fit <- stan(file='stan_programs/obs_F.stan', data=data,
            iter=1, chains=1, seed=194838, algorithm="Fixed_param")


SAMPLING FOR MODEL 'obs_F' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                7.7e-05 seconds (Sampling)
Chain 1:                7.7e-05 seconds (Total)
Chain 1:

obs_F <- extract(fit)$F_exp
obs_norm_ave <- extract(fit)$norm_ave[1,]

Now we can implement our model, including the summary statistics, in Stan.

writeLines(readLines("stan_programs/fit_adc1.stan"))

functions {
  vector group_means(vector y, int N, int[] indiv_idxs, int N_indiv) {
    vector[N_indiv] counts = rep_vector(0, N_indiv);
    vector[N_indiv] means = rep_vector(0, N_indiv);
    for (n in 1:N) {
      int idx = indiv_idxs[n];
      real delta = y[n] - means[idx];
      counts[idx] += 1;
      means[idx] += delta / counts[idx];
    }
    return means;
  }
  
  real F_stat(vector y, int N, int[] indiv_idxs, int N_indiv) {
    real global_mean = 0;
    vector[N_indiv] counts = rep_vector(0, N_indiv);
    vector[N_indiv] means = rep_vector(0, N_indiv);
    vector[N_indiv] sum_squares = rep_vector(0, N_indiv);
    
    real between = 0;
    real within = 0;

    for (n in 1:N) {
      int idx = indiv_idxs[n];
      real delta;

      delta = y[n] - global_mean;
      global_mean += delta / n;

      counts[idx] += 1;
      delta = y[n] - means[idx];
      means[idx] += delta / counts[idx];
      sum_squares[idx] += (y[n] - means[idx]) * delta;
    }

    for (l in 1:N_indiv)
      between += counts[l] * square(means[l] - global_mean) / (N_indiv - 1);
    within = mean(sum_squares) / (N - N_indiv);

    return between / within;
  }
}

data {
  int<lower=0> N;      // Number of observations
  int y[N];            // Observed counts
  
  // Number of individual experiments
  int<lower=0> N_exp;
  // Experiment from which each observation is generated
  int<lower=1, upper=N_exp> exp_idx[N]; 

  // Input for each experiment
  vector[N_exp] log_rho_I;
}

parameters {
  real log_psi; // Analog-to-digital coefficient
}

model {
  log_psi ~ normal(0, log(10));
  y ~ poisson_log(log_psi + log_rho_I[exp_idx]);  // Observational model
}

generated quantities {
  int y_post_pred[N] = poisson_log_rng(log_psi + log_rho_I[exp_idx]);
  vector[N_exp] norm_ave = group_means(to_vector(y_post_pred) ./ exp(log_rho_I[exp_idx]), 
                                       N, exp_idx, N_exp);
  real F_exp = F_stat(to_vector(y_post_pred) ./ exp(log_rho_I[exp_idx]), 
                      N, exp_idx, N_exp);
}

Stan program in hand we can let loose the algorithms of war.

fit <- stan(file='stan_programs/fit_adc1.stan', data=data, seed=4938483, refresh=1000)

Welcomingly there are no indications of computational problems.

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Confident in our fit we can move on to investigating its compatibility with the observed data. Let's start with a marginal posterior retrodictive check that uses a histogram of the recorded counts.

par(mfrow=c(1, 1))

samples <- extract(fit)

B <- 125

idx <- rep(0:B, each=2)
x <- sapply(1:length(idx), function(b) if(b %% 2 == 0) idx[b] + 0.5 else idx[b] - 0.5)

obs_counts <- hist(data$y, breaks=(0:(B + 1))-0.5, plot=FALSE)$counts
pad_obs <- do.call(cbind, lapply(idx, function(n) obs_counts[n + 1]))

counts <- sapply(1:4000, function(n) hist(samples$y_post_pred[n,], breaks=(0:(B + 1))-0.5, plot=FALSE)$counts)
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:(B + 1), function(b) quantile(counts[b,], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9,n + 1]))

plot(1, type="n", main="Marginal Posterior Retrodictive Check",
     xlim=c(-0.5, B + 0.5), xlab="y",
     ylim=c(0, max(c(obs_counts, cred[9,]))), ylab="")

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,], col=c_dark, lwd=2)

lines(x, pad_obs, col="white", lty=1, lw=2.5)
lines(x, pad_obs, col="black", lty=1, lw=2)

Unfortunately the observed data clearly exhibits behaviors that the model doesn't seem capable of reproducing. This is even more clear if we subtract the posterior predictive median to isolate the residuals.

plot(1, type="n", main="Marginal Posterior Retrodictive Residuals",
     xlim=c(-0.5, B + 0.5), xlab="y",
     ylim=c(-20, 20), ylab="")

polygon(c(x, rev(x)), c(pad_cred[1,] - pad_cred[5,], rev(pad_cred[9,] - pad_cred[5,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,] - pad_cred[5,], rev(pad_cred[8,] - pad_cred[5,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,] - pad_cred[5,], rev(pad_cred[7,] - pad_cred[5,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,] - pad_cred[5,], rev(pad_cred[6,] - pad_cred[5,])),
        col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,] - pad_cred[5,], col=c_dark, lwd=2)

lines(x, pad_obs - pad_cred[5,], col="white", lty=1, lw=2.5)
lines(x, pad_obs - pad_cred[5,], col="black", lty=1, lw=2)

What about the summary statistics that we crafted to isolate any potential heterogeneity? It looks like the observed normalized averages vary much more strongly across experiments than the posterior retrodictive values.

par(mar=c(5, 1, 2, 1))

par(mfrow=c(3, 3))
for (n in 1:9) {
  hist(samples$norm_ave[,n], breaks=seq(0, 10, 0.25),
       col=c_dark, border=c_dark_highlight, main=paste("Experiment", n),
       xlab="Normalized Average", xlim=c(0, 10), yaxt='n', ylab="")
  abline(v=obs_norm_ave[n], col="black", lty=1, lw=2)
}

par(mfrow=c(4, 3))
for (n in 10:21) {
  hist(samples$norm_ave[,n], breaks=seq(0, 10, 0.25),
       col=c_dark, border=c_dark_highlight, main=paste("Experiment", n),
       xlab="Normalized Average", xlim=c(0, 10), yaxt='n', ylab="")
  abline(v=obs_norm_ave[n], col="black", lty=1, lw=2)
}

par(mar=c(5, 5, 3, 5))

This is further supported by the analysis of variance ratio that is far higher for the observed data than the posterior retrodictive distribution, indicating that the observed data is exhibiting substantial heterogeneity that our simple model cannot accommodate.

par(mfrow=c(1, 1))
hist(samples$F_exp, breaks=100, col=c_dark, border=c_dark_highlight, main="",
     xlab="F statistic", xlim=c(0, 12000), yaxt='n', ylab="")
abline(v=obs_F, col="black", lty=1, lw=2)

4.3.2 Attempt Two

Seeing clear signs of heterogeneity in the observed data, at least relative to our initial homogenous model, we are lead to consider explicitly modeling heterogeneity between the experiments. Without any additional information about the conditions of each experiment we don't have many options beyond assuming exchangeable heterogeneity and hence a hierarchical model for the varying calibration coefficients.

Note that because the calibration coefficient is positive we apply a normal hierarchical model not to the coefficients themselves but rather the logarithm of the coefficients.

This model specification assumes, however, a monolithic centered parameterization which we know can be inappropriate if each experiment doesn't have enough measurements. Although we could start with a fully centered or non-centered parameterization and then use diagnostics and scatter plots to trace down the funnels that indicate which parameterizations we should change, we could also take a look at the counts in each experiment and start with a more informed guess.

Histogramming the number of measurements in each experiment we see a clear demarcation between the data multiplicities.

par(mfrow=c(1, 1))
hist(table(data$exp_idx), breaks=3*(0:40) - 0.5,
     col=c_dark, border=c_dark_highlight, main="",
     xlab="Measuremnts Per Experiment", xlim=c(0, 110), yaxt='n', ylab="")

While this gap in the data multiplicities might have nothing to do with how much data is needed for the centered parameterization to be optimal, it does provide a reasonable starting point. The experiments in the higher cluster are more likely to be amenable to a centered parameterization while the experiments in the lower cluster are more likely to be amenable to a non-centered parameterization. If this guess is wrong then we'll see the consequences in the Hamiltonian Monte Carlo diagnostics.

thresh <- 40

ncp_idx <- which(table(data$exp_idx) <= thresh)
cp_idx <- which(table(data$exp_idx) > thresh)

data$K_ncp <- length(ncp_idx)
data$ncp_idx <- array(ncp_idx)

data$K_cp <- length(cp_idx)
data$cp_idx <- array(cp_idx)

The mixed parameterization is straightforward to implement in Stan using our indexing gymnastics.

writeLines(readLines("stan_programs/fit_adc2.stan"))

functions {
  vector group_means(vector y, int N, int[] indiv_idxs, int N_indiv) {
    vector[N_indiv] counts = rep_vector(0, N_indiv);
    vector[N_indiv] means = rep_vector(0, N_indiv);
    for (n in 1:N) {
      int idx = indiv_idxs[n];
      real delta = y[n] - means[idx];
      counts[idx] += 1;
      means[idx] += delta / counts[idx];
    }
    return means;
  }
  
  real F_stat(vector y, int N, int[] indiv_idxs, int N_indiv) {
    real global_mean = 0;
    vector[N_indiv] counts = rep_vector(0, N_indiv);
    vector[N_indiv] means = rep_vector(0, N_indiv);
    vector[N_indiv] sum_squares = rep_vector(0, N_indiv);
    
    real between = 0;
    real within = 0;

    for (n in 1:N) {
      int idx = indiv_idxs[n];
      real delta;

      delta = y[n] - global_mean;
      global_mean += delta / n;

      counts[idx] += 1;
      delta = y[n] - means[idx];
      means[idx] += delta / counts[idx];
      sum_squares[idx] += (y[n] - means[idx]) * delta;
    }

    for (l in 1:N_indiv)
      between += counts[l] * square(means[l] - global_mean) / (N_indiv - 1);
    within = mean(sum_squares) / (N - N_indiv);

    return between / within;
  }
}

data {
  int<lower=0> N;      // Number of observations
  int y[N];            // Observed counts
  
  // Number of individual experiments
  int<lower=0> N_exp;
  // Experiment from which each observation is generated
  int<lower=1, upper=N_exp> exp_idx[N]; 
  
  int<lower=0, upper=N_exp> K_ncp;          // Number of noncentered individuals
  int<lower=1, upper=N_exp> ncp_idx[K_ncp]; // Index of noncentered individuals
  
  int<lower=0, upper=N_exp> K_cp;           // Number of centered individuals
  int<lower=1, upper=N_exp> cp_idx[K_cp];   // Index of noncentered individuals
  
  vector[N_exp] log_rho_I; // Nominal input for each experiment
}

parameters {
  real mu_log_psi;           // Analog-to-digital coefficients population location
  real<lower=0> tau_log_psi; // Analog-to-digital coefficients population scale
  vector[K_ncp] log_psi_ncp; // Non-centered individual analog-to-digital coefficients
  vector[K_cp]  log_psi_cp;  // Ccentered individual analog-to-digital coefficients
}

transformed parameters {
  // Recentered individual parameters
  vector[N_exp] log_psi;
  log_psi[ncp_idx] = mu_log_psi + tau_log_psi *  log_psi_ncp;
  log_psi[cp_idx] =  log_psi_cp;
}

model {
  mu_log_psi ~ normal(0, log(10));
  tau_log_psi ~ normal(0, 0.5 * log(10));
  
  log_psi_ncp ~ normal(0, 1);                   // Non-centered hierarchical model
  log_psi_cp ~ normal(mu_log_psi, tau_log_psi); // Centered hierarchical model
  
  y ~ poisson_log(log_psi[exp_idx] + log_rho_I[exp_idx]); // Observational model
}

generated quantities {
  int y_post_pred[N] = poisson_log_rng(log_psi[exp_idx] + log_rho_I[exp_idx]);
  vector[N_exp] norm_ave = group_means(to_vector(y_post_pred) ./ exp(log_rho_I[exp_idx]), 
                                       N, exp_idx, N_exp);
  real F_exp = F_stat(to_vector(y_post_pred) ./ exp(log_rho_I[exp_idx]), 
                      N, exp_idx, N_exp);
}

At which point we leap into the fray.

fit <- stan(file='stan_programs/fit_adc2.stan', data=data, seed=4938483, refresh=1000)

Delightfully our initial parameterization seems to have been reasonable because there are no indications of fitting problems.

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Confident in the accuracy of our fit we can now consider whether or not the heterogeneity resolved the tension in our posterior retrodictive checks.

The histogram summary statistic no longer exhibits any significant tension.

samples <- extract(fit)

par(mfrow=c(1, 1))
B <- 150

idx <- rep(0:B, each=2)
x <- sapply(1:length(idx), function(b) if(b %% 2 == 0) idx[b] + 0.5 else idx[b] - 0.5)

obs_counts <- hist(data$y, breaks=(0:(B + 1))-0.5, plot=FALSE)$counts
pad_obs <- do.call(cbind, lapply(idx, function(n) obs_counts[n + 1]))

counts <- sapply(1:4000, function(n) hist(samples$y_post_pred[n,], breaks=(0:(B + 1))-0.5, plot=FALSE)$counts)
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:(B + 1), function(b) quantile(counts[b,], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9,n + 1]))

plot(1, type="n", main="Marginal Posterior Retrodictive Check",
     xlim=c(-0.5, B + 0.5), xlab="y",
     ylim=c(0, max(c(obs_counts, cred[9,]))), ylab="")

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,], col=c_dark, lwd=2)

lines(x, pad_obs, col="white", lty=1, lw=2.5)
lines(x, pad_obs, col="black", lty=1, lw=2)

par(mfrow=c(1, 1))
plot(1, type="n", main="Marginal Posterior Retrodictive Residuals",
     xlim=c(-0.5, B + 0.5), xlab="y",
     ylim=c(-15, 15), ylab="")

polygon(c(x, rev(x)), c(pad_cred[1,] - pad_cred[5,], rev(pad_cred[9,] - pad_cred[5,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,] - pad_cred[5,], rev(pad_cred[8,] - pad_cred[5,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,] - pad_cred[5,], rev(pad_cred[7,] - pad_cred[5,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,] - pad_cred[5,], rev(pad_cred[6,] - pad_cred[5,])),
        col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,] - pad_cred[5,], col=c_dark, lwd=2)

lines(x, pad_obs - pad_cred[5,], col="white", lty=1, lw=2.5)
lines(x, pad_obs - pad_cred[5,], col="black", lty=1, lw=2)

Similarly the model is now able to capture the empirical normalized averages observed in each experiment.

par(mar=c(5, 1, 2, 1))

par(mfrow=c(3, 3))
for (n in 1:9) {
  hist(samples$norm_ave[,n], breaks=seq(0, 25, 0.25),
       col=c_dark, border=c_dark_highlight, main=paste("Experiment", n),
       xlab="Normalized Average", xlim=c(0, 20), yaxt='n', ylab="")
  abline(v=obs_norm_ave[n], col="black", lty=1, lw=2)
}

par(mfrow=c(4, 3))
for (n in 10:21) {
  hist(samples$norm_ave[,n], breaks=seq(0, 25, 0.25),
       col=c_dark, border=c_dark_highlight, main=paste("Experiment", n),
       xlab="Normalized Average", xlim=c(0, 20), yaxt='n', ylab="")
  abline(v=obs_norm_ave[n], col="black", lty=1, lw=2)
}

par(mar=c(5, 5, 3, 5))

We can even capture the behavior of the very large analysis of variance statistic seen in the observed data.

par(mfrow=c(1, 1))
hist(samples$F_exp, breaks=50, col=c_dark, border=c_dark_highlight, main="",
     xlab="F statistic", xlim=c(5000, 15000), yaxt='n', ylab="")
abline(v=obs_F, col="black", lty=1, lw=2)

With the suspected misfit resolved we can consider our model inferences. The observed data are consistent with a reasonable amount of heterogeneity in the calibration coefficient across experiments. We see this in both the model for the latent population of calibration coefficients

par(mfrow=c(1, 1))
plot(samples$mu_log_psi, samples$tau_log_psi,
     col=c_dark_trans, pch=16,
     xlab="mu_log_psi", xlim=c(-2, 2),
     ylab="tau_log_psi", ylim=c(0, 3))

and the individual calibration coefficients for the observed experiments.

par(mfrow=c(1, 1))

M <- data$N_exp
idx <- rep(1:M, each=2)
x <- sapply(1:length(idx), function(m) if(m %% 2 == 0) idx[m] + 0.5 else idx[m] - 0.5)

probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:data$N_exp, function(n) quantile(samples$log_psi[,n], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(m) cred[1:9,m]))

plot(1, type="n", main="",
     xlim=c(0.5, M + 0.5), xlab="", xaxt="n",
     ylim=c(min(pad_cred[1,]), max(pad_cred[9,])),
     ylab="Marginal log psi0 Posteriors")

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
for (m in 1:M)
  lines(x[(2 * m - 1):(2 * m)], pad_cred[5,(2 * m - 1):(2 * m)], col=c_dark, lwd=2)

axis(1, at=1:M, labels=1:M)

We can also take the opportunity to look for any signs that the individual parameters are poorly modeled by a latent normal population model. Fortunately there are no strong outliers or clusters that would suggest a serious problem.

At this point there's not much we can do to further probe the heterogeneity across experiments without additional information about the conditions of each experiment that might support a more sophisticated, non-exchangeable model. Conveniently as we're sharing the results of our initial analysis a lab technician notes that the internal mechanism of our analog-to-digital device is known to depend strongly on temperature. They also inform us that the temperature of the labs in which the experiments were collected are continuously measured and recorded!

The temperatures are recorded relative to a reference temperature and are easy enough to acquire.

temp_data <- read_rdump("data/temp.data.R")
names(temp_data)

[1] "log_rho_T" "N_exp"

To see how much of our heterogeneity might be explained by a temperature dependence let's reorder the experiments by the recorded temperature.

temp_idx <- order(temp_data$log_rho_T)

reordered_cred <- cred[,temp_idx]
pad_cred <- do.call(cbind, lapply(idx, function(m) reordered_cred[1:9,m]))

plot(1, type="n", main="",
     xlim=c(0.5, M + 0.5), xlab="", xaxt="n",
     ylim=c(min(pad_cred[1,]), max(pad_cred[9,])),
     ylab="Marginal log psi 0 Posteriors")

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
for (m in 1:M)
  lines(x[(2 * m - 1):(2 * m)], pad_cred[5,(2 * m - 1):(2 * m)], col=c_dark, lwd=2)

axis(1, at=1:M, labels=temp_idx)

The inferred calibration coefficients appear to be reasonably well correlated with temperature, breaking the exchangeability inherent to our hierarchical model. With the temperature data at hand, however, we can do better.

4.3.3 Attempt Three

Doing some research we learn that the process driving our analog-to-digital converter is known to have a relatively precise exponential temperature dependence which we can model by decomposing the calibration coefficients as \[ \psi = \psi_{0} \cdot \left( \rho_{T} \right)^{\gamma}. \] Our hypothetical research also motivate an informative prior model for the temperature exponent, \[ \gamma \sim \text{normal}(1.25, 0.25). \]

To account for any heterogeneity beyond that induced by this temperature dependence let's model $\psi_{0}$ hierarchically, only with a narrower prior model. Note that we're only modeling heterogeneity in the constant and not the temperature exponent.

To ensure that the prior model for this expanded model is still compatible with our domain expertise we can simulate from the induced prior distribution for the entire calibration coefficient and perform a prior pushforward check.

writeLines(readLines("stan_programs/adc_prior.stan"))

generated quantities {
  real<lower=0> psi;
  {
    real mu_log_psi0 = normal_rng(0, 0.5 * log(1.25));
    real tau_log_psi0 = fabs(normal_rng(0, log(1.25)));
    real log_psi0 = normal_rng(mu_log_psi0, tau_log_psi0);
    real log_rho_T = normal_rng(0, log(2));
    real gamma = normal_rng(1.25, 0.25);
    psi = exp(log_psi0 + gamma * log_rho_T);
  }
}

fit <- stan(file='stan_programs/adc_prior.stan', seed=194838, algorithm="Fixed_param",
            iter=4000, chains=1, refresh=4000)


SAMPLING FOR MODEL 'adc_prior' NOW (CHAIN 1).
Chain 1: Iteration:    1 / 4000 [  0%]  (Sampling)
Chain 1: Iteration: 4000 / 4000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                0.003789 seconds (Sampling)
Chain 1:                0.003789 seconds (Total)
Chain 1:

hist(extract(fit)$psi, breaks=seq(0, 25, 0.25),
     col=c_dark, border=c_dark_highlight, main="",
     xlim=c(0, 25), xlab="psi", yaxt='n', ylab="")

The prior pushforward distribution for $\psi$ is perhaps a bit too suppressed above values of $5$ but overall it looks reasonable.

Confident that our prior model does not strongly conflict with our domain expertise we can code up the full model in Stan.

writeLines(readLines("stan_programs/fit_adc3.stan"))

functions {
  vector group_means(vector y, int N, int[] indiv_idxs, int N_indiv) {
    vector[N_indiv] counts = rep_vector(0, N_indiv);
    vector[N_indiv] means = rep_vector(0, N_indiv);
    for (n in 1:N) {
      int idx = indiv_idxs[n];
      real delta = y[n] - means[idx];
      counts[idx] += 1;
      means[idx] += delta / counts[idx];
    }
    return means;
  }
  
  real F_stat(vector y, int N, int[] indiv_idxs, int N_indiv) {
    real global_mean = 0;
    vector[N_indiv] counts = rep_vector(0, N_indiv);
    vector[N_indiv] means = rep_vector(0, N_indiv);
    vector[N_indiv] sum_squares = rep_vector(0, N_indiv);
    
    real between = 0;
    real within = 0;

    for (n in 1:N) {
      int idx = indiv_idxs[n];
      real delta;

      delta = y[n] - global_mean;
      global_mean += delta / n;

      counts[idx] += 1;
      delta = y[n] - means[idx];
      means[idx] += delta / counts[idx];
      sum_squares[idx] += (y[n] - means[idx]) * delta;
    }

    for (l in 1:N_indiv)
      between += counts[l] * square(means[l] - global_mean) / (N_indiv - 1);
    within = mean(sum_squares) / (N - N_indiv);

    return between / within;
  }
}

data {
  int<lower=0> N;      // Number of observations
  int y[N];            // Observed counts
  
  // Number of individual experiments
  int<lower=0> N_exp;
  // Experiment from which each observation is generated
  int<lower=1, upper=N_exp> exp_idx[N]; 
  
  int<lower=0, upper=N_exp> K_ncp;          // Number of noncentered individuals
  int<lower=1, upper=N_exp> ncp_idx[K_ncp]; // Index of noncentered individuals
  
  int<lower=0, upper=N_exp> K_cp;           // Number of centered individuals
  int<lower=1, upper=N_exp> cp_idx[K_cp];   // Index of noncentered individuals
  
  vector[N_exp] log_rho_I; // Nominal input for each experiment

  vector[N_exp] log_rho_T; // Temperature for each experiment
}

parameters {
  real mu_log_psi0;           // Analog-to-digital coefficients population location
  real<lower=0> tau_log_psi0; // Analog-to-digital coefficients population scale
  vector[K_ncp] log_psi0_ncp; // Non-centered individual analog-to-digital coefficients
  vector[K_cp]  log_psi0_cp;  // Ccentered individual analog-to-digital coefficients
  
  real gamma;                     // Exponential temperature dependence
}

transformed parameters {
  // Recentered individual parameters
  vector[N_exp] log_psi0;
  log_psi0[ncp_idx] = mu_log_psi0 + tau_log_psi0 *  log_psi0_ncp;
  log_psi0[cp_idx] =  log_psi0_cp;
}

model {
  vector[N_exp] log_psi = log_psi0 + gamma * log_rho_T;
  
  mu_log_psi0 ~ normal(0, log(2));
  tau_log_psi0 ~ normal(0, 0.5 * log(2));
  
  log_psi0_ncp ~ normal(0, 1);                     // Non-centered hierarchical model
  log_psi0_cp ~ normal(mu_log_psi0, tau_log_psi0); // Centered hierarchical model
  
  gamma ~ normal(1.25, 0.25);
  
  y ~ poisson_log(log_psi[exp_idx] + log_rho_I[exp_idx]); // Observational model
}

generated quantities {
  int y_post_pred[N] = poisson_log_rng(  log_psi0[exp_idx] 
                                       + gamma * log_rho_T[exp_idx] 
                                       + log_rho_I[exp_idx]);
  vector[N_exp] norm_ave = group_means(to_vector(y_post_pred) ./ exp(log_rho_I[exp_idx]), 
                                       N, exp_idx, N_exp);
  real F_exp = F_stat(to_vector(y_post_pred) ./ exp(log_rho_I[exp_idx]), 
                      N, exp_idx, N_exp);
}

Combining the original data and the newly acquired temperature data we're now ready to fit.

data$log_rho_T = temp_data$log_rho_T

fit <- stan(file='stan_programs/fit_adc3.stan', data=data, seed=4938483, refresh=1000)

Once again no diagnostics indicate any problems with our computation.

util$check_all_diagnostics(fit)

[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Reassuringly the temperature model doesn't compromise any of the posterior retrodictive checks.

samples <- extract(fit)

par(mfrow=c(1, 1))
B <- 150

idx <- rep(0:B, each=2)
x <- sapply(1:length(idx), function(b) if(b %% 2 == 0) idx[b] + 0.5 else idx[b] - 0.5)

obs_counts <- hist(data$y, breaks=(0:(B + 1))-0.5, plot=FALSE)$counts
pad_obs <- do.call(cbind, lapply(idx, function(n) obs_counts[n + 1]))

counts <- sapply(1:4000, function(n) hist(samples$y_post_pred[n,], breaks=(0:(B + 1))-0.5, plot=FALSE)$counts)
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:(B + 1), function(b) quantile(counts[b,], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9,n + 1]))

plot(1, type="n", main="Marginal Posterior Retrodictive Check",
     xlim=c(-0.5, B + 0.5), xlab="y",
     ylim=c(0, max(c(obs_counts, cred[9,]))), ylab="")

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,], col=c_dark, lwd=2)

lines(x, pad_obs, col="white", lty=1, lw=2.5)
lines(x, pad_obs, col="black", lty=1, lw=2)

par(mfrow=c(1, 1))
plot(1, type="n", main="Marginal Posterior Retrodictive Residuals",
     xlim=c(-0.5, B + 0.5), xlab="y",
     ylim=c(-15, 15), ylab="")

polygon(c(x, rev(x)), c(pad_cred[1,] - pad_cred[5,], rev(pad_cred[9,] - pad_cred[5,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,] - pad_cred[5,], rev(pad_cred[8,] - pad_cred[5,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,] - pad_cred[5,], rev(pad_cred[7,] - pad_cred[5,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,] - pad_cred[5,], rev(pad_cred[6,] - pad_cred[5,])),
        col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,] - pad_cred[5,], col=c_dark, lwd=2)

lines(x, pad_obs - pad_cred[5,], col="white", lty=1, lw=2.5)
lines(x, pad_obs - pad_cred[5,], col="black", lty=1, lw=2)

par(mar=c(5, 1, 2, 1))

par(mfrow=c(3, 3))
for (n in 1:9) {
  hist(samples$norm_ave[,n], breaks=seq(0, 25, 0.25),
       col=c_dark, border=c_dark_highlight, main=paste("Experiment", n),
       xlab="Normalized Average", xlim=c(0, 20), yaxt='n', ylab="")
  abline(v=obs_norm_ave[n], col="black", lty=1, lw=2)
}

par(mfrow=c(4, 3))
for (n in 10:21) {
  hist(samples$norm_ave[,n], breaks=seq(0, 25, 0.25),
       col=c_dark, border=c_dark_highlight, main=paste("Experiment", n),
       xlab="Normalized Average", xlim=c(0, 20), yaxt='n', ylab="")
  abline(v=obs_norm_ave[n], col="black", lty=1, lw=2)
}

par(mar=c(5, 5, 3, 5))

par(mfrow=c(1, 1))
hist(samples$F_exp, breaks=50, col=c_dark, border=c_dark_highlight, main="",
     xlab="F statistic", xlim=c(5000, 15000), yaxt='n', ylab="")
abline(v=obs_F, col="black", lty=1, lw=2)

We can now take a look a closer look at our inferences. We learn a bit more about the temperature scaling than just what was provided by the informative prior model.

hist(abs(rnorm(4000, 1.25, 0.25)), breaks=seq(0, 5, 0.025),
     col=c_light, border=c_light_highlight, main="",
     xlim=c(0, 2), xlab="gamma", yaxt='n', ylim=c(0, 300), ylab="")
hist(samples$gamma, breaks=seq(0, 2, 0.025),
     col=c_dark, border=c_dark_highlight, add=T)

Interestingly the posterior for the latent population scale is pulled away from zero indicating significant heterogeneity beyond that captured by the assumed temperature dependence.

plot(samples$mu_log_psi0, samples$tau_log_psi0,
     col=c_dark_trans, pch=16,
     xlab="mu_log_psi0", xlim=c(-1, 1),
     ylab="tau_log_psi0", ylim=c(0, 1.5))

M <- data$N_exp
idx <- rep(1:M, each=2)
x <- sapply(1:length(idx), function(m) if(m %% 2 == 0) idx[m] + 0.5 else idx[m] - 0.5)

probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:data$N_exp, function(n) quantile(samples$log_psi0[,n], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(m) cred[1:9,m]))

plot(1, type="n", main="",
     xlim=c(0.5, M + 0.5), xlab="", xaxt="n",
     ylim=c(min(pad_cred[1,]), max(pad_cred[9,])),
     ylab="Marginal log psi0 Posteriors")

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
for (m in 1:M)
  lines(x[(2 * m - 1):(2 * m)], pad_cred[5,(2 * m - 1):(2 * m)], col=c_dark, lwd=2)

axis(1, at=1:M, labels=1:M)

One possible source of this excess heterogeneity might be an insufficient temperature model. If the temperature dependence were more complicated, however, then we would expect to see the heterogeneity continue to correlate with the recorded temperatures. To check we can reorder the experiments again.

reordered_cred <- cred[,temp_idx]
pad_cred <- do.call(cbind, lapply(idx, function(m) reordered_cred[1:9,m]))

plot(1, type="n", main="",
     xlim=c(0.5, M + 0.5), xlab="", xaxt="n",
     ylim=c(min(pad_cred[1,]), max(pad_cred[9,])),
     ylab="Marginal log psi0 Posteriors")

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
for (m in 1:M)
  lines(x[(2 * m - 1):(2 * m)], pad_cred[5,(2 * m - 1):(2 * m)], col=c_dark, lwd=2)

axis(1, at=1:M, labels=temp_idx)

Reordering the experiments no longer reveals systematic structure, however, suggesting that the heterogeneity is a consequence of some other process.

Confident in our modeling assumptions we can now plot the posteriors for the more directly interpretable constrained calibration coefficient, $\psi_{0}$.

M <- data$N_exp
idx <- rep(1:M, each=2)
x <- sapply(1:length(idx), function(m) if(m %% 2 == 0) idx[m] + 0.5 else idx[m] - 0.5)

probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:data$N_exp, function(n) quantile(exp(samples$log_psi0[,n]), probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(m) cred[1:9,m]))

plot(1, type="n", main="",
     xlim=c(0.5, M + 0.5), xlab="", xaxt="n",
     ylim=c(min(pad_cred[1,]), max(pad_cred[9,])),
     ylab="Marginal psi0 Posteriors")

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
for (m in 1:M)
  lines(x[(2 * m - 1):(2 * m)], pad_cred[5,(2 * m - 1):(2 * m)], col=c_dark, lwd=2)

axis(1, at=1:M, labels=1:M)

Finally we can consider how to communicate our inferences to anyone who might be using a similar analog-to-digital converter. If others will be using the exact same devices in the exact same experimental conditions then it's natural to report the individual calibration coefficients, $\log \psi_{0}$, directly. On the other hand if the experiments only sampled the devices and conditions then the latent population parameters $\mu_{\log \psi_{0}}$ and $\tau_{\log \psi_{0}}$ will better capture the range of behaviors that others will encounter.

5 Multivariate Normal Hierarchical Models

To this point we've discussed scalar normal hierarchical models that model heterogeneity that manifests in a single, one-dimensional parameter. Heterogeneity that manifests in multiple parameters can be modeled by many independent scalar normal hierarchical models, but sometimes we also want to capture correlated behavior amongst the parameters.

In general the behavior within each individual context might be modeled with many parameters that we pack into a vector $\boldsymbol{\theta}_{k}$. If the heterogeneity in those behaviors is exchangeable then de Finetti's theorem leads us to a multivariate hierarchical model of the form \[ \pi(\boldsymbol{\theta}_{1}, \ldots, \boldsymbol{\theta}_{K}, \phi) = \left[ \prod_{k = 1}^{K} \pi( \boldsymbol{\theta}_{k} \mid \phi ) \right] \cdot \pi(\phi). \]

Mirroring the motivations for the scalar normal population model we might then choose a multivariate normal population model \[ \pi( \boldsymbol{\theta}_{k} \mid \phi ) = \text{multi-normal}(\boldsymbol{\mu}, \boldsymbol{\Gamma}), \] where the location parameters $\boldsymbol{\mu}$ capture the centrality of the population along each component and the covariance parameters $\boldsymbol{\Gamma}$ capture the correlations and scales.

To facilitate population prior modeling we can decompose the covariance matrix of parameters into a correlation matrix of parameters, $\boldsymbol{\Psi}$, and component scales, $\boldsymbol{\tau}$, \[ \boldsymbol{\Gamma} = \text{diag}(\boldsymbol{\tau}) \cdot \boldsymbol{\Psi} \cdot \text{diag}(\boldsymbol{\tau}), \] where $\text{diag}$ maps a vector to a diagonal matrix. This decomposition then allows us to reason about the parameter scales independently of the correlations between those parameters.

As with one-dimensional normal hierarchical model the scales quantify the magnitude of heterogeneity in each component parameter; consequently we can reason about them in the same way. In particular half normal prior density functions are a natural choice when expanding around an initial homogenous model.

Prior modeling of the correlation matrix is more subtle. A common choice is the Lewandowski-Kurowicka-Joe, or LKJ, probability density function over the space of correlation matrices inspired by [12], \[ \pi(\boldsymbol{\Psi}) = \text{LKJ}(\zeta) = \left| \boldsymbol{\Psi} \right|^{\zeta}. \] The positive parameter $\zeta \in \mathbb{R}^{+}$ controls the concentration of the probability density function -- if $\zeta > 1$ then the probability density function concentrates around the identity matrix, if $\zeta = 1$ then the probability density function is uniform over the space of correlation matrices, and if $\zeta < 1$ then the probability density function concentrates on the boundaries of the space. Incidentally, because the LKJ density function depends only on the absolute value of a matrix determinant it is invariant to permutations of the rows and columns of $\boldsymbol{\Psi}$; consequently it defines a finitely-exchangeable model!

Because an identity correlation matrix corresponds to vanishing correlations between the parameters, $\zeta > 1$ is a natural choice when expanding around a simpler, uncorrelated model. Finding a specific $\zeta$ compatible with one's domain expertise, however, is challenging. In particular the strength of the concentration around the identity matrix implied by any given $\zeta$ depends on the dimensionality of $\boldsymbol{\Psi}$.

The implementation of a multivariate normal hierarchical model strongly parallels that of the scalar normal hierarchical model with only a few additional subtleties due to the correlations.

5.1 Multivariate Centered and Non-Centered Parameterizations

Like scalar normal hierarchical models multivariate normal hierarchical models admit both centered and non-centered parameterizations.

The centered parameterization directly models the individual parameters, \[ \boldsymbol{\theta}_{k} \sim \text{multi-normal}(\boldsymbol{\mu}, \boldsymbol{\Gamma}), \] or equivalently \[ \boldsymbol{\theta}_{k} \sim \text{multi-normal}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\tau}) \cdot \boldsymbol{\Psi} \cdot \text{diag}(\boldsymbol{\tau})). \] In practice it is typically easier to work not with the correlation matrix but rather its Cholesky decomposition that satisfies \[ \boldsymbol{\Psi} = \boldsymbol{L} \cdot \boldsymbol{L}^{T}. \] Conveniently the Stan Modeling Language provides implementations of both the multivariate normal and LKJ probability density functions directly in terms of the Cholesky factor.

The Cholesky factor of the correlation matrix also defines the multivariate generalization of the non-centered parameterization. In this case we model the relative deviations from the multivariate population, \[ \begin{align*} \boldsymbol{\eta}_{k} &\sim \text{multi-normal}( \boldsymbol{0}, \boldsymbol{I} ) \\ \boldsymbol{\theta}_{k} = \boldsymbol{\mu} + \text{diag}(\boldsymbol{\tau}) \cdot \boldsymbol{L} \cdot \boldsymbol{\eta}_{k} &\sim \text{multi-normal}(\boldsymbol{\mu}, \boldsymbol{\Gamma}). \end{align*} \]

Funnel degeneracies manifest similarly in the multivariate normal hierarchical model as in the scalar normal hierarchical model. In particular the non-centered parameterization is appropriate for individuals only weakly informed by observations while the centered parameterization is appropriate for individuals that are strongly informed by observations.

5.2 Multivariate for the Proletariate

To demonstrate how to implement multivariate normal hierarchical models in Stan let's generalize some of the scalar normal hierarchical model exercises from Section 4.2. Here we'll consider only the uniformly weakly informed and uniformly strongly informed circumstances, and hence only monolithic centered and non-centered parameterizations. The implementation of mixed parameterizations for the multivariate normal hierarchical model is a relatively straightforward generalization.

5.2.1 Uniformly Weakly Informative Likelihood Functions

Let's first consider the circumstance where each individual likelihood function is diffuse so that the prior model dominates the behavior of the posterior density function.

writeLines(readLines("stan_programs/generate_multivariate_data.stan"))

data {
  // Number of observations
  int<lower=1> N;      
  
  // Number of individuals in hierarchy
  int<lower=0> K;
  
  // Individual from which each observation is generated
  int<lower=1, upper=K> indiv_idx[N];
  
   // Number of components in each observation
  int<lower=1> I;     
  
  // Measurement variability
  real<lower=0> sigma; 
}

transformed data {
  // Population locations
  vector[I] mu = [-2.191979, 1.782117, 2.349158]';
  
  // Population scales
  vector<lower=0>[I] tau = [1.0625921, 0.7013441, 0.2720867]';
  
  // Population correlations
  matrix[I, I] L = [ [1.0000000, 0.0000000, 0.0000000],
                     [0.1885095, 0.9820714, 0.0000000],
                     [0.4708125, 0.2414451, 0.8485516] ];
                     
  vector[I] theta[K];
  for (k in 1:K)
    theta[k] = multi_normal_cholesky_rng(mu, diag_pre_multiply(tau, L));
}

generated quantities {
  real y[N, I]; // Simulated observations
  
  for (n in 1:N) {
    // Simulate individual observations
    y[n] = normal_rng(theta[indiv_idx[n]], sigma);
  } 
}

N <- 9
K <- 9
indiv_idx <- 1:9
I <- 3
sigma <- 10

fit <- stan(file='stan_programs/generate_multivariate_data.stan',
            data=c("N", "K", "indiv_idx", "I", "sigma"),
            iter=1, chains=1, seed=194838, algorithm="Fixed_param")


SAMPLING FOR MODEL 'generate_multivariate_data' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                2.8e-05 seconds (Sampling)
Chain 1:                2.8e-05 seconds (Total)
Chain 1:

y <- extract(fit)$y[1,,]

data <- list("N" = N, "y" = y, "K" = K, "indiv_idx" = indiv_idx, "I" = I, "sigma" = sigma)

In this case the monolithically centered and non-centered parameterizations should perform the worst and best, respectively.

5.2.1.1 Monolithically Centered Parameterization

We start with the monolithically centered parameterization which should be problematic.

writeLines(readLines("stan_programs/multivariate_hierarchical_cp.stan"))

data {
  int<lower=0> N;      // Number of observations
  int<lower=0> I;      // Number of components
  vector[I] y[N];      // Observations
  real<lower=0> sigma; // Measurement variability
  
  // Number of individuals in hierarchy
  int<lower=0> K;
  // Individual from which each observation is generated
  int<lower=1, upper=K> indiv_idx[N]; 
}

parameters {
  vector[I] mu;              // Population locations
  vector<lower=0>[I] tau;    // Population scales
  cholesky_factor_corr[I] L; // Population correlations
  vector[I] theta[K];        // Centered individual parameters
}

model {
  mu ~ normal(0, 5);        // Prior model
  tau ~ normal(0, 5);       // Prior model
  L ~ lkj_corr_cholesky(4); // Prior model
  
  // Centered hierarchical model
  theta ~ multi_normal_cholesky(mu, diag_pre_multiply(tau, L)); 
   
  // Observational model
  for (n in 1:N)
    y[n] ~ normal(theta[indiv_idx[n]], sigma); 
}

// Simulate a full observation from the current value of the parameters
generated quantities {
  real y_post_pred[N, I];
  for (n in 1:N)
    y_post_pred[n] = normal_rng(theta[indiv_idx[n]], sigma);
}

cp_fit <- stan(file='stan_programs/multivariate_hierarchical_cp.stan',
               data=data, seed=4938483)

As expected the diagnostics are very unhappy with our choice of parameterization.

util$check_all_diagnostics(cp_fit)

[1] "n_eff for parameter L[1,1] is NaN!"
[1] "n_eff for parameter L[1,2] is NaN!"
[1] "n_eff for parameter L[1,3] is NaN!"
[1] "n_eff for parameter L[2,3] is NaN!"
[1] "  n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated"
[1] "Rhat for parameter mu[3] is 1.14874447767227!"
[1] "Rhat for parameter tau[3] is 1.13489039281767!"
[1] "Rhat for parameter L[1,1] is NaN!"
[1] "Rhat for parameter L[1,2] is NaN!"
[1] "Rhat for parameter L[1,3] is NaN!"
[1] "Rhat for parameter L[2,1] is 1.18464275392373!"
[1] "Rhat for parameter L[2,2] is 1.25225674163188!"
[1] "Rhat for parameter L[2,3] is NaN!"
[1] "Rhat for parameter L[3,1] is 1.16612081791964!"
[1] "Rhat for parameter L[3,3] is 1.12939302453921!"
[1] "Rhat for parameter theta[1,1] is 1.11165050521883!"
[1] "Rhat for parameter theta[7,3] is 1.10618757826221!"
[1] "Rhat for parameter theta[9,3] is 1.11389755233146!"
[1] "Rhat for parameter lp__ is 1.24793221048929!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "684 of 4000 iterations ended with a divergence (17.1%)"
[1] "  Try running with larger adapt_delta to remove the divergences"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "Chain 2: E-FMI = 0.194241112771803"
[1] "  E-FMI below 0.2 indicates you may need to reparameterize your model"

That said we do need to keep in mind that the upper triangular elements of the Cholesky factor of the correlation matrix are constants and consequently induce false positive effective sample size and $\hat{R}$ warnings. Instead we should focus on the divergences and E-FMI warnings which are clear indicators of funnel degeneracies.

Indeed if we look at the first component of the $\boldsymbol{\theta}_{k}$ in each context we see the distinctive funnel geometry.

partition <- util$partition_div(cp_fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

par(mfrow=c(3, 3))

for (k in 1:data$K) {
  name <- paste("theta[", k, ",1]", sep="")
  plot(nondiv_samples[name][[1]], log(nondiv_samples$"tau[1]"),
       col=c_dark_trans, pch=16, cex=0.8,
       xlab=name, xlim=c(-30, 30), ylab="log(tau)", ylim=c(-2, 3))
  points(div_samples[name][[1]], log(div_samples$"tau[1]"),
         col=c_green_trans, pch=16, cex=0.8)
}

In fact the funnel degeneracy manifests in all of the components.

par(mfrow=c(3, 3))

for (k in 1:data$K) {
  name <- paste("theta[", k, ",2]", sep="")
  plot(nondiv_samples[name][[1]], log(nondiv_samples$"tau[2]"),
       col=c_dark_trans, pch=16, cex=0.8,
       xlab=name, xlim=c(-30, 30), ylab="log(tau)", ylim=c(-2, 3))
  points(div_samples[name][[1]], log(div_samples$"tau[2]"),
         col=c_green_trans, pch=16, cex=0.8)
}

par(mfrow=c(3, 3))

for (k in 1:data$K) {
  name <- paste("theta[", k, ",3]", sep="")
  plot(nondiv_samples[name][[1]], log(nondiv_samples$"tau[3]"),
       col=c_dark_trans, pch=16, cex=0.8,
       xlab=name, xlim=c(-30, 30), ylab="log(tau)", ylim=c(-2, 3))
  points(div_samples[name][[1]], log(div_samples$"tau[3]"),
         col=c_green_trans, pch=16, cex=0.8)
}

5.2.1.2 Monolithically Non-Centered Parameterization

On the other hand the monolithically non-centered parameterization should perform much better.

writeLines(readLines("stan_programs/multivariate_hierarchical_ncp.stan"))

data {
  int<lower=0> N;      // Number of observations
  int<lower=0> I;      // Number of components
  vector[I] y[N];      // Observations
  real<lower=0> sigma; // Measurement variability
  
  // Number of individuals in hierarchy
  int<lower=0> K;
  // Individual from which each observation is generated
  int<lower=1, upper=K> indiv_idx[N]; 
}

parameters {
  vector[I] mu;              // Population locations
  vector<lower=0>[I] tau;    // Population scales
  cholesky_factor_corr[I] L; // Population correlations
  vector[I] eta[K];  // Non-centered individual parameters
}

transformed parameters {
  vector[I] theta[K]; // Multivariate centered parameters for each individual
  for (k in 1:K)
    theta[k] = mu + diag_pre_multiply(tau, L) * eta[k];
}

model {
  mu ~ normal(0, 5);        // Prior model
  tau ~ normal(0, 5);       // Prior model
  L ~ lkj_corr_cholesky(4); // Prior model
  
  // Non-centered hierarchical model
  for (k in 1:K) 
    eta[k] ~ normal(0, 1);
  
  // Observational model
  for (n in 1:N) {
    y[n] ~ normal(theta[indiv_idx[n]], sigma); 
  }
}

// Simulate a full observation from the current value of the parameters
generated quantities {
  real y_post_pred[N, I];
  for (n in 1:N) {
    y_post_pred[n] = normal_rng(theta[indiv_idx[n]], sigma);
  }
}

ncp_fit <- stan(file='stan_programs/multivariate_hierarchical_ncp.stan',
                data=data, seed=4938483)

Now there are no diagnostics beyond the expected false positives.

util$check_all_diagnostics(ncp_fit)

[1] "n_eff for parameter L[1,1] is NaN!"
[1] "n_eff for parameter L[1,2] is NaN!"
[1] "n_eff for parameter L[1,3] is NaN!"
[1] "n_eff for parameter L[2,3] is NaN!"
[1] "  n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated"
[1] "Rhat for parameter L[1,1] is NaN!"
[1] "Rhat for parameter L[1,2] is NaN!"
[1] "Rhat for parameter L[1,3] is NaN!"
[1] "Rhat for parameter L[2,3] is NaN!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Confident in the faithfulness of our fit we can examine the posterior inferences. For example we can look at the marginal posteriors for each component parameter across the individual contexts. Without much information in any of the individual likelihood functions we don't learn much.

ncp_samples <- extract(ncp_fit)

par(mfrow=c(1, 1))

M <- data$K + 1
idx <- rep(1:M, each=2)
x <- sapply(1:length(idx), function(m) if(m %% 2 == 0) idx[m] + 0.5 else idx[m] - 0.5)

probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:data$K, function(k) quantile(ncp_samples$theta[,k,1], probs=probs))
cred <- cbind(cred, quantile(ncp_samples$mu[,1], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(m) cred[1:9,m]))

plot(1, type="n", main="",
     xlim=c(0.5, M + 0.5), xlab="", xaxt="n",
     ylim=c(min(pad_cred[1,]), max(pad_cred[9,])),
     ylab="Marginal Posteriors for First Component")
abline(v=M-0.5, col="gray80", lwd=2, lty=3)

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
for (m in 1:M)
  lines(x[(2 * m - 1):(2 * m)], pad_cred[5,(2 * m - 1):(2 * m)], col=c_dark, lwd=2)

axis(1, at=1:M, labels=c("th1", "th2", "th3", "th4", "th5",
                         "th6", "th7", "th8", "th9", "mu"))

In particular diffuse posterior distributions for the component scales limits the partial pooling.

par(mfrow=c(1, 3))

set.seed(7488393)

hist(abs(rnorm(4000, 0, 5)), breaks=seq(0, 25, 0.5),
     col=c_light, border=c_light_highlight, main="",
     xlim=c(0, 20), xlab="tau[1]", yaxt='n', ylim=c(0, 400), ylab="")
hist(ncp_samples$tau[,1], breaks=seq(0, 25, 0.5),
     col=c_dark, border=c_dark_highlight, add=T)

hist(abs(rnorm(4000, 0, 5)), breaks=seq(0, 25, 0.5),
     col=c_light, border=c_light_highlight, main="",
     xlim=c(0, 20), xlab="tau[2]", yaxt='n', ylim=c(0, 400), ylab="")
hist(ncp_samples$tau[,2], breaks=seq(0, 25, 0.5),
     col=c_dark, border=c_dark_highlight, add=T)

hist(abs(rnorm(4000, 0, 5)), breaks=seq(0, 25, 0.5),
     col=c_light, border=c_light_highlight, main="",
     xlim=c(0, 20), xlab="tau[3]", yaxt='n', ylim=c(0, 400), ylab="")
hist(ncp_samples$tau[,3], breaks=seq(0, 25, 0.5),
     col=c_dark, border=c_dark_highlight, add=T)

Marginal posterior retrodictive checks follow similarly.

par(mfrow=c(1, 1))

M <- data$K
idx <- rep(1:M, each=2)
x <- sapply(1:length(idx), function(m) if(m %% 2 == 0) idx[m] + 0.5 else idx[m] - 0.5)

probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:data$K, function(k) quantile(ncp_samples$y_post_pred[,k,1], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(m) cred[1:9,m]))

plot(1, type="n", main="",
     xlim=c(0.5, M + 0.5), xlab="", xaxt="n",
     ylim=c(min(pad_cred[1,]), max(pad_cred[9,])),
     ylab="Marginal Posterior Predictive for First Component")

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
for (m in 1:M)
  lines(x[(2 * m - 1):(2 * m)], pad_cred[5,(2 * m - 1):(2 * m)], col=c_dark, lwd=2)

pad_obs <- do.call(cbind, lapply(idx, function(m) data$y[m,1]))

lines(x, pad_obs, lwd=1.5, col="white")
lines(x, pad_obs, lwd=1.25, col="black")

axis(1, at=1:M, labels=c("y1", "y2", "y3", "y4", "y5",
                         "y6", "y7", "y8", "y9"))

5.2.2 Uniformly Strongly Informative Likelihood Functions

To explore the opposite side of the spectrum let's decrease the measurement variability so that all of the individual likelihood functions are now much more informative.

sigma <- 0.1

fit <- stan(file='stan_programs/generate_multivariate_data.stan',
            data=c("N", "K", "indiv_idx", "I", "sigma"),
            iter=1, chains=1, seed=194838, algorithm="Fixed_param")


SAMPLING FOR MODEL 'generate_multivariate_data' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                1.5e-05 seconds (Sampling)
Chain 1:                1.5e-05 seconds (Total)
Chain 1:

y <- extract(fit)$y[1,,]

data <- list("N" = N, "y" = y, "K" = K, "indiv_idx" = indiv_idx, "I" = I, "sigma" = sigma)

We now expect the monolithically centered parameterization to perform the best and the monolithically non-centered parameterization to perform the worst.

5.2.2.1 Monolithically Centered Parameterization

Following the pattern we start with the monolithically centered parameterization.

cp_fit <- stan(file='stan_programs/multivariate_hierarchical_cp.stan',
               data=data, seed=4938483)

Aside from the false positives the diagnostics look clean.

util$check_all_diagnostics(cp_fit)

[1] "n_eff for parameter L[1,1] is NaN!"
[1] "n_eff for parameter L[1,2] is NaN!"
[1] "n_eff for parameter L[1,3] is NaN!"
[1] "n_eff for parameter L[2,3] is NaN!"
[1] "  n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated"
[1] "Rhat for parameter L[1,1] is NaN!"
[1] "Rhat for parameter L[1,2] is NaN!"
[1] "Rhat for parameter L[1,3] is NaN!"
[1] "Rhat for parameter L[2,3] is NaN!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

With increased information in the individual likelihood functions we can now learn the components parameters of all of the individual contexts quite well.

cp_samples <- extract(cp_fit)

par(mfrow=c(1, 1))

M <- data$K + 1
idx <- rep(1:M, each=2)
x <- sapply(1:length(idx), function(m) if(m %% 2 == 0) idx[m] + 0.5 else idx[m] - 0.5)

probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:data$K, function(k) quantile(cp_samples$theta[,k,1], probs=probs))
cred <- cbind(cred, quantile(cp_samples$mu[,1], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(m) cred[1:9,m]))

plot(1, type="n", main="",
     xlim=c(0.5, M + 0.5), xlab="", xaxt="n",
     ylim=c(min(pad_cred[1,]), max(pad_cred[9,])),
     ylab="Marginal Posteriors for First Component")
abline(v=M-0.5, col="gray80", lwd=2, lty=3)

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
       col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
       col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
       col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
       col = c_mid_highlight, border = NA)
for (m in 1:M)
  lines(x[(2 * m - 1):(2 * m)], pad_cred[5,(2 * m - 1):(2 * m)], col=c_dark, lwd=2)

axis(1, at=1:M, labels=c("th1", "th2", "th3", "th4", "th5",
                         "th6", "th7", "th8", "th9", "mu"))

Moreover the information strongly constrains the population parameters, in particular the population scales which then informs the partial pooling across the multivariate normal hierarchy.

par(mfrow=c(1, 3))

hist(abs(rnorm(4000, 0, 5)), breaks=seq(0, 25, 0.5),
     col=c_light, border=c_light_highlight, main="",
     xlim=c(0, 20), xlab="tau[1]", yaxt='n', ylim=c(0, 700), ylab="")
hist(cp_samples$tau[,1], breaks=seq(0, 25, 0.5),
     col=c_dark, border=c_dark_highlight, add=T)

hist(abs(rnorm(4000, 0, 5)), breaks=seq(0, 25, 0.5),
     col=c_light, border=c_light_highlight, main="",
     xlim=c(0, 20), xlab="tau[2]", yaxt='n', ylim=c(0, 700), ylab="")
hist(cp_samples$tau[,2], breaks=seq(0, 25, 0.5),
     col=c_dark, border=c_dark_highlight, add=T)

hist(abs(rnorm(4000, 0, 5)), breaks=seq(0, 25, 0.5),
     col=c_light, border=c_light_highlight, main="",
     xlim=c(0, 20), xlab="tau[3]", yaxt='n', ylim=c(0, 700), ylab="")
hist(cp_samples$tau[,3], breaks=seq(0, 25, 0.5),
     col=c_dark, border=c_dark_highlight, add=T)

The additional information also allows us to resolve more structure and make more precise posterior retrodictive comparisons.

par(mfrow=c(1, 1))

M <- data$K
idx <- rep(1:M, each=2)
x <- sapply(1:length(idx), function(m) if(m %% 2 == 0) idx[m] + 0.5 else idx[m] - 0.5)

probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:data$K, function(k) quantile(cp_samples$y_post_pred[,k,1], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(m) cred[1:9,m]))

plot(1, type="n", main="",
     xlim=c(0.5, M + 0.5), xlab="", xaxt="n",
     ylim=c(min(pad_cred[1,]), max(pad_cred[9,])),
     ylab="Marginal Posterior Predictive for First Component")

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
for (m in 1:M)
  lines(x[(2 * m - 1):(2 * m)], pad_cred[5,(2 * m - 1):(2 * m)], col=c_dark, lwd=2)

pad_obs <- do.call(cbind, lapply(idx, function(m) data$y[m,1]))

lines(x, pad_obs, lwd=1.5, col="white")
lines(x, pad_obs, lwd=1.25, col="black")

axis(1, at=1:M, labels=c("y1", "y2", "y3", "y4", "y5",
                         "y6", "y7", "y8", "y9"))

5.2.2.2 Monolithically Non-centered Parameterization

Now how poorly does the monolithically non-centered parameterization fare?

ncp_fit <- stan(file='stan_programs/multivariate_hierarchical_ncp.stan',
                data=data, seed=4938483)

Despite our expectations there are no indications of fitting problems.

util$check_all_diagnostics(ncp_fit)

[1] "n_eff for parameter L[1,1] is NaN!"
[1] "n_eff for parameter L[1,2] is NaN!"
[1] "n_eff for parameter L[1,3] is NaN!"
[1] "n_eff for parameter L[2,3] is NaN!"
[1] "  n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated"
[1] "Rhat for parameter L[1,1] is NaN!"
[1] "Rhat for parameter L[1,2] is NaN!"
[1] "Rhat for parameter L[1,3] is NaN!"
[1] "Rhat for parameter L[2,3] is NaN!"
[1] "  Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Where did the expected inverted funnels degeneracies go? While there are hints of mild funnel behavior there is nothing sufficiently pathological to obstruct the computation.

partition <- util$partition_div(ncp_fit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]

par(mfrow=c(3, 3))

for (k in 1:data$K) {
  name <- paste("eta[", k, ",1]", sep="")
  plot(nondiv_samples[name][[1]], log(nondiv_samples$"tau[1]"),
       col=c_dark_trans, pch=16, cex=0.8,
       xlab=name, xlim=c(-5, 5), ylab="log(tau)", ylim=c(-1, 2))
  points(div_samples[name][[1]], log(div_samples$"tau[1]"),
         col=c_green_trans, pch=16, cex=0.8)
}

As in the scalar case the inverted funnel degeneracies are being suppressed by the partial pooling which is enhanced by the strong inferences about the population scales.

ncp_samples <- extract(cp_fit)

par(mfrow=c(1, 3))

hist(abs(rnorm(4000, 0, 5)), breaks=seq(0, 25, 0.5),
     col=c_light, border=c_light_highlight, main="",
     xlim=c(0, 20), xlab="tau[1]", yaxt='n', ylim=c(0, 700), ylab="")
hist(ncp_samples$tau[,1], breaks=seq(0, 25, 0.5),
     col=c_dark, border=c_dark_highlight, add=T)

hist(abs(rnorm(4000, 0, 5)), breaks=seq(0, 25, 0.5),
     col=c_light, border=c_light_highlight, main="",
     xlim=c(0, 20), xlab="tau[2]", yaxt='n', ylim=c(0, 700), ylab="")
hist(ncp_samples$tau[,2], breaks=seq(0, 25, 0.5),
     col=c_dark, border=c_dark_highlight, add=T)

hist(abs(rnorm(4000, 0, 5)), breaks=seq(0, 25, 0.5),
     col=c_light, border=c_light_highlight, main="",
     xlim=c(0, 20), xlab="tau[3]", yaxt='n', ylim=c(0, 700), ylab="")
hist(ncp_samples$tau[,3], breaks=seq(0, 25, 0.5),
     col=c_dark, border=c_dark_highlight, add=T)

That said, while the monolithically non-centered parameterization doesn't bias our posterior computation it does reduce performance.

Stan's dynamic Hamiltonian Monte Carlo algorithm requires much smaller step sizes to fit the monolithically non-centered parameterization.

cp_stepsizes <- sapply(1:4, function(c) get_sampler_params(cp_fit, inc_warmup=FALSE)[[c]][,'stepsize__'][1])
ncp_stepsizes <- sapply(1:4, function(c) get_sampler_params(ncp_fit, inc_warmup=FALSE)[[c]][,'stepsize__'][1])

rbind(cp_stepsizes, ncp_stepsizes)

                    [,1]       [,2]     [,3]       [,4]
cp_stepsizes  0.43721273 0.32446608 0.347055 0.38836290
ncp_stepsizes 0.04697921 0.04442629 0.045287 0.04446469

The mild funnel geometry also requires longer trajectories to fully explore.

cp_steps <- do.call(rbind, get_sampler_params(cp_fit, inc_warmup=FALSE))[,'n_leapfrog__']
ncp_steps <- do.call(rbind, get_sampler_params(ncp_fit, inc_warmup=FALSE))[,'n_leapfrog__']

cp_int_times <- unlist(lapply(1:4, function(c) cp_stepsizes[c] * cp_steps[(1000 * (c - 1) + 1): (1000 * c)]))
ncp_int_times <- unlist(lapply(1:4, function(c) ncp_stepsizes[c] * ncp_steps[(1000 * (c - 1) + 1): (1000 * c)]))

par(mfrow=c(1, 2))

int_time_breaks <- seq(-0.1, 51.1, 0.1)

hist(cp_int_times, breaks=int_time_breaks, main="Centered",
     col=c_dark, border=c_dark_highlight,
     xlab="Integration Time", xlim=c(0, 20), yaxt='n', ylab="")

hist(ncp_int_times, breaks=int_time_breaks, main="Non-Centered",
     col=c_dark, border=c_dark_highlight,
     xlab="Integration Time", xlim=c(0, 20), yaxt='n', ylab="")

The smaller step sizes and longer integration times then require more steps in each numerical Hamiltonian trajectory, each of which requires an expensive evaluation of the posterior density gradient and drives up the total cost of the algorithm.

cp_counts <- as.data.frame(table(cp_steps))
colnames(cp_counts) <- c("Leapfrog Steps", "CP Counts")

ncp_counts <- as.data.frame(table(ncp_steps))
colnames(ncp_counts) <- c("Leapfrog Steps", "NCP Counts")

comp <- merge(cp_counts, ncp_counts, by="Leapfrog Steps", all=TRUE, sort=TRUE)
comp[is.na(comp)] <- 0
print(comp, row.names=FALSE)

 Leapfrog Steps CP Counts NCP Counts
              7      1194          0
             15      2800          7
             23         1          0
             31         5         47
             47         0          1
             63         0        584
             79         0          2
             95         0         45
            127         0       2617
            159         0          9
            175         0          3
            191         0        137
            223         0          2
            255         0        520
            319         0          3
            383         0         17
            511         0          5
            639         0          1

6 Conclusion

By coupling context-dependent parameters to a latent population hierarchical models provide a flexible and powerful modeling technique for exchangeable heterogeneity. As powerful as this approach is from a modeling perspective, however, it introduces a myriad of degeneracies that each manifest under common circumstances. Normal hierarchical models in particular manifest a suite of common degeneracies. Only with a strong conceptual understanding of these degeneracies, and sensitive algorithmic diagnostics to identify them empirically, will we have the tools to manage them and employ hierarchical models robustly in practice.

For small populations we cannot learn much about a normal population from observations confined to each individual context. In these cases we have to be vigilant about our prior modeling to avoid the degeneracies that arise from learning from only a few contexts at a time. A little knowledge is a dangerous thing, computationally speaking.

Otherwise we have to be careful to properly parameterize each individual context. When an individual likelihood function is sufficiently informative the centered parameters are well informed and manifest the cleanest posterior density function geometry. Without that information the non-centered parameters, which are decoupled from the latent population, are the least degenerate and hence best facilitate efficient computation. Exactly what defines sufficiently informative depends on the particular shapes of the individual likelihood functions, and hence the details of each particular application.

Fortunately the sensitive failure diagnostics of Hamiltonian Monte Carlo are particularly well-suited for this challenge, allowing us to identify and investigate any degeneracies and guide principled responses.

Aside from computational considerations we also have to recognize the limitations of the exchangeability assumption implied by hierarchical models. Any information that might characterize the varying behavior between the individual contexts can provide the basis of much more informative model. From that perspective hierarchical models can be a useful intermediary that bridges completely homogenous models to systematic models of the sources of heterogeneity.

Acknowledgements

I warmly thank Susannah Tysor, Rob Trangucci, Charles Margossian, and Alex Andorra for helpful comments.

A very special thanks to everyone supporting me on Patreon: Aapeli Nevala, abcnap, Abhinav Katoch, Abhishek Vasu, Adam Bartonicek, Alan O'Donnell, Alex ANDORRA, Alexander Bartik, Alexander Noll, Alfredo Garbuno Iñigo, Allison M. Campbell, Anders Valind, Andrea Serafino, Andrew Rouillard, Annie Chen, Anthony Wuersch, Antoine Soubret, Arya, asif zubair, Austin Carter, Austin Rochford, Austin Rochford, Avraham Adler, Bao Tin Hoang, Ben Matthews, Ben Swallow, Benjamin Phillabaum, Benoit Essiambre, Bo Schwartz Madsen, Brian Hartley, Bryan Yu, Brynjolfur Gauti Jónsson, Cameron Smith, Canaan Breiss, Cat Shark, Chad Scherrer, Charles Naylor, Chase Dwelle, Chris Jones, Chris Zawora, Christopher Mehrvarzi, Chuck Carlson, Cole Monnahan, Colin Carroll, Colin McAuliffe, D, Damien Mannion, dan mackinlay, Dan W Joyce, Daniel Hocking, Daniel Rowe, Darshan Pandit, David Burdelski, David Christle, David Humeau, David Pascall, Derek G Miller, Diogo Melo, Doug Rivers, Ed Berry, Ed Cashin, Elizaveta Semenova, Eric Novik, Erik Banek, Ero Carrera, Ethan Goan, Evan Cater, Fabio Zottele, Federico Carrone, Felipe González, Fergus Chadwick, Filipe Dias, Finn Lindgren, Florian Wellmann, Francesco Corona, Geoff Rollins, George Ho, Georgia S, Gerardo Salazar, Granville Matheson, Guido Biele, Guilherme Marthe, Hamed Bastan-Hagh, Hany Abdulsamad, Haonan Zhu, Hugo Botha, Hyunji Moon, Ian, Ilaria Prosdocimi, Isaac S, J Michael Burgess, Jair Andrade Ortiz, James McInerney, James Wade, JamesU, Janek Berger, Jeff Dotson, Jeff Helzner, Jeffrey Arnold, Jeffrey Burnett, Jessica Graves, Joel Kronander, John Flournoy, John Zito, Jon, Jonathan Sedar, Jonathan St-onge, Jonathon Vallejo, Jose Pereira, Josh Weinstock, Joshua Duncan, Joshua Griffith, Joshua Mayer, Josué Mendoza, Justin Bois, Karim Naguib, Karim Osman, Kazuki Yoshida, Kejia Shi, Kádár András, Leo Grinsztajn, lizzie, Luiz Carvalho, Lukas Neugebauer, Marc Dotson, Marc Trunjer Kusk Nielsen, Mark Donoghoe, Markus P., Martin Modrák, Matt Wescott, Matthew Kay, Matthew Quick, Matthew T Stern, Maurits van der Meer, Maxim Kesin, Michael Colaresi, Michael DeWitt, Michael Dillon, Michael Griffiths, Michael Redman, Michael Tracy, Mick Cooney, Mike Lawrence, Mikhail Popov, MisterMentat, Mohammed Charrout, Márton Vaitkus, Nathan Rutenbeck, Nerf-Shepherd, Nicholas Cowie, Nicholas Erskine, Nicholas Knoblauch, Nicholas Ursa, Nick S, Nicolas Frisby, Noah Guzman, Ole Rogeberg, Oliver Crook, Olivier Ma, Omri Har Shemesh, Pablo León Villagrá, Patrick Boehnke, Pau Pereira Batlle, Paul Oreto, Peter Smits, Pieter van den Berg, Pintaius Pedilici, Ravin Kumar, Reed Harder, Riccardo Fusaroli, Richard Jiang, Richard Price, Robert Frost, Robert Goldman, Robert J Neal, Robert kohn, Robert Mitchell V, Robin Taylor, Rong Lu, Roofwalker, Rémi, Scott Block, Sean Talts, Seth Axen, Shira, sid phani, Simon Dirmeier, Simon Duane, Simon Lilburn, Srivatsa Srinath, Stephanie Fitzgerald, Stephen Lienhard, Stephen Oates, Steve Bertolani, Stijn, Stone Chen, Sue Marquez, Sus, Susan Holmes, Suyog Chandramouli, Sören Berg, Teddy Groves, Teresa Ortiz, Thomas Littrell, Thomas Vladeck, Théo Galy-Fajou, Tiago Cabaço, Tim Howes, Tim Radtke, Tom Knowles, Tom McEwen, Tommy Naugle, Tuan Nguyen Anh, Tyrel Stokes, Vanda Inacio de Carvalho, Vince, Virginia Webber, vittorio trotta, Will Dearden, Will Farr, William Grosso, yolhaj, yureq, Z, and ZAKH.

References

[1] Finetti, B. de. (1972). Probability, induction and statistics. The art of guessing. John Wiley & Sons, London-New York-Sydney.

[2] Diaconis, P. and Skyrms, B. (2017). Ten great ideas about chance. Princeton University Press.

[3] Diaconis, P. and Freedman, D. (1980). Finite exchangeable sequences. The Annals of Probability 8 745–64.

[4] Good, I. J. (1980). Some history of the hierarchical bayesian methodology. Trabajos de Estadistica Y de Investigacion Operativa 31 489.

[5] Good, I. J. (1983). Good thinking. University of Minnesota Press, Minneapolis, MN.

[6] Papaspiliopoulos, O., Roberts, G. O. and Sköld, M. (2003). Non-centered parameterisations for hierarchical models and data augmentation. In Bayesian statistics 7: Proceedings of the seventh valencia international meeting vol 307, (J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith and M. West, ed). Oxford University Press, USA.

[7] Papaspiliopoulos, O., Roberts, G. O. and Sköld, M. (2007). A general framework for the parametrization of hierarchical models. Statistical Science 59–73.

[8] Gorinova, M. I., Moore, D. and Hoffman, M. D. (2019). Automatic reparameterisation of probabilistic programs.

[9] Betancourt, M. (2018). A conceptual introduction to hamiltonian monte carlo. ArXiv e-prints 1701.02434.

[10] Betancourt, M. and Girolami, M. (2015). Hamiltonian Monte Carlo for hierarchical models. In Current Trends in Bayesian Methodology with Applications (U. S. Dipak K. Dey and A. Loganathan, ed). Chapman & Hall/CRC Press.

[11] Neal, R. (2003). Slice sampling. Annals of statistics 705–41.

[12] Lewandowski, D., Kurowicka, D. and Joe, H. (2009). Generating random correlation matrices based on vines and extended onion method. Journal of multivariate analysis 100 1989–2001.

License

A repository containing the material used in this case study is available on GitHub.

The code in this case study is copyrighted by Michael Betancourt and licensed under the new BSD (3-clause) license:

https://opensource.org/licenses/BSD-3-Clause

The text and figures in this case study are copyrighted by Michael Betancourt and licensed under the CC BY-NC 4.0 license:

https://creativecommons.org/licenses/by-nc/4.0/

Original Computing Environment

writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))

CC=clang

CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined -Wno-unneeded-internal-declaration
CXX=clang++ -arch x86_64 -ftemplate-depth-256

CXX14FLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined -Wno-unneeded-internal-declaration -Wno-unknown-pragmas
CXX14=clang++ -arch x86_64 -ftemplate-depth-256

sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.19.3         ggplot2_3.3.1        StanHeaders_2.21.0-3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6       pillar_1.4.4       compiler_4.0.2     prettyunits_1.1.1 
 [5] tools_4.0.2        digest_0.6.25      pkgbuild_1.0.8     evaluate_0.14     
 [9] lifecycle_0.2.0    tibble_3.0.1       gtable_0.3.0       pkgconfig_2.0.3   
[13] rlang_0.4.6        cli_2.0.2          parallel_4.0.2     yaml_2.2.1        
[17] xfun_0.16          loo_2.2.0          gridExtra_2.3      withr_2.2.0       
[21] stringr_1.4.0      dplyr_1.0.0        knitr_1.29         generics_0.0.2    
[25] vctrs_0.3.0        stats4_4.0.2       grid_4.0.2         tidyselect_1.1.0  
[29] glue_1.4.1         inline_0.3.15      R6_2.4.1           processx_3.4.2    
[33] fansi_0.4.1        rmarkdown_2.2      purrr_0.3.4        callr_3.4.3       
[37] magrittr_1.5       codetools_0.2-16   matrixStats_0.56.0 ps_1.3.3          
[41] scales_1.1.1       ellipsis_0.3.1     htmltools_0.4.0    assertthat_0.2.1  
[45] colorspace_1.4-1   stringi_1.4.6      munsell_0.5.0      crayon_1.3.4

Hierarchical Modeling

Michael Betancourt

November 2020