Hierarchical modeling is a powerful technique for modeling heterogeneity and, consequently, it is becoming increasingly ubiquitous in contemporary applied statistics. Unfortunately that ubiquitous application has not brought with it an equivalently ubiquitous understanding for how awkward these models can be to fit in practice. In this case study we dive deep into hierarchical models, from their theoretical motivations to their inherent degeneracies and the strategies needed to ensure robust computation. We'll learn not only how to use hierarchical models but also how to use them robustly.
In probabilistic modeling heterogeneity refers to systematic variation in the structure of a data generating process beyond the unpredictable, probabilistic variation inherent to measurement processes. Critically heterogeneity implies the existence of distinct contexts in which the structure of the data generating process is fixed; the resulting variation manifests only in measurements that span multiple contexts.
For example heterogeneity might manifest because of different ways a measurement is made, corresponding to the systematic variation in measurements made at different locations or times, with different measurement techniques, or even by different technicians. At the same time heterogeneity might be due to the nature of what is being observed, corresponding to varying subjects, environmental conditions, and the like. Heterogeneity could even be a manifestation of how our domain expertise varies across contexts.
In this case study I will focus on discrete heterogeneity defined by a discrete, although potentially infinite, set of contexts. This includes for example geographic heterogeneity across different countries, but not as a continuous function of distance. I will refer to the set of all contexts and corresponding behaviors as the population and the particular contexts as individuals or items.
From a modeling perspective systematic variation manifests as the replication of some or even all of the parameters across the defining contexts. Here we'll denote the nominal parameters potentially sensitive to the heterogeneity as \[ \theta \in \Theta. \] Note that \(\theta\) might be one-dimensional but could also be multi-dimensional depending on how sophisticatedly we model each context context.
In a homogeneous or complete pooling model we ignore the potential variation entirely and assume that the behavior in each individual context is captured by the nominal parameters. Because every context is modeled with the same, monolithic parameters all of the observations across those contexts pool together to inform those parameters, ensuring strongly informed inferences. If the actual variations are large, however, then these strong inferences come at the cost of substantial model misfit.
To model heterogeneity in general we have to replace the monolithic parameters with a deterministic function of the varying contexts. In the case of discrete heterogeneity we can replace the monolithic parameters \(\theta \in \Theta\) with a set of \(K\) parameters modeling the individual contexts, \[ \{ \theta_{1}, \ldots, \theta_{K} \}. \]
For example consider the humble normal observational model, \[ y \sim \text{normal}(\mu, \sigma). \] If we expected heterogeneity to manifest in the location of the observations then we might replace the monolithic parameter \(\mu\) with the varying parameters \(\{ \mu_{1}, \ldots, \mu_{K} \}\), \[ y_{k} \sim \text{normal}(\mu_{k}, \sigma). \] At the same time heterogeneity in the scale, also know as heteroskedasticity, could be captured by replacing \(\sigma\) with \(\{ \sigma_{1}, \ldots, \sigma_{K} \}\), \[ y_{k} \sim \text{normal}(\mu, \sigma_{k}). \] Of course both the location and scale could be sensitive to the varying contexts, in which case we would need to replace \(\{ \mu, \sigma \}\) with \[ \{ \{\mu_{1}, \sigma_{1} \}, \ldots, \{\mu_{K}, \sigma_{K} \} \} \] to give \[ y_{k} \sim \text{normal}(\mu_{k}, \sigma_{k}). \]
A model for discrete heterogeneity, however, isn't complete until we specify a probabilistic model for the individual parameters, \(\pi(\theta_{1}, \ldots, \theta_{K})\).
An independent heterogeneity or no pooling model assumes no latent interactions between the individual parameters, \[ \pi(\theta_{1}, \ldots, \theta_{K}) = \pi_{1}(\theta_{1}) \cdot \ldots \cdot \pi_{K}(\theta_{K}). \] The flexibility afforded by this independence assumption allows the model to capture any context-to-context variation. It also implies that only observations within each context inform the individual parameters, resulting in less informed inferences relative to the homogeneous model.
Including latent interactions between the individual parameters implements a partial pooling model. Here the interactions allow observations to potentially inform other contexts without assuming full homogeneity. If the interaction model is a reasonable approximation to the heterogeneity in the true data generating process then our inferences would be more strongly informed than in the no pooling model but suffer less misfit than the complete pooling model.
This raises the question, however, of exactly how we might model those interactions between the individual parameters.
Conveniently in practice discrete heterogeneity often manifests, or at least approximately manifests, in a way that drastically constrains the latent interactions. In this section we discuss the permutation symmetry inherent to exchangeable populations and the hierarchical models they imply.
Often we are left without any underlying theory for why a data generating process behaves differently in different contexts. In this case we know that the behavior should vary across contexts but we don't have any canonical way of ordering or arranging those contexts by the unknown behavior.
Probabilistically this ignorance implies that the behavior in each context correlates exactly the same with any other context; without a sense of order we can't define a sense of distance let alone distance-dependent interactions. Mathematically any model of heterogeneity consistent with this ignorance must be invariant to any permutation of the individual contexts, and hence any permutation of the individual parameters that model the behavior in those contexts. Invariance of a probabilistic model to any permutation of the parameters is known as exchangeability.
Consider for example modeling the rules of a game based on a deck of cards labeled only by mysterious symbols. Even if we know that each card has a different effect in the game we have no knowledge about how those effects might be ordered. Consequently any model of how that game plays out must be invariant to shuffling of the cards.
Typically exchangeability is the result of an explicit or implicit censoring process where we ignore any information about why the data generating process behaves differently in different contexts. In other words exchangeability is not a fundamental property of a system being observed but rather a consequence of how we choose to observe that system!
From this perspective we can think about exchangeability as a variant of Three Card Monte. Although we know that technically there are multiple balls, each with their own discriminating characteristics, once the balls are covered and shuffled we lose track of which item features which characteristics. What information remains has to be invariant to the many permutations of which we can't keep track, naturally motivating exchangeable prior models.
Another way of interpreting exchangeability is as label invariance. Once we've censored all discriminating information about each context then any labeling-- alphabetical, numerical, or otherwise -- of those contexts will be arbitrary.
In this case any relabeling of the contexts is equivalent to permuting the contexts while keeping the labels fixed, and once again we are naturally lead to consider exchangeable models.
All of this said, anytime we consider exchangeability we have to recognize that ignoring discriminating information can be a very poor modeling choice. Consider for example modeling the behavior of a continuous function \(f(x)\) with a set of function values at discretized inputs, \(f_{k} = f(x_{k})\).