In Bayesian inference the prior model provides a valuable opportunity to incorporate domain expertise into our inferences. Unfortunately this opportunity often becomes a contentious issue in many fields, and this potential value is lost in the debate. In this case study I will discuss the challenges of building prior models that capture meaningful domain expertise and some practical strategies for ameliorating those challenges as much as possible.

To begin we will review exactly what a prior model is and the subtleties of translating implicit domain expertise into explicit prior models. We will then discuss strategies for constructing prior models in one and then many dimensions before ending with a short discussion of meta analysis and its relationship to some common prior modeling heuristics.

1 The Prior Mire

The precise definition of a prior model is often taken for granted, but it can be surprisingly subtle once we start to model more elaborate systems.

Typically the prior model is formally defined in the context of an assumed observational model, which itself is formally based on which variables can be directly observed and which cannot be. The observational model, \(\pi(y \mid \theta)\), is first defined as a collection of probability distributions over the observable variables in the observational space \(y \in Y\), each indexed by the unobservable variables in the model configuration space, \(\theta \in \Theta\). A prior model, \(\pi(\theta)\), can then be defined as a probability distribution over those unobservable model configurations.

Evaluating the observational model on realized values of the observable variables, \(\tilde{y}\), gives a likelihood function, \[ l(\theta; \tilde{y}) \propto \pi(\tilde{y}\mid \theta). \] This realized likelihood function can then be used to update the prior model into a posterior distribution over the model configuration space, \[ \pi(\theta \mid \tilde{y}) \propto l(\theta; \tilde{y}) \cdot \pi(\theta), \] that quantifies how compatible different values of the unobservable variables are with the observations and whatever domain expertise is encoded in the assumed observational and prior models.

At the same time the observational model and prior model together define a full Bayesian model over all of the variables in both the observational and model configuration spaces, \[ \pi(y, \theta) = \pi(y \mid \theta) \, \pi(\theta). \] Conditioning the full Bayesian model on realized values of the observable variables, \(\tilde{y}\), gives the same posterior distribution, \[ \begin{align*} \pi(\theta | \tilde{y}) &\propto \pi(\tilde{y}, \theta) \\ &\propto \pi(\tilde{y} \mid \theta) \, \pi(\theta). \end{align*} \]

Because we've defined this prior model by which variables are observable and which are not we can always recover this formal prior model from the full Bayesian model by marginalizing over the observable variables, \[ \pi(\theta) = \int \mathrm{d} y \, \pi(y, \theta). \] That said once we've constructed a full Bayesian model the decomposition into an observational and prior model doesn't actually affect the posterior inferences. This suggests the question of whether there are other decompositions of a full Bayesian model into different observational and prior models that might also be useful.

Indeed while this formal decomposition is natural in the context of the mathematical operations from which we can derive a posterior distribution, it is not always so appropriate when interpreting a Bayesian model. That interpretational context is itself critical to developing useful Bayesian models in the first place.

For example a narratively generative perspective interprets the observational model as a collection of data generating processes, each of which defines one possible probabilistic story for how values of the observable variables \(y \in Y\) could be generated. The model configurations \(\theta \in \Theta\) then quantify the particular behaviors in each of those stories, and the prior model quantifies how consistent each of those data generating behaviors is with our domain expertise. These interpretations then provide the scaffolding on which we can incorporate modeling assumptions to develop the full Bayesian model.

The observational and prior models motivated by this narrative perspective don't always align with the formal definitions we introduced above. Consider for example the full Bayesian model \[ \pi(y, \theta, \phi) = \pi(y \mid \phi) \, \pi(\phi \mid \theta) \, \pi(\theta), \] where \(\phi\) models some intermediate phenomenon that couples the phenomenon modeled by \(\theta\) and the observations \(y\). Depending on how exactly we interpret this intermediate phenomenon different definitions of the observational and prior models will be useful.

If the intermediate phenomenon are interpreted as part of the data generating processes then the natural observational model subsumes the first two terms, \[ \pi(y, \phi \mid \theta) = \pi(y \mid \phi) \, \pi(\phi \mid \theta), \] or if we want to isolate the observable variables, \[ \pi(y \mid \theta) = \int \mathrm{d} \phi \, \pi(y \mid \theta, \phi) \, \pi(\phi \mid \theta). \] In this case the remaining term \(\pi(\theta)\) becomes the complementary prior model. When developing this model our domain expertise about the internal structure of the data generating processes would inform \(\pi(y \mid \phi)\) and \(\pi(\phi \mid \theta)\) while our domain expertise about the reasonable configurations of those data generating processes would inform \(\pi(\theta)\).

Alternatively we can interpret the data generating processes only conditional on the behavior of all of the latent phenomena. In this case the natural observational model is given by just the first term, \(\pi(y \mid \phi)\), leaving the last two terms to form the complementary prior model, \[ \pi(\phi, \theta) = \pi(\phi \mid \theta) \, \pi(\theta). \] From this perspective our domain expertise about the internal structure of the data generating processes would inform only \(\pi(y \mid \phi)\) while our domain expertise about the reasonable configurations of those data generating processes would inform \(\pi(\phi \mid \theta)\) and \(\pi(\theta)\).

If we can't uniquely define observational and prior models then how can we uniquely define prior modeling in contrast to the design of the observational model? In a very real sense we can't. The full Bayesian model not only fully defines our inferences but also unifies all of our modeling assumptions into a common mathematical framework. Any distinction between assumptions of the structure of the data generating processes verses assumptions of the configurations of that structure, and hence how we might inform those assumptions from our domain expertise, depends on how we choose to interpret the system being modeled.

To provide some structure here I will define prior modeling as the completion of a partial narratively generative model to a full Bayesian model. Once we have developed a narratively generative observational model a prior model introduces a probability distribution over any residual degrees of freedom that quantify consistency with our domain expertise.







In other words the goal of prior modeling is to introduce additional domain expertise that complements the domain expertise that has already gone into the development of the observational model.

2 Illicit Elicitation

The actualization of implicit, often qualitative, domain expertise into an explicit, quantitative prior model is referred to as elicitation. Regardless of whether that implicit knowledge is drawn from our ourselves, our colleagues, or even our communities more generally the challenges of this translation task are similar.

Anytime we take the responsibility of making a principled assumption, instead of defaulting to some conventional or otherwise unconsidered assumption, we have to consult our domain expertise. We elicit our domain expertise when choosing an observational model and a prior model relevant to a given application or, more holistically, when choosing a full Bayesian model that integrates all of our modeling assumptions into a common probabilistic framework.

While most recognize the need for eliciting domain expertise when designing an observational model, many find the very concept of prior elicitation distasteful. So much so that they often entirely reject any method that requires the elicitation of anything resembling a prior model. Arguments for this distaste vary, but many reduce to the presumption that any assumptions encoded in a prior model are more "subjective" than those encoded in an observational model despite there being no mathematical difference.

A more fair reason why prior elicitation can appear so unpleasant is that it's often presented as the elicitation of all available domain expertise about a given observational model, which would indeed be a massive and overwhelming undertaking. Formally a probability distribution over a space with an infinite number of elements, such as the integers or the real numbers, contains an infinite amount of information and specifying a particular prior model requires eliciting all of that information from our domain expertise. Moreover that elicitation is not free; the translation of even small bits of implicit domain expertise into explicit mathematical assumptions requires time and effort that eat away at the finite resources available in any given analysis. Eliciting all of the infinitesimal details of our domain expertise is indeed completely impractical [1].

The same argument, however, can be made for the elicitation of the observational model. Designing an observational model that encapsulates every microscopic detail of some true data generating process also requires an infinite amount of information, most of which isn't even available within our domain expertise. In practice we instead aim for an observational model that captures the more substantial, macroscopic behaviors of the assumed data generating process which can be informed by a limited elicitation of our domain expertise.

Similarly a useful prior model does not have to incorporate all of our domain expertise but rather just enough domain expertise to ensure well-behaved inferences. Because posterior inferences incorporate information from both the realized likelihood function and the prior model, any information encoded in both the likelihood function and the prior model is redundant. An effective prior model needs to elicit only information that's not already available in the realized likelihood function.

If observed data are available then we might analyze a realized likelihood function for the presence of any uncertainties that motivate exactly what type of domain expertise we should elicit into a complementary prior model. That said we have to be very careful to use this analysis to inform only what domain expertise we elicit and not the fabrication of domain expertise itself. A prior model that exactly rectifies any pathological behavior in a realized likelihood function but is incompatible with our actual domain expertise is not a valid prior model. Counterfeiting domain expertise in this way incorporates the observed data twice, resulting in inferences that overfit to the particular details of the observed data and poorly generalizing to other circumstances.

Often, however, we often don't have access to the observed data when developing a prior model. Without the observed data we can't construct the realized likelihood function, and without that likelihood function we can't motivate exactly what kind of information our prior model needs to capture, and hence what kind of domain expertise to elicit. Once we've developed an observational model, however, we can often constrain the range of potentially problematic behaviors that might arise in realized likelihood functions and build prior models that can compensate for most if not all of them. I will refer to this approach as defensive prior modeling.

By considering the consequences of the observational model we can prioritize what kind of domain expertise that we want to incorporate into our prior model. This drastically reduces the elicitation burden, often to the point where the specification of the necessary assumptions becomes much more palatable. The remaining challenge is identifying what kinds of domain expertise are useful for defensive prior modeling, both in one and many dimensional model configuration spaces, and how we can integrate those partial elicitations into self-consistent prior models.

3 One-Dimensional Expertise

Eliciting enough domain expertise to inform a prior model over one, and sometimes two, dimensional model configuration spaces is greatly facilitated by our ability to directly visualize probability density functions. That said even when we can picture a prior density function we still have an infinite amount of information to elicit. Around which model configurations does our domain expertise concentrate? How strong is that concentration? Is the concentration symmetric around the centrality or skewed towards model configurations on one side or another? How quickly does the concentration decay as we move away from the centrality?

To make this elicitation more manageable we can prioritize information that will compensate for any poorly informative likelihood functions, in particular diffuse likelihood functions that extend out to extreme model configurations that clash with our domain expertise. These extreme model configurations are not impossible but rather noticeably more inconsistent with our domain expertise than other, more well-behaved model configurations.

In other words the goal of this defensive prior model is not to emphasize consistent model configurations but rather to suppress extreme model configurations in case a realized likelihood functions is too diffuse. This more adversarial focus often facilitates the elicitation of domain expertise, especially the external elicitation from colleagues. While we might be hesitant to take responsibility for accepting certain model configurations as reasonable, we tend to be much more decisive in rejecting model configurations as unreasonable. Moreover even if multiple domain experts disagree on the which model configurations should be promoted we can often find agreement in which model configurations should be suppressed.

Before we can suppress extreme model configurations we need to separate them out from less extreme model configurations. This requires the elicitation of just enough domain expertise to identify approximate thresholds between the two. Consider, for example, a one-dimensional model configuration space that is parameterized such that model configurations are least extreme around some finite value but become monotonically more extreme as we move closer to positive or negative infinity. Here we need two thresholds, one to denote when we've ventured too far towards positive infinity and one when we've ventured too far towards negative infinity.




In other words these extremity thresholds determine the transition when model configurations become so extreme that they are better approximated as infinite by our domain expertise rather than vanishing. Model configurations below a extremity threshold more closely resemble zero than infinity while those above the threshold more closely resemble infinity than zero.

Another benefit of focusing no thresholds is that in practice they rarely need to be elicited all that precisely. The order of magnitude, or factor of ten, where the model configurations transition into more extreme behavior is often sufficient. Conveniently the relevant orders of magnitude are often already implicitly elicited in the units that have become conventional in a given circumstance. A model configuration typically quoted in grams is unlikely to take on values as small as milligrams or as large as kilograms or else those units that would have been taken as convention!

A containment prior model concentrates prior probability within the elicited extremity thresholds to ensure that the posterior distribution cannot extend far past the thresholds even when a realized likelihood function does. In order to implement a containment prior model, however, we have to define exactly how this containment is achieved.

3.1 Hard Verses Soft Containment

A robust containment prior model has to work well under all three basic interactions between the prior model and a realized likelihood function introduced in my modeling and inference case study. In particular the prior model should have a negligible influence under the contraction scenario but offer substantial regularization under the containment and compromise scenarios.

There are two main strategies that achieve this behavior. Hard containment immediately suppresses all model configurations outside of the extremity thresholds, ensuring that those model configurations are excluded from the posterior distribution no matter the behavior of the realized likelihood function. Soft containment, on the other hand, only gradually suppresses the model configurations past the thresholds allowing, some extreme model configurations to propagate to the posterior distribution. To determine which of these approaches might be more useful let's compare their influences in the three basic inferential scenarios.




First let's assume that the realized likelihood function strongly concentrates within the extremity thresholds so that the posterior distribution will contract from any containment prior model. Here both hard and soft containment prior models offer negligible contributions to the posterior distribution.




Indeed if we zoom into the neighborhood where the realized likelihood function peaks we see that both prior modes are locally well-approximated by a uniform prior model.




What happens when the realized likelihood function is only weakly informative and spreads past the extremity thresholds? In this case both prior models contain the posterior distribution away from the extreme model configurations. The hard containment model prevents any model configurations past the thresholds from propagating to the posterior distribution while the soft containment model allows for some leakage.




Finally let's consider when the realized likelihood function concentrates strongly outside of the extremity thresholds, as might happen if we poorly elicited the thresholds from our domain expertise. The conflict between the assumed prior model and the likelihood function here requires that the posterior distribution compromises between the two, but the nature of the compromise depends on the type of containment.




A hard containment model doesn't allow for any compromise outside of the assumed extremity thresholds, and the posterior distribution ends up concentrating up against the hard boundaries. The resulting awkward posterior density function is often problematic to accurately quantify computationally. Soft containment, however, allows for a more reasonable compromise where the posterior distribution can escape past the extreme thresholds towards the model configurations more consistent with the observed data.

Which of these behaviors is more useful depends on how much we trust our elicitation of the extremity thresholds. If we are completely confident in our elicitation then the compromise situation will either never arise or arise only due to irrelevant fluctuations in the observed data, in which case the hard containment ensures the best possible inferences. When we are less confident, however, the tension between the assumed prior model and the realized likelihood function is critical feedback that the thresholds may have been poorly elicited. In this case the greater compromise allowed by the soft containment approach allows not only for better inferences but also empirical diagnostics of the tension.

Because diagnostics of problematic modeling assumptions are so powerful I find that soft containment prior models are much more robust in practice than hard containment models and I strongly recommend their use. Our remaining challenge is specifying exactly how soft the containment should be.

3.2 Soft, Softer, or Softest?

Qualitatively the probability density function that defines a one-dimensional soft containment prior model should be approximately uniform between the extremity thresholds before rapidly decaying past those thresholds. In order to specify a quantitative prior model, however, we have to define exactly how much prior probability leaks past the thresholds and then exactly how quickly that excess probability decays.

The more prior probability contained within the extremity thresholds the more strongly the reasonable model configurations within those thresholds will influence any resulting posterior distribution; the more prior probability that leaks beyond the thresholds the easier it will be for a resulting posterior distribution to escape beyond the extremity thresholds if needed. In that sense this balance of probability captures our faith in the elicitation of the extremity thresholds themselves.

In practice I have found that isolating \(1\%\) of the prior probability beyond each threshold works well. For example if we are trying to enforce containment between a lower and upper threshold then we would tune our prior model so that \(1\%\) of the prior probability is below the lower threshold, \(1\%\) is above the upper threshold, and \(98\%\) is contained between the two.




Acknowledging how approximately we might elicit those thresholds, however, we shouldn't be too precious about that exact probability. Anything between \(0.2\%\) and \(5\%\), and in many cases even between \(0.1\%\) and \(10\%\) will likely achieve qualitatively similar containment.

To completely specify our soft containment prior model we need to define how the prior probability that leaks past the extremity thresholds is distributed. In other words we need to define the behavior of the tails of the prior density function. Lighter tailed probability density functions decay more quickly, concentrating the leaked probability closer to the extremity thresholds, while heavier tailed probability density functions decay more slowly, allowing the leaked probability to spread out to more extreme model configurations and weakening the effective containment of the prior model.

Consider for example a prior model specified by a normal density function with moderate tails and one specified by a Cauchy density function with extremely heavy tails, each tuned so that \(1\%\) of the prior probability is allocated below the lower threshold and above the upper threshold. The Cauchy density function spikes too strongly in the middle to see the tails but if we zoom in on the upper tails we can see how much more slowly the Cauchy tails decay, extending to orders of magnitude larger model configurations.




We can also see this in the corresponding cumulative distribution functions. Because the tail probabilities match, the normal and Cauchy cumulative distribution functions meet at the extremity thresholds. Beyond those thresholds, however, the Cauchy cumulative distribution function moves towards \(0\) and \(1\) much more slowly than the normal cumulative distribution function.