Not every observation is whole. In many circumstances some components of the observational space are straightforward to observe while some are more difficult to capture. This disparity motivates variate-covariate models that learn the statistical relationships between these components to predictively complete incomplete observations when needed. With some strong assumptions variate-covariate models eventually yield the infamous, although poorly named, regression models.

In this case study we'll review the foundations of variate-covariate modeling and techniques for building and implementing these models in practice, demonstrating the resulting in methodology with some extensive examples. Throughout I will attempt to relate this modeling perspective to more conventional treatments of regression modeling.

1 Foundations of Variate-Covariate Modeling

We begin by establishing the foundational concepts of variates, covariates, and models that consistently incorporate the two. We will then review some of the more common configurations of these models that lead to substantial simplifications, and the practical consequences of those simplifications.

1.1 Variates and Covariates

We often consider the observational space to be indivisible, with all of its components observed or predicted together. In some applications, however, not all of the components are always observed. This results in a missing data problem.

Consider for example the observational space \[ Y_{1} \times \ldots \times Y_{n} \times \ldots \times Y_{N} \] with the component variables \[ (y_{1}, \ldots, y_{n}, \ldots, y_{N}) \] and the complete observational model \[ \pi(y_{1}, \ldots, y_{n}, \ldots, y_{N} \mid \theta). \] When the first component is missing the resulting observation is modeled by the marginal observational model \[ \pi(y_{2}, \ldots, y_{n}, \ldots, y_{N} \mid \theta) = \int \mathrm{d} y_{1} \, \pi(y_{1}, \ldots, y_{n}, \ldots, y_{N} \mid \theta). \] Likewise if the first two components are missing then the resulting observation is modeled by \[ \pi(y_{3}, \ldots, y_{n}, \ldots, y_{N} \mid \theta) = \int \mathrm{d} y_{1} \, \mathrm{d} y_{2} \, \pi(y_{1}, \ldots, y_{n}, \ldots, y_{N} \mid \theta), \] and so on.

One of the most common missing data problems arises when the observational space decomposes into two components, one of which is included only in some observations while the other is included in all observations. Notationally I will refer to the observational space in this case as \(Y \times X\) with the variates \(y \in Y\) included only in some observations and the covariates \(x \in X\) included in all observations. In many cases the variate space \(Y\) and the covariate space \(X\) will also consist of multiple components of their own, but to simplify the notation I will generally denote them with single variables.

Moreover I will refer to an observation that contains just covariate values, \(\tilde{x}\), without an accompanying variate observation as an incomplete observation. On the other hand I will refer to an observation that contains both covariate and variate values, \((\tilde{y}, \tilde{x})\), as a complete observation.

There are many equivalent, or nearly equivalent, terminologies that one will find in the statistics and machine learning literature. For example complete observations are often referred to as "in samples" data with incomplete observations referred to as "out of sample" data. Additionally complete observations used to fit a model are often denoted "training data" while complete observations intentionally censored to incomplete observations to assess predictive performance are denoted "testing data".

1.2 Variate-Covariate Models

Once the full observational space has been partitioned, modeling complete and incomplete observations is conceptually straightforward. Complete observations are modeled with a complete observational model, \[ \pi(y, x \mid \theta), \] while incomplete observations are modeled with the corresponding marginal observational model, \[ \pi(x \mid \theta) = \int \mathrm{d} y \, \pi(y, x \mid \theta). \]

Neither of these models, however, informs what unobserved variate configurations are consistent with an incomplete observation \(\tilde{x}\). In order to complete an incomplete, and lonely, observation we need to understand the statistical relationship between the variates \(y\) and the covariates \(x\); not how \(y\) varies alone but rather how \(y\) covaries with \(x\).




If we can learn this covariation from complete observations then we might be able apply it to predicting missing variates.

Mathematically the covariation between \(y\) and \(x\) is captured in the conditional observational model \[ \pi(y \mid x, \theta) \] that lifts the incomplete observational model back to the complete observational model. In other words the statistical resolution of an incomplete observation motivates a particular conditional decomposition of the complete observational model \(\pi(y, x \mid \theta)\) into a conditional variate model \(\pi(y \mid x, \theta)\) and a marginal covariate model \(\pi(x \mid \theta)\), \[ \pi(y, x \mid \theta) = \pi(y \mid x, \theta) \, \pi(x \mid \theta). \]

This decomposition is motivated by the variate prediction task; critically it does not assume any narratively generative relationship. Variates and covariates are differentiated not by what is generated first in a complete observation but rather by what tends to be missing in a given application. The categorization of the observational space into variates and covariates can differ from one application to the next even when the data generating process remains the same up to the censoring process!

Narratively generative assumptions might motivate a completely different decomposition of the complete observational model \(\pi(y, x \mid \theta)\).




The number of conditional decompositions, and possible generative stories, explodes even further when the variate and covariate spaces are themselves composite.

For example consider measurements, \(y\), made at varying spatial locations, \(x\). In this case the conditional decomposition \[ \pi(y, x \mid \theta) = \pi(y \mid x, \theta) \, \pi(x \mid \theta) \] captures the structure of the data generating process. When the spatial locations are fixed by the measurement process, but not all measurements have been completed, \(y\) becomes a natural variate and \(x\) a natural covariate, and this narratively generative decomposition aligns with the natural variate-covariate decomposition. On the other hand if all of the measurements are observed but some of the location information has been lost then the non-narratively generative decomposition \[ \pi(y, x \mid \theta) = \pi(x \mid y, \theta) \, \pi(y \mid \theta) \] captures the appropriate variate-covariate structure.

Without any narratively generative structure to facilitate model development constructing a conditional variate model and marginal covariate model directly isn't always straightforward, especially if the interaction between the variates and covariates is complicated. In those cases it is often easier to use any narratively generative structure to first build the complete observational model before decomposing it into appropriate conditional variate and marginal covariate models.

2 You Complete Me (Statistically)

Conceptually incomplete observations inform the model configurations through the marginal covariate model, \[ \pi(\theta \mid \tilde{x}) \propto \pi(\tilde{x} \mid \theta) \, \pi(\theta), \] while complete observations inform the model configurations through both the conditional variate model and marginal covariate model, \[ \pi(\theta \mid \tilde{y}, \tilde{x}) \propto \pi(\tilde{y} \mid \tilde{x}, \theta) \, \pi(\tilde{x} \mid \theta) \, \pi(\theta). \] Any constraints on these model configurations \(\theta\) informs predictions of missing variates through the partial evaluation of the conditional variate model, \[ \pi(y \mid \tilde{x}, \theta). \] Mathematically integrating complete and incomplete observations into consistent predictions, however, isn't always straightforward.

The key to avoiding mathematical errors is to always work with the full Bayesian model, \[ \pi(y, x, \theta) = \pi(y \mid x, \theta) \, \pi(x \mid \theta) \, \pi(\theta). \] Evaluating the full model on whatever variables have been observed automatically integrates all of the available information to form consistent inferences of the model configurations and consistent predictions of any missing variates.

In this section we will investigate various evaluations of the full Bayesian model to understand how the conditional variate model and marginal covariate model contribute to these inferential outcomes in different circumstances. I have placed a particular focus on how these contributions are determined by the assumed interactions between the two models. To introduce the mathematical derivations more gently we will begin by considering a single observation before moving on to multiple observations where inferences and predictions are woven together.

Before that, however, let me emphasize that these derivations are meant to only illustrate how inferences from the conditional variate model and the marginal covariate model are intertwined in complex ways. To implement these inferences in practice we only ever need to construct the full Bayesian model and evaluate it on all of the observed variables!

2.1 Single Observation

Let's begin our analysis of the variate-covariate model by considering what happens when we evaluate the full Bayesian model for a single observation, \[ \pi(y, x, \theta) = \pi(y \mid x, \theta) \, \pi(x \mid \theta) \, \pi(\theta), \] on both a complete and an incomplete observation. We will first consider the model configurations monolithically before decomposing them into multiple parameters that capture the individual and common dependencies of the conditional variate and marginal covariate models.

2.1.1 Encapsulated Model Configurations

When the model configurations are treated as a monolithic variable \(\theta\) they inform both the the marginal covariate model and the conditional variate model. Consequently we will in general learn about the model configurations from both complete and incomplete observations.




Keep in mind that we're denoting the variate and covariate spaces with single variables to ease the notational burden here. The variate and covariate spaces could themselves decomposes into multiple components, for example if modeling looking at repeated observations.




Given a complete observation \((\tilde{y}, \tilde{x})\) the posterior distribution for \(\theta\) becomes \[ \begin{align*} \pi(\theta \mid \tilde{y}, \tilde{x}) &\propto \pi(\tilde{y}, \tilde{x}, \theta) \\ &\propto \pi(\tilde{y} \mid \tilde{x}, \theta) \, \pi(\tilde{x} \mid \theta) \, \pi(\theta). \end{align*} \] We can also recover this posterior distribution by incorporating the covariates and variates sequentially. The evaluated marginal covariate model first updates the prior model into a covariate only posterior distribution, \[ \pi(\theta \mid \tilde{x}) \propto \pi(\tilde{x} \mid \theta) \, \pi(\theta). \] Then the evaluated conditional variate model updates this into the full posterior distribution, \[ \pi(\theta \mid \tilde{y}, \tilde{x}) \propto \pi(\tilde{y} \mid \tilde{x}, \theta) \, \pi(\theta \mid \tilde{x}). \]

When only an incomplete observation of covariates \(\tilde{x}\) is available the conditional variate model informs a predictive distribution for the missing variate. To see this we first marginalize out the model configurations from the full Bayesian model, \[ \begin{align*} \pi(y, x) &= \int \mathrm{d} \theta \, \pi(y, x, \theta) \\ &= \int \mathrm{d} \theta \, \pi(y \mid x, \theta) \, \pi(x \mid \theta) \, \pi(\theta), \end{align*} \] and then evaluate on the observed covariate, \[ \begin{align*} \pi(y \mid \tilde{x}) &\propto \pi(y, \tilde{x}) \\ &\propto \int \mathrm{d} \theta \, \pi(y \mid \tilde{x}, \theta) \, \pi(\tilde{x} \mid \theta) \, \pi(\theta) \\ &\propto \int \mathrm{d} \theta \, \pi(y \mid \tilde{x}, \theta) \, \pi(\theta \mid \tilde{x}). \end{align*} \] Here the prior model and the evaluated marginal covariate model inform the model configurations \(\theta\), which then inform the conditional variate model how to make predictions for the unobserved variate \(y\).

2.1.2 Component Model Configurations

In most cases not every component of the model configuration space will inform both the conditional variate model and the marginal covariate model. A more realistic picture arises when we decompose the model configuration space into component parameters that influence just the conditional variate model, \(\eta\), component parameters that influence just the marginal covariate model, \(\gamma\), and component parameters that influence both, \(\psi\), \[ \theta = (\eta, \gamma, \psi). \]

The parameters \(\psi\) couple the behavior of the conditional variate model and marginal covariate together so that if one changes so too must the other. Because this coupling complicates inferential behavior the \(\psi\) are known as confounders.

Under this decomposition the full Bayesian model becomes \[ \pi(y, x, \eta, \gamma, \psi) = \pi(y \mid x, \eta, \psi) \, \pi(x \mid \gamma, \psi) \, \pi(\eta) \, \pi(\gamma) \, \pi(\psi). \]




Given a complete observation \((\tilde{y}, \tilde{x})\) we can inform a posterior distribution over all of these parameters, \[ \pi(\eta, \gamma, \psi, \mid \tilde{y}, \tilde{x}) \propto \pi(\tilde{y} \mid \tilde{x}, \eta, \psi) \, \pi(\tilde{x} \mid \gamma, \psi) \, \pi(\eta) \, \pi(\gamma) \, \pi(\psi). \] The observed covariates inform \(\gamma\) and \(\psi\) through the marginal covariate model while the covariation between the observed variates and covariates informs \(\eta\) and \(\psi\) through the conditional variate model.

To make predictions for the missing variate accompanying an observed covariate we proceed as before. First we marginalize the parameters out of the full Bayesian model, \[ \begin{align*} \pi(y, x) &= \int \mathrm{d} \eta \, \mathrm{d} \gamma \, \mathrm{d} \psi \, \pi(y, x, \eta, \gamma, \psi) \\ &= \int \mathrm{d} \eta \, \mathrm{d} \gamma \, \mathrm{d} \psi \, \pi(y \mid x, \eta, \psi) \, \pi(x \mid \gamma, \psi) \, \pi(\eta) \, \pi(\gamma) \, \pi(\psi) \\ &= \int \mathrm{d} \psi \left[ \int \mathrm{d} \eta \, \pi(y \mid x, \eta, \psi) \, \pi(\eta) \right] \left[ \int \mathrm{d} \gamma \, \pi(x \mid \gamma, \psi) \, \pi(\gamma) \right] \pi(\psi). \end{align*} \] Then we condition on the observed covariates \(\tilde{x}\), \[ \begin{align*} \pi(y \mid \tilde{x}) &\propto \pi(y, \tilde{x}) \\ &\propto \int \mathrm{d} \psi \left[ \int \mathrm{d} \eta \, \pi(y \mid \tilde{x}, \eta, \psi) \, \pi(\eta) \right] \left[ \int \mathrm{d} \gamma \, \pi(\tilde{x} \mid \gamma, \psi) \, \pi(\gamma) \right] \pi(\psi). \end{align*} \]

Without any observed variates \(\eta\) is informed by only the prior model. Integrating out \(\eta\) gives a marginal predictive distribution for the missing variate that depends on only the unknown confounding behavior, \[ \int \mathrm{d} \eta \, \pi(y \mid \tilde{x}, \eta, \psi) \, \pi(\eta) \propto \pi(y \mid \tilde{x}, \psi). \]

While the parameters \(\gamma\) don't inform this predictive distribution they do influence what we learn about that confounding parameters \(\psi\) from the observed covariates. After integrating out \(\gamma\) we can interpret the last two terms as a marginal posterior distribution for the confounding variables given just the observed covariates, \[ \left[ \int \mathrm{d} \gamma \, \pi(\tilde{x} \mid \gamma, \psi) \, \pi(\gamma) \right] \pi(\psi) \propto \pi(\psi \mid \tilde{x}). \]

The final predictive distribution is then given by weighting the confounder-dependent predictive distribution by this marginal posterior for the confounding behavior, \[ \pi(y \mid \tilde{x}) \propto \int \mathrm{d} \psi \, \pi(y \mid \tilde{x}, \psi) \, \pi(\psi \mid \tilde{x}). \]

Because of the confounding behavior both the marginal covariate model and the conditional variate model are necessary for making consistent inferences and predictions.

2.1.3 No Confounders

Inferential outcomes simplify dramatically when there are no confounding parameters. In this case the full Bayesian model decomposes into independent conditional variate and marginal covariate models, \[ \pi(y, x, \eta, \gamma) = \pi(y \mid x, \eta) \, \pi(x \mid \gamma) \, \pi(\eta) \, \pi(\gamma). \]




Given a complete observation \((\tilde{y}, \tilde{x})\) the posterior distribution over \(\eta\) and \(\psi\) becomes \[ \begin{align*} \pi(\eta, \gamma \mid \tilde{y}, \tilde{x}) &\propto \pi(\tilde{y} \mid \tilde{x}, \eta) \, \pi(\tilde{x} \mid \gamma) \, \pi(\eta) \, \pi(\gamma) \\ &\propto \pi(\tilde{y} \mid \tilde{x}, \eta) \, \pi(\eta) \, \pi(\tilde{x} \mid \gamma) \, \pi(\gamma) \\ &\propto \pi(\eta \mid \tilde{y}, \tilde{x}) \, \pi(\gamma \mid \tilde{x}). \end{align*} \] Without any confounding parameters inferences for the conditional variate parameter \(\eta\) and the marginal covariate model \(\gamma\) completely decouple! The empirical covariation between \(\tilde{y}\) and \(\tilde{x}\) informs the configuration of the conditional variate model completely independently of how \(\tilde{x}\) informs the configuration of the marginal covariate model. Consequently we can learn the conditional variate parameters \(\eta\) without constructing a marginal covariate model at all.

Predictions for unobserved variates simplify as well. The marginal distribution for the full observational space is given by \[ \begin{align*} \pi(y, x) &= \int \mathrm{d} \eta \, \mathrm{d} \gamma \, \pi(y, x, \eta, \gamma) \\ &= \int \mathrm{d} \eta \, \mathrm{d} \gamma \, \pi(y \mid x, \eta) \, \pi(x \mid \gamma) \, \pi(\eta) \, \pi(\gamma) \\ &= \left[ \int \mathrm{d} \eta \, \pi(y \mid x, \eta) \, \pi(\eta) \right] \left[ \int \mathrm{d} \gamma \, \pi(x \mid \gamma) \, \pi(\gamma) \right]. \end{align*} \] Conditioning on the observed covariates \(\tilde{x}\) then gives the predictive distribution for the missing variate \(y\), \[ \begin{align*} \pi(y \mid \tilde{x}) &\propto \pi(y, \tilde{x}) \\ &\propto \left[ \int \mathrm{d} \eta \, \pi(y \mid \tilde{x}, \eta) \, \pi(\eta) \right] \left[ \int \mathrm{d} \gamma \, \pi(\tilde{x} \mid \gamma) \, \pi(\gamma) \right]. \end{align*} \]

In this case the evaluation of the marginal covariate model is just a constant that can be dropped entirely to give \[ \pi(y \mid \tilde{x}) \propto \int \mathrm{d} \eta \, \pi(y \mid \tilde{x}, \eta) \, \pi(\eta). \] Once again the marginal covariate model is irrelevant; given an observed covariate we can make predictions for the unobserved variate using only the conditional variate model and the marginal prior model for \(\eta\).

Being able to disregard the marginal covariate model entirely significantly reduces our modeling burden, especially when the provenance of the covariates is poorly understood. This is possible, however, only in the absence of confounding behaviors and cannot be taken for granted.

2.2 Multiple Observations

Generalizing these evaluations to multiple observations, some of which might be complete and some of which might be incomplete, is straightforward if the configurations of the conditional variate and marginal covariate models are fixed. If those configurations vary across observations, however, then we have to proceed very carefully. In this section we will investigate inferential outcomes under various assumptions about the heterogeneity of the conditional variate and marginal covariate model configurations.

2.2.1 Encapsulated Model Configurations

To warm up let's consider the entire model configuration space encapsulated into a single variable \(\theta\) once again and build models for homogenous and heterogeneous configurations.

If the monolithic model configuration is fixed across observations then the full Bayesian model is given by \[ \pi(y_{1}, x_{1}, \ldots, y_{K}, x_{K}, \theta) = \left[ \prod_{k = 1}^{K} \pi(y_{k} \mid x_{k}, \theta) \, \pi(x_{k} \mid \theta) \right] \pi(\theta). \]




For just two observations, \((y_{1}, x_{1})\) and \((y_{2}, x_{2})\), this reduces to \[ \pi(y_{1}, x_{1}, y_{2}, x_{2}, \theta) = \pi(y_{1} \mid x_{1}, \theta) \, \pi(x_{1} \mid \theta) \, \pi(y_{2} \mid x_{2}, \theta) \, \pi(x_{2} \mid \theta) \, \pi(\theta). \]

When the first observation is complete but the second is incomplete, so that we have \(\tilde{y}_{1}\) and \(\tilde{x}_{1}\) but only \(\tilde{x}_{2}\), we can construct a predictive distribution for the missing variate as \[ \begin{align*} \pi(y_{2} \mid \tilde{y}_{1}, \tilde{x}_{1}, \tilde{x}_{2}) &\propto \pi(\tilde{y}_{1}, \tilde{x}_{1}, y_{2}, \tilde{x}_{2}) \\ &\propto \int \mathrm{d} \theta \, \pi(\tilde{y}_{1}, \tilde{x}_{1}, y_{2}, \tilde{x}_{2}, \theta) \\ &\propto \int \mathrm{d} \theta \, \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \theta) \, \pi(\tilde{x}_{1} \mid \theta) \, \pi(y_{2} \mid \tilde{x}_{2}, \theta) \, \pi(\tilde{x}_{2} \mid \theta) \, \pi(\theta) \\ &\propto \int \mathrm{d} \theta \, \pi(y_{2} \mid \tilde{x}_{2}, \theta) \, \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \theta) \, \pi(\tilde{x}_{1} \mid \theta) \, \pi(\tilde{x}_{2} \mid \theta) \, \pi(\theta). \end{align*} \]

The last three terms can be written as a posterior distribution for \(\theta\) given both covariate observations, \[ \pi(\tilde{x}_{1} \mid \theta) \, \pi(\tilde{x}_{2} \mid \theta) \, \pi(\theta) \propto \pi(\theta \mid \tilde{x}_{1}, \tilde{x}_{2}). \] This quantifies how much the prior model and the marginal covariate model inform the model configurations \(\theta\).

Evaluating the conditional variate model on the observed variate \(\tilde{y}_{1}\) defines a likelihood function that updates this covariate posterior distribution into a posterior distribution informed by all of the observed quantities, \[ \begin{align*} \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \theta) \, \pi(\tilde{x}_{1} \mid \theta) \, \, \pi(\tilde{x}_{2} \mid \theta) \, \pi(\theta) &\propto \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \theta) \, \pi(\theta \mid \tilde{x}_{1}, \tilde{x}_{2}) \\ &\propto \pi(\theta \mid \tilde{y}_{1}, \tilde{x}_{1}, \tilde{x}_{2}). \end{align*} \]

Finally the predictive distribution for the missing variate \(y_{2}\) can be interpreted as the conditional variate model evaluated at \(\tilde{x}_{2}\) averaged over this posterior distribution, \[ \pi(y_{2} \mid \tilde{y}_{1}, \tilde{x}_{1}, \tilde{x}_{2}) \propto \int \mathrm{d} \theta \, \pi(y_{2} \mid \tilde{x}_{2}, \theta) \, \pi(\theta \mid \tilde{y}_{1}, \tilde{x}_{1}, \tilde{x}_{2}). \]

Conversely when the model configuration varies across observations the full Bayesian model is given by \[ \pi(y_{1}, x_{1}, \ldots, y_{K}, x_{K}, \theta_{1}, \ldots, \theta_{K}) = \prod_{k = 1}^{K} \pi(y_{k} \mid x_{k}, \theta_{k}) \, \pi(x_{k} \mid \theta_{k}) \, \pi(\theta_{k}). \]




If these individual observations also consist of multiple components that behave homogeneously then this model could be expanded to \[ \pi(y_{1}, x_{1}, \ldots, y_{N}, x_{N}, \theta_{1}, \ldots, \theta_{K}) = \prod_{n = 1}^{N} \pi(y_{n} \mid x_{n}, \theta_{k(n)}) \, \pi(x_{n} \mid \theta_{k(n)}) \prod_{k = 1}^{K} \pi(\theta_{k}), \] where \(k(n)\) identifies to which observation, and hence model configuration, each component belongs.




Again for notational simplicity we will consider only \(y_{k}\) and \(x_{k}\) here, and not any substructure.

For \(K = 2\) observations the full Bayesian model for the multiple observations reduces to \[ \pi(y_{1}, x_{1}, y_{2}, x_{2}, \theta_{1}, \theta_{2}) = \pi(y_{1} \mid x_{1}, \theta_{1}) \, \pi(x_{1} \mid \theta_{1}) \, \pi(\theta_{1}) \, \pi(y_{2} \mid x_{2}, \theta_{2}) \, \pi(x_{2} \mid \theta_{2}) \, \pi(\theta_{2}). \]

If the first observation is complete but the second is incomplete as before we can construct a predictive distribution for the missing variate as \[ \begin{align*} \pi(y_{2} \mid \tilde{y}_{1}, \tilde{x}_{1}, \tilde{x}_{2}) &\propto \pi(\tilde{y}_{1}, \tilde{x}_{1}, y_{2}, \tilde{x}_{2}) \\ &\propto \int \mathrm{d} \theta_{1} \, \mathrm{d} \theta_{2} \, \pi(\tilde{y}_{1}, \tilde{x}_{1}, y_{2}, \tilde{x}_{2}, \theta_{1}, \theta_{2}) \\ &\propto \int \mathrm{d} \theta \, \mathrm{d} \theta_{2} \, \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \theta_{1}) \, \pi(\tilde{x}_{1} \mid \theta_{1}) \, \pi(\theta_{1}) \, \pi(y_{2} \mid \tilde{x}_{2}, \theta_{2}) \, \pi(\tilde{x}_{2} \mid \theta_{2}) \, \pi(\theta_{2}) \\ &\propto \int \mathrm{d} \theta_{1} \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \theta_{1}) \, \pi(\tilde{x}_{1} \mid \theta_{1}) \, \pi(\theta_{1}) \int \mathrm{d} \theta_{2} \pi(y_{2} \mid \tilde{x}_{2}, \theta_{2}) \, \pi(\tilde{x}_{2} \mid \theta_{2}) \, \pi(\theta_{2}). \end{align*} \]

Because the model configurations are completely heterogeneous inferences for the two observations completely decouple. Indeed because the first integral is a constant the predictive distribution reduces to \[ \pi(y_{2} \mid \tilde{y}_{1}, \tilde{x}_{1}, \tilde{x}_{2}) \propto \int \mathrm{d} \theta_{2} \pi(y_{2} \mid \tilde{x}_{2}, \theta_{2}) \, \pi(\tilde{x}_{2} \mid \theta_{2}) \, \pi(\theta_{2}), \] which is the exact same predictive distribution we derived when we had just one incomplete observation, \(\tilde{x}_{2}\). We need some common structure in order transfer inferences from one context to another.

2.2.2 Component Model Configurations

In more realistic settings the model configurations aren't monolithically homogenous or heterogeneous across observations. Instead some components might vary while others remain fixed, allowing at least some information to be pooled between different observations.

The behaviors quantified by each of the three component parameters \(\theta = (\eta, \gamma, \psi)\) can in general feature some element that is homogenous and some element that is heterogenous across observations. For demonstration purposes we will consider here heterogeneous parameters informing the marginal covariate model, \(\gamma_{k}\), homogenous parameters informing the conditional variate model, \(\eta\), heterogenous parameters informing the conditional variate model, \(\zeta_{k}\), and finally heterogenous confounding parameters informing both models, \(\psi_{k}\). I will refer to the \(\zeta_{k}\) as treatment parameters as they often model active interventions designed to influence the covariation between the variates and covariates.

Under these assumptions the full Bayesian model becomes \[ \begin{align*} &\pi(y_{1}, x_{1}, \ldots, y_{K}, x_{K}, \eta, \zeta_{1}, \ldots, \zeta_{K}, \gamma_{1}, \ldots, \gamma_{K}, \psi_{1}, \ldots, \psi_{K}) \\ &\quad = \left[ \prod_{k = 1}^{K} \pi(y_{k} \mid x_{k}, \eta, \zeta_{k}, \psi_{k}) \, \pi(x_{k} \mid \gamma_{k}, \psi_{k}) \, \pi(\zeta_{k}) \, \pi(\gamma_{k}) \, \pi(\psi_{k}) \right] \pi(\eta). \end{align*} \]




For reference if each of these observations contains multiple components then the model can also be expanded into \[ \begin{align*} &\pi(y_{1}, x_{1}, \ldots, y_{N}, x_{N}, \eta, \zeta_{1}, \ldots, \zeta_{K}, \gamma_{1}, \ldots, \gamma_{K}, \psi_{1}, \ldots, \psi_{K}) \\ &\quad = \left[ \prod_{n = 1}^{N} \pi(y_{n} \mid x_{n}, \eta, \zeta_{k(n)}, \psi_{k(n)}) \, \pi(x_{n} \mid \gamma_{k(n)}, \psi_{k(n)}) \right] \left[ \prod_{k = 1}^{K} \pi(\zeta_{k}) \, \pi(\gamma_{k}) \, \pi(\psi_{k}) \right] \pi(\eta). \end{align*} \]




For two observations: \((y_{1}, x_{1})\) and \((y_{2}, x_{2})\) the full Bayesian model reduces to \[ \begin{align*} &\pi(y_{1}, x_{1}, y_{2}, x_{2}, \eta, \zeta_{1}, \zeta_{2}, \gamma_{1}, \gamma_{2}, \psi_{1}, \psi_{2}) \\ &\quad\quad\quad\quad = \quad \pi(y_{1} \mid x_{1}, \eta, \zeta_{1}, \psi_{1}) \, \pi(x_{1} \mid \gamma_{1}, \psi_{1}) \, \pi(\zeta_{1}) \, \pi(\gamma_{1}) \, \pi(\psi_{1}) \\ &\quad\quad\quad\quad \quad\;\; \cdot \pi(y_{2} \mid x_{2}, \eta, \zeta_{2}, \psi_{2}) \, \pi(x_{2} \mid \gamma_{2}, \psi_{2}) \, \pi(\zeta_{2}) \, \pi(\gamma_{2}) \, \pi(\psi_{2}) \\ &\quad\quad\quad\quad \quad\;\; \cdot \pi(\eta). \end{align*} \]

If both observations are complete, and we have both \((\tilde{y}_{1}, \tilde{x}_{1})\) and \((\tilde{y}_{2}, \tilde{x}_{2})\), then we can inform all of the parameters beyond the prior model. The joint posterior distribution becomes \[ \begin{align*} \pi(\eta, \zeta_{1}, \zeta_{2}, \gamma_{1}, \gamma_{2}, \psi_{1}, \psi_{2} \mid \tilde{y}_{1}, \tilde{x}_{1}, \tilde{y}_{2}, \tilde{x}_{2}) &\propto \pi(\tilde{y}_{1}, \tilde{x}_{1}, \tilde{y}_{2}, \tilde{x}_{2}, \eta, \zeta_{1}, \zeta_{2}, \gamma_{1}, \gamma_{2}, \psi_{1}, \psi_{2}) \\ &\propto \quad \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \eta, \zeta_{1}, \psi_{1}) \, \pi(\tilde{x}_{1} \mid \gamma_{1}, \psi_{1}) \, \pi(\zeta_{1}) \, \pi(\gamma_{1}) \, \pi(\psi_{1}) \\ &\quad\;\; \cdot \pi(\tilde{y}_{2} \mid \tilde{x}_{2}, \eta, \zeta_{2}, \psi_{2}) \, \pi(\tilde{x}_{2} \mid \gamma_{2}, \psi_{2}) \, \pi(\zeta_{2}) \, \pi(\gamma_{2}) \, \pi(\psi_{2}) \\ &\quad\;\; \cdot \pi(\eta). \end{align*} \] The marginal covariate model allows \(\tilde{x}_{1}\) to inform \(\psi_{1}\) and \(\gamma_{1}\) and \(\tilde{x}_{2}\) to inform \(\psi_{2}\) and \(\gamma_{2}\). At the same time the conditional variate model allows the the covariation between \(\tilde{y}_{1}\) and \(\tilde{x}_{1}\) to inform \(\zeta_{1}\) and \(\psi_{1}\) while the covariation between \(\tilde{y}_{2}\) and \(\tilde{x}_{2}\) informs \(\zeta_{2}\) and \(\psi_{1}\). Both complete observations inform the shared parameter \(\eta\).

When only the first observation is complete, so that we have \((\tilde{y}_{1}, \tilde{x}_{1})\) but only \(\tilde{x}_{2}\), we can construct a predictive distribution for the unobserved variate \(y_{2}\), \[ \begin{align*} \pi(y_{2} \mid \tilde{y}_{1}, \tilde{x}_{1}, \tilde{x}_{1}) &\propto \pi(\tilde{y}_{1}, \tilde{x}_{1}, y_{2}, \tilde{x}_{1}) \\ &\propto \int \mathrm{d} \eta \, \mathrm{d} \zeta_{1} \, \mathrm{d} \zeta_{2} \, \mathrm{d} \gamma_{1} \, \mathrm{d} \gamma_{2} \, \mathrm{d} \psi_{1} \, \mathrm{d} \psi_{2} \, \pi(\tilde{y}_{1}, \tilde{x}_{1}, y_{2}, \tilde{x}_{2}, \eta, \zeta_{1}, \zeta_{2}, \gamma_{1}, \gamma_{2}, \psi_{1}, \psi_{2}) \\ &\propto \int \mathrm{d} \eta \, \mathrm{d} \zeta_{1} \, \mathrm{d} \zeta_{2} \, \mathrm{d} \gamma_{1} \, \mathrm{d} \gamma_{2} \, \mathrm{d} \psi_{1} \, \mathrm{d} \psi_{2} \, \\ &\hspace{10mm} \;\;\, \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \eta, \zeta_{1}, \psi_{1}) \, \pi(x_{1} \mid \gamma_{1}, \psi_{1}) \, \pi(\zeta_{1}) \, \pi(\gamma_{1}) \, \pi(\psi_{1}) \\ &\hspace{10mm} \cdot \pi(y_{2} \mid \tilde{x}_{2}, \eta, \zeta_{2}, \psi_{2}) \, \pi(x_{2} \mid \gamma_{2}, \psi_{2}) \, \pi(\zeta_{2}) \, \pi(\gamma_{2}) \, \pi(\psi_{2}) \\ &\hspace{10mm} \cdot \pi(\eta) \\ &\propto \int \mathrm{d} \psi_{1} \, \mathrm{d} \psi_{2} \, \pi(\psi_{1}) \, \pi(\psi_{2}) \\ &\hspace{10mm} \int \mathrm{d} \eta \, \pi(\eta) \\ &\hspace{20mm} \int \mathrm{d} \zeta_{2} \, \pi(y_{2} \mid \tilde{x}_{2}, \eta, \zeta_{2}, \psi_{2}) \, \pi(\zeta_{2}) \\ &\hspace{20mm} \int \mathrm{d} \zeta_{1} \, \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \eta, \zeta_{1}, \psi_{1}) \, \pi(\zeta_{1}) \\ &\hspace{10mm} \int \mathrm{d} \gamma_{1} \, \mathrm{d} \gamma_{2} \, \pi(\tilde{x}_{1} \mid \gamma_{1}, \psi_{1}) \, \pi(\tilde{x}_{2} \mid \gamma_{2}, \psi_{2}) \, \pi(\gamma). \end{align*} \]

Rewriting the last term as \[ \int \mathrm{d} \gamma_{1} \, \mathrm{d} \gamma_{2} \, \pi(\tilde{x}_{1} \mid \gamma, \psi_{1}) \, \pi(\tilde{x}_{2} \mid \gamma, \psi_{2}) \, \pi(\psi_{1}) \, \pi(\psi_{2}) \, \pi(\gamma) \propto \pi(\psi_{1}, \psi_{2} \mid \tilde{x}_{1}, \tilde{x}_{2}) \] clarifies that it quantifies how the observed covariates \(\tilde{x}_{1}\) and \(\tilde{x}_{2}\) inform the heterogenous confounding parameters.

Similarly we can rewrite the second term as \[ \int \mathrm{d} \zeta_{1} \, \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \eta, \zeta_{1}, \psi_{1}) \, \pi(\zeta_{1}) \pi(\eta) \propto \pi(\eta \mid \tilde{y}_{1}, \tilde{x}_{1}, \psi_{1}). \] In words this term quantifies how the complete observation \((\tilde{x}_{1}, \tilde{y}_{1})\) informs the homogenous parameter \(\eta\) given a fixed value of the confounding variable \(\psi_{1}\).

Finally the first term can be manipulated into \[ \int \mathrm{d} \zeta_{2} \, \pi(y_{2} \mid \tilde{x}_{2}, \eta, \zeta_{2}, \psi_{2}) \, \pi(\zeta_{2}) \propto \pi(y_{2} \mid \tilde{x}_{2}, \eta, \psi_{2}). \] Without observing \(y_{2}\) only the prior model \(\pi(\zeta_{2})\) informs the treatment parameter configuration for this incomplete observation.

Altogether we have \[ \begin{align*} \pi(y_{2} \mid \tilde{y}_{1}, \tilde{x}_{1}, \tilde{x}_{1}) &\propto \int \mathrm{d} \psi_{1} \, \mathrm{d} \psi_{2} \, \mathrm{d} \eta \, \pi(y_{2} \mid \tilde{x}_{2}, \eta, \psi_{2}) \, \pi(\eta \mid \tilde{y}_{1}, \tilde{x}_{1}, \psi_{1}) \, \pi(\psi_{1}, \psi_{2} \mid \tilde{x}_{1}, \tilde{x}_{2}). \end{align*} \] Through the marginal covariate model \(\tilde{x}_{1}\) and \(\tilde{x}_{2}\) inform \(\gamma_{1}\), \(\gamma_{2}\), and the confounding parameters \(\psi_{1}\) and \(\psi_{2}\). The conditional variate model allows the complete observation \((\tilde{y}_{1}, \tilde{x}_{1})\) to inform \(\eta\), the treatment parameter \(\zeta_{1}\), and the confounding parameter \(\psi_{1}\). Given all of this information, as well as the prior model for the treatment parameter \(\zeta_{2}\), the conditional variate model also informs the predictions for the unobserved variate \(y_{2}\) given \(\tilde{x}_{2}\).

In order to derive consistent inferences and predictions in this case we need to build both a marginal covariate model and a conditional variate model, and jointly infer their configurations across observations.

2.2.3 No Treatment, No Confounders

The structure of the predictive distribution simplifies dramatically when there are no confounding or treatment parameters. In this case the marginal covariate and conditional variate models once again decouple in both inferences and predictions.

Without any confounding or treatment parameters the full Bayesian model reduces to \[ \begin{align*} &\pi(y_{1}, x_{1}, \ldots, y_{K}, x_{K}, \eta, \gamma_{1}, \gamma_{2}) \\ &\quad = \left[ \prod_{k = 1}^{K} \pi(y_{k} \mid x_{k}, \eta) \, \pi(x_{k} \mid \gamma_{k}) \, \pi(\gamma_{k}) \right] \pi(\eta). \end{align*} \]




For two observations \((y_{1}, x_{1})\) and \((y_{2}, x_{2})\) this becomes \[ \pi(y_{1}, x_{1}, y_{2}, x_{2}, \eta, \gamma_{1}, \gamma_{2}) = \pi(y_{1} \mid x_{1}, \eta) \, \pi(x_{1} \mid \gamma_{1}) \, \pi(\gamma_{1}) \pi(y_{2} \mid x_{2}, \eta) \, \pi(x_{2} \mid \gamma_{2}) \, \pi(\gamma_{2}) \, \pi(\eta). \]

If both observations are complete then the posterior distribution becomes \[ \begin{align*} \pi(\eta, \gamma_{1}, \gamma_{2} \mid y_{1}, x_{1}, y_{2}, x_{2}) &\propto \pi(\eta, \gamma_{1}, \gamma_{2} \mid \tilde{y}_{1}, \tilde{x}_{1}, \tilde{y}_{2}, \tilde{x}_{2}) \\ &\propto \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \eta) \, \pi(\tilde{x}_{1} \mid \gamma_{1}) \, \pi(\gamma_{1}) \, \pi(\tilde{y}_{2} \mid \tilde{x}_{2}, \eta) \, \pi(\tilde{x}_{2} \mid \gamma_{2}) \, \pi(\gamma_{2}) \, \pi(\eta) \\ &\propto \big[ \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \eta) \, \pi(\tilde{y}_{2} \mid \tilde{x}_{2}, \eta) \, \pi(\eta) \big] \big[ \pi(\tilde{x}_{1} \mid \gamma_{1}) \, \pi(\gamma_{1}) \, \pi(\tilde{x}_{2} \mid \gamma_{2}) \, \pi(\gamma_{2}) \big] \\ &\propto \big[ \pi(\eta \mid \tilde{y}_{1}, \tilde{x}_{1}, \tilde{y}_{2}, \tilde{x}_{2}) \big] \big[ \pi(\gamma_{1} \mid \tilde{x}_{1}) \, \pi(\gamma_{2} \mid \tilde{x}_{2}) \big]. \end{align*} \] The observed covariation between the variate-covariate pairs inform \(\eta\) through the conditional variate model independently of the marginal covariate model or its configurations. When we're interested in inferring only the statistical relationship between the variates and the covariates we don't need to consider the marginal covariate model at all!

If only the first observation is complete then we can construct a predictive distribution for the unobserved variate \(y_{2}\), \[ \begin{align*} \pi(y_{2} \mid \tilde{y}_{1}, \tilde{x}_{1}, \tilde{x}_{2}) &\propto \pi(\tilde{y}_{1}, \tilde{x}_{1}, y_{2}, \tilde{x}_{2}) \\ &\propto \int \mathrm{d} \eta \, \mathrm{d} \gamma_{1} \, \mathrm{d} \gamma_{2} \, \pi(\tilde{y}_{1}, \tilde{x}_{1}, y_{2}, \tilde{x}_{2}, \eta, \gamma_{1}, \gamma_{2}) \\ &\propto \int \mathrm{d} \eta \, \mathrm{d} \gamma_{1} \, \mathrm{d} \gamma_{2} \, \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \eta) \, \pi(x_{1} \mid \gamma_{1}) \, \pi(\gamma_{1}) \, \pi(y_{2} \mid \tilde{x}_{2}, \eta) \, \pi(x_{2} \mid \gamma_{2}) \, \pi(\gamma_{2}) \, \pi(\eta) \\ &\propto \;\; \int \mathrm{d} \eta \, \pi(y_{2} \mid \tilde{x}_{2}, \eta) \, \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \eta) \, \pi(\eta) \\ &\quad \cdot \int \mathrm{d} \gamma_{1} \, \pi(\tilde{x}_{1} \mid \gamma_{1}) \, \pi(\gamma_{1}) \int \mathrm{d} \gamma_{2} \, \pi(\tilde{x}_{2} \mid \gamma_{2}) \, \pi(\gamma_{2}) \\ &\propto \;\; \int \mathrm{d} \eta \, \pi(y_{2} \mid \tilde{x}_{2}, \eta) \, \pi(\eta \mid \tilde{y}_{1}, \tilde{x}_{1}) \\ &\quad \cdot \int \mathrm{d} \gamma_{1} \, \pi(\gamma_{1} \mid \tilde{x}_{1}) \int \mathrm{d} \gamma_{2} \, \pi(\gamma_{2} \mid \tilde{x}_{2}) \\ &\propto \;\; \int \mathrm{d} \eta \, \pi(y_{2} \mid \tilde{x}_{2}, \eta) \, \pi(\eta \mid \tilde{y}_{1}, \tilde{x}_{1}). \end{align*} \] Without any confounders the contributions from the marginal covariate model reduce to constants that can be ignored; the marginal covariate model completely decouples from the prediction. Instead the covariation in the complete observation informs the configuration of the conditional variate model which then quantifies the unobserved variate values \(y_{2}\) consistent with the observed covariate \(\tilde{x}_{2}\).

So long as the configuration of the conditional variate model is homogeneous we can transfer inferences across observations, using the covariation in complete observations to inform predictions of missing variates in incomplete observations. Moreover this transfer is viable no matter the behavior of the covariates in either context; we can not only disregard whether the configuration of the marginal covariate model is homogenous or heterogenous across observations but also avoiding building that model entirely!

2.2.4 Treatment, No Confounders

A slightly less simple case occurs when there are no confounding parameters but we do have heterogenous treatment parameters. Here the full Bayesian model reduces to \[ \begin{align*} &\pi(y_{1}, x_{1}, \ldots, y_{K}, x_{K}, \eta, \zeta_{1}, \ldots, \zeta_{K}, \gamma_{1}, \ldots, \gamma_{K}) \\ &\quad = \left[ \prod_{k = 1}^{K} \pi(y_{k} \mid x_{k}, \eta, \zeta_{k}, \psi_{k}) \, \pi(x_{k} \mid \gamma_{k}) \, \pi(\zeta_{k}) \, \pi(\gamma_{k}) \right] \pi(\eta). \end{align*} \]




For two observations \((y_{1}, x_{1})\) and \((y_{2}, x_{2})\) this becomes \[ \begin{align*} &\pi(y_{1}, x_{1}, y_{2}, x_{2}, \eta, \zeta_{1}, \zeta_{2}, \gamma_{1}, \gamma_{2}) \\ &\quad\quad\quad\quad = \quad \pi(y_{1} \mid x_{1}, \eta, \zeta_{1}, \psi_{1}) \, \pi(x_{1} \mid \gamma_{1}, \psi_{1}) \, \pi(\zeta_{1}) \, \pi(\gamma_{1}) \\ &\quad\quad\quad\quad \quad\;\; \cdot \pi(y_{2} \mid x_{2}, \eta, \zeta_{2}, \psi_{2}) \, \pi(x_{2} \mid \gamma_{2}, \psi_{2}) \, \pi(\zeta_{2}) \, \pi(\gamma_{2}) \\ &\quad\quad\quad\quad \quad\;\; \cdot \pi(\eta). \end{align*} \]

A single complete observation \((\tilde{y}_{k}, \tilde{x}_{k})\) informs the corresponding treatment parameter \(\zeta_{k}\). Two complete observations \((\tilde{y}_{1}, \tilde{x}_{1})\) and \((\tilde{y}_{2}, \tilde{x}_{2})\) allow us to infer both \(\zeta_{1}\) and \(\zeta_{2}\) and compare them, \[ \begin{align*} \pi(\eta, \zeta_{1}, \zeta_{2}, \gamma_{1}, \gamma_{2} \mid \tilde{y}_{1}, \tilde{x}_{1}, \tilde{y}_{2}, \tilde{x}_{2}) &\propto \pi(\tilde{y}_{1}, \tilde{x}_{1}, \tilde{y}_{2}, \tilde{x}_{2}, \eta, \zeta_{1}, \zeta_{2}, \gamma_{1}, \gamma_{2}) \\ &\propto \quad \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \eta, \zeta_{1}) \, \pi(\tilde{x}_{1} \mid \gamma_{1}) \, \pi(\zeta_{1}) \, \pi(\gamma_{1}) \\ &\quad\;\; \cdot \pi(\tilde{y}_{2} \mid \tilde{x}_{2}, \eta, \zeta_{2}) \, \pi(\tilde{x}_{2} \mid \gamma_{2}) \, \pi(\zeta_{2}) \, \pi(\gamma_{2}) \\ &\quad\;\; \cdot \pi(\eta) \\ &\propto \quad \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \eta, \zeta_{1}) \, \pi(\zeta_{1}) \\ &\quad\;\; \cdot \pi(\tilde{y}_{2} \mid \tilde{x}_{2}, \eta, \zeta_{2}) \, \pi(\zeta_{2}) \\ &\quad\;\; \cdot \pi(\eta) \, \pi(\tilde{x}_{1} \mid \gamma_{1}) \, \pi(\gamma_{1}) \pi(\tilde{x}_{2} \mid \gamma_{2}) \, \pi(\gamma_{2}) \\ &\propto \quad \pi(\zeta_{1} \mid \tilde{y}_{1}, \tilde{x}_{1}, \eta) \\ &\quad\;\; \cdot \pi(\zeta_{2} \mid \tilde{y}_{2}, \tilde{x}_{2}, \eta) \\ &\quad\;\; \cdot \pi(\eta) \, \pi(\gamma_{1} \mid \tilde{x}_{1}) \, \pi(\gamma_{2} \mid \tilde{x}_{2}) \\ &\propto \quad \pi(\zeta_{1} \mid \tilde{y}_{1}, \tilde{x}_{1}, \eta) \\ &\quad\;\; \cdot \pi(\zeta_{2} \mid \tilde{y}_{2}, \tilde{x}_{2}, \eta) \\ &\quad\;\; \cdot \pi(\eta). \end{align*} \] Once again without any confounding parameters inferences of the marginal covariate model configurations and the conditional variate model configurations completely decouple from each other. Using only the conditional variate model we can construct a marginal posterior distribution for \(\zeta_{1}\) and \(\zeta_{2}\) that allows us to directly compare the two treatment behaviors.

Because the treatment parameters vary from observation to observation we need multiple complete observations to infer the heterogenous configurations of the conditional variate model beyond the given prior model. Prediction of missing variates corresponding to an incomplete covariate observation is less useful here unless the treatment parameters are somehow coupled in the model, for example probabilistically with a non-independent prior model \(\pi(\zeta_{1}, \zeta_{2})\) or deterministically with an assumed functional relationship \(\zeta_{1} = f(\zeta_{2})\).

2.2.5 No Treatment, Confounders

Finally let's consider the circumstance where we don't have any heterogenous treatment parameters but we do have heterogenous confounding parameters. In this case the full Bayesian model is given by \[ \begin{align*} &\pi(y_{1}, x_{1}, \ldots, y_{K}, x_{K}, \eta, \gamma_{1}, \ldots, \gamma_{K}, \psi_{1}, \ldots, \psi_{K}) \\ &\quad = \left[ \prod_{k = 1}^{K} \pi(y_{k} \mid x_{k}, \eta, \psi_{k}) \, \pi(x_{k} \mid \gamma_{k}, \psi_{k}) \, \pi(\gamma_{k}) \, \pi(\psi_{k}) \right] \pi(\eta). \end{align*} \]




For two observations \((y_{1}, x_{1})\) and \((y_{2}, x_{2})\) this reduces to \[ \begin{align*} &\pi(y_{1}, x_{1}, y_{2}, x_{2}, \eta, \gamma_{1}, \gamma_{2}, \psi_{1}, \psi_{2}) \\ &\quad\quad\quad\quad = \quad \pi(y_{1} \mid x_{1}, \eta, \psi_{1}) \, \pi(x_{1} \mid \gamma_{1}, \psi_{1}) \, \pi(\gamma_{1}) \, \pi(\psi_{1}) \\ &\quad\quad\quad\quad \quad\;\; \cdot \pi(y_{2} \mid x_{2}, \eta, \psi_{2}) \, \pi(x_{2} \mid \gamma_{2}, \psi_{2}) \, \pi(\gamma_{2}) \, \pi(\psi_{2}) \\ &\quad\quad\quad\quad \quad\;\; \cdot \pi(\eta). \end{align*} \]

When both observations are complete the posterior distribution becomes \[ \begin{align*} \pi(\eta, \gamma_{1}, \gamma_{2}, \psi_{1}, \psi_{2} \mid \tilde{y}_{1}, \tilde{x}_{1}, \tilde{y}_{2}, \tilde{x}_{2}) &\propto \pi(\tilde{y}_{1}, \tilde{x}_{1}, \tilde{y}_{2}, \tilde{x}_{2}, \eta, \gamma_{1}, \gamma_{2}, \psi_{1}, \psi_{2}) \\ &\propto \quad \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \eta, \psi_{1}) \, \pi(\tilde{x}_{1} \mid \gamma_{1}, \psi_{1}) \, \pi(\gamma_{1}) \, \pi(\psi_{1}) \\ &\quad\;\; \cdot \pi(\tilde{y}_{2} \mid \tilde{x}_{2}, \eta, \psi_{2}) \, \pi(\tilde{x}_{2} \mid \gamma_{2}, \psi_{2}) \, \pi(\gamma_{2}) \, \pi(\psi_{2}) \\ &\quad\;\; \cdot \pi(\eta). \end{align*} \] The confounding parameters \(\psi_{1}\) and \(\psi_{2}\) couple the conditional variate model and the marginal covariate models together; changing \(\psi_{k}\) changes not only the marginal covariate behavior but also the covaration between the variate and the covariate. We need both model components to learn the heterogenous configurations that give rise to both observations in order to construct consistent inferences for any of the parameters.

When only the first observation is complete the predictive distribution for the missing second variate becomes \[ \begin{align*} \pi(y_{2} \mid \tilde{y}_{1}, \tilde{x}_{1}, \tilde{x}_{1}) &\propto \pi(\tilde{y}_{1}, \tilde{x}_{1}, y_{2}, \tilde{x}_{1}) \\ &\propto \int \mathrm{d} \eta \, \mathrm{d} \gamma_{1} \, \mathrm{d} \gamma_{2} \, \mathrm{d} \psi_{1} \, \mathrm{d} \psi_{2} \, \pi(\tilde{y}_{1}, \tilde{x}_{1}, y_{2}, \tilde{x}_{2}, \eta, \gamma_{1}, \gamma_{2}, \psi_{1}, \psi_{2}) \\ &\propto \int \mathrm{d} \eta \, \mathrm{d} \gamma_{1} \, \mathrm{d} \gamma_{2} \, \mathrm{d} \psi_{1} \, \mathrm{d} \psi_{2} \, \pi(\eta) \\ &\hspace{10mm} \;\;\, \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \eta, \psi_{1}) \, \pi(x_{1} \mid \gamma_{1}, \psi_{1}) \, \pi(\gamma_{1}) \, \pi(\psi_{1}) \\ &\hspace{10mm} \cdot \pi(y_{2} \mid \tilde{x}_{2}, \eta, \psi_{2}) \, \pi(x_{2} \mid \gamma_{2}, \psi_{2}) \, \pi(\gamma_{2}) \, \pi(\psi_{2}) \\ &\propto \int \mathrm{d} \psi_{1} \, \mathrm{d} \psi_{2} \, \pi(\psi_{1}) \, \pi(\psi_{2}) \\ &\hspace{10mm} \int \mathrm{d} \eta \, \pi(y_{2} \mid \tilde{x}_{2}, \eta, \psi_{2}) \, \pi(\tilde{y}_{1} \mid \tilde{x}_{1}, \eta, \psi_{1}) \, \pi(\eta) \\ &\hspace{10mm} \int \mathrm{d} \gamma_{1} \, \pi(\tilde{x}_{1} \mid \gamma_{1}, \psi_{1}) \, \pi(\gamma_{1}) \int \mathrm{d} \gamma_{2} \, \pi(\tilde{x}_{2} \mid \gamma_{2}, \psi_{2}) \, \pi(\gamma_{2}). \end{align*} \] The confounding parameters obstruct a clean sequence of inferences that informs the final prediction. Both the conditional variate model and marginal covariate model are needed to consistently separate out inferences of \(\eta\) and \(\psi_{1}\) from the complete observation. The marginal covariate model allows the incomplete observation \(\tilde{x}_{2}\) to inform \(\psi_{2}\) which, along with the previous inferences of \(\eta\), allow the conditional variate model to inform predictions for the unobserved \(y_{2}\).

Naively assuming the simplifications that arise when there are no confounding or treatment parameters, in particular ignoring the contributions from the marginal covariate model and presuming a homogenous configuration of the conditional covariate model, introduces two immediate problems. Firstly without the marginal covariate model the conditional variate model is left to inform both \(\eta\) and \(\psi_{1}\) alone. This results in artificially increased uncertainties that not only limit inferential precision but also can obstruct accurate posterior computation. Secondly if we apply the configuration of the conditional variate model from the first observation to predictions of the missing \(y_{2}\) in the second observation then subsequent predictions will suffer from strong errors unless \(\psi_{2}\) happens to be very close to \(\psi_{1}\).

We cannot take any modeling shortcuts when confounding parameters couple the marginal covariate and conditional variate models!

3 Building Conditional Variate Models

Ideally we would engineer a conditional variate model from the appropriate decomposition of a meaningful full Bayesian model, \[ \pi(y, x, \theta) = \pi(y \mid x, \theta) \, \pi(x \mid \theta) \, \pi(\theta). \] For example we might construct \(\pi(y, x, \theta)\) from narratively generative considerations before decomposing it based on what components of the observational space on prone to missingness.

Sometimes, however, it is more convenient to construct a conditional variate model directly, without having first constructed a full Bayesian model and possibly without constructing an accompanying marginal covariate model at all. In this section we discuss the circumstances in which a conditional variate model is sufficient before I introduce some heuristics that can facilitate this direct construction. We will then discuss some of the practical consequences of using these heuristic models in practice.

3.1 Confound'it

In general we need to construct both a conditional variate model and a marginal covariate model in order to derive well-behaved inferences from complete variate-covariate observations. Our investigation in Section 2.1.3 and Section 2.2.3, however, showed that without confounding behavior inferences for these two model components completely decouple. We can derive well-behaved inferences for the conditional variate model configurations, and apply them to predictions of missing variates, without any reference to the marginal covariate model.

In this case we can avoid the burden of explicitly constructing a marginal covariate model entirely, and inferences of the remaining conditional variate model configurations can be translated to any full Bayesian model that decomposes into the same conditional variate model but a different, unconfounded marginal covariate model. Using only the conditional variate model we can pool complete observations with different covariate distributions to inform self-consistent inferences, and then apply those inferences to predictions of missing variates accompanying covariates following any other distribution.

To understand when we can take advantage of these massive simplifications we need to thoroughly understand the realistic potential of confounding. In this section we dive deeper into confounding behavior from a modeling perspective and the various circumstances in which we can have some confidence in its negligibility.

3.1.1 Coupling

By definition the presence of confounding behavior implies that the covariation between the variates and the covariates, as captured by the conditional variate model, is coupled to the distribution of the covariates, as captured by the marginal covariate model. In other words any change in the covariation due to changes in confounding parameters immediately implies a change in the marginal covariate behavior, and vice versa. Equivalently the lack of confounding behavior is distinguished by independence between the conditional variate distribution and marginal covariate distribution.

Consider for example a disease that may or may not be correlated with exposure to a potentially dangerous substance and a diagnostic test of that disease in an individual. If the exposure received by each individual in a population has been well-surveyed but tests are expensive and comparatively rarer then we might consider exposure as a covariate and the binary test output as an often missing variate. In this case a key parameter of the corresponding conditional variate model will be how susceptible an individual is to the suspected exposure.

Confounding is introduced when the distribution of observed covariates is also influenced by this susceptibility. For example if complete observations are drawn from individuals who volunteer then those less susceptible, and experiencing fewer or weaker effects of the disease, might be less inclined to volunteer while those more susceptible, and experiencing more or stronger symptoms, might be more inclined to volunteer. In this case the distribution of observed covariates will be coupled to the susceptibility of the observed individuals and hence the covariation between exposure and test results.

What if tests are performed on everyone employed at a particular company? Confounding might arise here if company employees are subjected to more exposure than the general population and more susceptible individuals tend to seek alternative employment options. Similarly testing performed within a particular neighborhood would manifest confounding behavior if exposure varies from neighborhood to neighborhood and individuals are able to move relatively freely between them. More susceptible individuals will tend to move to less exposed neighborhoods, displacing less susceptible individuals to the more exposed neighborhoods and confounding the exposure distribution in each neighborhood with the susceptibility of individuals in those neighborhoods.

As another example consider a more controlled experiment where a source device generates some input, such as colliding beams of particles or a stream of ionized molecules, that is incident on a detector apparatus that converts that input into a recordable output measurement, such as as an electrical readout. Here we might model the configuration of the inputs as covariates and the configuration of the final readout as variates.

An absence of confounders implies that that the conversion of inputs to outputs in the detector apparatus proceeds independently of the preparation of the inputs, but this can't always be guaranteed. For example consider a smaller laboratory with a limited power source fueling both the source and detector equipment. If the generation of stronger inputs drains too much power then less power will be available to amplify the conversion in the detector apparatus, resulting in weaker and maybe even missing outputs. One of the telltale signs of this kind of confounding is that the final distribution of observed covariates differs from the initial distribution of covariates prepared for the experiment.

Ultimately confounding behavior is not a property of a conditional variate model or a marginal covariate model alone but rather their interaction in the full Bayesian model. Many different data generating processes might result in the same conditional variate model but different marginal covariate model and vice versa. Some of these data generating processes might exhibit variate-covariate decompositions afflicted by confounding behaviors while some might be conveniently confounder-free. If we want to argue for the absence of confounding behavior, and the sufficiency of just a conditional variate model, then we have to, at least conceptually, confront the full context of that entire data generating process.

3.1.2 Designed Experiments

The most reliable strategy for avoiding confounding behavior is a carefully designed experiment that endows the variate-covariate decomposition with a narratively generative interpretation. If the generating of the covariates, as modeled by the marginal covariate model, occurs before and independently of the conditional generation of the variates, as modeled by the conditional variate model, then the variate-covariate decomposition \[ \pi(y, x, \eta, \gamma) = \pi(y \mid x, \eta) \, \pi(x \mid \gamma) \, \pi(\eta, \gamma), \] will not only be narratively generative but also free of confounders. In this case the distribution of covariates can be interpreted as the initial configuration of the experiment while the variates become the outcome of measurements of that configured system.

One of the most common experimental designs of this form is a randomization design where subjects that will be observed in a study are selected from a population based on a random, or more likely pseudo-random sequence. Because these pseudo-random sequences are generated from a predetermined set of rules there is no way for them to be influenced by external factors, let alone any factors that might influence covariation captured by the conditional variate model.

Note that randomization itself is not critical here; any experiment that is able to configure the distribution of covariates before any measurements are made, or even just without any knowledge of those measurement outcomes, will avoid any confounding behaviors. In the physical sciences, for example, covariate distributions are often configured to be uniform over a range of interesting behaviors, or discretized to a uniform grid within that scope, independently of how the measurements themselves proceed.

The ability for these designs to suppress confounding behavior, however, is limited by how well they can actually be implemented in practice. Imperfect implementations of these designs can introduce subtle confounding behavior that must be modeled in order for meaningful inferences to be drawn.

One of the more common implementation failures occurs when some measurements are censored, and neither the covariates or variates are recorded. In this case the observed covariate distribution will differ from the designed covariate distribution. If the censoring process is correlated with the covariation between the inputs and outputs then the marginal covariate model for the observed covariate distribution will be confounded with the conditional variate model even if the marginal covariate model expected from the experimental design is not!

For example a randomized experimental design can be compromised by subjects dropping out from the study before measurements are made. If the probability of dropping out is correlated with the behavior of conditional measurement process then the marginal covariate model for the remaining subjects will be confounded with the conditional variate model of that measurement process.

In any deliberate experiment we cannot take the implementation of an experimental design for granted. We have to carefully validate the execution of any experimental design before we take any modeling short cuts, such as ignoring the marginal covariate model, enabled that by design. Ultimately we don't model how measurements should have been made but rather how they are actually made. The latter is often much more interesting, but also much more challenging.

3.1.3 Natural Experiments

A lack of confounding behavior isn't always the consequence of an intentionally designed experiment. Sometimes circumstances outside of our direct control conveniently conspire to obstruct confounding behavior. These fortunate conditions are often referred to as natural experiments [1].

As with deliberate experiments, natural experiments are characterized by a narratively generative structure where the generating of the covariates is independent of the conditional generation of the variates. In natural experiments this is typically a consequence of some external phenomenon that selects out covariate observations but has no influence on the accompanying variate observations. If we can infer conditional variate model configurations in these fortunate circumstances then we can apply those inferences to predictions of missing variates in any other unconfounded circumstance, including intentional interventions.

In most cases the existence of a natural experiment is an assumption that we make based on our domain expertise. Because natural experiments are unintentional they usually can't be controlled and studied to determine their validity in any given circumstance. At best we can utilize standard model validation techniques, such as posterior retrodictive checks, to determine how appropriate a model based on an assumed natural experiment might be.

To demonstrate a natural experiment let's go back to the test-exposure example that we introduced above, where many methods of obtaining individuals to test introduced potential confounding behavior. Now, however, consider a concert where all attendees were tested as a requirement for entry. When interest in the concert and ability to attend are not influenced by susceptibility, which is not a trivial assumption by any means, this entry testing forms a convenient natural experiment.

If we also observe other manifestations of an external phenomenon that generates a natural experiment, beyond just the covariates, then we can also incorporate them into our inferences by including them in an expanded model. For example if \(\gamma\) parameterizes the possible behaviors of the external phenomenon and \(z\) denotes another observations besides the covariates then the joint model admits a narratively generative decomposition, \[ \pi(x, z, \gamma) = \pi(x, z \mid \gamma) \, \pi(\gamma). \] Motivated by the variate-covariate decomposition, however, we can also consider the further decomposing the latent model as \[ \pi(x, z, \gamma) = \pi(x \mid z, \gamma) \, \pi(z \mid \gamma) \, \pi(\gamma). \] Assuming no confounding in this latent structure introduces two stages of conditioning in the full model, \[ \begin{align*} \pi(y, x, z, \eta, \gamma) &= \pi(y \mid x, \eta) \, \pi(x, z, \gamma) \\ &= \pi(y \mid x, \eta) \, \pi(x \mid z, \gamma) \, \pi(z \mid \gamma) \, \pi(\gamma). \end{align*} \]




In the econometrics literature these auxiliary observations would be considered a special case of instrumental variables.

Because they rely on coincidence, natural experiments are much more fragile than deliberate experiments. For example they often can suppress confounding behavior only for certain groups of variate-covariate observations. In these cases we can use the natural experiments to fully infer conditional variate model configurations for the unconfounded groups but not all groups.

3.1.4 Looking Within

When confounding behavior cannot be avoided we have to explicitly model it in order to derive inferences that can generalized beyond the scope of the observed covariates. That modeling task, however, is not always feasible in practice.

If we are not confident enough in our understanding of a system to build an appropriate marginal covariate model, and any confounding with the conditional variate model that it would exhibit, then we can still construct inferences that are valid only in the scope of the given observation. In general inferences derived from only a conditional variate model, \[ \pi(\theta \mid \tilde{y}, \tilde{x}) = \pi(\tilde{y} \mid \tilde{x}, \theta) \, \pi(\theta), \] are well-defined for only the observed covariates \(\tilde{x}\). They do not characterize general covariation for any other covariate observations unless we can verify that any confounding behavior is negligible.

Forcing conditional inferences in these cases requires that we accept the responsibility of clearly communicating the resulting limitations of those inferences. Otherwise the audience might inadvertently assume that the forced conditional inferences generalize, and be left to deal with any poor consequences of that assumption on their own.

3.2 Recipe du Variate

If we ignore the marginal covariate model -- because we are confident in the absence of confounding behavior, take the confounding behavior for granted, or simply force conditional inferences -- then we are often left to develop a conditional variate model outside of a narratively generative context. In these cases we often have to rely on heuristics to motivate a model that is as appropriate as possible for a given application.

Step One: Model The Variates

A useful first step in a direct construction of a conditional variate model is to use the structure of the variates to motivate an appropriate family of probability density functions, \[ \pi(y \mid \theta). \] The type -- binary, integer, real, vector, etc -- and constraints -- interval-valued, positive-valued, simplex-valued, etc -- of the variates substantially limits the possible families. Any other available domain expertise can then be used to select an appropriate family from these compatible candidates.

Consider for example variates that take on positive integer values, \(y \in \mathbb{N}\). The simplest family of probability density functions over these values in the Poisson family. Another option is the negative binomial family. While there are other, more sophisticated possibilities, these two families offer an accommodating starting point for modeling.

Likewise for interval-constrained real valued variates we might consider the beta family of probability density functions. For positive-constrained real valued variates we might consider the gamma, inverse-gamma, or log normal families depending on what we know about the tail behavior of the variates.

Step Two: Inject The Covariates

Once we've settled on a family of probability density functions we can introduce explicit covariate dependence by replacing any or all of the family parameters with deterministic functions of the covariates, \[ \pi(y \mid x, \eta) = \pi(y \mid f(x, \eta)). \] The choice of parameters to replace with deterministic functions, and the structure of those functions, determines the range of covariation behaviors supported by the resulting conditional variate model. Families of probability density functions with particularly interpretable parameters allow for more principled incorporation of the covariates.

For example consider unconstrained, real-valued variates \(y \in \mathbb{R}\) modeled with the normal family of probability density functions, \[ \pi(y \mid \mu, \sigma) = \text{normal}(y \mid \mu, \sigma). \] To elevate this choice to a conditional variate model we need to replace the location parameter \(\mu\), the scale parameter \(\sigma\), or even both with deterministic functions of the covariates.

If we believe that the covariates will influence the centrality of the variate values but not their breadth then we might replace only the location parameter with the parameterized function \(f(-, \eta) : X \rightarrow \mathbb{R}\), \[ \pi(y \mid x, \eta, \sigma) = \text{normal}(y \mid f(x, \eta), \sigma). \] Alternatively if we believe that the covariates will influence the breadth but not the centrality of the variate distribution then we might replace the scale parameter with a parameterized function \(g(-, \theta) : X \rightarrow \mathbb{R}^{+}\), \[ \pi(y \mid x, \mu, \eta) = \text{normal}(y \mid \mu, g(x, \eta)). \] Moreover there's no reason why both behaviors can't be moderated by the covariates at the same time, \[ \pi(y \mid x, \theta, \eta, \zeta) = \text{normal}(y \mid f(x, \theta, \eta), g(x, \theta, \zeta)). \] Here the parameter \(\theta\) shared by both functions allows the influence of the covariate values on the centrality and breadth to be coupled.

Families of probability density functions that manifest richer behaviors also feature more parameters; more parameters requires more choices to be made when introducing covariate dependence. The decision of which parameters should depend on the covariates, let alone how they should depend, can quickly become overwhelming. This burden, however, is a fundamental cost of a more heuristic approach to constructing conditional variate models. There are many assumptions to make and our modeling responsibility is to acknowledge those possibilities and not take any one for granted.

That said reparameterizations of the base family of probability density functions can often facilitate these design challenges. Consider for example variates taking positive, real values, \(y \in \mathbb{R}^{+}\) that we decide to model with the gamma family. The most common parameterization of the gamma family is through shape and rate parameters, both of which moderate the centrality and breadth of the resulting probability distributions. If we want the covariates to influence just the centrality of the variates then both the shape and rate parameters will have to replaced with functions of the covariates. The location-dispersion parameterization of the gamma family, however, isolates control of the centrality directly into the one location parameter. The natural choice in this parameterization is to replace just that location parameter with a deterministic function of the covariates, leaving the dispersion parameter alone.

By first eliciting what properties of the variate distribution we want to relate to the covariate values, and then reparameterization the base family of probability density functions so that these properties are moderated by just a few parameters, we can make the the engineering of an appropriate conditional variate model much more straightforward.

Corollary: Visualizing Functional Behaviors

Marginal inferences for a single, one-dimensional parameter are straightforward to visualize. Visualizing inferences for an entire covariate-dependent function \(f(x, \eta)\) that replaces such a parameter in a heuristic conditional variate model, however, is a bit more complicated due to the dependence on both the input covariate \(x\) and functional configuration parameter \(\eta\).

Given a joint probability distribution over the covariate inputs and parameter values we can construct a pushforward distribution over functional outputs, \[ f_{*} \pi(\theta) = \int \mathrm{d} x \, \mathrm{d} \eta \, \pi(x) \, \pi(\eta) \, \mathbb{I} [ f(x, \eta) - \theta ], \] that is amenable to visualization. In practice we can construct these visualizations from Monte Carlo estimators derived from the samples \[ \begin{align*} \tilde{x}_{i}, \tilde{\eta}_{i} &\sim \pi(x, \eta) \\ \theta_{i} &= f(\tilde{x}_{i}, \tilde{\eta}_{i}). \end{align*} \] This joint distribution can for example be constructed from a conditional variate model and prior model for \(\eta\), \[ \begin{align*} \pi(x, \eta) &= \int \mathrm{d} \gamma \, \pi(x, \gamma, \eta) \\ &= \int \mathrm{d} \gamma \, \pi(x \mid \gamma, \eta) \, \pi(\gamma) \, \pi(\eta), \end{align*} \] or if there is no confounding simply \[ \pi(x, \eta) = \left[ \int \mathrm{d} \gamma \, \pi(x \mid \gamma) \, \pi(\gamma) \right] \, \pi(\eta). \] Alternatively we might utilize a posterior distribution for \(\eta\) informed by previous observations.

When we have multiple covariate observations, \[ (\tilde{x}_{1}, \ldots, \tilde{x}_{n}, \ldots, \tilde{x}_{N}), \] then we can also utilize the corresponding empirical distribution instead of an explicit marginal covariate model, \[ \begin{align*} \tilde{\eta}_{i} &\sim \pi(\eta) \\ \theta_{ni} &= f(\tilde{x}_{n}, \tilde{\eta}_{i}). \end{align*} \]

The immediate limitation of this approach is that it quantifies the functional behavior in only the particular context of a given marginal covariate model or empirical covariate distribution. If we are interested in how the functional behavior might generalize to other marginal covariate behaviors then we would have to construct a new pushforward distribution for each marginal covariate model of interest. Alternatively we could not marginalize over the covariate inputs at all.

Using only a probability distribution for the parameter \(\eta\) we can construct a pushforward distribution for the function value at each covariate input, \[ f_{*} \pi(\theta(x)) = \int \mathrm{d} \eta \, \pi(\eta) \, \mathbb{I} [ f(x, \eta) - \theta(x) ], \] or with samples \[ \begin{align*} \tilde{\eta}_{i} &\sim \pi(\eta) \\ \theta_{i}(x) &= f(x, \tilde{\eta}_{i}). \end{align*} \] When the covariate space is one-dimensional we can visualize the pushforward distributions for an interval of covariate inputs at the same time to communicate a more comprehensive picture of the functional behavior.

For example we can interpolate a single sequence of sampled function values \(\theta_{i}(x_{n})\) along a fine grid of covariate inputs \(x_{n}\) to visualize a single realization of the functional behavior.




Overlaying multiple interpolations begins to capture the full distribution of functional behaviors.




These spaghetti plots are dense with information, communicating how the function outputs are coupled across the entire range of covariate inputs. Unfortunately if we try to overlay too many sampled functional behaviors then we lose the ability to trace each function realization by eye, and the utility of this visualization begins to decrease.