Gaussian processes are a powerful approach for modeling functional behavior, but as with any tool that power can be wielded responsibly only with deliberate care. In this case study I introduce the basic usage of Gaussian processes in modeling applications, and how to implement them in Stan, before considering some of their subtle but ubiquitous inferential pathologies and the techniques for addressing those pathologies with principled domain expertise.

1 Modeling Functional Relationships

A common problem in probabilistic modeling is capturing the impact of some continuous variable, \(x \in X\), such as the known time or spatial location at which a measurement is made. Even once we've chosen an appropriate observational model, for example \(\pi(y; \theta, \phi)\), and identified which parameter is influenced by the covariate, for example \(\theta\), we still have to model the functional relationship between that parameter and the covariates, \(\theta = f(x)\).

For example consider a normal observation model where the covariates influence only the location parameter, \[ y \sim \text{normal} \left( f(x), \sigma \right). \] In this special case of modeling the structure of the location parameter we can interpret the observed data \((y_{n}, x_{n})\) as points scattered around the location function \(f(x)\) and overlay them on the same plot, even though they technically live in different spaces.




In some cases this functional relationship is motivated by explicit generative models, but when we aren't privileged with that depth of understanding we have to consider more heuristic models. For example if \(\theta\) takes values on the entire real line then we can consider a linear approximation where we assume a linear relationship between the covariate \(x\) and the parameter, \[ f(x) = \alpha + x^{T} \cdot \boldsymbol{\beta}. \]




This approximation, however, is often not rich enough to capture the functional behavior that manifests in a given system. A more sophisticated heuristic model might consider a higher-order polynomial approximation that captures more complex functional behaviors than the linear model.




As the order of the polynomial grows, however, the space of polynomial functions starts to be dominated by more extreme functional relationships that usually conflict with our own domain expertise.




Unfortunately this behavior is somewhat inevitable. In practice it is incredibly difficult, if not outright impossible, to engineer function spaces that are rich enough to capture complex behaviors of interest without also being so large that those desired behaviors are overwhelmed by less desirable behaviors. To compensate we need to ensure that our probabilistic model for the functional behavior provides sufficient regularization, concentrating on the desired behaviors while suppressing the undesirable behaviors.

Principled regularization, however, requires not only understanding what behaviors are consistent with our domain expertise and which are not but also how these behaviors are related to our representation of the function space. For example robust use of the polynomial model requires understanding how all of the polynomial coefficients interact to control the overall function behavior, which quickly becomes unmanageable as the order of the polynomial grows. Interpretable regularization requires much more than just shrinking all of the coefficients towards zero! Polynomial models are problematic not because they're too flexible but rather because it's challenging to regularize that flexibility in a principled way.

Gaussian processes define probabilisitic models of functional behavior not through any finite-dimensional parametric model but rather by defining probability distributions over function spaces directly. This more direct approach can drastically improve the interpretability of the model and hence facilitate its principled use.

1.1 Gaussian Processes In Theory

Gaussian processes are built on top of a sophisticated, and honestly a bit overwhelming, mathematical foundation [1, 2]. In this case study we will consider only a relatively simple abstraction of that theory that will be sufficient for their basic use with one-dimensional covariate spaces. The deeper theory becomes much more relevant when approximations are concerned; see Section 4.3 for a limited discussion.

A Gaussian process defines a probability distribution over functions; in other words every sample from a Gaussian process is an entire function from the covariate space \(X\) to the real-valued output space.