Part 1: Introduction to Gaussian Processes
Part 2: Optimizing Gaussian Processes Hyperparameters
Part 3: Bayesian Inference of Gaussian Processes Hyperparameters
In Part 1 of this case study we learned the basics of Gaussian processes and their implementation in Stan. We assumed, however, the knowledge of the correct hyperparameters for the squared exponential kernel, \(\alpha\) and \(\rho\), as well as the measurement variability, \(\sigma\), in the Gaussian observation model. In practice we will not know these hyperparameters a priori but will instead have to infer them from the observed data themselves.
Unfortunately, inferring Gaussian process hyperparameters is a notoriously challenging problem. In this part of the case study we will consider fitting the hyperparameters with maximum marginal likelihood, a computationally inexpensive technique for generating point estimates and consider the utility of those point estimates in the context of Gaussian process regression with a Gaussian observation model.
First things first we set up our local computing environment,
library(rstan)
Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.16.2, packaged: 2017-07-03 09:24:58 UTC, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
source("gp_utility.R")
and then load in both the simulated data and the ground truth,
data <- read_rdump('gp.data.R')
true_realization <- read_rdump('gp.truth.R')
Finally we recreate the true data generating process that we are attempting to model,
f_data <- list(sigma=true_realization$sigma_true,
N=length(true_realization$f_total),
f=true_realization$f_total)
dgp_fit <- stan(file='simu_gauss_dgp.stan', data=f_data, iter=1000, warmup=0,
chains=1, seed=5838298, refresh=1000, algorithm="Fixed_param")
SAMPLING FOR MODEL 'simu_gauss_dgp' NOW (CHAIN 1).
Iteration: 1 / 1000 [ 0%] (Sampling)
Iteration: 1000 / 1000 [100%] (Sampling)
Elapsed Time: 0 seconds (Warm-up)
0.056014 seconds (Sampling)
0.056014 seconds (Total)
plot_gp_pred_quantiles(dgp_fit, data, true_realization,
"True Data Generating Process Quantiles")
Maximum marginal likelihood optimizes the likelihood of the observed data conditioned on the Gaussian process hyperparameters with the realizations of the Gaussian process themselves marginalized out. Note that this is fundamentally different from constructing a modal estimator by optimizing the posterior density for the hyperparameters as the maximum marginal likelihood is a function. Differences arise between the two approaches when transformations are introduced to bound parameters, for example enforcing positivity of \(\alpha\), \(\rho\), and \(\sigma\).
To implement maximum marginal likelihood we move the now unknown hyperparameters to the parameters
block of our Stan program and then construct a analytic posterior Gaussian process in the model
block,
writeLines(readLines("opt1.stan"))
data {
int<lower=1> N;
real x[N];
vector[N] y;
}
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
}
model {
matrix[N, N] cov = cov_exp_quad(x, alpha, rho)
+ diag_matrix(rep_vector(square(sigma), N));
matrix[N, N] L_cov = cholesky_decompose(cov);
y ~ multi_normal_cholesky(rep_vector(0, N), L_cov);
}
We then optimize the resulting Stan program to give
gp_opt1 <- stan_model(file='opt1.stan')
opt_fit <- optimizing(gp_opt1, data=data, seed=5838298, hessian=FALSE)
Initial log joint probability = -20.4511
Optimization terminated normally:
Convergence detected: relative gradient magnitude is below tolerance
alpha <- opt_fit$par[2]
rho <- opt_fit$par[1]
sigma <- opt_fit$par[3]
sprintf('alpha = %s', alpha)
[1] "alpha = 2.37412983590049"
sprintf('rho = %s', rho)
[1] "rho = 0.334039378319107"
sprintf('sigma = %s', sigma)
[1] "sigma = 0.198461596554666"
Note the large marginal standard deviation, \(\alpha\), the small lenght scale, \(\rho\), and small measurement variability, \(\sigma\). Together these imply a Gaussian process that supports very wiggly functions.
Given the marginal maximum likelihood estimate of the hyperparameters we can then simulate from this induced Gaussian process posterior distributions and its corresponding posterior predictive distributions,
pred_data <- list(alpha=alpha, rho=rho, sigma=sigma, N=data$N, x=data$x, y=data$y,
N_predict=data$N_predict, x_predict=data$x_predict)
pred_opt_fit <- stan(file='predict_gauss.stan', data=pred_data, iter=1000, warmup=0,
chains=1, seed=5838298, refresh=1000, algorithm="Fixed_param")
SAMPLING FOR MODEL 'predict_gauss' NOW (CHAIN 1).
Iteration: 1 / 1000 [ 0%] (Sampling)
Iteration: 1000 / 1000 [100%] (Sampling)
Elapsed Time: 0 seconds (Warm-up)
7.92415 seconds (Sampling)
7.92415 seconds (Total)
Indeed the wiggly functions favored by this fit seriously overfit to the observed data,
plot_gp_realizations(pred_opt_fit, data, true_realization,
"Posterior Realizations")
plot_gp_quantiles(pred_opt_fit, data, true_realization,
"Posterior Quantiles")
The overfitting is particularly evident in the posterior predictive visualizations which demonstrate poor out of sample performance, especially in the neighborhoods near the observed data.
plot_gp_realizations(pred_opt_fit, data, true_realization,
"Posterior Predictive Realizations")
par(mfrow=c(1, 2))
plot_gp_pred_quantiles(pred_opt_fit, data, true_realization,
"PP Quantiles")
plot_gp_pred_quantiles(dgp_fit, data, true_realization,
"True DGP Quantiles")
But how reproducible is this result? Let’s try the fit again, only this time with a different random number generator seed that will select a different starting point for the hyperparameters.
opt_fit <- optimizing(gp_opt1, data=data, seed=2384853, hessian=FALSE)
Initial log joint probability = -20.3157
Optimization terminated normally:
Convergence detected: relative gradient magnitude is below tolerance
alpha <- opt_fit$par[2]
rho <- opt_fit$par[1]
sigma <- opt_fit$par[3]
sprintf('alpha = %s', alpha)
[1] "alpha = 1.5388985463163"
sprintf('rho = %s', rho)
[1] "rho = 0.234581345618205"
sprintf('sigma = %s', sigma)
[1] "sigma = 1.81866609858178"
The different starting point has yielded a completely different fit, which should already provoke unease. In this case the smaller marginal standard deviation and larger measurement variability produce a somewhat better fit, although the manifestation of overfitting is still evident.
pred_data <- list(alpha=alpha, rho=rho, sigma=sigma, N=data$N, x=data$x, y=data$y,
N_predict=data$N_predict, x_predict=data$x_predict)
pred_opt_fit <- stan(file='predict_gauss.stan', data=pred_data, iter=1000, warmup=0,
chains=1, seed=5838298, refresh=1000, algorithm="Fixed_param")
SAMPLING FOR MODEL 'predict_gauss' NOW (CHAIN 1).
Iteration: 1 / 1000 [ 0%] (Sampling)
Iteration: 1000 / 1000 [100%] (Sampling)
Elapsed Time: 0 seconds (Warm-up)
8.00264 seconds (Sampling)
8.00264 seconds (Total)
plot_gp_realizations(pred_opt_fit, data, true_realization,
"Posterior Realizations")
plot_gp_quantiles(pred_opt_fit, data, true_realization,
"Posterior Quantiles")
plot_gp_realizations(pred_opt_fit, data, true_realization,
"Posterior Predictive Realizations")
par(mfrow=c(1, 2))
plot_gp_pred_quantiles(pred_opt_fit, data, true_realization,
"PP Quantiles")
plot_gp_pred_quantiles(dgp_fit, data, true_realization,
"True DGP Quantiles")
The results of the hyperparameters with maximum marginal likelihood are highly sensitive to the initial starting point and of the many results we get all seem to prone to overfitting. One potential means of improving the robustness of these fits is to introduce regularization. Let’s see if that does any better.
Regularized maximum marginal likelihood introduces a loss functions to prevent the fit from straying into neighborhoods that are known to be pathological, such as those that might be prone to overfitting. Exactly what kind of regularization is needed is unclear at this point – indeed we will see in Part 3 that a Bayesian workflow is much more adept at identifying what kind of regularization is required.
Here we will presume a given regularization scheme and leave its motivation, and further discussion about why unregularized maximum marginal likelihood is so fragile, to Part 3.
In order to implement regularization in Stan we add prior densities to our Stan program that emulate the desired loss functions,
writeLines(readLines("opt2.stan"))
data {
int<lower=1> N;
real x[N];
vector[N] y;
}
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
}
model {
matrix[N, N] cov = cov_exp_quad(x, alpha, rho)
+ diag_matrix(rep_vector(square(sigma), N));
matrix[N, N] L_cov = cholesky_decompose(cov);
rho ~ inv_gamma(8.91924, 34.5805);
alpha ~ normal(0, 2);
sigma ~ normal(0, 1);
y ~ multi_normal_cholesky(rep_vector(0, N), L_cov);
}
In particular, the regularization penalizes the small length scales that we kept getting in the unregularized fits.
gp_opt2 <- stan_model(file='opt2.stan')
opt_fit <- optimizing(gp_opt2, data=data, seed=5838298, hessian=FALSE)
Initial log joint probability = -76.3954
Optimization terminated normally:
Convergence detected: relative gradient magnitude is below tolerance
alpha <- opt_fit$par[2]
rho <- opt_fit$par[1]
sigma <- opt_fit$par[3]
sprintf('alpha = %s', alpha)
[1] "alpha = 0.000445554939726782"
sprintf('rho = %s', rho)
[1] "rho = 3.48610316803464"
sprintf('sigma = %s', sigma)
[1] "sigma = 2.03159847725428"
Somewhat more encouraging, running again with a different seed yields essentially the same result.
gp_opt2 <- stan_model(file='opt2.stan')
opt_fit <- optimizing(gp_opt2, data=data, seed=95848338, hessian=FALSE)
Initial log joint probability = -40.0973
Optimization terminated normally:
Convergence detected: relative gradient magnitude is below tolerance
alpha <- opt_fit$par[2]
rho <- opt_fit$par[1]
sigma <- opt_fit$par[3]
sprintf('alpha = %s', alpha)
[1] "alpha = 0.000589796067037712"
sprintf('rho = %s', rho)
[1] "rho = 3.48620430056269"
sprintf('sigma = %s', sigma)
[1] "sigma = 2.03157221267242"
Unfortunately, the extremely small marginal standard deviations returned by the regularized maximum marginal likelihood fit lead to very rigid functions that do not capture the structure of the true data generating process,
pred_data <- list(alpha=alpha, rho=rho, sigma=sigma, N=data$N, x=data$x, y=data$y,
N_predict=data$N_predict, x_predict=data$x_predict)
pred_opt_fit <- stan(file='predict_gauss.stan', data=pred_data, iter=1000, warmup=0,
chains=1, seed=5838298, refresh=1000, algorithm="Fixed_param")
SAMPLING FOR MODEL 'predict_gauss' NOW (CHAIN 1).
Iteration: 1 / 1000 [ 0%] (Sampling)
Iteration: 1000 / 1000 [100%] (Sampling)
Elapsed Time: 0 seconds (Warm-up)
8.21524 seconds (Sampling)
8.21524 seconds (Total)
plot_gp_realizations(pred_opt_fit, data, true_realization,
"Posterior Realizations")
plot_gp_quantiles(pred_opt_fit, data, true_realization,
"Posterior Quantiles")
plot_gp_realizations(pred_opt_fit, data, true_realization,
"Posterior Predictive Realizations")
Despite the rigidity of the fitted Gaussian process, the larger measurement variability does better capture the data generating process than the unregularized fit.
par(mfrow=c(1, 2))
plot_gp_pred_quantiles(pred_opt_fit, data, true_realization,
"PP Quantiles")
plot_gp_pred_quantiles(dgp_fit, data, true_realization,
"True DGP Quantiles")
Still, the overall result leaves much to be desired.
Given some of the more subtle mathematical properties of Gaussian processes the limitations of even the regularized maximum marginal likelihood fit are actually not all that surprising. In particular, Gaussian processes are notoriously singular distributions over the Hilbert space of functions.
More intuitively, given particular kernel with particular hyperparameters a Gaussian process does not support an entire Hilbert space but rather only a single singular slice through that space. In other words, the functions realizations that we can draw from the Gaussian process are limited to this slice. Changing the hyperparameters by even an infinitesimal amount yields a different slice though the Hilbert space that has no overlap with the original slice.
Consequently, whenever we use a Gaussian process with fixed hyperparameters we consider an infinitesimal set of functions that could define the variate-covariate relationship. If the fixed hyperparameters don’t exactly correspond to the true hyperparameters – even if they are off by the smallest margins – then we won’t be able to generate any of the functions that would be generated by the true data generating process.
In practice this singular behavior isn’t quite as pathological as it sounds, as the behavior that differentiates between neighboring slices tends to be isolated to wiggles with high frequency and low amplitude that would have small effects. That said, when we don’t know the true values of the hyperparameters then even these small wiggles can have significant effects on the maximum marginal likelihood and hence strongly affect the resulting fit.
The only way to take full advantage of a Gaussian process is consider the entire Hilbert space of functions, and that requires considering all of the possible hyperparameters for the chosen kernel. Fortunately, that is a natural consequence of the Bayesian approach which we will discuss in Part 3.
Gaussian processes provide a flexible means of nonparametrically modeling regression behavior, but that flexibility also hinders the performance of point estimates derived from maximum marginal likelihood and even regularized maximum marginal likelihood, especially when the observed data are sparse. In order to take best advantage of this flexibility we need to infer Gaussian process hyperparameters with Bayesian methods.
Part 3: Bayesian Inference of Gaussian Processes Hyperparameters
The insights motivating this case study came from a particularly fertile research project with Dan Simpson, Rob Trangucci, and Aki Vehtari.
I thank Dan Simpson, Aki Vehtari, and Rob Trangucci for many helpful comments on the case study.
writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined
CC=clang
CXX=clang++ -arch x86_64 -ftemplate-depth-256
devtools::session_info("rstan")
Session info -------------------------------------------------------------
setting value
version R version 3.4.2 (2017-09-28)
system x86_64, darwin15.6.0
ui X11
language (EN)
collate en_US.UTF-8
tz America/New_York
date 2017-11-12
Packages -----------------------------------------------------------------
package * version date source
BH 1.65.0-1 2017-08-24 CRAN (R 3.4.1)
colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
dichromat 2.0-0 2013-01-24 CRAN (R 3.4.0)
digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.0)
graphics * 3.4.2 2017-10-04 local
grDevices * 3.4.2 2017-10-04 local
grid 3.4.2 2017-10-04 local
gridExtra 2.3 2017-09-09 CRAN (R 3.4.1)
gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
inline 0.3.14 2015-04-13 CRAN (R 3.4.0)
labeling 0.3 2014-08-23 CRAN (R 3.4.0)
lattice 0.20-35 2017-03-25 CRAN (R 3.4.2)
lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
MASS 7.3-47 2017-02-26 CRAN (R 3.4.2)
Matrix 1.2-11 2017-08-21 CRAN (R 3.4.2)
methods * 3.4.2 2017-10-04 local
munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.4.0)
Rcpp 0.12.13 2017-09-28 CRAN (R 3.4.2)
RcppEigen 0.3.3.3.0 2017-05-01 CRAN (R 3.4.0)
reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0)
rlang 0.1.4 2017-11-05 CRAN (R 3.4.2)
rstan * 2.16.2 2017-07-03 CRAN (R 3.4.1)
scales 0.5.0 2017-08-24 CRAN (R 3.4.1)
StanHeaders * 2.16.0-1 2017-07-03 CRAN (R 3.4.1)
stats * 3.4.2 2017-10-04 local
stats4 3.4.2 2017-10-04 local
stringi 1.1.5 2017-04-07 CRAN (R 3.4.0)
stringr 1.2.0 2017-02-18 CRAN (R 3.4.0)
tibble 1.3.4 2017-08-22 CRAN (R 3.4.1)
tools 3.4.2 2017-10-04 local
utils * 3.4.2 2017-10-04 local
viridisLite 0.2.0 2017-03-24 CRAN (R 3.4.0)