This case study has been deprecated; the current version can be found here.

Part 1: Introduction to Gaussian Processes
Part 2: Optimizing Gaussian Processes Hyperparameters
Part 3: Bayesian Inference of Gaussian Processes Hyperparameters

In Part 1 of this case study we learned the basics of Gaussian processes and their implementation in Stan. We assumed, however, the knowledge of the correct hyperparameters for the squared exponential kernel, \(\alpha\) and \(\rho\), as well as the measurement variability, \(\sigma\), in the Gaussian observation model. In practice we will not know these hyperparameters a priori but will instead have to infer them from the observed data themselves.

Unfortunately, inferring Gaussian process hyperparameters is a notoriously challenging problem. In this part of the case study we will consider fitting the hyperparameters with maximum marginal likelihood, a computationally inexpensive technique for generating point estimates and consider the utility of those point estimates in the context of Gaussian process regression with a Gaussian observation model.

1 Initial Setup

First things first we set up our local computing environment,

library(rstan)
Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.16.2, packaged: 2017-07-03 09:24:58 UTC, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
source("gp_utility.R")

and then load in both the simulated data and the ground truth,

data <- read_rdump('gp.data.R')
true_realization <- read_rdump('gp.truth.R')

Finally we recreate the true data generating process that we are attempting to model,

f_data <- list(sigma=true_realization$sigma_true,
               N=length(true_realization$f_total),
               f=true_realization$f_total)
dgp_fit <- stan(file='simu_gauss_dgp.stan', data=f_data, iter=1000, warmup=0,
                chains=1, seed=5838298, refresh=1000, algorithm="Fixed_param")

SAMPLING FOR MODEL 'simu_gauss_dgp' NOW (CHAIN 1).
Iteration:   1 / 1000 [  0%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)

 Elapsed Time: 0 seconds (Warm-up)
               0.056014 seconds (Sampling)
               0.056014 seconds (Total)
plot_gp_pred_quantiles(dgp_fit, data, true_realization,
                       "True Data Generating Process Quantiles")

2 Inferring the Hyperparameters with Maximum Marginal Likelihood

Maximum marginal likelihood optimizes the likelihood of the observed data conditioned on the Gaussian process hyperparameters with the realizations of the Gaussian process themselves marginalized out. Note that this is fundamentally different from constructing a modal estimator by optimizing the posterior density for the hyperparameters as the maximum marginal likelihood is a function. Differences arise between the two approaches when transformations are introduced to bound parameters, for example enforcing positivity of \(\alpha\), \(\rho\), and \(\sigma\).

To implement maximum marginal likelihood we move the now unknown hyperparameters to the parameters block of our Stan program and then construct a analytic posterior Gaussian process in the model block,

writeLines(readLines("opt1.stan"))
data {
  int<lower=1> N;
  real x[N];
  vector[N] y;
}

parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}

model {
  matrix[N, N] cov =   cov_exp_quad(x, alpha, rho)
                     + diag_matrix(rep_vector(square(sigma), N));
  matrix[N, N] L_cov = cholesky_decompose(cov);

  y ~ multi_normal_cholesky(rep_vector(0, N), L_cov);
}

We then optimize the resulting Stan program to give

gp_opt1 <- stan_model(file='opt1.stan')
opt_fit <- optimizing(gp_opt1, data=data, seed=5838298, hessian=FALSE)
Initial log joint probability = -20.4511
Optimization terminated normally: 
  Convergence detected: relative gradient magnitude is below tolerance
alpha <- opt_fit$par[2]
rho <- opt_fit$par[1]
sigma <- opt_fit$par[3]

sprintf('alpha = %s', alpha)
[1] "alpha = 2.37412983590049"
sprintf('rho = %s', rho)
[1] "rho = 0.334039378319107"
sprintf('sigma = %s', sigma)
[1] "sigma = 0.198461596554666"

Note the large marginal standard deviation, \(\alpha\), the small lenght scale, \(\rho\), and small measurement variability, \(\sigma\). Together these imply a Gaussian process that supports very wiggly functions.

Given the marginal maximum likelihood estimate of the hyperparameters we can then simulate from this induced Gaussian process posterior distributions and its corresponding posterior predictive distributions,

pred_data <- list(alpha=alpha, rho=rho, sigma=sigma, N=data$N, x=data$x, y=data$y,
                  N_predict=data$N_predict, x_predict=data$x_predict)
pred_opt_fit <- stan(file='predict_gauss.stan', data=pred_data, iter=1000, warmup=0,
                     chains=1, seed=5838298, refresh=1000, algorithm="Fixed_param")

SAMPLING FOR MODEL 'predict_gauss' NOW (CHAIN 1).
Iteration:   1 / 1000 [  0%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)

 Elapsed Time: 0 seconds (Warm-up)
               7.92415 seconds (Sampling)
               7.92415 seconds (Total)

Indeed the wiggly functions favored by this fit seriously overfit to the observed data,

plot_gp_realizations(pred_opt_fit, data, true_realization,
                     "Posterior Realizations")

plot_gp_quantiles(pred_opt_fit, data, true_realization,
                  "Posterior Quantiles")

The overfitting is particularly evident in the posterior predictive visualizations which demonstrate poor out of sample performance, especially in the neighborhoods near the observed data.

plot_gp_realizations(pred_opt_fit, data, true_realization,
                     "Posterior Predictive Realizations")

par(mfrow=c(1, 2))

plot_gp_pred_quantiles(pred_opt_fit, data, true_realization,
                       "PP Quantiles")

plot_gp_pred_quantiles(dgp_fit, data, true_realization,
                       "True DGP Quantiles")

But how reproducible is this result? Let’s try the fit again, only this time with a different random number generator seed that will select a different starting point for the hyperparameters.

opt_fit <- optimizing(gp_opt1, data=data, seed=2384853, hessian=FALSE)
Initial log joint probability = -20.3157
Optimization terminated normally: 
  Convergence detected: relative gradient magnitude is below tolerance
alpha <- opt_fit$par[2]
rho <- opt_fit$par[1]
sigma <- opt_fit$par[3]

sprintf('alpha = %s', alpha)
[1] "alpha = 1.5388985463163"
sprintf('rho = %s', rho)
[1] "rho = 0.234581345618205"
sprintf('sigma = %s', sigma)
[1] "sigma = 1.81866609858178"

The different starting point has yielded a completely different fit, which should already provoke unease. In this case the smaller marginal standard deviation and larger measurement variability produce a somewhat better fit, although the manifestation of overfitting is still evident.

pred_data <- list(alpha=alpha, rho=rho, sigma=sigma, N=data$N, x=data$x, y=data$y,
                  N_predict=data$N_predict, x_predict=data$x_predict)
pred_opt_fit <- stan(file='predict_gauss.stan', data=pred_data, iter=1000, warmup=0,
                     chains=1, seed=5838298, refresh=1000, algorithm="Fixed_param")

SAMPLING FOR MODEL 'predict_gauss' NOW (CHAIN 1).
Iteration:   1 / 1000 [  0%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)

 Elapsed Time: 0 seconds (Warm-up)
               8.00264 seconds (Sampling)
               8.00264 seconds (Total)
plot_gp_realizations(pred_opt_fit, data, true_realization,
                     "Posterior Realizations")

plot_gp_quantiles(pred_opt_fit, data, true_realization,
                  "Posterior Quantiles")