This case study has been deprecated; the current version can be found here.

Part 1: Introduction to Gaussian Processes
Part 2: Optimizing Gaussian Processes Hyperparameters
Part 3: Bayesian Inference of Gaussian Processes Hyperparameters

In Part 1 of this case study we learned the basics of Gaussian processes and their implementation in Stan. We assumed, however, the knowledge of the correct hyperparameters for the squared exponential kernel, \(\alpha\) and \(\rho\), as well as the measurement variability, \(\sigma\), in the Gaussian observation model. In practice we will not know these hyperparameters a priori but will instead have to infer them from the observed data themselves.

Unfortunately, inferring Gaussian process hyperparameters is a notoriously challenging problem. In this part of the case study we will consider fitting the hyperparameters with maximum marginal likelihood, a computationally inexpensive technique for generating point estimates and consider the utility of those point estimates in the context of Gaussian process regression with a Gaussian observation model.

1 Initial Setup

First things first we set up our local computing environment,

library(rstan)

Loading required package: ggplot2

Loading required package: StanHeaders

rstan (Version 2.16.2, packaged: 2017-07-03 09:24:58 UTC, GitRev: 2e1f913d3ca3)

For execution on a local, multicore CPU with excess RAM we recommend calling
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
source("gp_utility.R")

and then load in both the simulated data and the ground truth,

data <- read_rdump('gp.data.R')
true_realization <- read_rdump('gp.truth.R')

Finally we recreate the true data generating process that we are attempting to model,

f_data <- list(sigma=true_realization$sigma_true,
               N=length(true_realization$f_total),
               f=true_realization$f_total)
dgp_fit <- stan(file='simu_gauss_dgp.stan', data=f_data, iter=1000, warmup=0,
                chains=1, seed=5838298, refresh=1000, algorithm="Fixed_param")


SAMPLING FOR MODEL 'simu_gauss_dgp' NOW (CHAIN 1).
Iteration:   1 / 1000 [  0%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)

 Elapsed Time: 0 seconds (Warm-up)
               0.056014 seconds (Sampling)
               0.056014 seconds (Total)

plot_gp_pred_quantiles(dgp_fit, data, true_realization,
                       "True Data Generating Process Quantiles")

2 Inferring the Hyperparameters with Maximum Marginal Likelihood

Maximum marginal likelihood optimizes the likelihood of the observed data conditioned on the Gaussian process hyperparameters with the realizations of the Gaussian process themselves marginalized out. Note that this is fundamentally different from constructing a modal estimator by optimizing the posterior density for the hyperparameters as the maximum marginal likelihood is a function. Differences arise between the two approaches when transformations are introduced to bound parameters, for example enforcing positivity of \(\alpha\), \(\rho\), and \(\sigma\).

To implement maximum marginal likelihood we move the now unknown hyperparameters to the parameters block of our Stan program and then construct a analytic posterior Gaussian process in the model block,

writeLines(readLines("opt1.stan"))

data {
  int<lower=1> N;
  real x[N];
  vector[N] y;
}

parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}

model {
  matrix[N, N] cov =   cov_exp_quad(x, alpha, rho)
                     + diag_matrix(rep_vector(square(sigma), N));
  matrix[N, N] L_cov = cholesky_decompose(cov);

  y ~ multi_normal_cholesky(rep_vector(0, N), L_cov);
}

We then optimize the resulting Stan program to give

gp_opt1 <- stan_model(file='opt1.stan')
opt_fit <- optimizing(gp_opt1, data=data, seed=5838298, hessian=FALSE)

Initial log joint probability = -20.4511
Optimization terminated normally: 
  Convergence detected: relative gradient magnitude is below tolerance

alpha <- opt_fit$par[2]
rho <- opt_fit$par[1]
sigma <- opt_fit$par[3]

sprintf('alpha = %s', alpha)

[1] "alpha = 2.37412983590049"

sprintf('rho = %s', rho)

[1] "rho = 0.334039378319107"

sprintf('sigma = %s', sigma)

[1] "sigma = 0.198461596554666"

Note the large marginal standard deviation, \(\alpha\), the small lenght scale, \(\rho\), and small measurement variability, \(\sigma\). Together these imply a Gaussian process that supports very wiggly functions.

Given the marginal maximum likelihood estimate of the hyperparameters we can then simulate from this induced Gaussian process posterior distributions and its corresponding posterior predictive distributions,

pred_data <- list(alpha=alpha, rho=rho, sigma=sigma, N=data$N, x=data$x, y=data$y,
                  N_predict=data$N_predict, x_predict=data$x_predict)
pred_opt_fit <- stan(file='predict_gauss.stan', data=pred_data, iter=1000, warmup=0,
                     chains=1, seed=5838298, refresh=1000, algorithm="Fixed_param")


SAMPLING FOR MODEL 'predict_gauss' NOW (CHAIN 1).
Iteration:   1 / 1000 [  0%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)

 Elapsed Time: 0 seconds (Warm-up)
               7.92415 seconds (Sampling)
               7.92415 seconds (Total)

Indeed the wiggly functions favored by this fit seriously overfit to the observed data,

plot_gp_realizations(pred_opt_fit, data, true_realization,
                     "Posterior Realizations")

plot_gp_quantiles(pred_opt_fit, data, true_realization,
                  "Posterior Quantiles")

The overfitting is particularly evident in the posterior predictive visualizations which demonstrate poor out of sample performance, especially in the neighborhoods near the observed data.

plot_gp_realizations(pred_opt_fit, data, true_realization,
                     "Posterior Predictive Realizations")

par(mfrow=c(1, 2))

plot_gp_pred_quantiles(pred_opt_fit, data, true_realization,
                       "PP Quantiles")

plot_gp_pred_quantiles(dgp_fit, data, true_realization,
                       "True DGP Quantiles")

But how reproducible is this result? Let’s try the fit again, only this time with a different random number generator seed that will select a different starting point for the hyperparameters.

opt_fit <- optimizing(gp_opt1, data=data, seed=2384853, hessian=FALSE)

Initial log joint probability = -20.3157
Optimization terminated normally: 
  Convergence detected: relative gradient magnitude is below tolerance

alpha <- opt_fit$par[2]
rho <- opt_fit$par[1]
sigma <- opt_fit$par[3]

sprintf('alpha = %s', alpha)

[1] "alpha = 1.5388985463163"

sprintf('rho = %s', rho)

[1] "rho = 0.234581345618205"

sprintf('sigma = %s', sigma)

[1] "sigma = 1.81866609858178"

The different starting point has yielded a completely different fit, which should already provoke unease. In this case the smaller marginal standard deviation and larger measurement variability produce a somewhat better fit, although the manifestation of overfitting is still evident.

pred_data <- list(alpha=alpha, rho=rho, sigma=sigma, N=data$N, x=data$x, y=data$y,
                  N_predict=data$N_predict, x_predict=data$x_predict)
pred_opt_fit <- stan(file='predict_gauss.stan', data=pred_data, iter=1000, warmup=0,
                     chains=1, seed=5838298, refresh=1000, algorithm="Fixed_param")


SAMPLING FOR MODEL 'predict_gauss' NOW (CHAIN 1).
Iteration:   1 / 1000 [  0%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)

 Elapsed Time: 0 seconds (Warm-up)
               8.00264 seconds (Sampling)
               8.00264 seconds (Total)

plot_gp_realizations(pred_opt_fit, data, true_realization,
                     "Posterior Realizations")

plot_gp_quantiles(pred_opt_fit, data, true_realization,
                  "Posterior Quantiles")

plot_gp_realizations(pred_opt_fit, data, true_realization,
                     "Posterior Predictive Realizations")

par(mfrow=c(1, 2))

plot_gp_pred_quantiles(pred_opt_fit, data, true_realization,
                       "PP Quantiles")

plot_gp_pred_quantiles(dgp_fit, data, true_realization,
                       "True DGP Quantiles")

The results of the hyperparameters with maximum marginal likelihood are highly sensitive to the initial starting point and of the many results we get all seem to prone to overfitting. One potential means of improving the robustness of these fits is to introduce regularization. Let’s see if that does any better.

3 Regularized Maximum Marginal Likelihood

Regularized maximum marginal likelihood introduces a loss functions to prevent the fit from straying into neighborhoods that are known to be pathological, such as those that might be prone to overfitting. Exactly what kind of regularization is needed is unclear at this point – indeed we will see in Part 3 that a Bayesian workflow is much more adept at identifying what kind of regularization is required.

Here we will presume a given regularization scheme and leave its motivation, and further discussion about why unregularized maximum marginal likelihood is so fragile, to Part 3.

In order to implement regularization in Stan we add prior densities to our Stan program that emulate the desired loss functions,

writeLines(readLines("opt2.stan"))

data {
  int<lower=1> N;
  real x[N];
  vector[N] y;
}

parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}

model {
  matrix[N, N] cov =   cov_exp_quad(x, alpha, rho)
                     + diag_matrix(rep_vector(square(sigma), N));
  matrix[N, N] L_cov = cholesky_decompose(cov);

  rho ~ inv_gamma(8.91924, 34.5805);
  alpha ~ normal(0, 2);
  sigma ~ normal(0, 1);

  y ~ multi_normal_cholesky(rep_vector(0, N), L_cov);
}

In particular, the regularization penalizes the small length scales that we kept getting in the unregularized fits.

gp_opt2 <- stan_model(file='opt2.stan')
opt_fit <- optimizing(gp_opt2, data=data, seed=5838298, hessian=FALSE)

Initial log joint probability = -76.3954
Optimization terminated normally: 
  Convergence detected: relative gradient magnitude is below tolerance

alpha <- opt_fit$par[2]
rho <- opt_fit$par[1]
sigma <- opt_fit$par[3]

sprintf('alpha = %s', alpha)

[1] "alpha = 0.000445554939726782"

sprintf('rho = %s', rho)

[1] "rho = 3.48610316803464"

sprintf('sigma = %s', sigma)

[1] "sigma = 2.03159847725428"

Somewhat more encouraging, running again with a different seed yields essentially the same result.

gp_opt2 <- stan_model(file='opt2.stan')
opt_fit <- optimizing(gp_opt2, data=data, seed=95848338, hessian=FALSE)

Initial log joint probability = -40.0973
Optimization terminated normally: 
  Convergence detected: relative gradient magnitude is below tolerance

alpha <- opt_fit$par[2]
rho <- opt_fit$par[1]
sigma <- opt_fit$par[3]

sprintf('alpha = %s', alpha)

[1] "alpha = 0.000589796067037712"

sprintf('rho = %s', rho)

[1] "rho = 3.48620430056269"

sprintf('sigma = %s', sigma)

[1] "sigma = 2.03157221267242"

Unfortunately, the extremely small marginal standard deviations returned by the regularized maximum marginal likelihood fit lead to very rigid functions that do not capture the structure of the true data generating process,

pred_data <- list(alpha=alpha, rho=rho, sigma=sigma, N=data$N, x=data$x, y=data$y,
                  N_predict=data$N_predict, x_predict=data$x_predict)
pred_opt_fit <- stan(file='predict_gauss.stan', data=pred_data, iter=1000, warmup=0,
                    chains=1, seed=5838298, refresh=1000, algorithm="Fixed_param")


SAMPLING FOR MODEL 'predict_gauss' NOW (CHAIN 1).
Iteration:   1 / 1000 [  0%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)

 Elapsed Time: 0 seconds (Warm-up)
               8.21524 seconds (Sampling)
               8.21524 seconds (Total)

plot_gp_realizations(pred_opt_fit, data, true_realization,
                     "Posterior Realizations")

plot_gp_quantiles(pred_opt_fit, data, true_realization,
                  "Posterior Quantiles")

plot_gp_realizations(pred_opt_fit, data, true_realization,
                     "Posterior Predictive Realizations")

Despite the rigidity of the fitted Gaussian process, the larger measurement variability does better capture the data generating process than the unregularized fit.

par(mfrow=c(1, 2))

plot_gp_pred_quantiles(pred_opt_fit, data, true_realization,
                       "PP Quantiles")

plot_gp_pred_quantiles(dgp_fit, data, true_realization,
                       "True DGP Quantiles")

Still, the overall result leaves much to be desired.

4 One Singular Fluctuation

Given some of the more subtle mathematical properties of Gaussian processes the limitations of even the regularized maximum marginal likelihood fit are actually not all that surprising. In particular, Gaussian processes are notoriously singular distributions over the Hilbert space of functions.

More intuitively, given particular kernel with particular hyperparameters a Gaussian process does not support an entire Hilbert space but rather only a single singular slice through that space. In other words, the functions realizations that we can draw from the Gaussian process are limited to this slice. Changing the hyperparameters by even an infinitesimal amount yields a different slice though the Hilbert space that has no overlap with the original slice.

Consequently, whenever we use a Gaussian process with fixed hyperparameters we consider an infinitesimal set of functions that could define the variate-covariate relationship. If the fixed hyperparameters don’t exactly correspond to the true hyperparameters – even if they are off by the smallest margins – then we won’t be able to generate any of the functions that would be generated by the true data generating process.

In practice this singular behavior isn’t quite as pathological as it sounds, as the behavior that differentiates between neighboring slices tends to be isolated to wiggles with high frequency and low amplitude that would have small effects. That said, when we don’t know the true values of the hyperparameters then even these small wiggles can have significant effects on the maximum marginal likelihood and hence strongly affect the resulting fit.

The only way to take full advantage of a Gaussian process is consider the entire Hilbert space of functions, and that requires considering all of the possible hyperparameters for the chosen kernel. Fortunately, that is a natural consequence of the Bayesian approach which we will discuss in Part 3.

5 Conclusion

Gaussian processes provide a flexible means of nonparametrically modeling regression behavior, but that flexibility also hinders the performance of point estimates derived from maximum marginal likelihood and even regularized maximum marginal likelihood, especially when the observed data are sparse. In order to take best advantage of this flexibility we need to infer Gaussian process hyperparameters with Bayesian methods.

Part 3: Bayesian Inference of Gaussian Processes Hyperparameters

6 Acknowledgements

The insights motivating this case study came from a particularly fertile research project with Dan Simpson, Rob Trangucci, and Aki Vehtari.

I thank Dan Simpson, Aki Vehtari, and Rob Trangucci for many helpful comments on the case study.

7 Original Computing Environment

writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))

CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined

CC=clang
CXX=clang++ -arch x86_64 -ftemplate-depth-256

devtools::session_info("rstan")

Session info -------------------------------------------------------------

 setting  value                       
 version  R version 3.4.2 (2017-09-28)
 system   x86_64, darwin15.6.0        
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            
 date     2017-11-12

Packages -----------------------------------------------------------------

 package      * version   date       source        
 BH             1.65.0-1  2017-08-24 CRAN (R 3.4.1)
 colorspace     1.3-2     2016-12-14 CRAN (R 3.4.0)
 dichromat      2.0-0     2013-01-24 CRAN (R 3.4.0)
 digest         0.6.12    2017-01-27 CRAN (R 3.4.0)
 ggplot2      * 2.2.1     2016-12-30 CRAN (R 3.4.0)
 graphics     * 3.4.2     2017-10-04 local         
 grDevices    * 3.4.2     2017-10-04 local         
 grid           3.4.2     2017-10-04 local         
 gridExtra      2.3       2017-09-09 CRAN (R 3.4.1)
 gtable         0.2.0     2016-02-26 CRAN (R 3.4.0)
 inline         0.3.14    2015-04-13 CRAN (R 3.4.0)
 labeling       0.3       2014-08-23 CRAN (R 3.4.0)
 lattice        0.20-35   2017-03-25 CRAN (R 3.4.2)
 lazyeval       0.2.1     2017-10-29 CRAN (R 3.4.2)
 magrittr       1.5       2014-11-22 CRAN (R 3.4.0)
 MASS           7.3-47    2017-02-26 CRAN (R 3.4.2)
 Matrix         1.2-11    2017-08-21 CRAN (R 3.4.2)
 methods      * 3.4.2     2017-10-04 local         
 munsell        0.4.3     2016-02-13 CRAN (R 3.4.0)
 plyr           1.8.4     2016-06-08 CRAN (R 3.4.0)
 R6             2.2.2     2017-06-17 CRAN (R 3.4.0)
 RColorBrewer   1.1-2     2014-12-07 CRAN (R 3.4.0)
 Rcpp           0.12.13   2017-09-28 CRAN (R 3.4.2)
 RcppEigen      0.3.3.3.0 2017-05-01 CRAN (R 3.4.0)
 reshape2       1.4.2     2016-10-22 CRAN (R 3.4.0)
 rlang          0.1.4     2017-11-05 CRAN (R 3.4.2)
 rstan        * 2.16.2    2017-07-03 CRAN (R 3.4.1)
 scales         0.5.0     2017-08-24 CRAN (R 3.4.1)
 StanHeaders  * 2.16.0-1  2017-07-03 CRAN (R 3.4.1)
 stats        * 3.4.2     2017-10-04 local         
 stats4         3.4.2     2017-10-04 local         
 stringi        1.1.5     2017-04-07 CRAN (R 3.4.0)
 stringr        1.2.0     2017-02-18 CRAN (R 3.4.0)
 tibble         1.3.4     2017-08-22 CRAN (R 3.4.1)
 tools          3.4.2     2017-10-04 local         
 utils        * 3.4.2     2017-10-04 local         
 viridisLite    0.2.0     2017-03-24 CRAN (R 3.4.0)

Robust Gaussian Processes in Stan

Michael Betancourt

October 2017