par(family="serif", las=1, bty="l",
cex.axis=1, cex.lab=1, cex.main=1,
xaxs="i", yaxs="i", mar = c(5, 5, 3, 1))
Bae’s Theorem
In 2006 Netflix announced the infamous Netflix Prize. The competition challenged anyone to use a data set of customer movie reviews to inform predictions of a second, held-out set of customer movie reviews. Superficially the Netflix challenge was a mixed success, although from a broader perspective it demonstrated the perils of poorly-chosen metrics for predictive performance and the subtleties of data privacy. Wikipedia summarizes the history well.
Beyond the Netflix Prize itself the corresponding data set is a nice example of some of the problems that can arise in a wide range of practical applications. For instance not only are customer preferences limited to five star ratings but also the interpretation of those ratings are ambiguous and typically not consistent across customers. Some customers are generous with their five star ratings while some are meager with not only their five star ratings but also their four star and sometimes even three star ratings.
In addition to the idiosyncratic rating scales any analysis of this data also has to contend with idiosyncratic customer preferences. Because not every customer will agree on the quality of a given movie we have to decide whether we want to try to learn an aggregate preference or the individual customer preferences.
In this chapter I will develop a Bayesian analysis of a subset of the Netflix training data set, not in an attempt to win the Netflix Prize decades too late but rather to demonstrate some of these common analysis challenges.
Importantly this analysis will not be the first time that Netflix has been associated with Bayesian inference. In 2016 Amy Hogan (@alittlestats) presented her influential Bae’s Theorem, p( \text{chill} \mid \text{Netflix} ) = \frac{ p( \text{Netflix} \mid \text{chill} ) \, p( \text{chill} ) } { p( \text{Netflix} ) }.
1 Setup
As always we begin by setting up our local R
environment.
library(rstan)
rstan_options(auto_write = TRUE) # Cache compiled Stan programs
options(mc.cores = parallel::detectCores()) # Parallelize chains
:::setDefaultClusterOptions(setup_strategy = "sequential") parallel
<- new.env()
util source('mcmc_analysis_tools_rstan.R', local=util)
source('mcmc_visualization_tools.R', local=util)
2 Data Exploration
The full Netflix Prize training data set consisted of 100,480,507 customer-movie pairs, each accompanied by an ordinal rating between one and five “stars”, with one being the worst rating and five being the best. The observed ratings spanned 480,189 anonymized customers and 17,770 movies.
To allow for a more manageable analysis I reduced the full data set by considering only the first 1000 movies and then randomly subsampling 100 customers and 200 movies with probabilities proportional to the total number of ratings. This left 2,415 total ratings.
Finally to facilitate modeling and add another layer of anonymizing obfuscating I relabeled the selected customers and movies with contiguous indices.
<- read_rdump('data/ratings.data.R')
data
cat(sprintf("%s Customers", data$N_customers))
100 Customers
cat(sprintf("%s Movies", data$N_movies))
200 Movies
cat(sprintf("%s Total Ratings", data$N_ratings))
2415 Total Ratings
Despite favoring customers with more ratings in the data subsampling most of the selected customers still rated only a few movies. Overall the training data set is relatively sparse, with only a few customers contributing most of the ratings.
par(mfrow=c(1, 1), mar=c(5, 5, 2, 1))
$plot_line_hist(table(data$customer_idxs),
util-0.5, 95.5, 5,
xlab="Number of Ratings Per Customer")
Similarly most of the selected movies have only a few ratings.
par(mfrow=c(1, 1), mar=c(5, 5, 2, 1))
$plot_line_hist(table(data$movie_idxs),
util-0.5, 55.5, 2,
xlab="Number of Ratings Per Movie")
Warning in check_bin_containment(bin_min, bin_max, values): 2 values (1.0%)
fell above the binning.
This sparsity is even more evident if we visualize the customer-movie pairings.
<- seq(1, data$N_movies, 1)
xs <- seq(1, data$N_customers, 1)
ys <- matrix(0, nrow=data$N_movies, ncol=data$N_customers)
zs
for (n in 1:data$N_ratings) {
$movie_idxs[n], data$customer_idxs[n]] <- 1
zs[data
}
par(mfrow=c(1, 1), mar = c(5, 5, 1, 1))
image(xs, ys, zs, col=c("white", util$c_dark_teal),
xlab="Movie", ylab="Customer")
The observed ratings are slightly biased towards large values, with far more four star ratings than two star ratings. From the data alone, however, we cannot determine whether or not this is because most movies that had been rated were really good or because most customers were just generous with their ratings.
par(mfrow=c(1, 1), mar=c(5, 5, 2, 1))
$plot_line_hist(data$ratings,
util-0.5, 6.5, 1, xlab="Ratings")
That said there is substantial heterogeneity in the ratings across customers. Customer 70, for example, gave many high ratings while Customer 23 gave many low ratings.
par(mfrow=c(2, 3), mar=c(5, 5, 1, 1))
for (c in c(7, 23, 40, 70, 77, 100)) {
$plot_line_hist(data$ratings[data$customer_idxs == c],
util-0.5, 6.5, 1,
xlab="Rating", main=paste('Customer', c))
}
Similarly we see strong variation in the observed ratings across movies. Movies 117 and 180 are particularly well-reviewed.
par(mfrow=c(2, 3), mar=c(5, 5, 1, 1))
for (m in c(33, 53, 61, 80, 117, 180)) {
$plot_line_hist(data$ratings[data$movie_idxs == m],
util-0.5, 6.5, 1,
xlab="Rating", main=paste('Movie', m))
}
When critiquing any model of these ratings we want to be able to interrogate this variation in rating behavior across customers and movies. Even with the subsampled data, however, a histogram of ratings for each customer and movie would be too ungainly.
One alternative approach is to compute a scalar summary of the ratings within each group, and then construct a histogram of those stratified values. For example we might consider histograms of empirical means stratified by customer or movie.
par(mfrow=c(1, 2), mar=c(5, 5, 2, 1))
<-
mean_rating_customer sapply(1:data$N_customers,
function(c) mean(data$ratings[data$customer_idxs == c]))
$plot_line_hist(mean_rating_customer,
util0, 6, 1,
xlab="Customer-wise Average Ratings")
<-
mean_rating_movie sapply(1:data$N_movies,
function(m) mean(data$ratings[data$movie_idxs == m]))
$plot_line_hist(mean_rating_movie,
util0, 6, 0.5,
xlab="Movie-wise Average Ratings")
Of course an empirical mean captures only some of the rating behavior within each strata. We can capture more with a summary that is sensitive to the dispersion of ratings in each strata, such as the empirical variance or empirical entropy.
Here let’s go with the empirical variance, or rather a modified empirical variance that defaults to zero when a strata consists of only one value.
<- function(vals) {
safe_var if (length(vals) == 1)
0)
(else
var(vals))
( }
par(mfrow=c(1, 2), mar=c(5, 5, 2, 1))
<-
var_rating_customer sapply(1:data$N_customers,
function(c) safe_var(data$ratings[data$customer_idxs == c]))
$plot_line_hist(var_rating_customer,
util0, 5, 0.5,
xlab="Customer-wise Rating Variances")
<-
var_rating_movie sapply(1:data$N_movies,
function(m) safe_var(data$ratings[data$movie_idxs == m]))
$plot_line_hist(var_rating_movie,
util0, 5, 0.5,
xlab="Movie-wise Rating Variances")
The main limitation with these stratified summary statistics is that they are sensitive to only the marginal variation across movies and customers. In other words they are sensitive to heterogeneity across movies or customers but not heterogeneity across movies and customers at the same time. Unfortunately because each customer-movie pair has at most one rating, and most have no ratings, we can’t just stratify summary statistics by both customer and movie.
One potential compromise is to construct empirical covariances for each pair of customers or movies. For example the empirical covariance between two movies m_{1} and m_{2} is defined by \hat{\rho}_{m_{1} m_{2}} = \hat{\rho}_{m_{2} m_{1}} = \frac{1}{N_{C} - 1} \sum_{c = 1}^{N_{C}} (r_{c m_{1}} - \hat{\mu}_{m_{1}}) \cdot (r_{c m_{2}} - \hat{\mu}_{m_{2}}) where N_{C} is the number of customers, r_{c m} is the rating given to movie m by customer c, and \hat{\mu}_{m} = \frac{1}{N_{C}} \sum_{c = 1}^{N_{C}} r_{c m}. The immediate issue is that, because not every customer rates every movie, many of the r_{c m} in these sums will be undefined.
All we can do is limit the sums to the customers who have rated both movies. More formally let \mathsf{c}_{m} denote the set of customers who have rated movie m and N_{C, m} denote the number of elements in that set. Similarly let \mathsf{c}_{m_{1} m_{2}} denote the set of customers who have rated both movies m_{1} and m_{2} with N_{C, m_{1} m_{2}} the number of elements in that set. Then we can define \hat{\rho}_{m_{1} m_{2}} = \hat{\rho}_{m_{2} m_{1}} = \frac{1}{N_{C, m_{1} m_{2}} - 1} \sum_{c \in \mathsf{c}_{m_{1} m_{2}}} (r_{c m_{1}} - \hat{\mu}_{m_{1}}) \cdot (r_{c m_{2}} - \hat{\mu}_{m_{2}}) with \hat{\mu}_{m} = \frac{1}{N_{C, m}} \sum_{c \in \mathsf{c}_{m}} r_{c m} the movie-wise empirical means that we’ve already constructed.
All of this said given the relative sparsity of the observed ratings the set \mathsf{c}_{m_{1} m_{2}} will be empty for most pairs of movies. Even fewer pairs of movies will have the N_{C, m_{1} m_{2}} > 1 needed for \hat{\rho}_{m_{1} m_{2}} to be well-defined, let alone N_{C, m_{1} m_{2}} large enough for \hat{\rho}_{m_{1} m_{2}} to provide an informative summary.
We can avoid any ill-defined or poorly informative empirical covariances by including only those movie pairs with N_{C, m_{1} m_{2}} sufficiently large enough in the final histogram. This also has the added benefit of reducing the total number of summaries that we have to bin into the final histogram visualization. Here I will require N_{C, m_{1} m_{2}} > 7.
Now that we’ve carefully laid out the math the implementation is relatively straightforward. First we loop over the observed ratings twice, incrementing the partial sums for each pair of movies when appropriate. This gives us \hat{\Sigma}_{m_{1} m_{2}} = \sum_{c \in \mathsf{c}_{m_{1} m_{2}}} (r_{c m_{1}} - \hat{\mu}_{m_{1}}) \cdot (r_{c m_{2}} - \hat{\mu}_{m_{2}}) and N_{C, m_{1} m_{2}} = \sum_{c \in \mathsf{c}_{m_{1} m_{2}}} 1.
<- matrix(0,
covar_rating_movie nrow=data$N_movies,
ncol=data$N_movies)
<- matrix(0,
movie_pair_counts nrow=data$N_movies,
ncol=data$N_movies)
for (n1 in 1:data$N_ratings) {
for (n2 in 1:data$N_ratings) {
if (data$customer_idxs[n1] == data$customer_idxs[n2]) {
<- data$movie_idxs[n1]
m1 <- data$movie_idxs[n2]
m2 <- (data$ratings[n1] - mean_rating_movie[m1]) *
y $ratings[n2] - mean_rating_movie[m2])
(data<- covar_rating_movie[m1, m2] + y
covar_rating_movie[m1, m2] <- covar_rating_movie[m2, m1] + y
covar_rating_movie[m2, m1] <- movie_pair_counts[m1, m2] + 1
movie_pair_counts[m1, m2] <- movie_pair_counts[m2, m1] + 1
movie_pair_counts[m2, m1]
}
} }
Next we compute \hat{\rho}_{m_{1} m_{2}} = \frac{ \hat{\Sigma}_{m_{1} m_{2}} }{ N_{C, m_{1} m_{2}} - 1 } for each pair of movies where N_{C, m_{1} m_{2}} is larger than 7.
<- list()
m_pairs <- c()
covar_rating_movie_filt
for (m1 in 2:data$N_movies) {
for (m2 in 1:(m1 - 1)) {
if (movie_pair_counts[m1, m2] > 7) {
length(m_pairs) + 1]] <- c(m1, m2)
m_pairs[[<- c(covar_rating_movie_filt,
covar_rating_movie_filt /
covar_rating_movie[m1, m2] - 1))
(movie_pair_counts[m1, m2]
}
} }
Finally we bin these values into a histogram that visualizes the range of these partial empirical covariance behaviors.
par(mfrow=c(1, 1), mar=c(5, 5, 2, 1))
$plot_line_hist(covar_rating_movie_filt,
util-4, 4, 0.25,
xlab="Movie-wise Rating Covariances")
In order to construct posterior retrodictive checks later on we will need to select the posterior predictive values for these same selected movie pairs. We might as well construct the appropriate variable names now and have them ready.
<-
covar_rating_movie_filt_names sapply(m_pairs,
function(p) paste0('covar_rating_movie_pred[',
1], ',', p[2], ']')) p[
All of this said I think that there is still a lot of opportunity for better summary statistics in applications like these.
3 Homogeneous Customer Model
Now that we’ve familiarized ourselves with the data we can make our first attempt at modeling the data generating process that, well, generated it.
First and foremost we need to account for both the ordering and discreteness of the five star rating system. Fortunately this is straightforward with the latent cut point construction that I discuss in my chapter on ordinal modeling. At the same time we can use the induced Dirichlet prior model to avoid any unreasonable behavior in those cut points.
We can then model the consumer preference for each movie with a separate affinity parameter that shifts the latent probability density function relative to the fixed cut points, and hence the induced probabilities for each rating. Note that we can interpret this construction as a bipartide pair-wise comparison model that contrasts the austerity of customers to the quality of each movie.
The subtlety with this interpretation is that the customer austerity is not a separate parameter but rather implicit in the translation freedom of the cut points. Setting the anchor point in the induced Dirichlet prior model to zero effectively removes this translation freedom and centers the customer austerity around zero.
Initially we will assume that the customer behavior is homogeneous so that we need only a single set of cut points. On the other hand each movie needs to be modeled with a distinct affinity parameter.
Instead of modeling those parameters independently we will assume that our domain expertise about the movies is exchangeable so that we can model with hierarchically. To avoid degeneracy in the pair-wise comparisons we will need to anchor the population location to zero, the same anchor location used in the induced Dirichlet prior model. Finally we will begin with a monolithic non-centered parameterization given the sparsity of observed ratings.
model1.stan
functions {
real induced_dirichlet_lpdf(vector c, vector alpha, real phi) {
int K = num_elements(c) + 1;
vector[K - 1] sigma = inv_logit(phi - c);
vector[K] p;
matrix[K, K] J = rep_matrix(0, K, K);
// Induced ordinal probabilities
1] = 1 - sigma[1];
p[for (k in 2:(K - 1))
1] - sigma[k];
p[k] = sigma[k - 1];
p[K] = sigma[K -
// Baseline column of Jacobian
for (k in 1:K) J[k, 1] = 1;
// Diagonal entries of Jacobian
for (k in 2:K) {
real rho = sigma[k - 1] * (1 - sigma[k - 1]);
J[k, k] = - rho;1, k] = + rho;
J[k -
}
return dirichlet_lpdf(p | alpha)
+ log_determinant(J);
}
}
data {
int<lower=1> N_ratings;
array[N_ratings] int<lower=1, upper=5> ratings;
int<lower=1> N_customers;
array[N_ratings] int<lower=1, upper=N_customers> customer_idxs;
int<lower=1> N_movies;
array[N_ratings] int<lower=1, upper=N_movies> movie_idxs;
}
parameters {
vector[N_movies] gamma_ncp; // Non-centered movie affinities
real<lower=0> tau_gamma; // Movie affinity population scale
ordered[4] cut_points; // Customer rating cut points
}
transformed parameters {
// Centered movie affinities
vector[N_movies] gamma = tau_gamma * gamma_ncp;
}
model {
// Prior model
0, 1);
gamma_ncp ~ normal(0, 5 / 2.57);
tau_gamma ~ normal(
1, 5), 0);
cut_points ~ induced_dirichlet(rep_vector(
// Observational model
ratings ~ ordered_logistic(gamma[movie_idxs], cut_points);
}
generated quantities {
array[N_ratings] int<lower=1, upper=5> rating_pred;
array[N_customers] real mean_rating_customer_pred
0, N_customers);
= rep_array(array[N_customers] real var_rating_customer_pred
0, N_customers);
= rep_array(
array[N_movies] real mean_rating_movie_pred = rep_array(0, N_movies);
array[N_movies] real var_rating_movie_pred = rep_array(0, N_movies);
matrix[N_movies, N_movies] covar_rating_movie_pred;
{array[N_customers] real C = rep_array(0, N_customers);
array[N_movies] real M = rep_array(0, N_movies);
for (n in 1:N_ratings) {
real delta = 0;
int c = customer_idxs[n];
int m = movie_idxs[n];
rating_pred[n] = ordered_logistic_rng(gamma[m], cut_points);
1;
C[c] +=
delta = rating_pred[n] - mean_rating_customer_pred[c];
mean_rating_customer_pred[c] += delta / C[c];
var_rating_customer_pred[c]
+= delta * (rating_pred[n] - mean_rating_customer_pred[c]);
1;
M[m] +=
delta = rating_pred[n] - mean_rating_movie_pred[m];
mean_rating_movie_pred[m] += delta / M[m];
var_rating_movie_pred[m]
+= delta * (rating_pred[n] - mean_rating_movie_pred[m]);
}
for (c in 1:N_customers) {
if (C[c] > 1)
1;
var_rating_customer_pred[c] /= C[c] - else
0;
var_rating_customer_pred[c] =
}for (m in 1:N_movies) {
if (M[m] > 1)
1;
var_rating_movie_pred[m] /= M[m] - else
0;
var_rating_movie_pred[m] =
}
}
{matrix[N_movies, N_movies] counts;
for (m1 in 1:N_movies) {
for (m2 in 1:N_movies) {
0;
counts[m1, m2] = 0;
covar_rating_movie_pred[m1, m2] =
}
}
for (n1 in 1:N_ratings) {
for (n2 in 1:N_ratings) {
if (customer_idxs[n1] == customer_idxs[n2]) {
int m1 = movie_idxs[n1];
int m2 = movie_idxs[n2];
real y = (ratings[n1] - mean_rating_movie_pred[m1])
* (ratings[n2] - mean_rating_movie_pred[m2]);
covar_rating_movie_pred[m1, m2] += y;
covar_rating_movie_pred[m2, m1] += y;1;
counts[m1, m2] += 1;
counts[m2, m1] +=
}
}
}
for (m1 in 1:N_movies) {
for (m2 in 1:N_movies) {
if (counts[m1, m2] > 1)
1;
covar_rating_movie_pred[m1, m2] /= counts[m1, m2] -
}
}
} }
<- stan(file="stan_programs/model1.stan",
fit data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)
Despite the initial model complexity there are no diagnostic issues indicating suspect computation.
<- util$extract_hmc_diagnostics(fit)
diagnostics $check_all_hmc_diagnostics(diagnostics) util
All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
<- util$extract_expectand_vals(fit)
samples1 <- util$filter_expectands(samples1,
base_samples c('gamma_ncp',
'tau_gamma',
'cut_points'),
check_arrays=TRUE)
$check_all_expectand_diagnostics(base_samples) util
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
Consequently we’re ready to investigate this model’s retrodictive performance.
The model appears to be flexible enough to capture the behavior of the aggregate ratings.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
$plot_hist_quantiles(samples1, 'rating_pred', -0.5, 6.5, 1,
utilbaseline_values=data$ratings,
xlab="All Ratings")
On the other hand the retrodictive performance is much worse if we look at individual customer behaviors. In particular there is much more heterogeneity in the observed data than what the model can reproduce, which makes sense given that we explicitly assumed homogeneous customer behavior.
par(mfrow=c(2, 3), mar=c(5, 5, 1, 1))
for (c in c(7, 23, 40, 70, 77, 100)) {
<- sapply(which(data$customer_idxs == c),
names function(n) paste0('rating_pred[', n, ']'))
<- util$filter_expectands(samples1, names)
filtered_samples
<- data$ratings[data$customer_idxs == c]
customer_ratings $plot_hist_quantiles(filtered_samples, 'rating_pred',
util-0.5, 6.5, 1,
baseline_values=customer_ratings,
xlab="Ratings",
main=paste('Customer', c))
}
On the other hand the model doesn’t seem to have a problem capturing the heterogeneity in the observed ratings across movies, at least for this quick spot check.
par(mfrow=c(2, 3), mar=c(5, 5, 1, 1))
for (m in c(33, 53, 61, 80, 117, 180)) {
<- sapply(which(data$movie_idxs == m),
names function(n) paste0('rating_pred[', n, ']'))
<- util$filter_expectands(samples1, names)
filtered_samples
<- data$ratings[data$movie_idxs == m]
movie_ratings $plot_hist_quantiles(filtered_samples, 'rating_pred',
util-0.5, 6.5, 1,
baseline_values=movie_ratings,
xlab="Ratings",
main=paste('Movie', m))
}
To really investigate the variation in retrodictive performance, however, we need to look beyond just a few customers and movies. Instead we can examine the behavior of all customers and movies at the same time with histograms of stratified summary statistics.
Here we see that the posterior predictive behavior of the customer-wise means are more narrowly distributed than what we see in the observed data. At the same time the posterior predictive customer-wise variances concentrate at larger values than the observed customer-wise variances.
par(mfrow=c(2, 2), mar=c(5, 5, 1, 1))
$plot_hist_quantiles(samples1, 'mean_rating_customer_pred',
util0, 6, 0.5,
baseline_values=mean_rating_customer,
xlab="Customer-wise Average Ratings")
$plot_hist_quantiles(samples1, 'mean_rating_movie_pred',
util0, 6, 0.6,
baseline_values=mean_rating_movie,
xlab="Movie-wise Average Ratings")
$plot_hist_quantiles(samples1, 'var_rating_customer_pred',
util0, 7, 0.5,
baseline_values=var_rating_customer,
xlab="Customer-wise Rating Variances")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 232 predictive values (0.1%) fell above the binning.
$plot_hist_quantiles(samples1, 'var_rating_movie_pred',
util0, 7, 0.5,
baseline_values=var_rating_movie,
xlab="Movie-wise Rating Variances")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 1348 predictive values (0.2%) fell above the binning.
Finally the collection of selected movie covariances appears to be more heavy-tailed in the observed data relative to the posterior predictions.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
<-
filtered_samples $filter_expectands(samples1,
util
covar_rating_movie_filt_names)
$plot_hist_quantiles(filtered_samples, 'covar_rating_movie_pred',
util-4.25, 4.25, 0.25,
baseline_values=covar_rating_movie_filt,
xlab="Filtered Movie-wise Rating Covariances")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 282 predictive values (0.0%) fell below the binning.
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 2310 predictive values (0.0%) fell above the binning.
4 Independent, Heterogeneous Customer Model
All of our retrodictive checks tell a consistent story – customers do not rate movies the same way as each other. Fortunately it’s straightforward to just model each customer’s idiosyncratic behavior with their own set of cut points.
model2.stan
functions {
real induced_dirichlet_lpdf(vector c, vector alpha, real phi) {
int K = num_elements(c) + 1;
vector[K - 1] sigma = inv_logit(phi - c);
vector[K] p;
matrix[K, K] J = rep_matrix(0, K, K);
// Induced ordinal probabilities
1] = 1 - sigma[1];
p[for (k in 2:(K - 1))
1] - sigma[k];
p[k] = sigma[k - 1];
p[K] = sigma[K -
// Baseline column of Jacobian
for (k in 1:K) J[k, 1] = 1;
// Diagonal entries of Jacobian
for (k in 2:K) {
real rho = sigma[k - 1] * (1 - sigma[k - 1]);
J[k, k] = - rho;1, k] = + rho;
J[k -
}
return dirichlet_lpdf(p | alpha)
+ log_determinant(J);
}
}
data {
int<lower=1> N_ratings;
array[N_ratings] int<lower=1, upper=5> ratings;
int<lower=1> N_customers;
array[N_ratings] int<lower=1, upper=N_customers> customer_idxs;
int<lower=1> N_movies;
array[N_ratings] int<lower=1, upper=N_movies> movie_idxs;
}
parameters {
vector[N_movies] gamma_ncp; // Non-centered movie qualities
real<lower=0> tau_gamma; // Movie quality population scale
array[N_customers] ordered[4] cut_points; // Customer rating cut points
}
transformed parameters {
vector[N_movies] gamma = tau_gamma * gamma_ncp;
}
model {
// Prior model
0, 1);
gamma_ncp ~ normal(0, 5 / 2.57);
tau_gamma ~ normal(
for (c in 1:N_customers)
1, 5), 0);
cut_points[c] ~ induced_dirichlet(rep_vector(
// Observational model
for (n in 1:N_ratings) {
int c = customer_idxs[n];
int m = movie_idxs[n];
ratings[n] ~ ordered_logistic(gamma[m], cut_points[c]);
}
}
generated quantities {
array[N_ratings] int<lower=1, upper=5> rating_pred;
array[N_customers] real mean_rating_customer_pred
0, N_customers);
= rep_array(array[N_customers] real var_rating_customer_pred
0, N_customers);
= rep_array(
array[N_movies] real mean_rating_movie_pred = rep_array(0, N_movies);
array[N_movies] real var_rating_movie_pred = rep_array(0, N_movies);
matrix[N_movies, N_movies] covar_rating_movie_pred;
{array[N_customers] real C = rep_array(0, N_customers);
array[N_movies] real M = rep_array(0, N_movies);
for (n in 1:N_ratings) {
real delta = 0;
int c = customer_idxs[n];
int m = movie_idxs[n];
rating_pred[n] = ordered_logistic_rng(gamma[m], cut_points[c]);
1;
C[c] +=
delta = rating_pred[n] - mean_rating_customer_pred[c];
mean_rating_customer_pred[c] += delta / C[c];
var_rating_customer_pred[c]
+= delta * (rating_pred[n] - mean_rating_customer_pred[c]);
1;
M[m] +=
delta = rating_pred[n] - mean_rating_movie_pred[m];
mean_rating_movie_pred[m] += delta / M[m];
var_rating_movie_pred[m]
+= delta * (rating_pred[n] - mean_rating_movie_pred[m]);
}
for (c in 1:N_customers) {
if (C[c] > 1)
1;
var_rating_customer_pred[c] /= C[c] - else
0;
var_rating_customer_pred[c] =
}for (m in 1:N_movies) {
if (M[m] > 1)
1;
var_rating_movie_pred[m] /= M[m] - else
0;
var_rating_movie_pred[m] =
}
}
{matrix[N_movies, N_movies] counts;
for (m1 in 1:N_movies) {
for (m2 in 1:N_movies) {
0;
counts[m1, m2] = 0;
covar_rating_movie_pred[m1, m2] =
}
}
for (n1 in 1:N_ratings) {
for (n2 in 1:N_ratings) {
if (customer_idxs[n1] == customer_idxs[n2]) {
int m1 = movie_idxs[n1];
int m2 = movie_idxs[n2];
real y = (ratings[n1] - mean_rating_movie_pred[m1])
* (ratings[n2] - mean_rating_movie_pred[m2]);
covar_rating_movie_pred[m1, m2] += y;
covar_rating_movie_pred[m2, m1] += y;1;
counts[m1, m2] += 1;
counts[m2, m1] +=
}
}
}
for (m1 in 1:N_movies) {
for (m2 in 1:N_movies) {
if (counts[m1, m2] > 1)
1;
covar_rating_movie_pred[m1, m2] /= counts[m1, m2] -
}
}
} }
<- stan(file="stan_programs/model2.stan",
fit data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)
Frustratingly the computation suffers from a few stray divergences.
<- util$extract_hmc_diagnostics(fit)
diagnostics $check_all_hmc_diagnostics(diagnostics) util
Chain 1: 1 of 1024 transitions (0.1%) diverged.
Chain 2: 6 of 1024 transitions (0.6%) diverged.
Chain 3: 2 of 1024 transitions (0.2%) diverged.
Chain 4: 1 of 1024 transitions (0.1%) diverged.
Divergent Hamiltonian transitions result from unstable numerical
trajectories. These instabilities are often due to degenerate target
geometry, especially "pinches". If there are only a small number of
divergences then running with adept_delta larger than 0.801 may reduce
the instabilities at the cost of more expensive Hamiltonian
transitions.
<- util$extract_expectand_vals(fit)
samples2 <- util$filter_expectands(samples2,
base_samples c('gamma_ncp',
'tau_gamma',
'cut_points'),
check_arrays=TRUE)
$check_all_expectand_diagnostics(base_samples) util
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
One possibility is that the replicated cut points are messing with the hierarchical geometry of the movie affinities. The movie affinities most susceptible to degenerate behavior in a non-centered parameterization are those with the most ratings, and those do not exhibit any obvious geometric pathologies.
<- as.numeric(names(tail(sort(table(data$movie_idxs)), 9)))
idxs <- sapply(idxs, function(m) paste0('gamma[', m, ']'))
names $plot_div_pairs(names, 'tau_gamma', samples2, diagnostics) util
At this point we could spend some time investigating for any degenerate behavior between the cut points that could be causing problems, or even between some cut points and some movie affinities. Given the small number of divergences, however, let’s just run again with a less aggressive step size adaptation and cope with the increased computational cost. We can always come back to this investigation later if this ends up being our final model.
<- stan(file="stan_programs/model2.stan",
fit data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0,
control=list('adapt_delta'=0.9))
Fortunately that seems to have done the trick and now our diagnostics are squeaky clean.
<- util$extract_hmc_diagnostics(fit)
diagnostics $check_all_hmc_diagnostics(diagnostics) util
All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
<- util$extract_expectand_vals(fit)
samples2 <- util$filter_expectands(samples2,
base_samples c('gamma_ncp',
'tau_gamma',
'cut_points'),
check_arrays=TRUE)
$check_all_expectand_diagnostics(base_samples) util
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
Has our retrodictive performance improved?
Interestingly the behavior of the aggregate ratings isn’t quite as consistent between the observed data and posterior predictions as it was before. That said the increased retrodictive tension isn’t necessarily large enough to be a concern yet.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
$plot_hist_quantiles(samples2, 'rating_pred', -0.5, 6.5, 1,
utilbaseline_values=data$ratings,
xlab="All Ratings")
More importantly the retrodictive performance for the customers that we’ve spot checked is much better.
par(mfrow=c(2, 3), mar=c(5, 5, 1, 1))
for (c in c(7, 23, 40, 70, 77, 100)) {
<- sapply(which(data$customer_idxs == c),
names function(n) paste0('rating_pred[', n, ']'))
<- util$filter_expectands(samples2, names)
filtered_samples
<- data$ratings[data$customer_idxs == c]
customer_ratings $plot_hist_quantiles(filtered_samples, 'rating_pred',
util-0.5, 6.5, 1,
baseline_values=customer_ratings,
xlab="Ratings",
main=paste('Customer', c))
}
This improvement in the customer-wise behaviors hasn’t come at any cost to the movie-wise retrodictive performance, at least for this limited spot check.
par(mfrow=c(2, 3), mar=c(5, 5, 1, 1))
for (m in c(33, 53, 61, 80, 117, 180)) {
<- sapply(which(data$movie_idxs == m),
names function(n) paste0('rating_pred[', n, ']'))
<- util$filter_expectands(samples2, names)
filtered_samples
<- data$ratings[data$movie_idxs == m]
movie_ratings $plot_hist_quantiles(filtered_samples, 'rating_pred',
util-0.5, 6.5, 1,
baseline_values=movie_ratings,
xlab="Ratings",
main=paste('Movie', m))
}
Do the histograms of stratified summary statistics tell a similar story? The retrodictive tension in the customer-wise means and variances has decreased, although some remains in the variances.
par(mfrow=c(2, 2), mar=c(5, 5, 1, 1))
$plot_hist_quantiles(samples2, 'mean_rating_customer_pred',
util0, 6, 0.5,
baseline_values=mean_rating_customer,
xlab="Customer-wise Average Ratings")
$plot_hist_quantiles(samples2, 'mean_rating_movie_pred',
util0, 6, 0.6,
baseline_values=mean_rating_movie,
xlab="Movie-wise Average Ratings")
$plot_hist_quantiles(samples2, 'var_rating_customer_pred',
util0, 7, 0.5,
baseline_values=var_rating_customer,
xlab="Customer-wise Rating Variances")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 390 predictive values (0.1%) fell above the binning.
$plot_hist_quantiles(samples2, 'var_rating_movie_pred',
util0, 7, 0.5,
baseline_values=var_rating_movie,
xlab="Movie-wise Rating Variances")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 2586 predictive values (0.3%) fell above the binning.
Moreover the retrodictive tension in the collection of selected movie covariances remains.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
<-
filtered_samples $filter_expectands(samples2,
util
covar_rating_movie_filt_names)
$plot_hist_quantiles(filtered_samples, 'covar_rating_movie_pred',
util-4.25, 4.25, 0.25,
baseline_values=covar_rating_movie_filt,
xlab="Filtered Movie-wise Rating Covariances")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 363 predictive values (0.0%) fell below the binning.
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 280 predictive values (0.0%) fell above the binning.
If we were content with this mild retrodictive tensions then we would move on to exploring the posterior inferences themselves. For example we could visualize the marginal posterior inferences for the cut points of each customer.
par(mfrow=c(4, 1), mar=c(5, 5, 1, 1))
for (k in 1:4) {
<- sapply(1:data$N_customers,
names function(r) paste0('cut_points[', r, ',', k, ']'))
$plot_disc_pushforward_quantiles(samples2, names,
utilxlab="Customer",
display_ylim=c(-6, 6),
ylab=paste0('cut_point[', k, ']'))
}
The component cut points are a bit more interpretable when visualized together, especially when comparing customers. For example to compare Customer 23 and Customer 70 we might overlay the marginal posterior visualizations of the four component cut points of each customer in adjacent plots.
par(mfrow=c(2, 1), mar=c(5, 5, 1, 1))
<- c(util$c_mid, util$c_mid_highlight,
cols $c_dark, util$c_dark_highlight)
util
<- 23
c
<- 1
k <-paste0('cut_points[', c, ',', k, ']')
name $plot_expectand_pushforward(samples2[[name]],
util50, flim=c(-9, 9), ylim=c(0, 2),
col=cols[k], display_name='Cut Points',
main=paste('Customer', c))
for (k in 2:4) {
<-paste0('cut_points[', c, ',', k, ']')
name $plot_expectand_pushforward(samples2[[name]],
util50, flim=c(-9, 9),
col=cols[k], border="#BBBBBB88",
add=TRUE)
}
text(0, 1.65, "cut_points[1]", col=util$c_mid)
text(2, 1.4, "cut_points[2]", col=util$c_mid_highlight)
text(4, 1, "cut_points[3]", col=util$c_dark)
text(6, 0.5, "cut_points[4]", col=util$c_dark_highlight)
<- 70
c
<- 1
k <-paste0('cut_points[', c, ',', k, ']')
name $plot_expectand_pushforward(samples2[[name]],
util50, flim=c(-9, 9), ylim=c(0, 2),
col=cols[k], display_name='Cut Points',
main=paste('Customer', c))
Warning in util$plot_expectand_pushforward(samples2[[name]], 50, flim = c(-9, :
12 values (0.3%) fell below the histogram binning.
Warning in util$plot_expectand_pushforward(samples2[[name]], 50, flim = c(-9, :
0 values (0.0%) fell above the histogram binning.
for (k in 2:4) {
<-paste0('cut_points[', c, ',', k, ']')
name $plot_expectand_pushforward(samples2[[name]],
util50, flim=c(-9, 9),
col=cols[k], border="#BBBBBB88",
add=TRUE)
}
text(-5.5, 0.35, "cut_points[1]", col=util$c_mid)
text(-3, 0.75, "cut_points[2]", col=util$c_mid_highlight)
text(-2, 1.0, "cut_points[3]", col=util$c_dark)
text(1.0, 1.25, "cut_points[4]", col=util$c_dark_highlight)
This allow us to see that Customer 23 is pretty stingy; a movie affinity needs to be pretty large in order for the probability of a large rating to be non-negligible. On the other hand Customer 70 is much more generous; they are likely to give even a mediocre movie a high rating.
At the same time we can investigate the inferred affinities for each movie.
par(mfrow=c(2, 1), mar=c(5, 5, 1, 1))
$plot_expectand_pushforward(samples2[['tau_gamma']],
util25, flim=c(0, 1),
display_name='tau_gamma')
<- sapply(1:data$N_movies,
names function(m) paste0('gamma[', m, ']'))
$plot_disc_pushforward_quantiles(samples2, names,
utilxlab="Movie",
ylab="Affinity")
While there is a lot of uncertainty, not surprising given the limited data, we can definitely see some trends. For example the concentration of the tau_gamma
marginal posterior distribution away from zero indicates substantial variation in the movie affinities. Moreover despite the uncertainties we can clearly differentiate the best movies from the worst movies. We’ll do even more with these movie affinity inferences in the next section.
5 Hierarchical Customer Model
At this point we need to address a stark asymmetry in the last model. Although we’re accounting for heterogeneity in both customer and movie behaviors, we’re modeling only the latter hierarchically. We don’t have any domain expertise that obstructs the exchangeability of the customers so why don’t we model the individual cut points hierarchically as well? All we need is an appropriate multivariate population model.
Conveniently the induced Dirichlet prior naturally composes with a hyper Dirichlet population model that I discussed in my die fairness case study. This makes for a natural cut point population model that pools each customer’s ratings to a common distribution of rating probabilities.
That said this is not the most general hierarchical model that we might consider. This model assumes that the heterogeneity in the cut points is independent of the heterogeneity in the movie affinities and more generally those heterogeneities could be coupled together. That said I think that it is more than reasonable to assume that how each customer translates their preferences into a movie rating is independent from those preferences themselves.
model3.stan
functions {
real induced_dirichlet_lpdf(vector c, vector alpha, real phi) {
int K = num_elements(c) + 1;
vector[K - 1] sigma = inv_logit(phi - c);
vector[K] p;
matrix[K, K] J = rep_matrix(0, K, K);
// Induced ordinal probabilities
1] = 1 - sigma[1];
p[for (k in 2:(K - 1))
1] - sigma[k];
p[k] = sigma[k - 1];
p[K] = sigma[K -
// Baseline column of Jacobian
for (k in 1:K) J[k, 1] = 1;
// Diagonal entries of Jacobian
for (k in 2:K) {
real rho = sigma[k - 1] * (1 - sigma[k - 1]);
J[k, k] = - rho;1, k] = + rho;
J[k -
}
return dirichlet_lpdf(p | alpha)
+ log_determinant(J);
}
}
data {
int<lower=1> N_ratings;
array[N_ratings] int<lower=1, upper=5> ratings;
int<lower=1> N_customers;
array[N_ratings] int<lower=1, upper=N_customers> customer_idxs;
int<lower=1> N_movies;
array[N_ratings] int<lower=1, upper=N_movies> movie_idxs;
}
parameters {
vector[N_movies] gamma_ncp; // Non-centered movie affinities
real<lower=0> tau_gamma; // Movie affinity population scale
array[N_customers] ordered[4] cut_points; // Customer rating cut points
simplex[5] mu_q; // Rating simplex population location
real<lower=0> tau_q; // Rating simplex population scale
}
transformed parameters {
// Centered movie affinities
vector[N_movies] gamma = tau_gamma * gamma_ncp;
}
model {
vector[5] ones = rep_vector(1, 5);
vector[5] alpha = mu_q / tau_q + ones;
// Prior model
0, 1);
gamma_ncp ~ normal(0, 5 / 2.57);
tau_gamma ~ normal(
5 * ones);
mu_q ~ dirichlet(0, 1);
tau_q ~ normal(
for (c in 1:N_customers)
0);
cut_points[c] ~ induced_dirichlet(alpha,
// Observational model
for (n in 1:N_ratings) {
int c = customer_idxs[n];
int m = movie_idxs[n];
ratings[n] ~ ordered_logistic(gamma[m], cut_points[c]);
}
}
generated quantities {
array[N_ratings] int<lower=1, upper=5> rating_pred;
array[N_customers] real mean_rating_customer_pred
0, N_customers);
= rep_array(array[N_customers] real var_rating_customer_pred
0, N_customers);
= rep_array(
array[N_movies] real mean_rating_movie_pred = rep_array(0, N_movies);
array[N_movies] real var_rating_movie_pred = rep_array(0, N_movies);
matrix[N_movies, N_movies] covar_rating_movie_pred;
{array[N_customers] real C = rep_array(0, N_customers);
array[N_movies] real M = rep_array(0, N_movies);
for (n in 1:N_ratings) {
real delta = 0;
int c = customer_idxs[n];
int m = movie_idxs[n];
rating_pred[n] = ordered_logistic_rng(gamma[m], cut_points[c]);
1;
C[c] +=
delta = rating_pred[n] - mean_rating_customer_pred[c];
mean_rating_customer_pred[c] += delta / C[c];
var_rating_customer_pred[c]
+= delta * (rating_pred[n] - mean_rating_customer_pred[c]);
1;
M[m] +=
delta = rating_pred[n] - mean_rating_movie_pred[m];
mean_rating_movie_pred[m] += delta / M[m];
var_rating_movie_pred[m]
+= delta * (rating_pred[n] - mean_rating_movie_pred[m]);
}
for (c in 1:N_customers) {
if (C[c] > 1)
1;
var_rating_customer_pred[c] /= C[c] - else
0;
var_rating_customer_pred[c] =
}for (m in 1:N_movies) {
if (M[m] > 1)
1;
var_rating_movie_pred[m] /= M[m] - else
0;
var_rating_movie_pred[m] =
}
}
{matrix[N_movies, N_movies] counts;
for (m1 in 1:N_movies) {
for (m2 in 1:N_movies) {
0;
counts[m1, m2] = 0;
covar_rating_movie_pred[m1, m2] =
}
}
for (n1 in 1:N_ratings) {
for (n2 in 1:N_ratings) {
if (customer_idxs[n1] == customer_idxs[n2]) {
int m1 = movie_idxs[n1];
int m2 = movie_idxs[n2];
real y = (ratings[n1] - mean_rating_movie_pred[m1])
* (ratings[n2] - mean_rating_movie_pred[m2]);
covar_rating_movie_pred[m1, m2] += y;
covar_rating_movie_pred[m2, m1] += y;1;
counts[m1, m2] += 1;
counts[m2, m1] +=
}
}
}
for (m1 in 1:N_movies) {
for (m2 in 1:N_movies) {
if (counts[m1, m2] > 1)
1;
covar_rating_movie_pred[m1, m2] /= counts[m1, m2] -
}
}
} }
<- stan(file="stan_programs/model3.stan",
fit data=data, seed=8438338,
warmup=1000, iter=2024, refresh=0)
The hierarchical model over the cut points has already proved useful – the mild computational issues that we had considered in the last model fit have vanished.
<- util$extract_hmc_diagnostics(fit)
diagnostics $check_all_hmc_diagnostics(diagnostics) util
All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
<- util$extract_expectand_vals(fit)
samples3 <- util$filter_expectands(samples3,
base_samples c('gamma_ncp',
'tau_gamma',
'cut_points',
'mu_q', 'tau_q'),
check_arrays=TRUE)
$check_all_expectand_diagnostics(base_samples) util
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
Overall the retrodictive performance is a little bit better than in the previous model. In particular the agreements in the aggregate ratings histogram and stratified covariances histogram have improved slightly. On the other hand the retrodictive tension in the stratified variances stubbornly persists. We shouldn’t expect too much performance, however, given that the hierarchical structure doesn’t add any flexibility to the customer behaviors beyond what we had already included.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
$plot_hist_quantiles(samples3, 'rating_pred', -0.5, 6.5, 1,
utilbaseline_values=data$ratings,
xlab="All Ratings")
par(mfrow=c(2, 3), mar=c(5, 5, 1, 1))
for (c in c(7, 23, 40, 70, 77, 100)) {
<- sapply(which(data$customer_idxs == c),
names function(n) paste0('rating_pred[', n, ']'))
<- util$filter_expectands(samples3, names)
filtered_samples
<- data$ratings[data$customer_idxs == c]
customer_ratings $plot_hist_quantiles(filtered_samples, 'rating_pred',
util-0.5, 6.5, 1,
baseline_values=customer_ratings,
xlab="Ratings",
main=paste('Customer', c))
}
par(mfrow=c(2, 3), mar=c(5, 5, 1, 1))
for (m in c(33, 53, 61, 80, 117, 180)) {
<- sapply(which(data$movie_idxs == m),
names function(n) paste0('rating_pred[', n, ']'))
<- util$filter_expectands(samples3, names)
filtered_samples
<- data$ratings[data$movie_idxs == m]
movie_ratings $plot_hist_quantiles(filtered_samples, 'rating_pred',
util-0.5, 6.5, 1,
baseline_values=movie_ratings,
xlab="Ratings",
main=paste('Movie', m))
}
par(mfrow=c(2, 2), mar=c(5, 5, 1, 1))
$plot_hist_quantiles(samples3, 'mean_rating_customer_pred',
util0, 6, 0.5,
baseline_values=mean_rating_customer,
xlab="Customer-wise Average Ratings")
$plot_hist_quantiles(samples3, 'mean_rating_movie_pred',
util0, 6, 0.6,
baseline_values=mean_rating_movie,
xlab="Movie-wise Average Ratings")
$plot_hist_quantiles(samples3, 'var_rating_customer_pred',
util0, 7, 0.5,
baseline_values=var_rating_customer,
xlab="Customer-wise Rating Variances")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 234 predictive values (0.1%) fell above the binning.
$plot_hist_quantiles(samples3, 'var_rating_movie_pred',
util0, 7, 0.5,
baseline_values=var_rating_movie,
xlab="Movie-wise Rating Variances")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 2222 predictive values (0.3%) fell above the binning.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
<-
filtered_samples $filter_expectands(samples3,
util
covar_rating_movie_filt_names)
$plot_hist_quantiles(filtered_samples, 'covar_rating_movie_pred',
util-4.25, 4.25, 0.25,
baseline_values=covar_rating_movie_filt,
xlab="Filtered Movie-wise Rating Covariances")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 286 predictive values (0.0%) fell below the binning.
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 410 predictive values (0.0%) fell above the binning.
If we were satisfied with this retrodictive performance then there are already many ways that we can explore and utilize these inferences.
With the new hierarchical structure we can investigate what we learned about not only each individual customer but also the population of customers. The marginal posterior distribution for tau_q
suggests a small but non-negligible heterogeneity in the cut points. Moreover the inferences for the baseline probabilities mu_q
suggest that customers are relatively optimistic, with four star ratings being more probable than tree star ratings for a neutral movie with zero affinity.
par(mfrow=c(2, 1), mar=c(5, 5, 1, 1))
$plot_expectand_pushforward(samples3[['tau_q']],
util25, flim=c(0, 0.3),
display_name='tau_q')
<- sapply(1:5, function(k) paste0('mu_q[', k, ']'))
names $plot_disc_pushforward_quantiles(samples3, names,
utilxlab="Rating",
ylab="Baseline Rating Probability")
The regularizing influence of the hierarchical model is strongest for those customers with the fewest ratings. For example Customer 42 has only three observed ratings, and the hierarchical inferences for their cut points shift and narrow pretty substantially relative to the inferences from the previous model.
<- 42
c table(data$customer_idxs)[c]
42
3
par(mfrow=c(2, 2), mar=c(5, 5, 1, 1))
<- c(1.5, 2, -4, -3)
lab2_xs <- c(0.15, 0.25, 0.35, 0.35)
lab2_ys
<- c(-6, -5, 3, 3.5)
lab3_xs <- c(0.25, 0.4, 0.5, 0.6)
lab3_ys
for (k in 1:4) {
<- paste0('cut_points[', c, ',', k, ']')
name $plot_expectand_pushforward(samples2[[name]],
util50, flim=c(-10, 5), ylim=c(0, 1.0),
col=util$c_light,
display_name='Cut Points',
main=paste('Customer', c))
<- paste0('cut_points[', c, ',', k, ']')
name $plot_expectand_pushforward(samples3[[name]],
util50, flim=c(-10, 5),
col=util$c_dark, border="#BBBBBB88",
add=TRUE)
text(lab2_xs[k], lab2_ys[k], "Model 2", col=util$c_light)
text(lab3_xs[k], lab3_ys[k], "Model 3", col=util$c_dark)
}
Warning in util$plot_expectand_pushforward(samples2[[name]], 50, flim = c(-10,
: 3 values (0.1%) fell below the histogram binning.
Warning in util$plot_expectand_pushforward(samples2[[name]], 50, flim = c(-10,
: 0 values (0.0%) fell above the histogram binning.
Warning in util$plot_expectand_pushforward(samples3[[name]], 50, flim = c(-10,
: 1 value (0.0%) fell below the histogram binning.
Warning in util$plot_expectand_pushforward(samples3[[name]], 50, flim = c(-10,
: 0 value (0.0%) fell above the histogram binning.
We can also see the impact of the hierarchical model if we examine the cut point inferences for the two models across all customers. The more uncertain customer inferences in the previous model, especially for the first and last cut points, are pulled towards the other customer inferences.
par(mfrow=c(4, 1), mar=c(5, 5, 1, 1))
for (k in 1:4) {
<- sapply(1:data$N_customers,
names function(c) paste0('cut_points[', c, ',', k, ']'))
<- paste0('cut_points[', k, ']')
yname $plot_disc_pushforward_quantiles(samples3, names,
utilxlab="customer",
display_ylim=c(-6, 6),
ylab=yname)
}
The hierarchical influence, however, doesn’t change the qualitative details. For example Customer 23 is still stingy with their ratings while Customer 70 is still generous.
par(mfrow=c(2, 1), mar=c(5, 5, 1, 1))
<- c(util$c_mid, util$c_mid_highlight,
cols $c_dark, util$c_dark_highlight)
util
<- 23
c
<- 1
k <-paste0('cut_points[', c, ',', k, ']')
name $plot_expectand_pushforward(samples3[[name]],
util50, flim=c(-9, 9), ylim=c(0, 2),
col=cols[k], display_name='Cut Points',
main=paste('Customer', c))
for (k in 2:4) {
<-paste0('cut_points[', c, ',', k, ']')
name $plot_expectand_pushforward(samples3[[name]],
util50, flim=c(-9, 9),
col=cols[k], border="#BBBBBB88",
add=TRUE)
}
text(0, 1.65, "cut_points[1]", col=util$c_mid)
text(2, 1.45, "cut_points[2]", col=util$c_mid_highlight)
text(4, 1, "cut_points[3]", col=util$c_dark)
text(6, 0.5, "cut_points[4]", col=util$c_dark_highlight)
<- 70
c
<- 1
k <-paste0('cut_points[', c, ',', k, ']')
name $plot_expectand_pushforward(samples3[[name]],
util50, flim=c(-9, 9), ylim=c(0, 2),
col=cols[k], display_name='Cut Points',
main=paste('Customer', c))
Warning in util$plot_expectand_pushforward(samples3[[name]], 50, flim = c(-9, :
10 values (0.2%) fell below the histogram binning.
Warning in util$plot_expectand_pushforward(samples3[[name]], 50, flim = c(-9, :
0 values (0.0%) fell above the histogram binning.
for (k in 2:4) {
<-paste0('cut_points[', c, ',', k, ']')
name $plot_expectand_pushforward(samples3[[name]],
util50, flim=c(-9, 9),
col=cols[k], border="#BBBBBB88",
add=TRUE)
}
text(-5.5, 0.4, "cut_points[1]", col=util$c_mid)
text(-3, 0.85, "cut_points[2]", col=util$c_mid_highlight)
text(-2, 1.15, "cut_points[3]", col=util$c_dark)
text(1.0, 1.35, "cut_points[4]", col=util$c_dark_highlight)
Of course we can still investigate the behavior of individual movies and the hierarchical population of movies. The hierarchical regularization of the cut points allows the observed ratings to slightly better inform the movie affinities.
par(mfrow=c(2, 1), mar=c(5, 5, 1, 1))
$plot_expectand_pushforward(samples3[['tau_gamma']],
util50, flim=c(0, 1),
display_name='tau_gamma')
<- sapply(1:data$N_movies,
names function(m) paste0('gamma[', m, ']'))
$plot_disc_pushforward_quantiles(samples3, names,
utilxlab="Movie",
ylab="Affinity")
Now let’s go one step further than we did in the previous section and use these movie inferences to rank the movies by their expected affinities. This is just one heuristic for ranking items based on their inferred qualities, but one that has the advantage of being relatively fast to compute.
<- function(m) {
expected_affinity $ensemble_mcmc_est(samples3[[paste0('gamma[', m, ']')]])[1]
util
}
<- sapply(1:data$N_movies,
expected_affinities function(m) expected_affinity(m))
<- sort(expected_affinities, index.return=TRUE)$ix post_mean_ordering
We can then use this ranking to select out the five worst movies.
print(data.frame("Rank"=200:196,
"Movie"=head(post_mean_ordering, 5)),
row.names=FALSE)
Rank Movie
200 31
199 159
198 13
197 40
196 175
To be a bit less pessimistic we could also consider the five best movies.
print(data.frame("Rank"=5:1,
"Movie"=tail(post_mean_ordering, 5)),
row.names=FALSE)
Rank Movie
5 193
4 33
3 117
2 44
1 180
We have a variety of ways to make inferential comparisons between two movies at a time. For example we could just overlay the marginal posterior distributions for each movie affinity.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
<- head(post_mean_ordering, 1)
m1 <-paste0('gamma[', m1, ']')
name $plot_expectand_pushforward(samples3[[name]],
util50, flim=c(-3, 3),
ylim=c(0, 1.3),
col=util$c_mid,
display_name='Affinity')
text(-1.25, 1.2, paste('Movie', m1), col=util$c_mid)
<- tail(post_mean_ordering, 1)
m2 <-paste0('gamma[', m2, ']')
name $plot_expectand_pushforward(samples3[[name]],
util50, flim=c(-3, 3),
col=util$c_dark,
border="#BBBBBB88",
add=TRUE)
text(1.25, 1.2, paste('Movie', m2), col=util$c_dark)
Even better we can directly compute the probability that one movie affinity is larger than the other. Here there is little ambiguity whether or not the top ranked movie is actually better than the worst ranked movie.
<- list('g1' = paste0('gamma[', m1,']'),
var_repl 'g2' = paste0('gamma[', m2,']'))
<-
p_est $implicit_subset_prob(samples3,
utilfunction(g1, g2) g1 < g2,
var_repl)
<- paste0("Posterior probability that movie %i affinity ",
format_string "> movie %i affinity = %.3f +/- %.3f.")
cat(sprintf(format_string, m1, m2, p_est[1], 2 * p_est[2]))
Posterior probability that movie 31 affinity > movie 180 affinity = 1.000 +/- 0.000.
6 Hierarchical Customer Model With Heterogeneous Affinities
At this point there are two main limitations with the last model. Firstly there is the mild retrodictive tension in a few of the summary statistics. Secondly it makes the strong assumption that once the different interpretations of ratings are taken into account all customers have the same opinions about each movie.
This for example prevents us from making personalized recommendations to each customer. Incorporating individual tastes into the model would inform much more nuanced inferences and predictions. It could even resolve some of the retrodictive tension.
The immediate challenge with trying to infer personal movie preferences, however, is that customers rate only a very small proportion of the available movies.
<- seq(1, data$N_movies, 1)
xs <- seq(1, data$N_customers, 1)
ys <- matrix(0, nrow=data$N_movies, ncol=data$N_customers)
zs
for (n in 1:data$N_ratings) {
$movie_idxs[n], data$customer_idxs[n]] <- 1
zs[data
}
par(mfrow=c(1, 1), mar = c(5, 5, 1, 1))
image(xs, ys, zs, col=c("white", util$c_dark_teal),
xlab="Movie", ylab="customer")
In the machine learning literature the problem of filling in unobserved pairs, like customer-movie ratings in this application, is often known as “matrix completion” in analogy to filling in the missing cells of this visualization.
Because most of the ratings are unobserved the only way to inform individual customer preferences for each movie is to pool correlations in each movie affinities across customers. For example we might model each customers’ movie affinities as common baseline affinities plus individual deviations, \boldsymbol{\gamma}_{c} = \boldsymbol{\gamma}_{0} + \boldsymbol{\delta}_{c}. We can then pool these individual deviations together with a multivariate normal population model, p(\boldsymbol{\delta}_{c}) = \text{multi-normal}( \mathbf{0}, \boldsymbol{\Sigma} ) with \boldsymbol{\Sigma} = \boldsymbol{\tau}_{\gamma}^{T} \cdot \boldsymbol{\Phi}_{\gamma} \cdot \boldsymbol{\tau}_{\gamma}. Whatever we learn about the population baseline \boldsymbol{\gamma}_{0}, the population scales \boldsymbol{\tau}_{\gamma}, and the population correlations \boldsymbol{\Phi}_{\gamma} fill in whatever elements of any affinity vector \boldsymbol{\gamma}_{c} that might be unobserved.
Because the movies are still exchangeable we can model the baseline movie affinities hierarchically as well, p( \gamma_{0, m} ) = \text{normal}(0, \tau_{\gamma_{0}} ). We’re now modeling the cut points, the baseline movie affinities, and the individual customer affinities hierarchically all at the same time. Pretty neat.
Given the sparsity of observed ratings we’ll implement the normal and multivariate normal hierarchical models with non-centered parameterizations.
model4.stan
functions {
real induced_dirichlet_lpdf(vector c, vector alpha, real phi) {
int K = num_elements(c) + 1;
vector[K - 1] sigma = inv_logit(phi - c);
vector[K] p;
matrix[K, K] J = rep_matrix(0, K, K);
// Induced ordinal probabilities
1] = 1 - sigma[1];
p[for (k in 2:(K - 1))
1] - sigma[k];
p[k] = sigma[k - 1];
p[K] = sigma[K -
// Baseline column of Jacobian
for (k in 1:K) J[k, 1] = 1;
// Diagonal entries of Jacobian
for (k in 2:K) {
real rho = sigma[k - 1] * (1 - sigma[k - 1]);
J[k, k] = - rho;1, k] = + rho;
J[k -
}
return dirichlet_lpdf(p | alpha)
+ log_determinant(J);
}
}
data {
int<lower=1> N_ratings;
array[N_ratings] int<lower=1, upper=5> ratings;
int<lower=1> N_customers;
array[N_ratings] int<lower=1, upper=N_customers> customer_idxs;
int<lower=1> N_movies;
array[N_ratings] int<lower=1, upper=N_movies> movie_idxs;
}
parameters {
// Baseline movie affinities population model
vector[N_movies] gamma0_ncp;
real<lower=0> tau_gamma0;
// Individual customer affinity population model
array[N_customers] vector[N_movies] delta_gamma_ncp;
vector<lower=0>[N_movies] tau_delta_gamma;
cholesky_factor_corr[N_movies] L_delta_gamma;
array[N_customers] ordered[4] cut_points; // Customer rating cut points
simplex[5] mu_q; // Rating simplex population location
real<lower=0> tau_q; // Rating simplex population scale
}
transformed parameters {
// Centered baseline movie affinities
vector[N_movies] gamma0 = tau_gamma0 * gamma0_ncp;
// Centered customer movie affinities
matrix[N_movies, N_movies] Phi;
array[N_customers] vector[N_movies] delta_gamma;
{matrix[N_movies, N_movies] L_cov
= diag_pre_multiply(tau_delta_gamma, L_delta_gamma);
Phi = L_cov * L_cov';for (c in 1:N_customers)
delta_gamma[c] = L_cov * delta_gamma_ncp[c];
}
}
model {
vector[5] ones = rep_vector(1, 5);
vector[5] alpha = mu_q / tau_q + ones;
// Prior model
0, 1);
gamma0_ncp ~ normal(0, 5 / 2.57);
tau_gamma0 ~ normal(
for (r in 1:N_customers)
0, 1);
delta_gamma_ncp[r] ~ normal(0, 5 / 2.57);
tau_delta_gamma ~ normal(5.0 * sqrt(N_movies));
L_delta_gamma ~ lkj_corr_cholesky(
for (c in 1:N_customers)
0);
cut_points[c] ~ induced_dirichlet(alpha, 5 * ones);
mu_q ~ dirichlet(0, 1);
tau_q ~ normal(
// Observational model
for (n in 1:N_ratings) {
int c = customer_idxs[n];
int m = movie_idxs[n];
ratings[n] ~ ordered_logistic(gamma0[m] + delta_gamma[c][m],
cut_points[c]);
}
}
generated quantities {
array[N_ratings] int<lower=1, upper=5> rating_pred;
array[N_customers] real mean_rating_customer_pred
0, N_customers);
= rep_array(array[N_customers] real var_rating_customer_pred
0, N_customers);
= rep_array(
array[N_movies] real mean_rating_movie_pred = rep_array(0, N_movies);
array[N_movies] real var_rating_movie_pred = rep_array(0, N_movies);
matrix[N_movies, N_movies] covar_rating_movie_pred;
{array[N_customers] real C = rep_array(0, N_customers);
array[N_movies] real M = rep_array(0, N_movies);
for (n in 1:N_ratings) {
real delta = 0;
int c = customer_idxs[n];
int m = movie_idxs[n];
rating_pred[n]
= ordered_logistic_rng(gamma0[m] + delta_gamma[c][m],
cut_points[c]);
1;
C[c] +=
delta = rating_pred[n] - mean_rating_customer_pred[c];
mean_rating_customer_pred[c] += delta / C[c];
var_rating_customer_pred[c]
+= delta * (rating_pred[n] - mean_rating_customer_pred[c]);
1;
M[m] +=
delta = rating_pred[n] - mean_rating_movie_pred[m];
mean_rating_movie_pred[m] += delta / M[m];
var_rating_movie_pred[m]
+= delta * (rating_pred[n] - mean_rating_movie_pred[m]);
}
for (c in 1:N_customers) {
if (C[c] > 1)
1;
var_rating_customer_pred[c] /= C[c] - else
0;
var_rating_customer_pred[c] =
}for (m in 1:N_movies) {
if (M[m] > 1)
1;
var_rating_movie_pred[m] /= M[m] - else
0;
var_rating_movie_pred[m] =
}
}
{matrix[N_movies, N_movies] counts;
for (m1 in 1:N_movies) {
for (m2 in 1:N_movies) {
0;
counts[m1, m2] = 0;
covar_rating_movie_pred[m1, m2] =
}
}
for (n1 in 1:N_ratings) {
for (n2 in 1:N_ratings) {
if (customer_idxs[n1] == customer_idxs[n2]) {
int m1 = movie_idxs[n1];
int m2 = movie_idxs[n2];
real y = (ratings[n1] - mean_rating_movie_pred[m1])
* (ratings[n2] - mean_rating_movie_pred[m2]);
covar_rating_movie_pred[m1, m2] += y;
covar_rating_movie_pred[m2, m1] += y;1;
counts[m1, m2] += 1;
counts[m2, m1] +=
}
}
}
for (m1 in 1:N_movies) {
for (m2 in 1:N_movies) {
if (counts[m1, m2] > 1)
1;
covar_rating_movie_pred[m1, m2] /= counts[m1, m2] -
}
}
} }
<- stan(file="stan_programs/model4.stan",
fit data=data, seed=8438338, init=0,
warmup=1000, iter=2024, refresh=0)
People complain about the annoyance of diagnostic warnings but how bad can they be if we don’t see any for this complex model with 40,706 degrees of freedom? Turns out Hamiltonian Monte Carlo is pretty good!
<- util$extract_hmc_diagnostics(fit)
diagnostics $check_all_hmc_diagnostics(diagnostics) util
All Hamiltonian Monte Carlo diagnostics are consistent with reliable
Markov chain Monte Carlo.
<- util$extract_expectand_vals(fit)
samples4 <- util$filter_expectands(samples4,
base_samples c('gamma0_ncp',
'tau_gamma0',
'delta_gamma_ncp',
'tau_delta_gamma',
'L_delta_gamma',
'cut_points',
'mu_q', 'tau_q'),
check_arrays=TRUE)
$check_all_expectand_diagnostics(base_samples,
utilexclude_zvar=TRUE)
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.
Unfortunately all of this added sophisticated doesn’t actually seem to improve the retrodictive performance much. Even the retrodictive tension in the covariances, which should be particularly sensitive to added flexibility in the new model, is similar to what we saw with the previous model. One possibility is that the multivariate normal population model is isn’t sufficiently heavy-tailed to accommodate the more extreme tastes of the customers.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
$plot_hist_quantiles(samples4, 'rating_pred', -0.5, 6.5, 1,
utilbaseline_values=data$ratings,
xlab="All Ratings")
par(mfrow=c(2, 3), mar=c(5, 5, 1, 1))
for (c in c(7, 23, 40, 70, 77, 100)) {
<- sapply(which(data$customer_idxs == c),
names function(n) paste0('rating_pred[', n, ']'))
<- util$filter_expectands(samples4, names)
filtered_samples
<- data$ratings[data$customer_idxs == c]
customer_ratings $plot_hist_quantiles(filtered_samples, 'rating_pred',
util-0.5, 6.5, 1,
baseline_values=customer_ratings,
xlab="Ratings",
main=paste('Customer', c))
}
par(mfrow=c(2, 3), mar=c(5, 5, 1, 1))
for (m in c(33, 53, 61, 80, 117, 180)) {
<- sapply(which(data$movie_idxs == m),
names function(n) paste0('rating_pred[', n, ']'))
<- util$filter_expectands(samples4, names)
filtered_samples
<- data$ratings[data$movie_idxs == m]
movie_ratings $plot_hist_quantiles(filtered_samples, 'rating_pred',
util-0.5, 6.5, 1,
baseline_values=movie_ratings,
xlab="Ratings",
main=paste('Movie', m))
}
par(mfrow=c(2, 2), mar=c(5, 5, 1, 1))
$plot_hist_quantiles(samples4, 'mean_rating_customer_pred',
util0, 6, 0.5,
baseline_values=mean_rating_customer,
xlab="Customer-wise Average Ratings")
$plot_hist_quantiles(samples4, 'mean_rating_movie_pred',
util0, 6, 0.6,
baseline_values=mean_rating_movie,
xlab="Movie-wise Average Ratings")
$plot_hist_quantiles(samples4, 'var_rating_customer_pred',
util0, 7, 0.5,
baseline_values=var_rating_customer,
xlab="Customer-wise Rating Variances")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 283 predictive values (0.1%) fell above the binning.
$plot_hist_quantiles(samples4, 'var_rating_movie_pred',
util0, 7, 0.5,
baseline_values=var_rating_movie,
xlab="Movie-wise Rating Variances")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 2337 predictive values (0.3%) fell above the binning.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
<-
filtered_samples $filter_expectands(samples4,
util
covar_rating_movie_filt_names)
$plot_hist_quantiles(filtered_samples, 'covar_rating_movie_pred',
util-4.25, 4.25, 0.25,
baseline_values=covar_rating_movie_filt,
xlab="Filtered Movie-wise Rating Covariances")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 124 predictive values (0.0%) fell below the binning.
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 90 predictive values (0.0%) fell above the binning.
The lack of any substantial improvements in the retrodictive performance suggests that we might not have included enough data to really resolve individual customer preferences quite yet. We can directly quantify how much we can resolve individual customer preferences by examining our posterior inferences.
Posterior inferences for the cut point population behaviors are mostly consistent with the previous model, although the baseline rating probabilities do slightly shift down to be more centered around 3. This suggest that the previous model may have been contorting itself a bit to account for the variation in customer tastes.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
$plot_expectand_pushforward(samples3[['tau_q']],
util25, flim=c(0, 0.3),
col=util$c_light,
display_name='tau_q')
text(0.05, 10, "Model 2", col=util$c_light)
$plot_expectand_pushforward(samples4[['tau_q']],
util25, flim=c(0, 0.3),
col=util$c_dark,
border="#BBBBBB88",
add=TRUE)
text(0.2, 10, "Model 3", col=util$c_dark)
par(mfrow=c(2, 1), mar=c(5, 5, 1, 1))
<- sapply(1:5, function(k) paste0('mu_q[', k, ']'))
names $plot_disc_pushforward_quantiles(samples3, names,
utilxlab="Rating",
ylab="Baseline Rating Probability",
main="Model 3")
<- sapply(1:5, function(k) paste0('mu_q[', k, ']'))
names $plot_disc_pushforward_quantiles(samples4, names,
utilxlab="Rating",
ylab="Baseline Rating Probability",
main="Model 4")
Similarly some of the individual customer cut points change slightly. For example the cut points for Customer 23 shift a bit towards larger values.
par(mfrow=c(2, 2), mar=c(5, 5, 1, 1))
<- 23
c
<- c(0, -0.5, 0, 1)
lab3_xs <- c(1.75, 0.5, 0.5, 0.25)
lab3_ys
<- c(2, 4, 5.5, 8)
lab4_xs <- c(0.5, 0.5, 0.5, 0.25)
lab4_ys
for (k in 1:4) {
<-paste0('cut_points[', c, ',', k, ']')
name $plot_expectand_pushforward(samples3[[name]],
util40, flim=c(-2, 10),
col=util$c_light,
display_name=name)
$plot_expectand_pushforward(samples4[[name]],
util40, flim=c(-2, 10),
col=util$c_dark,
border="#BBBBBB88",
add=TRUE)
text(lab3_xs[k], lab3_ys[k], "Model 3", col=util$c_light)
text(lab4_xs[k], lab4_ys[k], "Model 4", col=util$c_dark)
}
The baseline movie affinities emulate the universal movie preferences in the previous model, and indeed inferences for them are similar if slightly more heterogeneous.
par(mfrow=c(2, 2), mar=c(5, 5, 1, 1))
$plot_expectand_pushforward(samples3[['tau_gamma']],
util25, flim=c(0, 1.25),
display_name="tau_gamma",
main="Model 3")
<- sapply(1:data$N_movies,
names function(m) paste0('gamma[', m, ']'))
$plot_disc_pushforward_quantiles(samples3, names,
utilxlab="Movie",
ylab="Affinities",
main="Model 3")
$plot_expectand_pushforward(samples4[['tau_gamma0']],
util25, flim=c(0, 1.25),
display_name="tau_gamma0",
main="Model 4")
<- sapply(1:data$N_movies,
names function(m) paste0('gamma0[', m, ']'))
$plot_disc_pushforward_quantiles(samples4, names,
utilxlab="Movie",
ylab="Baseline Affinities",
main="Model 4")
Now, however, we can investigate the preferences idiosyncratic to each customer. For example the individual movie affinity scales quantify how much the customers disagree about a particular movie.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
<- sapply(1:data$N_movies,
names function(m) paste0('tau_delta_gamma[', m, ']'))
$plot_disc_pushforward_quantiles(samples4, names,
utilxlab="Movie",
ylab="Affinity Variation Scales")
Overall there is a lot of uncertainty, with the inferences for most of the movie affinity scales concentrating around zero. That said a few stand out. For example the posterior inferences for \tau_{\gamma, 159} are starting to pull away from zero, suggesting that customers tend to disagree with the quality of this movie more than usual.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
<- 159
m <- paste0('tau_delta_gamma[', m, ']')
name $plot_expectand_pushforward(samples4[[name]],
util25, flim=c(0, 10),
display_name=name)
The inferred correlations in the multivariate normal population model allow us to inform predictions for how a customer would react to unrated movies given the movies they have rated. While most of the correlations are consistent with zero many are large enough to be resolved even from this limited data set.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
$plot_hist_quantiles(samples4, 'Phi', -1, 1, 0.01,
utilxlab="Movie-wise Affinity Correlations")
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 136930 predictive values (0.1%) fell below the binning.
Warning in check_bin_containment(bin_min, bin_max, collapsed_values,
"predictive value"): 543674 predictive values (0.3%) fell above the binning.
It’s more practical, not to mention more interpretable, to investigate the consequence of these correlations. In particular we can look at the movie affinities for each customer by adding together the common baselines with their individual preferences. Here we’ll consider Customer 23.
par(mfrow=c(2, 1), mar=c(5, 5, 1, 1))
<- 23
c
<- sapply(1:data$N_movies,
names function(m) paste0('gamma0[', m, ']'))
$plot_disc_pushforward_quantiles(samples4, names,
utilxlab="Movie",
ylab="Baseline Affinity",
main="Baseline")
<- sapply(1:data$N_movies,
names function(m) paste0('delta_gamma[', c, ',', m, ']'))
$plot_disc_pushforward_quantiles(samples4, names,
utilxlab="Movie",
ylab="Change in Affinity",
main=paste0('Customer', c))
<- sapply(1:data$N_movies,
expectands function(m)
local({ idx = m; function(x1, x2)
+ x2[idx] }) )
x1[idx] names(expectands) <- sapply(1:data$N_movies,
function(m)
paste0('gamma[', c, ',', m, ']'))
<- list('x1'=array(sapply(1:data$N_movies,
var_repl function(m)
paste0('gamma0[', m, ']'))),
'x2'=array(sapply(1:data$N_movies,
function(m)
paste0('delta_gamma[', c,
',', m, ']'))))
<-
affinity_samples $eval_expectand_pushforwards(samples4,
util
expectands, var_repl)
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
<- sapply(1:data$N_movies,
names function(m) paste0('gamma[', c, ',', m, ']'))
$plot_disc_pushforward_quantiles(affinity_samples, names,
utilxlab="Movie",
ylab="Affinity",
main=paste('Customer', c))
Even better we can separately visualize the movie affinities that are directly informed by observed ratings and those informed by only the multivariate normal hierarchical model. Note that the affinities for the movies that Customer 23 did not rate are not only much more uncertain but also much more uniform.
<- data$movie_idxs[data$customer_idxs == c]
rated_movie_idxs <- setdiff(1:data$N_movies, rated_movie_idxs)
unrated_movie_idxs
par(mfrow=c(2, 1), mar=c(5, 5, 1, 1))
<- sapply(1:data$N_movies,
names function(m) paste0('gamma[', c, ',', m, ']'))
$plot_disc_pushforward_quantiles(affinity_samples, names,
utilxlab="Rated Movie",
ylab="Customer Affinity",
main=paste0('Customer', c))
for (m in unrated_movie_idxs) {
polygon(c(m - 0.5, m + 0.5, m + 0.5, m- 0.5),
c(-4.75, -4.75, 4.75, 4.75), col="white", border=NA)
}
$plot_disc_pushforward_quantiles(affinity_samples, names,
utilxlab="Unrated Movie",
ylab="Customer Affinity",
main=paste0('Customer', c))
for (m in rated_movie_idxs) {
polygon(c(m - 0.5, m + 0.5, m + 0.5, m- 0.5),
c(-4.75, -4.75, 4.75, 4.75), col="white", border=NA)
}
The immediate benefit of modeling individual preferences is that we can now make movie recommendations bespoke to Customer 23.
<- function(m) {
expected_affinity <- paste0('gamma[', c, ',', m, ']')
name $ensemble_mcmc_est(affinity_samples[[name]])[1]
util
}
<- sapply(1:data$N_movies,
expected_affinities function(m) expected_affinity(m))
<- sort(expected_affinities, index.return=TRUE)$ix post_mean_ordering
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
<- sapply(post_mean_ordering,
names function(m) paste0('gamma[', c, ',', m, ']'))
<- "Movies Ordered by Expected Affinity"
xname $plot_disc_pushforward_quantiles(affinity_samples, names,
utilxlab=xname,
ylab="Affinity")
From this we can infer what movies we think Customer 23 will like the least.
print(data.frame("Rank"=200:196,
"Movie"=head(post_mean_ordering, 5)),
row.names=FALSE)
Rank Movie
200 40
199 13
198 78
197 55
196 183
As well as what movies we think they will like the most.
print(data.frame("Rank"=5:1,
"Movie"=tail(post_mean_ordering, 5)),
row.names=FALSE)
Rank Movie
5 167
4 61
3 156
2 23
1 97
Of course there’s not much utility in recommending a customer a movie that they’ve already seen. A much more useful recommendation is for movies that they haven’t yet seen but might enjoy.
Here let’s assume that a movie has been unrated by a customer only if it the customer has not yet seen it. Consequently the recommendation task comes down to inferring the unrated movies with the highest affinities for Customer 23.
<- sapply(unrated_movie_idxs,
expected_affinities function(m) expected_affinity(m))
<- sort(expected_affinities, index.return=TRUE)$ix post_mean_ordering
We can finally present a list of the top movies to recommend to Customer 23.
print(data.frame("Rank"=10:1,
"Movie"=tail(unrated_movie_idxs[post_mean_ordering], 10)),
row.names=FALSE)
Rank Movie
10 95
9 186
8 155
7 15
6 107
5 161
4 43
3 86
2 87
1 193
All of this said we should have only mild confidence in these recommendations given the large uncertainties.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
<- sapply(unrated_movie_idxs[post_mean_ordering],
names function(m) paste0('gamma[', c, ',', m, ']'))
<- "Unrated Movies Ordered by Expected Affinity"
xname $plot_disc_pushforward_quantiles(affinity_samples, names,
utilxlab=xname,
ylab="Affinity")
One subtlety with recommendations is that in most applications we cannot evaluate their performance directly. For example absent any additional interrogation of the Customer 23 the only indication of how much they agree with one our recommendations is how well they rate the movie in the future.
Fortunately we can use our inferences to predict not only the latent affinity of these movie recommendations but also how we think Customer 23 would rate them. This would allow us to compare predicted rankings to actual rankings.
<- tail(unrated_movie_idxs[post_mean_ordering], 1)
movie_idx
<- function(x) {
logistic if (x > 0) {
1 / (1 + exp(-x))
else {
} <- exp(x)
e / (1 + e)
e
}
}
<- list(function(c, gamma) 1 - logistic(gamma - c[1]),
expectands function(c, gamma) logistic(gamma - c[1]) -
logistic(gamma - c[2]),
function(c, gamma) logistic(gamma - c[2]) -
logistic(gamma - c[3]),
function(c, gamma) logistic(gamma - c[3]) -
logistic(gamma - c[4]),
function(c, gamma) logistic(gamma - c[4]))
names(expectands) <- c('p[1]', 'p[2]', 'p[3]', 'p[4]', 'p[5]')
<- list('c'=array(sapply(1:4,
var_repl function(k)
paste0('cut_points[', c, ',', k, ']'))),
'gamma'=paste0('gamma[', c, ',', movie_idx, ']'))
for (k in 1:4) {
<- paste0('cut_points[', c, ',', k, ']')
name <- samples4[[name]]
affinity_samples[[name]]
}
<-util$eval_expectand_pushforwards(affinity_samples,
prob_samples
expectands, var_repl)
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
$plot_disc_pushforward_quantiles(prob_samples, names(expectands),
utilxlab="Rating",
ylab="Posterior Probability")
Interestingly we don’t actually predict a high rating for our top recommendation. In hindsight, however, this shouldn’t be unexpected given how austere Customer 23 is with their stars.
par(mfrow=c(1, 1), mar=c(5, 5, 1, 1))
$plot_line_hist(data$ratings[data$customer_idxs == c],
util-0.5, 6.5, 1,
xlab="Rating", main=paste('customer', c))
Another benefit of this hierarchical approach is that we are not limited to making inferences and predictions for existing customers. In particular we can also make inferences and predictions for new customers by sampling new cut points and movie affinities from the respective hierarchical population models. With the dearth of observed ratings these predictions will be highly uncertain, but at the same time that uncertainty prevents us from making overly confident claims.
7 Computational Considerations
I want to emphasize that this case study is first and foremost a demonstrative analysis. In particular I reduced the data not for any statistical reason by rather to ensure that the models would not take too long to run as I developed the analysis. Ultimately the final model took about three hours to run on my laptop which wasn’t too onerous, especially given the total number of parameters.
That said I do think it is useful to at least consider what the different priorities might be for a more realistic analysis where a specific inferential goal would be driving the amount of data to include and different computational resources might be available. What would it take to speed up the fit of the final model or scale it up to a larger data set?
Recall that the overall cost of running Hamiltonian Monte Carlo can roughly be decomposed into the number of iterations, the number of model evaluations per iteration, and the cost of each model evaluation. The number of model evaluations per iteration is driven by the posterior geometry and how hard the Hamiltonian Monte Carlo algorithm has to work to explore it. For a fixed data set the two main ways that we can reduce computation is to improve the posterior geometry or speed up the model evaluations.
Whenever working with hierarchical models we need to be considerate of the potentially problematic geometries to which they are prone. In this case study we seemed to do okay with a monolithic non-centered parameterization for the normal and multivariate normal hierarchies, but we could possibly improve the geometry by non-centering the parameters corresponding to more prolific movies and customers. Before doing that, however, we can estimate the potential for improvement by examining the length of the numerical Hamiltonian trajectories, and hence how many model evaluations were needed per iteration, in our last fit.
$plot_num_leapfrogs_by_chain(diagnostics) util
Despite the complexity of the final model the numerical Hamiltonian trajectories weren’t all that long. Even in an ideal case we can’t do do much better than ten or so leapfrog steps per trajectory; consequently the maximum possible speed up that we could get from improving the geometry here would be less than an order of magnitude! That’s not trivial but it suggests that the computational cost is not being dominated by the number of model evaluations but rather the cost of each model evaluation itself.
So how can we speed up the model evaluations? One immediate strategy is parallelization, especially if we’re working with computers blessed with lots and lots of threads. For example we can, at least in theory, parallelize the many matrix-vector products that are required in the transformed parameters
block. That said achieving these potential speed ups in practice is always frustrated by the subtle input/output costs of passing all of the needed information to each thread and back in each model evaluation.
Either approach to speeding up the fitting of the final model will be challenging, requiring careful investigations and implementations and offering no guarantee of success. Even worse neither of these strategies will really be able to compete with the quadratic cost of evaluating the model, both in terms of N_{\text{customer}} \cdot N_{\text{movie}} and N_{\text{movie}}^{2}, if we attempt to add more customers and/or movies. For example scaling up from 200 movies to 2000 movies, still only a fraction of the total data set and an even more negligible part of the full data a company like Netflix would have available, would require a 100 fold increase in the cost of evaluating the model. Even if the posterior geometry doesn’t get any worst that pushes three hours to over a full day of computation.
In practice we can fight quadratic scaling only so far. Ultimately the problem is that the final model has to compare every movie to every other movie. Consequently the most effective scaling strategy is to limit the number of movies to which each movie is compared. More formally we need to introduce an appropriate sparsity structure on the movies so that most of the N_{\text{movie}}^{2} covariances are zero.
Many methods attempt to learn a sparsity structure consistent with the observed data, dynamically turning off covariances that end up too small. This, however, is an outrageously difficult learning problem and approximate results tend to be fragile without unreasonable amounts of data. We can usually do much better by taking advantage of our domain expertise to motivate appropriate sparsity structures directly.
For example we could first group movies into genres before modeling common baseline affinities, correlated deviations across genres, and perhaps even correlated deviations for each movie within each genre. This effectively introduces a block-diagonal structure to the full covariance matrix which scales much more efficiently without sacrificing all of the correlations that can help inform predictions for unrated movies.
Given the sparsity of the observed ratings we can learn only so much. Consequently we might as well build a more restricted model of meaningful behaviors that we have a hope of resolving than attempting to learn intricate details about which we just don’t have enough information.
8 Conclusion
In this case study I developed a relatively sophisticated analysis of consumer feedback that accounts for not only how each customer interprets the possible ordinal ratings in different ways but also the variation in their preferences. To learn anything about these behaviors in spite of the sparsity of the data we had to leverage our domain expertise and some intricate modeling techniques.
If anything I hope that this case study demonstrates how powerful hierarchical modeling techniques can be when used carefully. We used our domain expertise to sketch out the data generating process first, and only then considered opportunities for heterogeneous behaviors.
By starting with the broad features of the data generating process we established an explicit context that made is easier to identify not only what behaviors were heterogeneous but also what heterogeneous behaviors might be coupled together. Moreover the structure of those behaviors motivated appropriate population models. In this way we were able to develop an elaborate model with multiple, multivariate hierarchies without becoming to overwhelmed in the process.
Hierarchical modeling is so much more than one-dimensional normal population models!
Acknowledgements
A very special thanks to everyone supporting me on Patreon: Adam Fleischhacker, Adriano Yoshino, Alejandro Navarro-Martínez, Alessandro Varacca, Alex D, Alexander Noll, Andrea Serafino, Andrew Mascioli, Andrew Rouillard, Andrew Vigotsky, Ara Winter, Austin Rochford, Avraham Adler, Ben Matthews, Ben Swallow, Benoit Essiambre, Bertrand Wilden, boot, Bradley Kolb, Brendan Galdo, Bryan Chang, Brynjolfur Gauti Jónsson, Cameron Smith, Canaan Breiss, Cat Shark, CG, Charles Naylor, Chase Dwelle, Chris Jones, Christopher Mehrvarzi, Colin Carroll, Colin McAuliffe, Damien Mannion, dan mackinlay, Dan W Joyce, Dan Waxman, Dan Weitzenfeld, Daniel Edward Marthaler, Daniel Saunders, Darshan Pandit, Darthmaluus , David Galley, David Wurtz, Doug Rivers, Dr. Jobo, Dr. Omri Har Shemesh, Dylan Maher, Ed Cashin, Edgar Merkle, Eli Witus, Eric LaMotte, Ero Carrera, Eugene O’Friel, Felipe González, Fergus Chadwick, Finn Lindgren, Francesco Corona, Geoff Rollins, Guilherme Marthe, Håkan Johansson, Hamed Bastan-Hagh, haubur, Hector Munoz, Henri Wallen, hs, Hugo Botha, Ian, Ian Costley, idontgetoutmuch, Ignacio Vera, Ilaria Prosdocimi, Isaac Vock, Isidor Belic, jacob pine, Jair Andrade, James C, James Hodgson, James Wade, Janek Berger, Jarrett Byrnes, Jason Martin, Jason Pekos, Jason Wong, jd, Jeff Burnett, Jeff Dotson, Jeff Helzner, Jeffrey Erlich, Jerry Lin , Jessica Graves, Joe Sloan, John Flournoy, Jonathan H. Morgan, Jonathon Vallejo, Joran Jongerling, Josh Knecht, June, Justin Bois, Kádár András, Karim Naguib, Karim Osman, Kristian Gårdhus Wichmann, Lars Barquist, lizzie , LOU ODETTE, Luís F, Marcel Lüthi, Marek Kwiatkowski, Mariana Carmona, Mark Donoghoe, Markus P., Márton Vaitkus, Matthew, Matthew Kay, Matthew Mulvahill, Matthieu LEROY, Mattia Arsendi, Matěj Kolouch Grabovský, Maurits van der Meer, Max, Michael Colaresi, Michael DeWitt, Michael Dillon, Michael Lerner, Mick Cooney, Mike Lawrence, MisterMentat , N Sanders, N.S. , Name, Nathaniel Burbank, Nic Fishman, Nicholas Clark, Nicholas Cowie, Nick S, Ole Rogeberg, Oliver Crook, Olivier Ma, Patrick Kelley, Patrick Boehnke, Pau Pereira Batlle, Pete St. Marie, Peter Johnson, Pieter van den Berg , ptr, quasar, Ramiro Barrantes Reynolds, Raúl Peralta Lozada, Ravin Kumar, Rémi , Rex Ha, Riccardo Fusaroli, Richard Nerland, Robert Frost, Robert Goldman, Robert kohn, Robin Taylor, Ryan Gan, Ryan Grossman, Ryan Kelly, S Hong, Sean Wilson, Sergiy Protsiv, Seth Axen, shira, Simon Duane, Simon Lilburn, Simone Sebben, sssz, Stefan Lorenz, Stephen Lienhard, Steve Harris, Stew Watts, Stone Chen, Susan Holmes, Svilup, Tao Ye, Tate Tunstall, Tatsuo Okubo, Teresa Ortiz, Theodore Dasher, Thomas Siegert, Thomas Vladeck, Tobychev , Tony Wuersch, Tyler Burch, Virginia Fisher, Vladimir Markov, Wil Yegelwel, Will Farr, Will Lowe, Will Wen, woejozney, yolhaj , yureq , Zach A, Zad Rafi, and Zhengchen Cai.
License
A repository containing all of the files used to generate this chapter is available on GitHub.
The code in this case study is copyrighted by Michael Betancourt and licensed under the new BSD (3-clause) license:
https://opensource.org/licenses/BSD-3-Clause
The text and figures in this chapter are copyrighted by Michael Betancourt and licensed under the CC BY-NC 4.0 license:
Original Computing Environment
writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))
CC=clang
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined -Wno-unneeded-internal-declaration
CXX=clang++ -arch x86_64 -ftemplate-depth-256
CXX14FLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined -Wno-unneeded-internal-declaration -Wno-unknown-pragmas
CXX14=clang++ -arch x86_64 -ftemplate-depth-256
sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] colormap_0.1.4 rstan_2.32.6 StanHeaders_2.32.7
loaded via a namespace (and not attached):
[1] gtable_0.3.4 jsonlite_1.8.8 compiler_4.3.2 Rcpp_1.0.11
[5] parallel_4.3.2 gridExtra_2.3 scales_1.3.0 yaml_2.3.8
[9] fastmap_1.1.1 ggplot2_3.4.4 R6_2.5.1 curl_5.2.0
[13] knitr_1.45 htmlwidgets_1.6.4 tibble_3.2.1 munsell_0.5.0
[17] pillar_1.9.0 rlang_1.1.2 utf8_1.2.4 V8_4.4.1
[21] inline_0.3.19 xfun_0.41 RcppParallel_5.1.7 cli_3.6.2
[25] magrittr_2.0.3 digest_0.6.33 grid_4.3.2 lifecycle_1.0.4
[29] vctrs_0.6.5 evaluate_0.23 glue_1.6.2 QuickJSR_1.0.8
[33] codetools_0.2-19 stats4_4.3.2 pkgbuild_1.4.3 fansi_1.0.6
[37] colorspace_2.1-0 rmarkdown_2.25 matrixStats_1.2.0 tools_4.3.2
[41] loo_2.6.0 pkgconfig_2.0.3 htmltools_0.5.7