Churn, Bayes-by, Churn

Author

Michael Betancourt

Published

May 1, 2026

How do customers bring value to a company? In many cases value is driven not so much by the total number of customers at any given time, but rather for how long those customers will stay engaged. This constant ebb and flow of active customers is known as ,churn.

To properly forecast income for a given company, we need to infer when customers will leave. In this case study, we’ll model bank customer lifetimes using survival modeling techniques, and use those inferences to predict the most valuable customer segments.

1 Setup

Let’s begin by setting up our local python environment.

import matplotlib
import matplotlib.pyplot as plot
plot.rcParams['figure.figsize'] = [8, 4]
plot.rcParams['figure.dpi'] = 100
plot.rcParams['font.family'] = "Serif"

import math
import numpy
import json

This includes loading up pystan.

# Needed to run pystan through a jupyter kernel
import nest_asyncio
nest_asyncio.apply()

import stan

Next, we have a suite Markov chain Monte Carlo diagnostics and estimation tools; this code and supporting documentation are both available on GitHub.

import mcmc_analysis_tools_pystan3 as util

Finally, we have a suite of probabilistic visualization functions based on Markov chain Monte Carlo estimation. Again the code and supporting documentation are available on GitHub.

import mcmc_visualization_tools as putil

2 Data Exploration

To understand bank customer churn, customer acquisition and loss were collected across a sixty month observation period. Only those customers who joined after the observation period were recorded, but many of the customers were still active when the observation period ended.

with open("data/historical_data.json", "r") as infile:
  data = json.load(infile)

All times were recorded in months relative to the start of the observation window. More precisely, the observation window started at time T_{0} = 0 \text{months} and ended at time T_{1} = 60 \text{months}.

print(f'Observation period: [{data['T0']}, {data['T1']}]')

Observation period: [0, 60]

In total, 2243 customers had churned during the observation period, with 257 additional customers still active afterwards.

print(f'{data['N_churn']} churned customers')

2243 churned customers

print(f'{data['N_active']} active customers')

257 active customers

The hard cutoff of the observation period results in some interesting selection effects in the observations. As we reach the end of the observation period, churned customers are less likely to have joined because they had less time to leave before the cutoff. On the other hand, active customers were more likely to join as we approached the cutoff where they had less time to churn.

putil.plot_line_hists(plot.gca(),
                      data['t_join_churn'],
                      data['t_join_active'],
                      data['T0'] - 3, data['T1'] + 3, 3,
                      xlabel="Time Customer Joins (months)")

plot.gca().axvline(x=data['T0'], linewidth=2,
                   linestyle='dashed', color="#DDDDDD")
plot.gca().axvline(x=data['T1'], linewidth=2,
                   linestyle='dashed', color="#DDDDDD")

plot.gca().text(10, 80, "Churned\nCustomers",
                color='black')

plot.gca().text(35, 25, "Active\nCustomers",
                color=putil.mid_teal)

plot.show()

Because data collection ended at T_{1}, we observed leave times for only the customers who churn before the observational period ends.

putil.plot_line_hist(plot.gca(), data['t_leave_churn'],
                     data['T0'] - 3, data['T1'] + 3, 3,
                     xlabel="Time Customer Leaves (months)")

plot.gca().axvline(x=data['T0'], linewidth=2,
                   linestyle='dashed', color="#DDDDDD")
plot.gca().axvline(x=data['T1'], linewidth=2,
                   linestyle='dashed', color="#DDDDDD")

plot.show()

The join and leave times are not as interesting as their difference, which gives observed customer lifetimes, or tenures (Figure 1).

Most customers have churned after twenty months, but we also see a long tail of more loyal customers.

tenure_churn = [ tl - tj for tj, tl
                 in zip(data['t_join_churn'],
                        data['t_leave_churn']) ]

putil.plot_line_hist(plot.gca(), tenure_churn,
                     data['T0'] - 2, data['T1'] + 2, 2,
                     xlabel="Customer Tenure (months)")
plot.show()

Demographic information was also recorded for each of the observed customers, regardless of whether or not they had left by the end of the observation period (Table 1).

Table 1: The data includes demographic information about the customers who joined during the observation period.
Variable Name	Description	Values	Units
online_banking	Online Banking Participation	{0, 1}	False/True
X[n,0]	Customer Age	Positive Real	Years
X[n,1]	Number of Active Bank Products	Positive Integer	Counts
X[n,2]	Relative Wealth	[0, 1]	Percentile
X[n,3]	Monthly Margin	Positive Real	Euro

A little over half of all churned customers participated in online banking.

N_online_banking = sum([ 1 for o in
                         data['online_banking_churn']
                         if o == 1 ])

print(f'{N_online_banking} out of {data['N_churn']} '
      f'({100 * N_online_banking / data['N_churn']:.2f}%) '
       'churned customers used online banking.')

1548 out of 2243 (69.01%) churned customers used online banking.

The other demographic variables span a pretty wide range of values. Note, however, that we also see a secondary peak of high wealth, high margin customers.

bin_min = [ 0, 0, 0, 0 ]
bin_max = [ 100, 15, 1.05, 1000 ]
bin_delta = [ 5, 1, 0.05, 10 ]

f, axarr = plot.subplots(2, 2, layout="constrained")

for i in range(data['I']):
  i1 = i // 2
  i2 = i % 2

  putil.plot_line_hist(axarr[i1, i2],
                       [ x[i] for x in data['X_churn'] ],
                       bin_min[i], bin_max[i], bin_delta[i],
                       xlabel=data['x_names'][i])

plot.show()

1 values (0.04%) fell below the histogram binning.
6 values (0.27%) fell above the histogram binning.

Whenever demographic information is available, we can consider whether or not any behavior of interest, such as customer lifetimes, vary across the demographics. The empirical covariation doesn’t show any obvious patterns.

putil.plot_line_hists(plot.gca(),
                      [ t for t, o
                        in zip(tenure_churn,
                               data['online_banking_churn'])
                        if o == 0 ],
                      [ t for t, o
                        in zip(tenure_churn,
                               data['online_banking_churn'])
                        if o == 1 ],
                      data['T0'] - 2, data['T1'] + 2, 2,
                      prob=True,
                      xlabel="Customer Tenure (months)")

plot.gca().text(15, 0.04, "No Online Banking",
                color='black')

plot.gca().text(10, 0.12, "Online Banking",
                color=putil.mid_teal)

plot.show()

f, axarr = plot.subplots(2, 2, layout="constrained")

for i in range(data['I']):
  i1 = i // 2
  i2 = i % 2

  axarr[i1, i2].scatter([ x[i] for x in data['X_churn'] ],
                        tenure_churn,
                        color="black", s=2)
  axarr[i1, i2].set_xlim([bin_min[i], bin_max[i]])
  axarr[i1, i2].set_xlabel(data['x_names'][i])
  axarr[i1, i2].set_ylim([data['T0'], data['T1']])
  axarr[i1, i2].set_ylabel("Tenure (months)")

plot.show()

That said, raw scatter plots are not the most sensitive visualizations.

3 Model 1: Constant Stimulus/Hazard Model

Having familiarized ourselves with the data, let’s now attempt to infer the configuration of a customer lifetime model with those observations.

3.1 Observational Model

Churn is unavoidable in practice; it’s only a matter of time before a customer eventually leaves. Modeling how long it takes for an inevitable event to occur is the purview of survival modeling.

In survival modeling, we don’t directly model event times. Instead we model whatever temporal stimulus, classically referred to as the hazard, that eventually triggers an event to occur. In the case of customer lifetimes, the stimulus might be interpreted as modeling the overall content of a customer relative to other banking options.

Given a stimulus function \lambda(t), we can derive the cumulative stimulus function, which quantifies how much stimulus has built-up by a certain time, \Lambda(t) = \int_{0}^{t} \mathrm{d} t' \, \lambda(t'). The cumulative stimulus function then defines a complementary cumulative distribution function, or survival function, S(t) = \exp \left( - \Lambda(t) \right). If stimulus accumulates quickly, then the probability that an event has not yet occurred by a given time rapidly decays.

Finally, given a survival function we can derive a time-to-event model, p(t) = \lambda(t) \, S(t). Events occur when the instantaneous stimulus is large and the survival function has not yet decayed too much.

Let’s start simple and assume that the stimulus for customer churn is fixed in time, \lambda(t) = \gamma. In this case, the cumulative stimulus function grows linearly, \Lambda(t) = \int_{0}^{t} \mathrm{d} t' \, \lambda(t') = \gamma \, t, while the survival function decays exponentially, S(t) = \exp \left( - \Lambda(t) \right) = \exp(-\gamma \, t).

The resulting time-to-event model is given by a familiar exponential probability density function, \begin{align*} p(t) &= \lambda(t) \, S(t) \\ &= \gamma \, \exp(-\gamma \, t) \\ &= \mathrm{exponential}(t \mid \gamma). \end{align*} This makes implementing this model in practice particularly straightforward.

Because each customer’s tenure starts at their specific join time, the time in the time-to-event model is given by t = t_{\text{leave}} - t_{\text{join}}. More compactly, we’re modeling p( t_{\text{leave}} - t_{\text{join}} ) = \mathrm{exponential}( t_{\text{leave}} - t_{\text{join}} \mid \gamma ).

3.2 Prior Model

To perform a Bayesian analysis, we need to complement this observational model with a prior model. The prior model quantifies any available domain expertise that we want to include in the analysis.

For this analysis, let’s say that our domain expertise was already used to design the original observation period. Customer tenure longer than 60 months is not impossible, but it should be relatively rare.

There are various ways to quantify “rare”, but here let’s say that we want to tune our prior model so that \pi( t < 60 \text{ months } ) \approx 0.99, or equivalently \pi( t > 60 \text{ months } ) \approx 0.01. That said, we usually don’t have to be particularly precious about the precise tail probability. For example, in practice \pi( t > 60 \text{ months } ) \approx 0.05 or even \pi( t > 60 \text{ months } ) \approx 0.10 would likely result in equivalent inferences. Our priority here is just suppressing unrealistic behaviors.

Conveniently, the tail probability in a survival model is given directly by the survival function. Consequently, our domain expertise directly constrains the reasonable behaviors of the survival function, \begin{align*} \pi( t > 60 \text{ months } ) &= 0.01 \\ S( 60 \text{ months } ) &= 0.01 \\ \exp(- \gamma \, 60 \text{ months } ) &= 0.01. \end{align*}

This, in, turn, constrains the reasonable values of \gamma, \begin{align*} \exp(- \gamma \, 60 \text{ months } ) &= 0.01 \\ - \gamma \, 60 \text{ months } &= \log(0.01) \\ \gamma &= - \frac{\log(0.01) }{ 60 \text{ months } } \\ \gamma &\approx \frac{0.1 }{ \text{ months } }. \end{align*}

As \gamma increased above this threshold, the model predicts tenures that concentrate more and more strongly towards zero. When \gamma is below this threshold, however, the more predicts more diffuse tenures. Overall we’ll need to balance these two extremes behaviors.

Let’s give a bit of wiggle room and suppress model configurations with \gamma \lessapprox 1, or, equivalently, 0 \lessapprox \eta = \frac{1}{\gamma} \lessapprox 1. We can implement this with a half-normal prior model that allocates 99\% of its probability between 0 and 1, p( \eta ) = \text{half-normal}( \eta \mid 0, 1 / 2.57 ).

To verify that this prior model enforces the desired regularization, we can perform a prior predictive check. Here we compare the prior predictions for tenure against our elicited domain expertise.

t_upper = 60

model1\_prior.stan

functions {
  // Log stimulus function
  real log_lambda(real t, real gamma) {
    return log(gamma);
  }
  
  // Cumulative stimulus function
  real Lambda(real t, real gamma) {
    return gamma * t;
  }

  // Log survival function
  real log_survival(real t, real gamma) {
    return -Lambda(t, gamma);
  }

  // Inverse cumulative stimulus function
  real Lambda_inv(real l, real gamma) {
    return l / gamma;
  }
  
  // Inverse survival function
  real inv_survival(real u, real gamma) {
    return Lambda_inv(-log(u), gamma);
  }
}

data {
  int<lower=0> N;
  real<lower=0> t_upper;
}

parameters {
  real<lower=0> eta; // Inverse scale parameter (Months)
}

model {
  // Prior Model
  // 0 <~ eta <~ 50
  target += normal_lpdf(eta | 0, 40 / 2.57);
}

generated quantities {
  array[N] real<lower=0> tenure_pred;
  real upper_tail_p_hat = 0;

  for (n in 1:N) {
    tenure_pred[n]
      = inv_survival(uniform_rng(0, 1), 1 / eta);
    if (tenure_pred[n] > t_upper)
      upper_tail_p_hat += 1;
  }
  upper_tail_p_hat /= N;
}

with open('stan_programs/model1_prior.stan', 'r') as file:
  stan_program = file.read()
model = stan.build(stan_program,
                   random_seed=8438338,
                   data={'N': 2500, 't_upper': t_upper})
fit = model.sample(num_samples=1024, refresh=0)

samples = util.extract_expectand_vals(fit)

Building...

Qualitatively, the prior predictions from our model look great. Shorter tenures are more likely, but very long tenures are not entirely impossible.

putil.plot_hist_quantiles(
  plot.gca(), samples, 'tenure_pred',
  0, t_upper + 12, 2,
  xlabel='Customer Tenure (months)'
)
plot.gca().set_xlim([0, t_upper + 10])

rect = plot.Rectangle((t_upper, 0), 10, 1600,
                      facecolor="#BBBBBB", alpha=0.25)
plot.gca().add_patch(rect)
plot.axvline(x=t_upper, linewidth=2, color="#BBBBBB")

plot.show()

179867 predictive values (1.76%) fell above the histogram binning.

We can also provide a more quantitative check by computing the prior predictive tail probability.

est = util.ensemble_mcmc_est(samples['upper_tail_p_hat'])

print(f'pi( t > 60 ) = {est[0]:.3f} +/- {2 * est[1]:.3f}')

pi( t > 60 ) = 0.029 +/- 0.002

This isn’t too far from our initial 1\% target!

3.3 Posterior Quantification

Together, the observational model and the prior define a full Bayesian model. We can summarize the structure of this model using a directed graph (Figure 2), but to work with it in practice we’ll need to implement it as a probabilistic program.

Figure 2: Our initial survival model concerns how long a customer stays after joining, t_{\text{leave}} - t_{\text{join}}.

model1.stan

functions {
  // Log stimulus function
  real log_lambda(real t, real gamma) {
    return log(gamma);
  }
  
  // Cumulative stimulus function
  real Lambda(real t, real gamma) {
    return gamma * t;
  }

  // Log survival function
  real log_survival(real t, real gamma) {
    return -Lambda(t, gamma);
  }

  // Inverse cumulative stimulus function
  real Lambda_inv(real l, real gamma) {
    return l / gamma;
  }
  
  // Inverse survival function
  real inv_survival(real u, real gamma) {
    return Lambda_inv(-log(u), gamma);
  }
}

data {
  real           T0; // Months
  real<lower=T0> T1; // Months

  // Churned customers
  int<lower=0> N_churn;
  array[N_churn] real<lower=T0, upper=T1> t_join_churn;  // Months
  array[N_churn] real<lower=T0, upper=T1> t_leave_churn; // Months
}

parameters {
  real<lower=0> eta; // Inverse scale parameter (Months)
}

transformed parameters {
  real<lower=0> gamma = 1 / eta;
}

model {
  // Prior Model
  // 0 <~ eta <~ 40
  target += normal_lpdf(eta | 0, 40 / 2.57);

  // Observational Model
  for (n in 1:N_churn) {
    real tenure = t_leave_churn[n] - t_join_churn[n];
    target +=   log_lambda(tenure, gamma)
              + log_survival(tenure, gamma);
  }
}

generated quantities {
  array[N_churn] real<lower=0> tenure_churn_pred;
  array[N_churn] real<lower=T0> t_leave_churn_pred;

  for (n in 1:N_churn) {
    tenure_churn_pred[n]
      = inv_survival(uniform_rng(0, 1), gamma);
    t_leave_churn_pred[n]
      = t_join_churn[n] + tenure_churn_pred[n];
  }
}

Using the full Bayesian model, we can learn which model configurations are consistent with the observed data with a posterior distribution.

numeric_keys = ['T0', 'T1', 'I', 'N_churn', 'X_churn',
                'online_banking_churn', 't_join_churn',
                't_leave_churn', 'N_active', 'X_active',
                'online_banking_active', 't_join_active']
stan_data = {k: data[k] for k in numeric_keys}

In practice, we’ll use Stan’s Hamiltonian Monte Carlo sampler to quantify the posterior distribution.

with open('stan_programs/model1.stan', 'r') as file:
  stan_program = file.read()
model = stan.build(stan_program,
                   random_seed=8438338,
                   data=stan_data)
fit = model.sample(num_samples=1024, refresh=0)

Building...

As powerful as Hamiltonian Monte Carlo is, the accuracy of this posterior quantification is not guaranteed. To ensure faithful posterior inferences, we have to first review the available computational diagnostics. In this case, there fortunately don’t appear to be any signs of computational problems.

diagnostics = util.extract_hmc_diagnostics(fit)
util.check_all_hmc_diagnostics(diagnostics)

samples = util.extract_expectand_vals(fit)
base_samples = util.filter_expectands(samples, ['eta'])
util.check_all_expectand_diagnostics(base_samples)

All Hamiltonian Monte Carlo diagnostics are consistent with accurate
Markov chain Monte Carlo.
 
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.

3.4 Posterior Retrodictive Checks

Before reviewing any posterior inferences, we first have to judge the posterior retrodictive performance of our model. This requires investigating how well the posterior predictive distribution recovers various features of the observed data. Inferences drawn from a model with poor retrodictive performance poorly describe the actual system being analyzed.

Here the posterior predictive distribution of tenures skews towards smaller tenures more strongly than what we see in the observed data.

putil.plot_hist_quantiles(
  plot.gca(), samples, 'tenure_churn_pred',
  data['T0'] - 1, data['T1'] + 1, 1,
  baseline_values=tenure_churn,
  xlabel='Customer Tenure (months)'
)
plot.show()

55 predictive values (0.00%) fell above the histogram binning.

On their own, the posterior predictive leave times not only decay too quickly, but also and extend into impossible values! This suggests that we might not be properly accounting for the constraints of the observation period.

putil.plot_hist_quantiles(
  plot.gca(), samples, 't_leave_churn_pred',
  data['T0'] - 3, data['T1'] + 23, 3,
  baseline_values=data['t_leave_churn'],
  xlabel="Time Customer Leaves (months)"
)
plot.gca().axvline(x=data['T0'], linewidth=2,
                   linestyle='dashed', color="#DDDDDD")
plot.gca().axvline(x=data['T1'], linewidth=2,
                   linestyle='dashed', color="#DDDDDD")

plot.show()

3550 predictive values (0.04%) fell above the histogram binning.

Although we haven’t included demographic heterogeneity in our model, we can still see if the posterior predictions behave similarly to the observed data across the demographics. Here we’ll compare median tenure in bins of each demographic variable; for a more detailed discussion of this summary see this section.

bin_min = [ 0, 0, 0, 0 ]
bin_max = [ 100, 15, 1.05, 1000 ]
bin_delta = [ 20, 1, 0.20, 200 ]

pred_names = [ f'tenure_churn_pred[{n + 1}]'
               for n in range(data['N_churn']) ]

f, axarr = plot.subplots(2, 2, layout="constrained")

for i in range(data['I']):
  i1 = i // 2
  i2 = i % 2

  putil.plot_conditional_median_quantiles(
    axarr[i1, i2], samples, pred_names,
    [ x[i] for x in data['X_churn'] ],
    bin_min[i], bin_max[i], bin_delta[i],
    tenure_churn, residual=False,
    xlabel=data['x_names'][i], ylabel=""
  )

plot.show()

1 conditioning values (0.04%) fell below the histogram binning.
6 conditioning values (0.27%) fell above the histogram binning.

There is clear retrodictive disagreement here, as well.

4 Model 2: Constant Stimulus/Hazard Model with Censoring

All of this poor retrodictive performance indicates that our initial model is nowhere near adequate enough to fit the observed data. Fortunately, the precise nature of the retrodictive tensions motivates a variety of explicit improvements to our model.

Let’s start by addressing the impossible posterior predictions for the customer leave times. Predicted leave times for churned customers have to be before the end of the observational window, otherwise the customers would still be active.

At the same time, we are not completely ignorant of the leave times of the active customers. They must be after the close of the observational period.

Fortunately, both of these behaviors are straightforward to accommodate into our survival model.

4.1 Censored Observational Model

When making predictions for churned customers, t_{\text{leave}} has to be smaller than the right observational boundary T_{1}, t_{\text{leave}} < T_{1}. This then constrains the possible tenures; if the customer joined at t_{\text{join}} then \begin{align*} t_{\text{leave}} &< T_{1} \\ t_{\text{leave}} - t_{\text{join}} &< T_{1} - t_{\text{join}} \\ t &< T_{1} - t_{\text{join}}. \end{align*}

To accommodate active customers, we treat their unobserved leave times as auxiliary parameters and that we then marginalize. For a single active customer, the leave time model is given by p( t_{\text{leave}} \mid t_{\text{join}}, \theta ) = p( t_{\text{leave}} - t_{\text{join}} \mid \theta ), where p( t \mid \theta ) is the time-to-event model. To marginalize the unobserved t_{\text{leave}} from the model, we integrate over the possible values, \begin{align*} \int_{T_{1}}^{\infty} \mathrm{d} t_{\text{leave}} \, I &= \int_{T_{1}}^{\infty} \mathrm{d} t_{\text{leave}} \, p( t_{\text{leave}} \mid t_{\text{join}}, \theta ) \\ &= \int_{T_{1}}^{\infty} \mathrm{d} t_{\text{leave}} \, p( t_{\text{leave}} - t_{\text{join}} \mid \theta ) \\ &= \int_{T_{1} - t_{\text{join}}}^{\infty} \mathrm{d} t \, p( t \mid \theta ) \\ &= \pi[ t > T_{1} - t_{\text{join}} ] \\ &= S( T_{1} - t_{\text{join}} ). \end{align*}

In other words, each active customer doesn’t contribute a point-wise density p(t) to the observational model, but rather an entire interval probability \pi[ t > T_{1} - t_{\text{join}} ].

4.2 Posterior Quantification

With a bit more care, our full Bayesian model now considers both the churned and active customers (Figure 3).

Figure 3: Although we don’t observe the precise leave times of the active customers, we can lower bound them by how far the join times are from the end of the observation period. These censored observations also inform the parameters of our survival model.

Conveniently, censored observations and predictions can both be implemented with the log survival function.

model2.stan

functions {
  // Log stimulus function
  real log_lambda(real t, real gamma) {
    return log(gamma);
  }
  
  // Cumulative stimulus function
  real Lambda(real t, real gamma) {
    return gamma * t;
  }

  // Log survival function
  real log_survival(real t, real gamma) {
    return -Lambda(t, gamma);
  }

  // Inverse cumulative stimulus function
  real Lambda_inv(real l, real gamma) {
    return l / gamma;
  }
  
  // Inverse survival function
  real inv_survival(real u, real gamma) {
    return Lambda_inv(-log(u), gamma);
  }
}

data {
  real           T0; // Months
  real<lower=T0> T1; // Months

  // Churned customers
  int<lower=0> N_churn;
  array[N_churn] real<lower=T0, upper=T1> t_join_churn;  // Months
  array[N_churn] real<lower=T0, upper=T1> t_leave_churn; // Months

  // Active customers
  int<lower=0> N_active;
  array[N_active] real<lower=T0, upper=T1> t_join_active; // Months
}

parameters {
  real<lower=0> eta; // Inverse scale parameter (Months)
}

transformed parameters {
  real<lower=0> gamma = 1 / eta;
}

model {
  // Prior Model
  // 0 <~ eta <~ 40
  target += normal_lpdf(eta | 0, 40 / 2.57);

  // Observational Model
  for (n in 1:N_churn) {
    real tenure = t_leave_churn[n] - t_join_churn[n];
    target +=   log_lambda(tenure, gamma)
              + log_survival(tenure, gamma);
  }

  for (n in 1:N_active) {
    target += log_survival(T1 - t_join_active[n], gamma);
  }
}

generated quantities {
  array[N_churn] real<lower=0> tenure_churn_pred;
  array[N_churn] real<lower=T0> t_leave_churn_pred;

  for (n in 1:N_churn) {
    real S = exp(log_survival(T1 - t_join_churn[n], gamma));
    real u = uniform_rng(S, 1);
    tenure_churn_pred[n]
      = inv_survival(u, gamma);
    t_leave_churn_pred[n]
      = t_join_churn[n] + tenure_churn_pred[n];
  }
}

with open('stan_programs/model2.stan', 'r') as file:
  stan_program = file.read()
model = stan.build(stan_program,
                   random_seed=8438338,
                   data=stan_data)
fit = model.sample(num_samples=1024, refresh=0)

Building...

Despite the added complexity, no computational problems have arisen.

diagnostics = util.extract_hmc_diagnostics(fit)
util.check_all_hmc_diagnostics(diagnostics)

samples = util.extract_expectand_vals(fit)
base_samples = util.filter_expectands(samples, ['eta'])
util.check_all_expectand_diagnostics(base_samples)

All Hamiltonian Monte Carlo diagnostics are consistent with accurate
Markov chain Monte Carlo.
 
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.

4.3 Posterior Retrodictive Checks

Did these changes improve the adequacy of our model at all? Well, the posterior predictive leave times now agree much better with the observed behavior.

putil.plot_hist_quantiles(
  plot.gca(), samples, 't_leave_churn_pred',
  data['T0'] - 3, data['T1'] + 23, 3,
  baseline_values=data['t_leave_churn'],
  xlabel="Time Customer Leaves (months)"
)
plot.gca().axvline(x=data['T0'], linewidth=2,
                   linestyle='dashed', color="#DDDDDD")
plot.gca().axvline(x=data['T1'], linewidth=2,
                   linestyle='dashed', color="#DDDDDD")

plot.show()

Unfortunately, there is still tension in the customer tenure. Specifically, the observed behavior is peaked away from zero while the posterior predictive behavior peaks at zero.

putil.plot_hist_quantiles(
  plot.gca(), samples, 'tenure_churn_pred',
  data['T0'] - 1, data['T1'] + 1, 1,
  baseline_values=tenure_churn,
  xlabel='Customer Tenure (months)'
)
plot.show()

Examining our model, the exponential time-to-event model always peaks at zero. In other words, there is no configuration of our initial survival model that can match this observed behavior.

Let’s try to address this inadequacy before we consider the retrodictive performance across the demographic variables.

5 Model 3: Power Stimulus/Hazard Model with Censoring

Fortunately, our initial survival model isn’t too difficult to generalize.

5.1 Observational Model

Instead of a constant stimulus, let’s allow the stimulus to also increase with a power of time, \lambda(t) = \alpha \, \frac{1}{\sigma} \, \left( \frac{t}{\sigma} \right)^{\alpha - 1} This results in the cumulative stimulus function \Lambda(t) = \int_{0}^{t} \mathrm{d} t' \, \lambda(t') = \left( \frac{t}{\sigma} \right)^{\alpha}, and the survival function S(t) = \exp \left( - \Lambda(t) \right) = \exp \left( - \left( \frac{t}{\sigma} \right)^{\alpha} \right).

Interestingly, the generalized time-to-event model also reduces to a familiar form, this time the Weibull family of probability density functions, \begin{align*} p(t) &= \lambda(t) \, S(t) \\ &= \alpha \, \frac{1}{\sigma} \, \left( \frac{t}{\sigma} \right)^{\alpha - 1} \, \exp \left( - \left( \frac{t}{\sigma} \right)^{\alpha} \right) \\ &= \mathrm{Weibull}(t \mid \alpha, \sigma). \end{align*}

5.2 Prior Model

The single threshold that we considered before was enough to configure the reasonable configurations of the single parameter \gamma in our initial model. It is not, however, sufficient to constrain both \alpha and \sigma in our generalized model.

Here we’ll consider two thresholds, one suppressing unreasonably large tenures and one suppressing unreasonably small tenures. Let’s say that customers leaving before only a single month should be rare.

This gives us two tail probability constraints \begin{align*} \pi( t < 1 \text{ month } ) \approx 0.01 \\ \pi( t > 60 \text{ months } ) \approx 0.01, \end{align*} which gives us two equations for the reasonable values of \alpha and \sigma.

First we have \begin{align*} S( 1 \text{ month }) &= 0.99 \\ \exp \left( -\left( \frac{ 1 \text{ month } }{ \sigma } \right)^\alpha \right) &= 0.99 \\ \left( \frac{ 1 \text{ month } }{ \sigma } \right)^\alpha &= -\log(0.99) \\ -\alpha \, \log \left( \frac{ \sigma }{ 1 \text{ month } } \right) &= \log(-\log(0.99)) \\ \alpha \, \log \left( \frac{ \sigma }{ 1 \text{ month } } \right) &= -\log(-\log(0.99)) \end{align*} Second we have \begin{align*} S( 60 \text{ months }) &= 0.01 \\ \exp \left( -\left( \frac{ 60 \text{ months } }{ \sigma } \right)^\alpha \right) &= 0.01 \\ \left( \frac{ 60 \text{ months } }{ \sigma } \right)^\alpha &= -\log(0.01) \\ -\alpha \, \log \left( \frac{ \sigma }{ 60 \text{ months } } \right) &= \log(-\log(0.01)) \\ \alpha \, \log \left( \frac{ \sigma }{ 60 \text{ months } } \right) &= -\log(-\log(0.01)) \end{align*}

After first defining \kappa = \frac{ \log(-\log(0.99)) }{ \log(-\log(0.01)) }, we can divide the two equations to give an equation entirely in terms of \sigma, \begin{align*} \frac{ \log \left( \frac{ \sigma }{ 1 \text{ month } } \right) }{ \log \left( \frac{ \sigma }{ 1 \text{ month } } \right) - \log 60 } &= \kappa \\ \log \left( \frac{ \sigma }{ 1 \text{ month } } \right) &= \kappa \, \log \left( \frac{ \sigma }{ 1 \text{ month } } \right) - \kappa \, \log 60 \\ ( 1 - \kappa ) \, \log \left( \frac{ \sigma }{ 1 \text{ month } } \right) &= - \kappa \, \log 60 \\ \log \left( \frac{ \sigma }{ 1 \text{ month } } \right) &= - \frac{ \kappa }{ 1 - \kappa} \, \log 60 \\ \sigma &= 60^{- \frac{ \kappa }{ 1 - \kappa} } \text{ month } \end{align*}

Model configurations with \sigma larger than this value will be too diffuse, allocating excessive probability below and above our two thresholds.

Next, we can solve the first equation for \sigma, \log( \sigma) = - \frac{ \log(-\log(0.99)) }{ \alpha }, and then substitute it into the second equation to give an equation entirely in terms of \alpha, \begin{align*} \alpha \, \log \left( \frac{ \sigma }{ 60 \text{ months } } \right) &= -\log(-\log(0.01)) \\ \alpha \, \log \left( \frac{ \sigma }{ 1 \text{ month } } \right) - \alpha \, \log 60 &= -\log(-\log(0.01)) \\ \alpha \, \left( - \frac{ \log(-\log(0.99)) }{ \alpha } \right) - \alpha \, \log 60 &= -\log(-\log(0.01)) \\ - \log(-\log(0.99)) - \alpha \, \log 60 &= -\log(-\log(0.01)) \\ - \alpha \, \log 60 &= -\log(-\log(0.01)) + \log(-\log(0.99)) \\ \alpha &= \frac{ \log(-\log(0.01)) - \log(-\log(0.99)) }{ \log 60 } \end{align*} In this case model configurations with \alpha below this value will concentrate at tenures that are too long.

Adding a little bit of a buffer, these calculations suggest the parameter constraints 0 \lessapprox \sigma \lessapprox 75 and 1 \lessapprox \alpha or 0 \lessapprox \omega = \frac{1}{\alpha} \lessapprox 1. Once again, half-normal prior models neatly ensure the desired regularization, \begin{align*} p( \sigma ) = \text{half-normal}( \sigma \mid 0, 75 / 2.57 ) \\ p( \omega ) = \text{half-normal}( \omega \mid 0, 1 / 2.57) \end{align*}

We can then verify that this prior model behaves as desired by implementing another prior predictive check.

t_lower = 1
t_upper = 60

model3\_prior.stan

functions {
  // Log stimulus function
  real log_lambda(real t, real alpha, real sigma) {
    return log(alpha) - log(sigma) + (alpha - 1) * log(t / sigma);
  }
  
  // Cumulative stimulus function
  real Lambda(real t, real alpha, real sigma) {
    return pow(t / sigma, alpha);
  }

  // Log survival function
  real log_survival(real t, real alpha, real sigma) {
    return -Lambda(t, alpha, sigma);
  }

  // Inverse cumulative stimulus function
  real Lambda_inv(real l, real alpha, real sigma) {
    return sigma * pow(l, 1 / alpha);
  }
  
  // Inverse survival function
  real inv_survival(real u, real alpha, real sigma) {
    return Lambda_inv(-log(u), alpha, sigma);
  }
}

data {
  int<lower=0> N;
  real<lower=0> t_lower;
  real<lower=0> t_upper;
}

parameters {
  real<lower=0> omega; // Inverse shape parameter (Unitless)
  real<lower=0> sigma; // Scale parameter (Months)
}

model {
  // Prior Model
  // 1 <~ alpha
  // 0 <~ omega = 1 / alpha <~ 1 / 1
  target += normal_lpdf(omega | 0, 1 / 2.57);

  // 0 <~ sigma <~ 75
  target += normal_lpdf(sigma | 0, 75 / 2.57);
}

generated quantities {
  array[N] real<lower=0> tenure_pred;
  real lower_tail_p_hat = 0;
  real upper_tail_p_hat = 0;

  for (n in 1:N) {
    tenure_pred[n]
      = inv_survival(uniform_rng(0, 1), 1 / omega, sigma);
    if (tenure_pred[n] < t_lower)
      lower_tail_p_hat += 1;
    if (tenure_pred[n] > t_upper)
      upper_tail_p_hat += 1;
  }
  lower_tail_p_hat /= N;
  upper_tail_p_hat /= N;
}

with open('stan_programs/model3_prior.stan', 'r') as file:
  stan_program = file.read()
model = stan.build(stan_program,
                   random_seed=8438338,
                   data={'N': 2500,
                         't_lower': t_lower,
                         't_upper': t_upper})
fit = model.sample(num_samples=1024, refresh=0)

samples = util.extract_expectand_vals(fit)

Building...

Qualitatively, the prior predictive behavior is similar to that in our previous model.

putil.plot_hist_quantiles(
  plot.gca(), samples, 'tenure_pred',
  0, t_upper + 12, 2,
  xlabel='Customer Tenure (months)'
)
plot.gca().set_xlim([0, t_upper + 10])

rect_lower = plot.Rectangle((0, 0), t_lower, 1600,
                      facecolor="#BBBBBB", alpha=0.25)
plot.gca().add_patch(rect_lower)
plot.axvline(x=t_lower, linewidth=2, color="#BBBBBB")

rect_upper = plot.Rectangle((t_upper, 0), 10, 1600,
                      facecolor="#BBBBBB", alpha=0.25)
plot.gca().add_patch(rect_upper)
plot.axvline(x=t_upper, linewidth=2, color="#BBBBBB")

plot.show()

226894 predictive values (2.22%) fell above the histogram binning.

Quantitatively, both prior predictive tail probabilities are reasonable. Again, we typically don’t need to be precious about the precise values, especially given the imprecision of our domain expertise elicitation in the first place.

est = util.ensemble_mcmc_est(samples['lower_tail_p_hat'])

print(f'pi( t < 1 ) = {est[0]:.3f} +/- {2 * est[1]:.3f}')

pi( t < 1 ) = 0.040 +/- 0.009

est = util.ensemble_mcmc_est(samples['upper_tail_p_hat'])

print(f'pi( t > 60 ) = {est[0]:.3f} +/- {2 * est[1]:.3f}')

pi( t > 60 ) = 0.043 +/- 0.005

5.3 Posterior Quantification

The graphical structure of this new model is the same as before, we just replace the initial stimulus function with a more general one (Figure 4).

Figure 4: Moving from a constant to a power stimulus model changes the derived time-to-event model, but not the overall structure of the data generating process.

model3.stan

functions {
  // Log stimulus function
  real log_lambda(real t, real alpha, real sigma) {
    return log(alpha) - log(sigma) + (alpha - 1) * log(t / sigma);
  }
  
  // Cumulative stimulus function
  real Lambda(real t, real alpha, real sigma) {
    return pow(t / sigma, alpha);
  }

  // Log survival function
  real log_survival(real t, real alpha, real sigma) {
    return -Lambda(t, alpha, sigma);
  }

  // Inverse cumulative stimulus function
  real Lambda_inv(real l, real alpha, real sigma) {
    return sigma * pow(l, 1 / alpha);
  }
  
  // Inverse survival function
  real inv_survival(real u, real alpha, real sigma) {
    return Lambda_inv(-log(u), alpha, sigma);
  }
}

data {
  real           T0; // Months
  real<lower=T0> T1; // Months

  // Churned customers
  int<lower=0> N_churn;
  array[N_churn] real<lower=T0, upper=T1> t_join_churn;  // Months
  array[N_churn] real<lower=T0, upper=T1> t_leave_churn; // Months

  // Active customers
  int<lower=0> N_active;
  array[N_active] real<lower=T0, upper=T1> t_join_active; // Months
}

parameters {
  real<lower=0> omega; // Inverse shape parameter (Unitless)
  real<lower=0> sigma; // Scale parameter (Months)
}

transformed parameters {
  real<lower=0> alpha = 1 / omega;
}

model {
  // Prior Model
  // 0 <~ omega <~ 1
  target += normal_lpdf(omega | 0, 1 / 2.57);

  // 0 <~ sigma <~ 75
  target += normal_lpdf(sigma | 0, 75 / 2.57);

  // Observational Model
  for (n in 1:N_churn) {
    real tenure = t_leave_churn[n] - t_join_churn[n];
    target +=   log_lambda(tenure, alpha, sigma)
              + log_survival(tenure, alpha, sigma);
  }

  for (n in 1:N_active) {
    target += log_survival(T1 - t_join_active[n], alpha, sigma);
  }
}

generated quantities {
  array[N_churn] real<lower=0> tenure_churn_pred;
  array[N_churn] real<lower=T0> t_leave_churn_pred;

  for (n in 1:N_churn) {
    real S = exp(log_survival(T1 - t_join_churn[n],
                              alpha, sigma));
    real u = uniform_rng(S, 1);
    tenure_churn_pred[n]
      = inv_survival(u, alpha, sigma);
    t_leave_churn_pred[n]
      = t_join_churn[n] + tenure_churn_pred[n];
  }
}

with open('stan_programs/model3.stan', 'r') as file:
  stan_program = file.read()
model = stan.build(stan_program,
                   random_seed=8438338,
                   data=stan_data)
fit = model.sample(num_samples=1024, refresh=0)

Building...

Conveniently, the posterior computation remains well-behaved.

diagnostics = util.extract_hmc_diagnostics(fit)
util.check_all_hmc_diagnostics(diagnostics)

samples = util.extract_expectand_vals(fit)
base_samples = util.filter_expectands(samples,
                                      ['omega', 'sigma'])
util.check_all_expectand_diagnostics(base_samples)

All Hamiltonian Monte Carlo diagnostics are consistent with accurate
Markov chain Monte Carlo.
 
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.

5.4 Posterior Retrodictive Checks

Overall, the retrodictive agreement for both the tenure and leave times has improved quite a bit. That said, the agreement still leaves something to be desired.

f, axarr = plot.subplots(2, 1, layout="constrained")

putil.plot_hist_quantiles(
  axarr[0], samples, 'tenure_churn_pred',
  data['T0'] - 1, data['T1'] + 1, 1,
  baseline_values=tenure_churn,
  xlabel='Customer Tenure (months)'
)


putil.plot_hist_quantiles(
  axarr[1], samples, 't_leave_churn_pred',
  data['T0'] - 3, data['T1'] + 23, 3,
  baseline_values=data['t_leave_churn'],
  xlabel="Time Customer Leaves (months)"
)
axarr[1].axvline(x=data['T0'], linewidth=2,
                 linestyle='dashed', color="#DDDDDD")
axarr[1].axvline(x=data['T1'], linewidth=2,
                 linestyle='dashed', color="#DDDDDD")

plot.show()

This residual tension isn’t necessarily surprising. Remember that we have yet to consider demographic heterogeneity! Indeed, when we examine the binned medians, we see strong disagreement between the posterior predictions and the observed data.

bin_min = [ 0, 0, 0, 0 ]
bin_max = [ 100, 15, 1.05, 1000 ]
bin_delta = [ 20, 1, 0.20, 200 ]

pred_names = [ f'tenure_churn_pred[{n + 1}]'
               for n in range(data['N_churn']) ]

f, axarr = plot.subplots(2, 2, layout="constrained")

for i in range(data['I']):
  i1 = i // 2
  i2 = i % 2

  putil.plot_conditional_median_quantiles(
    axarr[i1, i2], samples, pred_names,
    [ x[i] for x in data['X_churn'] ],
    bin_min[i], bin_max[i], bin_delta[i],
    tenure_churn, residual=False,
    xlabel=data['x_names'][i], ylabel=""
  )

plot.show()

1 conditioning values (0.04%) fell below the histogram binning.
6 conditioning values (0.27%) fell above the histogram binning.

6 Model 4: Power Stimulus/Hazard Model with Censoring and Proportional Stimuli/Hazards

In order to improve our model, we have to address a question that is easy to take for granted but is often surprisingly subtle. Exactly what behaviors should vary with the demographic variables?

6.1 Observational Model

Instead of considering which features of the time-to-event model vary with external variables, the proportional stimulus model has the external variables scale the underlying stimulus function. This dependence will then propagate to the derived time-to-event model.

The dependence of the stimulus function on the external covariates is not necessarily constant. In fact, the dependence can itself depend on the behavior of the external variables. In the language of regression modeling, the conditional relationship between event time and the external variables can be confounded with the marginal behavior of the external variables. Here we will assume that any confounding is negligible, so that we can safely ignore the latter.

If we denote the continuous demographic variables for the nth customer as (x_{n1}, \ldots, x_{nI}) and the categorical demographic variables as, (l^{1}_{n}, \ldots, l^{K}_{n}) then the proportional stimulus model can be written as \lambda(t) = \lambda_{0}(t) \, \exp ( f( x_{n1}, \ldots, x_{nI}, \nu^{1}_{l^{1}_{n}}, \ldots, \nu^{K}_{l^{K}_{n}}), ) To ensure that the model is well-behaved, the outputs of the function f must be positive.

The baseline stimulus function \lambda_{0}(t) models the behavior for the implicit baseline configuration satisfying f(x_{n1}, \ldots, x_{nI}, \nu^{1}_{l^{1}_{n}}, \ldots, \nu^{K}_{l^{K}_{n}}, ) = 0. If the output of f is smaller than one, then it suppresses the driving stimulation, leading to later event times. When the output is larger than one, the stimulus is amplified and event times accelerate. The precise form of f determines how the behavior varies as the demographic variables deviate from the implicit baseline.

Proportional stimulus models often assume an exponential linear functional form, \lambda(t) = \lambda_{0}(t) \, \exp \left( \sum_{i = 1}^{I} \beta_{i} \, (x_{ni} - x_{0i}) + \sum_{k = 1}^{K} \delta^{k}_{l^{k}_{n}} \right), where \delta^{k}_{l^{k}_{n}} = \nu^{k}_{l^{k}_{n}} - \nu^{k}_{l^{k}_{0}}. We can always interpret this exponential linear model as an approximation of a nonlinear functional model around a baseline configuration. Conveniently, the baseline configuration for this functional model is explicitly given by x_{ni} = x_{0i} and l^{k}_{n} = l^{k}_{0}.

A proportional power stimulus function takes the form, \lambda(t) = \alpha \, \frac{1}{\sigma} \, \left( \frac{t}{\sigma} \right)^{\alpha - 1} \, \exp \left( \sum_{i = 1}^{I} \beta_{i} \, (x_{ni} - x_{0i}) + \sum_{k = 1}^{K} \delta^{k}_{l^{k}_{n}} \right) with the resulting cumulative stimulus function \Lambda(t) = \left( \frac{t}{\sigma} \right)^{\alpha} \, \exp \left( \sum_{i = 1}^{I} \beta_{i} \, (x_{ni} - x_{0i}) + \sum_{k = 1}^{K} \delta^{k}_{l^{k}_{n}} \right), and survival function S(t) = \exp \left( - \left( \frac{t}{\sigma} \right)^{\alpha} \, \exp \left( \sum_{i = 1}^{I} \beta_{i} \, (x_{ni} - x_{0i}) + \sum_{k = 1}^{K} \delta^{k}_{l^{k}_{n}} \right) \right).

The time-to-event model then becomes \begin{align*} p(t) &=\; \alpha \, \frac{1}{\sigma} \, \left( \frac{t}{\sigma} \right)^{\alpha - 1} \, \\ &\quad \cdot \exp \left( \sum_{i = 1}^{I} \beta_{i} \, (x_{ni} - x_{0i}) + \sum_{k = 1}^{K} \delta^{k}_{l^{k}_{n}} \right) \, \\ &\quad \cdot \exp \left( - \left( \frac{t}{\sigma} \right)^{\alpha} \, \exp \left( \sum_{i = 1}^{I} \beta_{i} \, (x_{ni} - x_{0i}) + \sum_{k = 1}^{K} \delta^{k}_{l^{k}_{n}} \right) \right) \\ &= \alpha \, \frac{1}{\tau} \, \left( \frac{t}{\tau} \right)^{\alpha - 1} \, \exp \left( - \left( \frac{t}{\tau} \right)^{\alpha} \, \right) \\ &= \mathrm{weibull}(t \mid \alpha, \tau) \end{align*} where \tau = \sigma \, \exp \left( \sum_{i = 1}^{I} \beta_{i} \, (x_{ni} - x_{0i}) + \sum_{k = 1}^{K} \delta^{k}_{l^{k}_{n}} \right) In this case, only the second parameter of the time-to-event model varies with the demographic variables.

We’ll take the baseline online banking configuration to be false. Because online banking features only two categories, true and false, we can implement this contribution by adding a single parameter \delta_{\mathrm{online}} for when a customer participates in online banking.

6.2 Defining External Variables

At this point we have to choose our external variables. We can always use our demographic variables directly, but we also have an opportunity to transform them to improve the performance of our model.

For example, because the observed wealth percentiles and monthly margins concentrate near their respective boundaries, any functional dependence on these variables will likely be exponential-nonlinear. In this case, the exponential-linear model will might provide a better approximation if we can first un-constrain these variables to avoid the boundaries entirely.

On the other hand, the observed age and product counts concentrate relatively far away from their lower zero boundaries. Consequently an exponential-linear model has a much better chance of performing reasonably well without any transformations.

The interval constraints of wealth percentiles can be transformed away with the logit function, while the positivity constraint of the monthly margins can be transformed away with the natural logarithm.

def logit(x):
  return math.log(x / (1 - x))

def transform_covariates(x):
  return [ x[0], x[1], logit(x[2]), math.log(x[3]) ]

z_names = [ data['x_names'][0],
            data['x_names'][1],
            f'logit(Wealth / Percentile)',
            f'log(Monthly Margin / Euro)'    ]

Z_churn = [ transform_covariates(x)
            for x in data['X_churn'] ]

Z_active = [ transform_covariates(x)
            for x in data['X_active'] ]

As expected, the transformed demographic variables no longer bunch up at their lower boundary.

bin_min = [ 0, 0, -10, -10 ]
bin_max = [ 100, 10, 10, 10 ]
bin_delta = [ 5, 1, 1, 1 ]

f, axarr = plot.subplots(2, 2, layout="constrained")

for i in range(data['I']):
  i1 = i // 2
  i2 = i % 2

  putil.plot_line_hist(axarr[i1, i2],
                       [ z[i] for z in Z_churn ],
                       bin_min[i], bin_max[i], bin_delta[i],
                       xlabel=z_names[i])
plot.show()

1 values (0.04%) fell below the histogram binning.
3 values (0.13%) fell above the histogram binning.
18 values (0.80%) fell above the histogram binning.

Our model recovers customer tenure binned in these transformed demographic variables just as poorly as it did in bins of the unprocessed demographic variables.

bin_min = [ 0, 0, -10, -10 ]
bin_max = [ 100, 15, 10, 10 ]
bin_delta = [ 10, 1, 2, 2 ]

pred_names = [ f'tenure_churn_pred[{n + 1}]'
               for n in range(data['N_churn']) ]

f, axarr = plot.subplots(2, 2, layout="constrained")

for i in range(data['I']):
  i1 = i // 2
  i2 = i % 2

  putil.plot_conditional_median_quantiles(
    axarr[i1, i2], samples, pred_names,
    [ z[i] for z in Z_churn ],
    bin_min[i], bin_max[i], bin_delta[i],
    tenure_churn, residual=False,
    xlabel=z_names[i], ylabel=""
  )

plot.show()

1 conditioning values (0.04%) fell below the histogram binning.
18 conditioning values (0.80%) fell above the histogram binning.

Having determined the external inputs to the proportional stimulus model, we just need to specify a baseline configuration. This baseline effectively defines a default customer, relative to whom we can interpret the demographic heterogeneity.

Here we’ll start with an interpretable baseline defined in terms of our nominal demographic variables.

x0 = [
 40,  # Age (years)
 2,   # Number of bank products (counts)
 0.5, # Wealth (Percentile)
 50   # Monthly Margin (euros)
]

The baseline configuration in our model is then given by transforming this nominal baseline.

z0 = transform_covariates(x0)

stan_data['z0'] = z0
stan_data['Z_churn'] = Z_churn
stan_data['Z_active'] = Z_active

6.3 Prior Model

One benefit of the proportional stimulus model is that we can motivate an informative prior model for the new parameters by reasoning about thresholds in proportional changes.

Consider, for example, a categorical demographic variable. If the value of only this variable changes from its baseline to \delta^{k}_{l}, then the stimulus function is proportionally scaled by \exp( \delta^{k}_{l} ). Constraining the reasonable proportional variables then constrains reasonable values for \delta^{k}_{l}, \begin{align*} \frac{1}{\delta} &\lessapprox \exp( \delta^{k}_{l} ) \lessapprox \delta \\ - \log \delta &\lessapprox \delta^{k}_{l} \lessapprox + \log \delta \end{align*}

For online banking participation we’ll take \delta = 5 so that \vert \delta_{\mathrm{online}} \vert < \log 5. Note that 500\% is an absolutely huge change; consequently this prior model is extremely weakly informative relative to what domain expertise would be available in most applications!

To motivate prior models for coefficients of the remaining demographic variables, we consider constraining the proportional variation when each component individually deviates from its baseline by a certain amount, \begin{align*} \frac{1}{\delta} &\lessapprox \exp( \beta_{i} \, \Delta x_{i} ) \lessapprox \delta \\ - \log \delta &\lessapprox \beta_{i} \, \Delta x_{i} \lessapprox + \log \delta \\ - \frac{ \log \delta}{ \Delta x_{i} } &\lessapprox \beta_{i} \lessapprox + \frac{ \log \delta }{ \Delta x_{i} } \end{align*}

For age we’ll take \delta = 5 and \Delta x_{i} = 10 so that \vert \beta_{\mathrm{age}} \vert < \frac{ \log 5 }{ 10 }, while for the number of banking products we’ll take \delta = 5 and \Delta x_{i} = 1 so that \vert \beta_{\mathrm{products}} \vert < \frac{ \log 5 }{ 1 }.

When working with the transformed demographic variables we’ll want to take those transformations into account. For example, log transforming a demographic variable is equivalent to the proportional variation \begin{align*} \exp \left( \beta_{i} \, (\log x_{i} - \log x_{0i}) \right) &= \exp \left( \beta_{i} \, \log \frac{ x_{i} }{ x_{0i} } \right) \\ &= \exp \left( \log \left( \frac{ x_{i} }{ x_{0i} } \right)^{\beta_{i}} \right) \\ &= \left( \frac{ x_{i} }{ x_{0i} } \right)^{\beta_{i}} \end{align*}

Consequently, the corresponding prior model is informed by reasoning about proportional output variation given proportional input variation. If x_{i} = \zeta \, x_{0i} then \begin{align*} \frac{1}{\delta} &\lessapprox \bigg( \frac{ x_{i} }{ x_{0i} } \bigg)^{\beta_{i}} \lessapprox \delta \\ \frac{1}{\delta} &\lessapprox \bigg( \zeta \bigg)^{\beta_{i}} \lessapprox \delta \\ - \log \delta &\lessapprox \beta_{i} \, \log\zeta \lessapprox + \log \delta \\ - \frac{ \log \delta }{ \log \zeta } &\lessapprox \beta_{i} \lessapprox + \frac{ \log \delta }{ \log \zeta}. \end{align*}

For monthly margin we’ll take \delta = 5 and \zeta = 2 so that \vert \beta_{i} \vert < \frac{ \log 5 }{ \log 2}.

Accounting for logit transformed demographic variables is a bit more subtle. First we have \begin{align*} \exp \left( \beta_{i} \, (\mathrm{logit} x_{i} - \mathrm{logit} x_{0i}) \right) &= \exp \left( \beta_{i} \, \left( \log \frac{ x_{i} }{ 1 - x_{i} } - \log \frac{ x_{0i} }{ 1 - x_{0i} } \right) \right) \\ &= \exp \left( \beta_{i} \, \log \left( \frac{ x_{i} }{ 1 - x_{i} } \frac{ 1 - x_{0i} }{ x_{0i} } \right) \right) \\ &= \exp \left( \log \left( \frac{ x_{i} }{ 1 - x_{i} } \frac{ 1 - x_{0i} }{ x_{0i} } \right)^{\beta_{i}} \right) \\ &= \left( \frac{ x_{i} }{ 1 - x_{i} } \frac{ 1 - x_{0i} }{ x_{0i} } \right)^{\beta_{i}} \end{align*}

In this case, we need to think about how the proportional output changes given proportional changes in the input odds, \zeta = \frac{ x_{i} }{ 1 - x_{i} } \frac{ 1 - x_{0i} }{ x_{0i} }. This gives - \frac{ \log \delta }{ \log \zeta } \lessapprox \beta_{i} \lessapprox + \frac{ \log \delta }{ \log \zeta}.

For wealth percentile we’ll take \delta = 5 and \zeta = 2 so that \vert \beta_{i} \vert < \frac{ \log 5 }{ \log 2}.

We don’t want any unexpected interaction between the demographic variables to spoil all of this careful prior model development. Specifically, we should be diligent and perform a prior check.

model4\_prior.stan

functions {
  // Log stimulus function
  real log_lambda(real t, real alpha, real sigma) {
    return log(alpha) - log(sigma) + (alpha - 1) * log(t / sigma);
  }
  
  // Cumulative stimulus function
  real Lambda(real t, real alpha, real sigma) {
    return pow(t / sigma, alpha);
  }

  // Log survival function
  real log_survival(real t, real alpha, real sigma) {
    return -Lambda(t, alpha, sigma);
  }

  // Inverse cumulative stimulus function
  real Lambda_inv(real l, real alpha, real sigma) {
    return sigma * pow(l, 1 / alpha);
  }
  
  // Inverse survival function
  real inv_survival(real u, real alpha, real sigma) {
    return Lambda_inv(-log(u), alpha, sigma);
  }
}

data {
  real<lower=0> t_lower;
  real<lower=0> t_upper;

  int I;            // Number of covariates
  row_vector[I] z0; // Covariate baselines

  // Churned customers
  int<lower=0> N_churn;
  matrix[N_churn, I] Z_churn;
  array[N_churn] int online_banking_churn;
}

parameters {
  real<lower=0> omega; // Inverse shape parameter (Unitless)
  real<lower=0> sigma; // Scale parameter (Months)

  vector[I] beta;    // Covariate slopes (1 / covariate units)
  real delta_online; // Online banking contribution
}

transformed parameters {
  real<lower=0> alpha = 1 / omega;
}

model {
  // Prior Model
  // 0 <~ omega <~ 1
  target += normal_lpdf(omega | 0, 1 / 2.57);

  // 0 <~ sigma <~ 75
  target += normal_lpdf(sigma | 0, 75 / 2.57);

  // -log(5) / 10 <~ beta[1] <~ +log(5) / 10
  target += normal_lpdf(beta[1] | 0, 0.07);

  // -log(5) / 1 <~ beta[2] <~ +log(5) / 1
  target += normal_lpdf(beta[2] | 0, 0.70);

  // -log(5) / 2 <~ beta[3] <~ +log(5) / 2
  target += normal_lpdf(beta[3] | 0, 0.35);

  // -log(5) / log(2) <~ beta[4] <~ +log(5) / log(2)
  target += normal_lpdf(beta[4] | 0, 1);

  // 0.1 <~ exp(delta_online) <~ 10
  target += normal_lpdf(delta_online | 0, 1);
}


generated quantities {
  array[N_churn] real<lower=0> tenure_pred;
  real lower_tail_p_hat = 0;
  real upper_tail_p_hat = 0;

  for (n in 1:N_churn) {
    real mu = (Z_churn[n,] - z0) * beta;
    if (online_banking_churn[n])
      mu += delta_online;
    real tau = sigma * exp(-mu / alpha);

    tenure_pred[n]
      = inv_survival(uniform_rng(0, 1), alpha, tau);

    if (tenure_pred[n] < t_lower)
      lower_tail_p_hat += 1;
    if (tenure_pred[n] > t_upper)
      upper_tail_p_hat += 1;
  }
  lower_tail_p_hat /= N_churn;
  upper_tail_p_hat /= N_churn;
}

simu_data={'t_lower': t_lower, 't_upper': t_upper,
           'I': data['I'], 'z0': z0,
           'N_churn': data['N_churn'], 'Z_churn': Z_churn,
           'online_banking_churn': data['online_banking_churn']}

with open('stan_programs/model4_prior.stan', 'r') as file:
  stan_program = file.read()
model = stan.build(stan_program,
                   random_seed=8438338,
                   data=simu_data)
fit = model.sample(num_samples=1024, refresh=0)

samples = util.extract_expectand_vals(fit)

Building...

The shape of the prior predictive distribution in this new model is similar to that from the previous models, just a bit more diffuse.

putil.plot_hist_quantiles(
  plot.gca(), samples, 'tenure_pred',
  0, t_upper + 12, 2,
  xlabel='Customer Tenure (months)'
)
plot.gca().set_xlim([0, t_upper + 10])

rect_lower = plot.Rectangle((0, 0), t_lower, 1600,
                      facecolor="#BBBBBB", alpha=0.25)
plot.gca().add_patch(rect_lower)
plot.axvline(x=t_lower, linewidth=2, color="#BBBBBB")

rect_upper = plot.Rectangle((t_upper, 0), 10, 1600,
                      facecolor="#BBBBBB", alpha=0.25)
plot.gca().add_patch(rect_upper)
plot.axvline(x=t_upper, linewidth=2, color="#BBBBBB")

plot.show()

706640 predictive values (7.69%) fell above the histogram binning.

This excessive diffusion also manifests is higher tail probabilities.

est = util.ensemble_mcmc_est(samples['lower_tail_p_hat'])

print(f'pi( t < 1 ) = {est[0]:.3f} +/- {2 * est[1]:.3f}')

pi( t < 1 ) = 0.057 +/- 0.009

est = util.ensemble_mcmc_est(samples['upper_tail_p_hat'])

print(f'pi( t > 60 ) = {est[0]:.3f} +/- {2 * est[1]:.3f}')

pi( t > 60 ) = 0.104 +/- 0.005

All of this is to be expected, however, since we’re using the same prior model for \alpha and \sigma and then incorporating additional variation in the tenure behavior with the proportional scaling. That said, the prior predictive tail probabilities are still small enough that our prior model is in reasonable agreement with our elicited tenure thresholds. Besides, we can always introduce a more informative prior model later if needed.

6.4 Posterior Quantification

Incorporating a proportional stimulus into our previous model requires demographic variables for both the churned and active customers (Figure 5).

Figure 5: A proportional stimulus survival model allows the demographic variables to suppress or amplify the stimulus function, which then modifies the time-to-event model.

Because the proportional scaling effectively modifies only the \sigma parameter of the time-to-event model, we just need to compute and then use \tau in place of \sigma.

model4.stan

functions {
  // Log stimulus function
  real log_lambda(real t, real alpha, real sigma) {
    return log(alpha) - log(sigma) + (alpha - 1) * log(t / sigma);
  }
  
  // Cumulative stimulus function
  real Lambda(real t, real alpha, real sigma) {
    return pow(t / sigma, alpha);
  }

  // Log survival function
  real log_survival(real t, real alpha, real sigma) {
    return -Lambda(t, alpha, sigma);
  }

  // Inverse cumulative stimulus function
  real Lambda_inv(real l, real alpha, real sigma) {
    return sigma * pow(l, 1 / alpha);
  }
  
  // Inverse survival function
  real inv_survival(real u, real alpha, real sigma) {
    return Lambda_inv(-log(u), alpha, sigma);
  }
}

data {
  real           T0; // Months
  real<lower=T0> T1; // Months

  int I;            // Number of covariates
  row_vector[I] z0; // Covariate baselines

  // Churned customers
  int<lower=0> N_churn;
  array[N_churn] real<lower=T0, upper=T1> t_join_churn;  // Months
  array[N_churn] real<lower=T0, upper=T1> t_leave_churn; // Months

  matrix[N_churn, I] Z_churn;
  array[N_churn] int online_banking_churn;

  // Active customers
  int<lower=0> N_active;
  array[N_active] real<lower=T0, upper=T1> t_join_active; // Months

  matrix[N_active, I] Z_active;
  array[N_active] int online_banking_active;
}

parameters {
  real<lower=0> omega; // Inverse shape parameter (Unitless)
  real<lower=0> sigma; // Scale parameter (Months)

  vector[I] beta;    // Covariate slopes (1 / covariate units)
  real delta_online; // Online banking contribution
}

transformed parameters {
  real<lower=0> alpha = 1 / omega;
}

model {
  // Prior Model
  // 0 <~ omega <~ 1
  target += normal_lpdf(omega | 0, 1 / 2.57);

  // 0 <~ sigma <~ 75
  target += normal_lpdf(sigma | 0, 75 / 2.57);

  // -log(5) / 10 <~ beta[1] <~ +log(5) / 10
  target += normal_lpdf(beta[1] | 0, 0.07);

  // -log(5) / 1 <~ beta[2] <~ +log(5) / 1
  target += normal_lpdf(beta[2] | 0, 0.70);

  // -log(5) / 2 <~ beta[3] <~ +log(5) / 2
  target += normal_lpdf(beta[3] | 0, 0.35);

  // -log(5) / log(2) <~ beta[4] <~ +log(5) / log(2)
  target += normal_lpdf(beta[4] | 0, 1);

  // 0.1 <~ exp(delta_online) <~ 10
  target += normal_lpdf(delta_online | 0, 1);

  // Observational Model
  for (n in 1:N_churn) {
    real tenure = t_leave_churn[n] - t_join_churn[n];

    real mu = (Z_churn[n,] - z0) * beta;
    if (online_banking_churn[n])
      mu += delta_online;
    real tau = sigma * exp(-mu / alpha);

    target +=   log_lambda(tenure, alpha, tau)
              + log_survival(tenure, alpha, tau);
  }

  for (n in 1:N_active) {
    real mu = (Z_active[n,] - z0) * beta;
    if (online_banking_active[n])
      mu += delta_online;
    real tau = sigma * exp(-mu / alpha);

    target += log_survival(T1 - t_join_active[n], alpha, tau);
  }
}

generated quantities {
  array[N_churn] real<lower=0> tenure_churn_pred;
  array[N_churn] real<lower=T0> t_leave_churn_pred;

  for (n in 1:N_churn) {
    real mu = (Z_churn[n,] - z0) * beta;
    if (online_banking_churn[n])
      mu += delta_online;
    real tau = sigma * exp(-mu / alpha);

    real S = exp(log_survival(T1 - t_join_churn[n],
                              alpha, tau));
    real u = uniform_rng(S, 1);
    tenure_churn_pred[n]
      = inv_survival(u, alpha, tau);
    t_leave_churn_pred[n]
      = t_join_churn[n] + tenure_churn_pred[n];
  }
}

with open('stan_programs/model4.stan', 'r') as file:
  stan_program = file.read()
model = stan.build(stan_program,
                   random_seed=8438338,
                   data=stan_data)
fit = model.sample(num_samples=1024, refresh=0)

Building...

Hamiltonian Monte Carlo remains undaunted by our increasingly-complex posterior distributions.

diagnostics = util.extract_hmc_diagnostics(fit)
util.check_all_hmc_diagnostics(diagnostics)

samples = util.extract_expectand_vals(fit)
base_samples = util.filter_expectands(samples,
                                      ['omega', 'sigma',
                                       'beta', 'delta_online'],
                                      check_arrays=True)
util.check_all_expectand_diagnostics(base_samples)

All Hamiltonian Monte Carlo diagnostics are consistent with accurate
Markov chain Monte Carlo.
 
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.

6.5 Posterior Retrodictive Checks

After all of this modeling work, the payoff is pretty rewarding. By incorporating demographic heterogeneity, our new model exhibits beautiful retrodictive agreement across all of the summaries that we have considered so far.

f, axarr = plot.subplots(2, 1, layout="constrained")

putil.plot_hist_quantiles(
  axarr[0], samples, 'tenure_churn_pred',
  data['T0'] - 1, data['T1'] + 1, 1,
  baseline_values=tenure_churn,
  xlabel='Customer Tenure (months)'
)


putil.plot_hist_quantiles(
  axarr[1], samples, 't_leave_churn_pred',
  data['T0'] - 3, data['T1'] + 23, 3,
  baseline_values=data['t_leave_churn'],
  xlabel="Time Customer Leaves (months)"
)
axarr[1].axvline(x=data['T0'], linewidth=2,
                 linestyle='dashed', color="#DDDDDD")
axarr[1].axvline(x=data['T1'], linewidth=2,
                 linestyle='dashed', color="#DDDDDD")

plot.show()

pred_names = [ f'tenure_churn_pred[{n + 1}]'
               for n in range(data['N_churn']) ]

f, axarr = plot.subplots(2, 2, layout="constrained")

for i in range(data['I']):
  i1 = i // 2
  i2 = i % 2

  putil.plot_conditional_median_quantiles(
    axarr[i1, i2], samples, pred_names,
    [ z[i] for z in Z_churn ],
    bin_min[i], bin_max[i], bin_delta[i],
    tenure_churn, residual=False,
    xlabel=z_names[i], ylabel=""
  )

plot.show()

1 conditioning values (0.04%) fell below the histogram binning.
18 conditioning values (0.80%) fell above the histogram binning.

Failures of the exponential-linear proportion model would manifest as disagreement in these conditional median plots. Moreover, the shape of the residual difference between the posterior predictive and observed behaviors would directly inform the functional flexibility that we would need to add.

6.6 Posterior Inferences

Now that none of our posterior retrodictive checks suggest any model inadequacies, we can interpret the derived posterior inferences as reasonable approximations to the true data generating behavior.

f, axarr = plot.subplots(1, 3, layout="constrained")

util.plot_expectand_pushforward(axarr[0], samples['alpha'],
                                25, flim=[1.3, 1.6],
                                display_name="alpha")

util.plot_expectand_pushforward(axarr[1], samples['sigma'],
                                25, flim=[3.75, 5.25],
                                display_name="sigma")

util.plot_expectand_pushforward(axarr[2], samples['delta_online'],
                                25, flim=[-0.25, 0.25],
                                display_name="delta_online")
plot.show()

We can see, for example, that the stimulus function, and the speed at which customers leave, deceases when any of the demographic variables move above their baseline values. Moreover, the decrease is most sensitive to increases in the logit wealth percentile.

names = [ f'beta[{i + 1}]'
          for i in range(data['I']) ]
putil.plot_disc_pushforward_quantiles_vert(
  plot.gca(), samples, names,
  yticklabels=names,
  xlabel="Marginal Posterior Quantiles"
)

plot.show()

7 Customer Lifetime Value By Segment

Now the real fun begins. Our posterior inferences can be used to not only learn about the customers observed in our study, but also make predictions for the behavior of new, unobserved customers.

For example, let’s say that the bank’s potential customer base has been stratified into segments, each of which is characterized by an ensemble of representative demographic configurations.

with open("data/customer_segmentation.json", "r") as infile:
  customer_segmentation = json.load(infile)

for i, name in enumerate(customer_segmentation['segment_names']):
  print(f'Segment {i + 1}: {name}')

Segment 1: Mass Retail
Segment 2: Private Banking
Segment 3: Wealth
Segment 4: Retired
Segment 5: Young Professional
Segment 6: Student

The customers in the private banking segment are older and more wealthy than the customers in the student segment.

bin_min = [ 0, 0, 0, 0 ]
bin_max = [ 100, 15, 1.05, 1000 ]
bin_delta = [ 5, 1, 0.05, 10 ]

f, axarr = plot.subplots(2, 2, layout="constrained")

for i in range(data['I']):
  i1 = i // 2
  i2 = i % 2

  putil.plot_line_hists(axarr[i1, i2],
                        [ x[i] for s, x in
                          zip(customer_segmentation['customer_segment_pred'],
                              customer_segmentation['X_pred'])
                          if s == 1 ],
                        [ x[i] for s, x in
                          zip(customer_segmentation['customer_segment_pred'],
                              customer_segmentation['X_pred'])
                          if s == 5 ],
                       bin_min[i], bin_max[i], bin_delta[i],
                       xlabel=data['x_names'][i])

plot.show()

When can use our model to predict how loyal the customers in each of these segments might be, and hence what their lifetime value might to the bank might be. For demonstration purposes, let’s say that each customer, regardless of segment, earns the bank the same value each month. To be as robust as possible, we’ll consider lifetime value with and without future discounting.

stan_data['value_per_month'] = 50
stan_data['zero_rate'] = 0.10

7.1 Posterior Quantification

Because our inferences are the same, the basic structure of the full Bayesian model is the same. We just have to add the derived predictions (Figure 6).

Figure 6: Once we’ve learned the consistent configurations of our model, we can make predictions for new customers.

In our Stan program, we just need to passing in the segment demographics and then calculating predictions in the generated quantities block.

model5.stan

functions {
  // Log stimulus function
  real log_lambda(real t, real alpha, real sigma) {
    return log(alpha) - log(sigma) + (alpha - 1) * log(t / sigma);
  }
  
  // Cumulative stimulus function
  real Lambda(real t, real alpha, real sigma) {
    return pow(t / sigma, alpha);
  }

  // Log survival function
  real log_survival(real t, real alpha, real sigma) {
    return -Lambda(t, alpha, sigma);
  }

  // Inverse cumulative stimulus function
  real Lambda_inv(real l, real alpha, real sigma) {
    return sigma * pow(l, 1 / alpha);
  }
  
  // Inverse survival function
  real inv_survival(real u, real alpha, real sigma) {
    return Lambda_inv(-log(u), alpha, sigma);
  }
}

data {
  real           T0; // Months
  real<lower=T0> T1; // Months

  int I;            // Number of covariates
  row_vector[I] z0; // Covariate baselines

  // Churned customers
  int<lower=0> N_churn;
  array[N_churn] real<lower=T0, upper=T1> t_join_churn;  // Months
  array[N_churn] real<lower=T0, upper=T1> t_leave_churn; // Months

  matrix[N_churn, I] Z_churn;
  array[N_churn] int online_banking_churn;

  // Active customers
  int<lower=0> N_active;
  array[N_active] real<lower=T0, upper=T1> t_join_active; // Months

  matrix[N_active, I] Z_active;
  array[N_active] int online_banking_active;

  // Prospective customers
  int<lower=0> N_pred;

  matrix[N_pred, I] Z_pred;
  array[N_pred] int online_banking_pred;

  real<lower=0> value_per_month;    // Euros / Month
  real<lower=0, upper=1> zero_rate; // Percentage / Year
}

parameters {
  real<lower=0> omega; // Inverse shape parameter (Unitless)
  real<lower=0> sigma; // Scale parameter (Months)

  vector[I] beta;    // Covariate slopes (1 / covariate units)
  real delta_online; // Online banking contribution
}

transformed parameters {
  real<lower=0> alpha = 1 / omega;
}

model {
  // Prior Model
  target += normal_lpdf(omega | 0, 1 / 2.57);
  target += normal_lpdf(sigma | 0, 75 / 2.57);
  target += normal_lpdf(beta[1] | 0, 0.07);
  target += normal_lpdf(beta[2] | 0, 0.70);
  target += normal_lpdf(beta[3] | 0, 0.35);
  target += normal_lpdf(beta[4] | 0, 1);
  target += normal_lpdf(delta_online | 0, 1);

  // Observational Model
  for (n in 1:N_churn) {
    real tenure = t_leave_churn[n] - t_join_churn[n];

    real mu = (Z_churn[n,] - z0) * beta;
    if (online_banking_churn[n])
      mu += delta_online;
    real tau = sigma * exp(-mu / alpha);

    target +=   log_lambda(tenure, alpha, tau)
              + log_survival(tenure, alpha, tau);
  }

  for (n in 1:N_active) {
    real mu = (Z_active[n,] - z0) * beta;
    if (online_banking_active[n])
      mu += delta_online;
    real tau = sigma * exp(-mu / alpha);

    target += log_survival(T1 - t_join_active[n], alpha, tau);
  }
}

generated quantities {
  array[N_pred] real<lower=0> clv;
  array[N_pred] real<lower=0> clv_discount;

  for (n in 1:N_pred) {
    real mu = (Z_pred[n,] - z0) * beta;
    if (online_banking_pred[n])
      mu += delta_online;
    real tau = sigma * exp(-mu / alpha);

    real u = uniform_rng(0, 1);
    real tenure = inv_survival(u, alpha, tau);

    clv[n]
      = value_per_month * tenure;
    clv_discount[n]
      = clv[n] / pow(1 + zero_rate / 360, 12 * tenure);
  }
}

stan_data['N_pred'] = customer_segmentation['N_pred']
stan_data['Z_pred'] = [ transform_covariates(x)
                        for x in
                        customer_segmentation['X_pred'] ]
stan_data['online_banking_pred'] = \
  customer_segmentation['online_banking_pred']

with open('stan_programs/model5.stan', 'r') as file:
  stan_program = file.read()
model = stan.build(stan_program,
                   random_seed=8438338,
                   data=stan_data)
fit = model.sample(num_samples=1024, refresh=0)

Building...

Although we’re quantifying the same posterior distribution, the modified generated quantities block does affect how pseudo-random numbers are consumed. This, in turn, can affect the performance of Hamiltonian Monte Carlo. Usually this isn’t an issue, but to be safe we’ll want to check our computational diagnostics again. Fortunately, there are no indications of problems here.

diagnostics = util.extract_hmc_diagnostics(fit)
util.check_all_hmc_diagnostics(diagnostics)

samples = util.extract_expectand_vals(fit)
base_samples = util.filter_expectands(samples,
                                      ['omega', 'sigma',
                                       'beta', 'delta_online'],
                                      check_arrays=True)
util.check_all_expectand_diagnostics(base_samples)

All Hamiltonian Monte Carlo diagnostics are consistent with accurate
Markov chain Monte Carlo.
 
All expectands checked appear to be behaving well enough for reliable
Markov chain Monte Carlo estimation.

7.2 Posterior Inferences

By inferring customer lifetime value for each customer in each of the segments, we have a wealth of information that we can use for various forecasting tasks.

f, axarr = plot.subplots(2, 1, layout="constrained")

putil.plot_hist_quantiles(
  axarr[0], samples, 'clv',
  0, 1000, 50,
  xlabel='Customer Lifetime Value (Euro)'
)

putil.plot_hist_quantiles(
  axarr[1], samples, 'clv_discount',
  0, 1000, 50,
  xlabel='Future Discounted Customer Lifetime Value (Euro)'
)

plot.show

377176 predictive values (6.14%) fell above the histogram binning.
318553 predictive values (5.18%) fell above the histogram binning.

Namely, we can quantify the differences in customer lifetime value across the customer segments. Customers in the private banking, wealth, and retired segments offer much larger lifetime value than those in the mass retail, young professional, and student segments.

segments = customer_segmentation['customer_segment_pred']

f, axarr = plot.subplots(2, 3, layout="constrained")

for k in range(customer_segmentation['S']):
  k1 = k // 3
  k2 = k % 3

  filt_names = [ f'clv[{n + 1}]'
                 for n in range(customer_segmentation['N_pred'])
                 if segments[n] == k + 1 ]
  filt_samples = util.filter_expectands(samples, filt_names)

  putil.plot_hist_quantiles(
    axarr[k1, k2], filt_samples, 'clv',
    0, 2000, 125,
    xlabel='Customer Lifetime Value (Euro)',
    title=customer_segmentation['segment_names'][k]
  )

plot.show()

26744 predictive values (2.61%) fell above the histogram binning.
11704 predictive values (1.14%) fell above the histogram binning.
12047 predictive values (1.18%) fell above the histogram binning.

At this point the possibilities are endless. If, for example, we have information on the cost of acquiring customers in each segment, then we could use these posterior inferences to quantify the utility of recruiting customers in each segment. This could then be used to, for instance, optimize marketing strategies.

8 Conclusion

Accurately inferring customer churn is more subtle than it might first appear. In particular, if we don’t properly account for active customers then we will tend to overestimate churn. In turn, properly accounting for both churned and active customers requires careful modeling assumptions that we must validate. Once we’ve invested the time to build an adequate model, however, we have access to sophisticated posterior inferences than can inform a diversity of important decisions.

License

The code in this case study is copyrighted by Michael Betancourt and licensed under the new BSD (3-clause) license:

https://opensource.org/licenses/BSD-3-Clause

The text and figures in this case study are copyrighted by Michael Betancourt and licensed under the CC BY-NC 4.0 license:

https://creativecommons.org/licenses/by-nc/4.0/

Original Computing Environment

from watermark import watermark
print(watermark())

Last updated: 2026-05-04T00:38:06.062336-04:00

Python implementation: CPython
Python version       : 3.12.10
IPython version      : 9.6.0

Compiler    : Clang 13.0.0 (clang-1300.0.29.30)
OS          : Darwin
Release     : 24.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 16
Architecture: 64bit

print(watermark(packages="matplotlib, numpy, json, stan"))

matplotlib: 3.10.7
 numpy    : not installed
 json     : not installed
 stan     : not installed