In my previous case study I introduced probability theory but treated probability distributions as monolithic objects given to us by some higher mathematical power. To really take advantage of probability theory in practice we’ll need to build probability distributions from the ground up, and conditional probability theory will prove to be a vital tool in that construction.

This case study will introduce a conceptual understanding of conditional probability theory and its applications. We’ll begin with a discussion of marginal probability distributions before introducing conditional probability distributions as their complement. Then we’ll examine how different conditional probability distributions can be related to each other through Bayes’ Theorem before considering how all of these objects manifest in probability mass function and probability density function representations. Finally we’ll review some of the important practical applications of the theory.

1 Compressing Probability Into Marginal Distributions

A projection operator, \(\varpi: X \rightarrow Y\), maps points from a total space \(X\) to points in a base space \(Y \subset X\). All of the points in \(X\) that project to the same base point \(y \in Y\) form a fiber attached to \(y\), \[ F(y) = \left\{ x \in X \mid \varpi(x) = y \right\}. \]





The total space then decomposes into the union of these fibers, \[ X = \cup_{y \in Y} F(y), \]





which then collapse to the base space under the projection operator. Moreover, under reasonable conditions all the fibers, regardless of their base point, will look like the same fiber space, \(F\),
\[ F(y) = F, \forall y \in Y. \]

Under those same reasonable conditions, a \(\sigma\)-algebra on \(X\) naturally defines a \(\sigma\)-algebra on \(Y\) and the projection operator is measurable with respect to the two. Consequently any joint probability distribution on \(X\) will transform into a unique marginal probability distribution on \(Y\). More commonly we say that we marginalize out the fibers, \[ \mathbb{P}_{\pi_{*}} [ B ] = \mathbb{P}_{\pi} [ \varpi^{-1}(B)], \] where \[ \varpi^{-1}(B) = \cup_{y \in B} F(y). \]





By pushing a probability distribution on \(X\) along the projection operator we compress all of the probability along the fibers onto the corresponding base points and lose all of the information about how probability is distributed along the fibers themselves.

This process is a bit more straightforward when we are dealing with a product space, \(X \times Y\), where each point can be identified with the components \((x, y)\). Any such product space is naturally equipped with the component projection operators \[ \begin{alignat*}{6} \varpi_{X} :\; &X \times Y& &\rightarrow& \; &X& \\ &(x, y)& &\mapsto& &x& \end{alignat*} \] and \[ \begin{alignat*}{6} \varpi_{Y} :\; &X \times Y& &\rightarrow& \; &Y& \\ &(x, y)& &\mapsto& &y&. \end{alignat*} \]

The \(X\)-projection \(\varpi_{X} : X \times Y \rightarrow X\) compresses all points in the product space with the same \(X\)-component to the corresponding point in \(X\), loosing all information about the \(Y\)-components. Consequently every fiber is a copy of \(Y\), \[ F(x) = \varpi_{X}^{-1}(x) = Y. \]



In particular the product space decomposes into the union of many copies of \(Y\).



At the same time the \(Y\)-projection \(\varpi_{Y} : X \times Y \rightarrow Y\) compresses all points in the product space with the same \(Y\)-component to the corresponding point in \(Y\), loosing all information about the \(X\)-components.
Every fiber with respect to the \(Y\)_projection is then copy of \(X\), \[ F(y) = \varpi_{Y}^{-1}(y) = X. \]



Consequently the product space also decomposes into the union of many copies of \(X\)!



Pushing a distribution defined on \(X \times Y\) forwards along \(\varpi_{X}\) compresses all probability along \(Y\) to give a marginal probability distribution over \(X\) defined by \[ \mathbb{P}_{\pi_{X*}} [ B ] = \mathbb{P}_{\pi} [ \varpi^{-1}_{X}(B) ] \] Equivalently, pushing that same distribution forwards along \(\varpi_{Y}\) compresses all probability along \(X\) to give a marginal probability distribution over \(Y\) defined by \[ \mathbb{P}_{\pi_{Y*}} [ B ] = \mathbb{P}_{\pi} [ \varpi^{-1}_{Y}(B) ] \]





Consider, for example, the three-dimensional space, \(\mathbb{R}^{3}\), whose component spaces we typically denote axes with the labels \(X\), \(Y\), and \(Z\), \[ \mathbb{R}^{3} = \mathbb{R} \times \mathbb{R} \times \mathbb{R} = X \times Y \times Z. \] The coordinate functions serve as projection operators onto the three axes; for example the \(x\) coordinate identifies the position in \(X\). Marginalizing out \(X\) transforms a probability distribution over \(X \times Y \times Z\) into a probability distribution over the two-dimensional space, \(Y \times Z = \mathbb{R}^{2}\). Further marginalizing out \(Y\) then gives a probability distribution over the one-dimensional space, \(Z = \mathbb{R}\).

2 Conditional Probability Distributions

Projection operators allow us to transform a probability distribution over a space to a probability distribution on some lower-dimensional subspace. Is it possible, however, to go the other way? Can we take a given marginal probability distribution on a subspace and construct a joint probability distribution on the total space that projects back to that marginal? We can if we define a probability distribution over all of the fibers in order to specify the information lost in the compression to the marginal distribution.

A conditional probability distribution defines a probability distribution over each fiber, \[ \begin{alignat*}{6} \mathbb{P}_{F \mid Y} :\; &\mathcal{F} \times Y& &\rightarrow& \; &[0, 1]& \\ &(B, y)& &\mapsto& &\mathbb{P}_{F \mid Y} [B, y]&. \end{alignat*} \] where \(\mathcal{F}\) is the induced \(\sigma\)-algebra over the fiber space under the usual regularity conditions on the projection operator that we’ve been assuming.





Evaluated at any \(y \in Y\) the conditional probability distribution defines a probability distribution over the corresponding fiber. On the other hand, when evaluated at a given subset \(B \in \mathcal{F}\) the conditional probability distribution becomes a measurable function from \(Y\) into \([0, 1]\) that quantifies how the probability of that set varies as we move from one fiber to the next.

Together with a marginal distribution, \(\pi_{Y}\), we can then implicitly define a joint probability distribution over the total space by specifying the probability of any set on \(X\). First we decompose the set along the fibers by taking its intersection with each fiber.





We can then compute the probability allocated to each of these intersections using the conditional probability distribution, \[ p(y) = \mathbb{P}_{F \mid Y} [A \cap \varpi^{-1} (y), y] \]





This yields a real-valued function on the base space, \(p : Y \rightarrow [0, 1]\) and in order to aggregate all of these probabilities together to compute the total probability allocated to the set \(A\) we take the expectation of this function, \[ \begin{align*} \mathbb{P}_{X} [ A ] &= \mathbb{E}_{Y} [ p(y) ] \\ &= \mathbb{E}_{Y} [ \mathbb{P}_{F \mid Y} [A \cap \varpi^{-1} (y), y] ]. \end{align*} \] The induced joint distribution on the total space is consistent in the sense that if we push it forward along the projection operator we recover the input marginal distribution with which we started.

To reinforce these ideas let’s consider how conditional probability distributions manifest on a product space, \(X \times Y\), with respect to the \(Y\)-projection \(\varpi: X \times Y \rightarrow Y\). As we saw in the previous section, in this case the fiber space is \(X\).

In this case the conditional probability distribution is a collection of probability distributions over \(X\) indexed by the base point \(y \in Y\), \[ \begin{alignat*}{6} \mathbb{P}_{X \mid Y} :\; &\mathcal{X} \times Y& &\rightarrow& \; &[0, 1]& \\ &(B, y)& &\mapsto& &\mathbb{P}_{X \mid Y}[B, y]&. \end{alignat*} \]





We can use this probability distribution to lift a marginal distribution on \(Y\) to a joint distribution on \(X \times Y\) implicitly by providing probabilities to any well-behaved subset \(A\) of \(X \times Y\) in three steps. First we decompose \(A\) into stripes formed by intersecting \(A\) with each fiber.





We then compute the probability allocated to each of these intersections using the conditional probability distribution, \[ p(y) = \mathbb{P}_{X \mid Y} [A \cap \varpi^{-1} (y), y] \]





Finally we aggregate these fiber probabilities together into a total probability on the joint space by taking an expectation, \[ \begin{align*} \mathbb{P}_{X \times Y} [ A ] &= \mathbb{E}_{Y} [ p(y) ] \\ &= \mathbb{E}_{Y} [ \mathbb{P}_{X \mid Y} [A \cap \varpi^{-1} (y), y] ]. \end{align*} \]

3 Representations of Conditional Probability Distributions

Because conditional probability distributions are just collections of regular probability distributions they admit the similar representations. We just have to allow those representations to vary with the conditioning base point. To avoid some technical subtleties we’ll limit our consideration here to product spaces, \(X \times Y\).

3.1 Conditional Probability Mass Functions

When the components spaces \(X\) and \(Y\) are discrete we can represent the conditional probability distributions with conditional probability mass functions.

For example, a conditional probability mass function with respect to the \(X\)-projection is defined as \[ \begin{alignat*}{6} \pi_{Y \mid X} :\; &Y \times X& &\rightarrow& \; &[0, 1]& \\ &(y, x)& &\mapsto& &\pi_{Y \mid X} (y \mid x)&. \end{alignat*} \] Giving a probability mass function for the marginal distribution, \(\pi_{X}(x)\), we can immediately construct a probability mass function for the corresponding joint distribution as \[ \pi_{X \times Y} (x, y) = \pi_{Y \mid X} ( y \mid x ) \pi_{X} (x). \]

Similarly, a conditional probability mass function with respect to the \(Y\)-projection is defined as \[ \begin{alignat*}{6} \pi_{X \mid Y} :\; &X \times Y& &\rightarrow& \; &[0, 1]& \\ &(x, y)& &\mapsto& &\pi_{X \mid Y} (x \mid y)&. \end{alignat*} \] Giving a probability mass function for the marginal distribution, \(\pi_{Y}(y)\), we can immediately construct a probability mass function for the corresponding joint distribution as \[ \pi_{X \times Y} (x, y) = \pi_{X \mid Y} ( x \mid y ) \, \pi_{Y} (y). \]

We have two different ways to decompose joint distribution over a binary product space into a conditional distribution and a marginal distribution, but these two decompositions are not independent. Bayes’ Theorem defines a relationship between these two decompositions, allowing us to reconstruct one from another.

The manifestation of Bayes’ Theorem for probability mass functions follows from identifying the two different decompositions of the joint probability mass function, \[ \begin{align*} \pi_{X \times Y} (x, y) &= \pi_{X \times Y}(x, y) \\ \pi_{Y \mid X} ( y \mid x ) \, \pi_{X} (x) &= \pi_{X \mid Y} ( x \mid y ) \, \pi_{Y} (y), \end{align*} \] which we can manipulate to give \[ \pi_{Y \mid X} ( y \mid x ) = \frac{ \pi_{X \mid Y} ( x \mid y ) \, \pi_{Y} (y) }{ \pi_{X} (x) } \] or \[ \pi_{X \mid Y} ( x \mid y ) = \frac{ \pi_{Y \mid X} ( y \mid x ) \, \pi_{X} (x) }{ \pi_{Y} (y) }. \] In words, given the marginal probability mass functions we can recover any conditional probability mass function from the other.

3.2 Conditional Probability Density Functions

Equivalently, when the components spaces \(X\) and \(Y\) are subsets of the real numbers we can represent the conditional probability distributions with conditional probability density functions.

For example, a conditional probability density function with respect to the \(X\)-projection is defined as \[ \begin{alignat*}{6} \pi_{Y \mid X} :\; &Y \times X& &\rightarrow& \; &[0, 1]& \\ &(y, x)& &\mapsto& &\pi_{Y \mid X} (y \mid x)&. \end{alignat*} \] Given a probability density function for the marginal distribution, \(\pi_{X}(x)\), we can immediately construct a probability density function for the corresponding joint distribution as \[ \pi_{X \times Y}(x, y) = \pi_{Y \mid X} ( y \mid x ) \, \pi_{X} (x). \]

Similarly, a conditional probability density function with respect to the \(Y\)-projection is defined as \[ \begin{alignat*}{6} \pi_{X \mid Y} :\; &X \times Y& &\rightarrow& \; &[0, 1]& \\ &(x, y)& &\mapsto& &\pi_{X \mid Y} (x \mid y)&. \end{alignat*} \] Giving a probability density function for the marginal distribution, \(\pi_{Y}(y)\), we can immediately construct a probability density function for the corresponding joint distribution as \[ \pi_{X \times Y} (x, y) = \pi_{X \mid Y} ( x \mid y ) \, \pi_{Y} (y). \]

The manifestation of Bayes’ Theorem for probability density functions follows from identifying the two different decompositions of the joint probability density function, \[ \begin{align*} \pi_{X \times Y} (x, y) &= \pi_{X \times Y} (x, y) \\ \pi_{Y \mid X} ( y \mid x ) \, \pi_{X} (x) &= \pi_{X \mid Y} ( x \mid y ) \, \pi_{Y} (y), \end{align*} \] which we can manipulate to give \[ \pi_{Y \mid X} ( y \mid x ) = \frac{ \pi_{X \mid Y} ( x \mid y ) \, \pi_{Y} (y) }{ \pi_{X} (x) } \] or \[ \pi_{X \mid Y} ( x \mid y ) = \frac{ \pi_{Y \mid X} ( y \mid x ) \, \pi_{X} (x) }{ \pi_{Y} (y) }. \] In words, given the marginal probability density functions we can recover any conditional probability density function from the other.

For example, a joint Gaussian probability density function on \(\mathbb{R}^{2}\) is given by \[ \pi(x, y) = \frac{1}{2 \pi \sigma_{x} \sigma_{y} \sqrt{1 - \rho^{2}} } \exp \left[ - \frac{1}{2} \frac{1}{1 - \rho^{2}} \left( \left( \frac{ x - \mu_x }{\sigma_{x}} \right)^{2} - 2 \rho \frac{x - \mu_x}{\sigma_{x}} \frac{y - \mu_y}{\sigma_{y}} + \left( \frac{y - \mu_y}{\sigma_{y}} \right)^{2} \right) \right], \]





The probability density function for the marginal distribution on \(x\) is given by \[ \begin{align*} \pi(x) &= \int \mathrm{d}x \, \pi(x, y) \\ &= \frac{1}{\sqrt{2\pi \sigma_{x}^{2}}} \exp \left[ - \frac{1}{2} \left( \frac{x - \mu_x}{\sigma_{x}} \right)^{2} \right] \end{align*} \]