Conditional Probability Theory
Conditional probability theory provides a rigorous way to decompose probability distributions over a space X into a collection of probability distributions over subsets of X. This decomposition introduces two powerful new operations into probability theory. First, it allows us to reduce complicated probabilistic calculations over all of X into a sequence of potentially-simpler calculations over the smaller subsets. At the same time, it allows us to build probability distributions up from lower-dimensional, and more manageable, components.
That said, all of this power comes at the cost of subtlety. To avoid any confusion, our introduction to conditional probability theory will need to proceed carefully.
In this chapter we will first learn how to decompose spaces into subsets before discussing how probability distributions can be decomposed across those subsets. The decomposition of the probability density functions that are so critical to practical applications will be the topic of the next chapter.
1 Decomposing Spaces With Partitions
In Chapter 6, Section 1.2.1, we introduced the notion of a partition (Figure 1 (b)). To refresh our memories, a partition is a collection of subsets \mathcal{P} = \{ \mathsf{c}_{1}, \ldots, \mathsf{c}_{i}, \ldots \}, that are each non-empty, \mathsf{c}_{i} \ne \emptyset, are mutually disjoint, \mathsf{c}_{i} \cap \mathsf{c}_{i' \ne i} = \emptyset, and altogether cover the entire space, \cup_{i} \mathsf{c}_{i} = X. A collection of subsets that cover X but intersect with each other do not form a valid partition (Figure 1 (c)), nor does a collection of disjoint subsets that don’t cover all of X (Figure 1 (d)).
The individual subsets that form a partition are known as the cells of the partition. A partition can contain a finite number of cells, a countably infinite number of cells, or even an uncountably infinite number of cells. I will refer to partitions with a finite, countably infinite, and uncountable infinite number of cells as finite, countable, and uncountable partitions, respectively.
A finite partition can always be defined as an explicit list of cells. Explicit lists, however, aren’t practical for countable or uncountable partitions, as they would have to be infinitely long. Fortunately, we can also implicitly define partitions from the level sets of appropriate functions.
Consider, for example, a finite partition \mathcal{P} defined as an explicit list of I subsets, \mathcal{P} = \{ \mathsf{c}_{1}, \ldots, \mathsf{c}_{i}, \ldots \mathsf{c}_{I} \}. In order to distinguish between the individual cells here, I have assigned them each a unique numerical label or index from the integers \{1, \ldots, I \}. Beyond a notational convenience, we can also use this indexing to define the partition itself.
The indexing implicitly defines a bijective index function that maps each cell to its corresponding integer index, \begin{alignat*}{6} \chi_{\mathcal{P}} :\; &\mathcal{P}& &\rightarrow& \; &\{1, \ldots, I \}& \\ &\mathsf{c}_{i}& &\mapsto& &i&. \end{alignat*} At the same time, we can also define a function that maps each point in the ambient space x \in X to the partition cell that contains it, \begin{alignat*}{6} \eta_{\mathcal{P}} :\; &X& &\rightarrow& \; &\mathcal{P}& \\ &x& &\mapsto& &\{ \mathsf{c}_{i} \in \mathcal{P} \mid x \in \mathsf{c}_{i} \}&. \end{alignat*} Composing these two functions together defines a third function that maps points in the ambient space to partition cell indices (Figure 2), \begin{alignat*}{6} \phi_{\mathcal{P}} = \chi_{\mathcal{P}} \circ \eta_{\mathcal{P}} :\; &X& &\rightarrow& \; &\{1, \ldots, I \}& \\ &x& &\mapsto& &\{ i \in \{1, \ldots, I \} \mid x \in \mathsf{c}_{i} \in \mathcal{P} \}&. \end{alignat*}
Because the partition cells are, by definition, disjoint and cover all of X, each point x \in X falls into one, and only one, partition cell. In other words, each point is associated with one and only one partition cell index. Consequently, \phi_{\mathcal{P}} will always be a surjective function.
The level set of \phi_{\mathcal{P}} for a given index i is the subset of all input points that fall into the ith partition cell. This, however, is just the ith partition cell itself, \phi_{\mathcal{P}}^{-1}(i) = \{ x \in X \mid \varpi_{\mathcal{P}}(x) = i \} = \mathsf{c}_{i}. As a result, we can completely reconstruct the cells of the partition \mathcal{P} from the level sets of this indexing function (Figure 3), \mathcal{P} = \{ \mathsf{c}_{1} = \varpi_{\mathcal{P}}^{-1}(1), \ldots, \mathsf{c}_{i} = \varpi_{\mathcal{P}}^{-1}(i), \ldots, \mathsf{c}_{I} = \varpi_{\mathcal{P}}^{-1}(I) \}!
Because the cells in a partition are generally unordered, the exact indexing we use here is arbitrary. Different permutations of the labels define different index functions \chi_{\mathcal{P}}, and hence different composite functions \phi_{\mathcal{P}}. The level sets of these functions, however, are always the same. In other words, we are free to work with whichever indexing might be most convenient in any given application.
Let’s take a breath and summarize what we’ve done so far. Any finite partition \mathcal{P} can be explicitly defined as a list of disjoint subsets, or implicitly defined by an appropriate surjective function. The advantage of this implicit definition is that it immediately generalizes to any type of partition.
Every function f : X \rightarrow Y decomposes the input space X into level sets, f^{-1}(y). By definition, these level sets are not only disjoint but also cover all of X, X = \bigcup_{y \in Y} f^{-1}(y). If f is surjective, then every one if its level sets will also be non-empty, f^{-1}(y) \ne \emptyset for all y \in Y. Consequently, the level sets of every surjective function implicitly defines a partition where each cell is indexed by a unique output value.
If the output space Y contains a finite number of points, then the level sets of f define a finite partition (Figure 4). On the other hand, when Y contains a countably infinite number of points, the level sets implicitly define a countable partition even though we cannot exhaustively list every cell in practice. Similarly, if Y contains an uncountably infinite number of points, then the level sets define an uncountable partition (Figure 5).
To demonstrate uncountable partitions, let’s consider a few examples over the space X = \mathbb{R}^{2}. The surjective function \begin{alignat*}{6} f :\; &\mathbb{R}^{2}& &\rightarrow& \; &\mathbb{R}& \\ &(x_{1}, x_{2})& &\mapsto& &x_{1}& \end{alignat*} implicitly defines a partition that decomposes X into an uncountable number of real lines, each of which can be visualized by a vertical line (Figure 6 (a)). Similarly, thesurjective function \begin{alignat*}{6} f :\; &\mathbb{R}^{2}& &\rightarrow& \; &\mathbb{R}^{+}& \\ &(x_{1}, x_{2})& &\mapsto& &r = \sqrt{x_{1}^{2} + x_{2}^{2}}& \end{alignat*} implicitly defines a partition that decomposes X into an uncountable number of concentric arcs with a fixed radii (Figure 6 (b)).
Partitions comprised of measurable subsets are particularly important in probability theory. When the cells of a partition \mathcal{P} are all \mathcal{X}-measurable, the partition itself becomes a subset of the defining \sigma-algebra, \mathcal{P} \subset \mathcal{X}. Accordingly, these partitions are referred to as \mathcal{X}-measurable partitions, or simply measurable partitions when the relevant \sigma-algebra is unambiguous.
Even if a surjective function is measurable, it may not define a measurable partition. Only if the output space is equipped with a \sigma-algebra \mathcal{Y} that includes all of the atomic subsets, \{ y \} \in \mathcal{Y} for all y \in Y, will the level sets of a measurable function f : (X, \mathcal{X}) \rightarrow (Y, \mathcal{Y}) always be \mathcal{X}-measurable subsets of the input space, f^{-1}( y ) = f^{*}( \{ y \} ) \in \mathcal{X}. If \mathcal{Y} does contain all of the atomic subsets, however, then every surjective and (\mathcal{X}, \mathcal{Y})-measurable function will define measurable level sets, and hence a measurable partition.
A \sigma-algebras that contains of all of the atomic subsets in the ambient space is known as a Hausdorff \sigma-algebra (Leão Jr, Fragoso, and Ruffino 2004). Similarly, a space paired with a Hausdorff \sigma-algebra is known as a Hausdorff measurable space.
Fortunately, all but the most pathological \sigma-algebras are Hausdorff. In practice, we can pretty safely assume that every \sigma-algebra we will encounter will satisfy this property. Because all of the functions that we will work with in practice will be measurable, we can also safely assume that the partitions implicitly defined by any surjective function will be measurable.
2 Conditioning on Countable And Explicit Partitions
Whether defined explicitly or implicitly, a partition decomposes a space X into a collection of non-empty, non-overlapping subsets. This spatial decomposition then provides the basis for decomposing probability distributions over X into a collection of probability distributions confined to those subsets. Before tackling the full generality of this procedure, we will first build up intuition in the simplest case of countable partitions explicitly defined as lists of subsets.
2.1 The Law of Total Probability
As we saw in Chapter Four, Kolmogorov’s axioms define a probability distribution as consistent allocation of probability over measurable subsets. In order to decompose a probability distribution, we need to be able to decompose measurable subset and the probabilities allocated to them.
Any measurable set \mathsf{x} \in \mathcal{X} can be immediately decomposed into its intersections with the cells of a given partition \mathcal{P} (Figure 7 (b)), \mathsf{x} = \bigcup_{\mathsf{c} \in \mathcal{P}} \left( \mathsf{x} \cap \mathsf{c} \right). Because the partition cells are mutually disjoint, these intersections will also be mutually disjoint. If \mathsf{c}_{1} \in \mathcal{P} and \mathsf{c}_{2} \in \mathcal{P} are two distinct partition cells, then \begin{align*} ( \mathsf{x} \cap \mathsf{c}_{1} ) \cap ( \mathsf{x} \cap \mathsf{c}_{2} ) &= ( \mathsf{x} \cap \mathsf{c}_{1} ) \cap ( \mathsf{c}_{2} \cap \mathsf{x} ) \\ &= \mathsf{x} \cap ( \mathsf{c}_{1} \cap \mathsf{c}_{2} ) \cap \mathsf{x} \\ &= \mathsf{x} \cap \emptyset \cap \mathsf{x} \\ &= \emptyset. \end{align*}
If the partition \mathcal{P} is countable, then any measurable subset \mathsf{x} \in \mathcal{X} will decompose into a countable number of components. Moreover, each of these components will also be measurable whenever the partition is measurable, because \sigma-algebras are closed under countable unions.
To summarize, the intersections of a measurable subset with the cells of a countable, measurable partition define a countable collection of disjoint, measurable subsets. This is a ripe opportunity to apply the countable additivity of probability distributions (Figure 7 (c)), \begin{align*} \pi(\mathsf{x}) &= \pi \left( \bigcup_{\mathsf{c} \in \mathcal{P}} \left( \mathsf{x} \cap \mathsf{c} \right) \right) \\ &= \sum_{\mathsf{c} \in \mathcal{P}} \pi( \mathsf{x} \cap \mathsf{c} ). \end{align*}
In words, the probability allocated to \mathsf{x} \in \mathcal{X} decomposes into a sum of probabilities allocated to the partition intersections. This decomposition of probability allocations is referred to as the law of total probability.
2.2 Conditional Probabilities
Now that we can decompose the probabilities allocated to individual measurable subsets, we can consider how to decompose entire probability distributions. To make our first steps towards this decomposition more manageable, let’s begin with a simplifying restriction on the partition.
Partitions whose cells are all allocated non-zero probabilities allow us to multiply and divide by the cell probabilities without fear of zeros. I will refer to a partition where every cell is not only measurable but also allocated a non-zero probability, \pi(\mathsf{c}) > 0 for all \mathsf{c} \in \mathcal{P}, as a \pi-non-null partition.
Because each \pi(\mathsf{c}) is non-zero, we can rearrange each term in the law of total probability to give \begin{align*} \pi( \mathsf{x} ) &= \sum_{\mathsf{c} \in \mathcal{P}} \pi( \mathsf{x} \cap \mathsf{c} ) \\ &= \sum_{\mathsf{c} \in \mathcal{P}} \pi( \mathsf{x} \cap \mathsf{c} ) \cdot \frac{ \pi( \mathsf{c} ) }{ \pi( \mathsf{c} ) } \\ &= \sum_{\mathsf{c} \in \mathcal{P}} \frac{ \pi(\mathsf{x} \cap \mathsf{c}) }{ \pi ( \mathsf{c} ) } \cdot \pi( \mathsf{c} ) \\ &\equiv \sum_{\mathsf{c} \in \mathcal{P}} \pi^{\mathcal{P}}( \mathsf{x} \mid \mathsf{c} ) \cdot \pi( \mathsf{c} ). \end{align*}
Each conditional probability \pi^{\mathcal{P}}( \mathsf{x} \mid \mathsf{c} ) = \frac{ \pi(\mathsf{x} \cap \mathsf{c}) }{ \pi (\mathsf{c}) } quantifies the proportion of the probability allocated to the intersection of \mathsf{x} and the conditioning partition cell, \pi(\mathsf{x} \cap \mathsf{c}), relative to the total probability allocated to the conditioning partition cell, \pi( \mathsf{c} ) (Figure 8).
By definition, a measurable subset \mathsf{x} \in \mathcal{X} that doesn’t overlap with the conditioning partition cell \mathsf{c} is allocated zero conditional probability, \begin{align*} \pi^{\mathcal{P}}( \mathsf{x} \mid \mathsf{c} ) &= \frac{ \pi(\mathsf{x} \cap \mathsf{c}) }{ \pi (\mathsf{c}) } \\ &= \frac{ \pi(\emptyset) }{ \pi (\mathsf{c}) } \\ &= \frac{ 0 }{ \pi (\mathsf{c}) } \\ &= 0. \end{align*} At the same time, any measurable subset that completely overlaps with the conditioning partition cell, \mathsf{x} \cap \mathsf{c} = \mathsf{c}, is allocated full conditional probability, \begin{align*} \pi^{\mathcal{P}}( \mathsf{x} \mid \mathsf{c} ) &= \frac{ \pi(\mathsf{x} \cap \mathsf{c}) }{ \pi (\mathsf{c}) } \\ &= \frac{ \pi(\mathsf{c}) }{ \pi (\mathsf{c}) } \\ &= 1. \end{align*}
Conditional probabilities look suspiciously like probability allocations that have been restricted to the domain of the conditioning partition cell. With a little more work we can show that this suspicion is in fact correct.
2.3 Conditional Probability Distributions Over The Ambient Space
Given a measurable subset \mathsf{x} \in \mathcal{X} and a measurable, \pi-non-null partition cell \mathsf{c} \in \mathcal{P}, we can construct a single conditional probability \pi^{\mathcal{P}}( \mathsf{x} \mid \mathsf{c}). The collection of all conditional probabilities relative to a particular partition cell \mathsf{c} defines a function from measurable subsets to conditional probabilities, \begin{alignat*}{6} \pi^{\mathcal{P}}_{\mathsf{c}} :\; &\mathcal{X}& &\rightarrow& \; &[0, 1]& \\ &\mathsf{x} & &\mapsto& &\frac{ \pi(\mathsf{x} \cap \mathsf{c}) }{ \pi (\mathsf{c}) }&. \end{alignat*}
The immediate question is whether or not this function defines a probability distribution. To answer that question, we’ll have to consider the Kolmogorov axioms.
We begin with the first Kolmogorov axiom, which requires a function that maps measurable subsets into probabilities. This matches the inputs and output spaces of \pi^{\mathcal{P}}_{\mathsf{c}}, so we’re good.
In order to satisfy the second Kolmogorov axiom, the probability allocated to the entire ambient set must be one. Indeed, \begin{align*} \pi^{\mathcal{P}}_{\mathsf{c}}( X ) &= \frac{ \pi(X \cap \mathsf{c}) }{ \pi(\mathsf{c}) } \\ &= \frac{ \pi(\mathsf{c}) }{ \pi(\mathsf{c}) } \\ &= 1. \end{align*}
Finally, we need \pi^{\mathcal{P}}_{\mathsf{c}} to satisfy countable additivity. For any countable collection of measurable but disjoint subset sets \{ \mathsf{x}_{1}, \ldots, \mathsf{x}_{j}, \ldots \}, we have \begin{align*} \pi^{\mathcal{P}}_{\mathsf{c}}( \cup_{j} \mathsf{x}_{j} ) &= \frac{ \pi( \, (\cup_{j} \mathsf{x}_{j}) \, \cap \mathsf{c}) } { \pi(\mathsf{c}) } \\ &= \frac{ \pi( \cup_{j} (\mathsf{x}_{j} \cap \mathsf{c}) ) } { \pi(\mathsf{c}) } \\ &= \frac{ \sum_{j} \pi( \mathsf{x}_{j} \cap \mathsf{c} ) } { \pi(\mathsf{c}) } \\ &= \sum_{j} \frac{ \pi( \mathsf{x}_{j} \cap \mathsf{c} ) } { \pi(\mathsf{c}) } \\ &= \sum_{j} \pi^{\mathcal{P}}_{\mathsf{c}}( \mathsf{x}_{j} ), \end{align*} as needed.
With all three Kolmogorov axioms verified, we can now formally state that, for any partition cell \mathsf{c}, the mapping defined by \pi^{\mathcal{P}}_{\mathsf{c}}( \mathsf{x} ) = \frac{ \pi( \mathsf{x} \cap \mathsf{c}) }{ \pi (\mathsf{c}) } is a probability distribution over the ambient space X. Formally, we say that \pi^{\mathcal{P}}_{\mathsf{c}} is a conditional probability distribution.
2.4 Conditional Probability Distributions Over Partition Cells
An important feature of conditional probability distributions is that their allocations are relatively singular.
Recall that any measurable subset that doesn’t intersect with the conditioning partition cell is always allocated zero probability, \begin{align*} \pi^{\mathcal{P}}_{\mathsf{c}}( \mathsf{x} ) &= \pi^{\mathcal{P}}( \mathsf{x} \mid \mathsf{c} ) \\ &= \frac{ \pi(\mathsf{x} \cap \mathsf{c}) }{ \pi(\mathsf{c}) } \\ &= \frac{ \pi(\emptyset) }{ \pi(\mathsf{c}) } \\ &= 0. \end{align*} All of the conditional probability concentrates within the conditioning partition cell itself, \begin{align*} \pi^{\mathcal{P}}_{\mathsf{c}}( \mathsf{c} ) &= \pi^{\mathcal{P}}( \mathsf{c} \mid \mathsf{c} ) \\ &= \frac{ \pi(\mathsf{c} \cap \mathsf{c}) }{ \pi(\mathsf{c}) } \\ &= \frac{ \pi(\mathsf{c}) }{ \pi(\mathsf{c}) } \\ &= 1! \end{align*}
Intuitively, this suggests that we can interpret a conditional probability distribution as a restriction of the initial probability distribution to a particular partition cell. To formalize this intuition, however, we have to define what it means to restrict not only the elements of (X, \mathcal{X}) to a partition cell but also the measurable subsets.
Taking the intersection of any subset \mathsf{x} \subset X with a partition cell \mathsf{c} \subset X gives a subset whose elements are entirely contained within the partition cell, \mathsf{x} \cap \mathsf{c} \subset \mathsf{c}. Moreover, intersecting an entire collection of subsets \{ \mathsf{x}_{1}, \ldots, \mathsf{x}_{j}, \ldots \} \subset 2^{X} with \mathsf{c} gives a collection of subsets that are all contained within the partition cell, \{ \mathsf{x}_{1} \cap \mathsf{c}, \ldots, \mathsf{x}_{j} \cap \mathsf{c}, \ldots \} \subset 2^{\mathsf{c}}.
When the partition cell is itself an element of that initial subset collection, this restriction respects the subset operations. Formally, if \mathcal{X} \subset 2^{X} is a collection of subsets of X that contains \mathsf{c} and is closed under complements, countable unions, and countable operations, then the collection of intersections \mathcal{X}_{\mathsf{c}} = \{ \mathsf{x} \cap \mathsf{c} \text{ for all } \mathsf{x} \in \mathcal{X} \} \subset 2^{\mathsf{c}} will also be a collection of subsets of \mathsf{c} that is closed under complements, countable unions, and countable operations. In other words, if \mathcal{X} is a \sigma-algebra over X that contains \mathsf{c}, then \mathcal{X}_{\mathsf{c}} will be a \sigma-algebra over \mathsf{c}. This restricted \sigma-algebra is known as a subspace \sigma-algebra.
By construction, every measurable subset in a restricted \sigma-algebra \mathcal{X}_{\mathsf{c}} is also a measurable subset in the ambient \sigma-algebra \mathcal{X}. Consequently, the probabilities defined by a conditional probability distribution over X also define probabilities over \mathsf{c}. This allows us to define a new function \begin{alignat*}{6} \pi^{\mathcal{P}}_{\mathsf{c}} :\; &\mathcal{X}_{\mathsf{c}}& &\rightarrow& \; &[0, 1]& \\ &\mathsf{s}& &\mapsto& &\frac{ \pi(\mathsf{s} \cap \mathsf{c}) }{ \pi (\mathsf{c}) }& \end{alignat*} with \pi^{\mathcal{P}}_{\mathsf{c}}( \mathsf{c} ) = \pi^{\mathcal{P}}( \mathsf{c} \mid \mathsf{c} ) = 1 and \pi^{\mathcal{P}}_{\mathsf{c}}( \cup_{j} \mathsf{s}_{j} ) = \sum_{j} \pi^{\mathcal{P}}_{\mathsf{c}}( \mathsf{s}_{j} ). This, however, is exactly a probability distribution over the partition cell \mathsf{c}!
All of this demonstrates that we have two equally valid interpretations of a conditional probability distribution. First, we can interpret a conditional probability distribution as a probability distribution over the full ambient space X which completely concentrates within a conditioning partition cell \mathsf{c} \in \mathcal{P} (Figure 9 (a)). Alternatively, we can interpret a conditional probability distribution as a probability distribution over just the conditioning partition cell (Figure 9 (b)).
The former interpretation is more common in technical mathematics. As we will see in Chapter 9, however, the latter interpretation is more in line with how conditional probability distributions are interpreted and used in more practical applications of probability theory.
2.5 Conditional Probability Kernels
We can push our organization of conditional probabilities one step further by collecting all of the conditional probability distributions for all of the partition cells into a single mathematical object (Figure 10), \begin{alignat*}{6} \pi^{\mathcal{P}}( \cdot \mid \cdot ) :\; &\mathcal{X} \times \mathcal{P}& &\rightarrow& \; &[0, 1]& \\ &\mathsf{x}, \mathsf{c}& &\mapsto& &\frac{ \pi(\mathsf{x} \cap \mathsf{c}) }{ \pi (\mathsf{c}) }&. \end{alignat*} I will refer to this binary function as a conditional probability kernel.
Partially evaluating a conditional probability kernel on a measurable subset in its first argument results in a measurable, unary function from each partition cell to the corresponding conditional probability, \begin{alignat*}{6} p_{\mathsf{x}} = \pi^{\mathcal{P}}( \mathsf{x} \mid \cdot ) :\; &\mathcal{P}& &\rightarrow& \; &[0, 1]& \\ &\mathsf{c}& &\mapsto& &\frac{ \pi(\mathsf{x} \cap \mathsf{c}) }{ \pi (\mathsf{c}) }&. \end{alignat*} In words, this partial evaluation quantifies how much the unconditional probability allocated to \mathsf{x} contributes to the unconditional probability allocated to each partition cell. I will refer to this object as a conditional probability function.
On the other hand, partially evaluating a conditional probability kernel on a partition cell in its second argument gives the corresponding conditional probability distribution, \begin{alignat*}{6} \pi^{\mathcal{P}}_{\mathsf{c}} = \pi^{\mathcal{P}}( \cdot \mid \mathsf{c} ) :\; &\mathcal{X}& &\rightarrow& \; &[0, 1]& \\ &\mathsf{x}& &\mapsto& &\frac{ \pi(\mathsf{x} \cap \mathsf{c}) }{ \pi (\mathsf{c}) }&. \end{alignat*}
As is so often the case, we have to be careful with the terminology here. I have used “conditional probability distribution” to refer to a particular probability distribution associated with a particular partition cell, and “conditional probability kernel” to refer to the collection of all probability distributions defined by all of the cells in a partition. This convention, however, is by no means universal. Many references use “conditional probability distribution” to refer to the collection of probability distributions \pi^{\mathcal{P}} instead of a particular probability distribution \pi^{\mathcal{P}}_{\mathsf{c}}, and some even use it to refer to both at the same time! Needless to say, this latter overloaded and ambiguous terminology makes it very easy to confuse the two objects.
Again, I will use “conditional probability distribution” and “conditional probability kernel” to avoid as much ambiguity in this book as possible. When reading other texts, however, you will want to be careful to identify to which object an author is referring as any given time. Moreover, in your own writing there is no harm in being redundant and clarifying whether you are referring to \pi^{\mathcal{P}}( \cdot \mid \cdot ) or \pi^{\mathcal{P}}( \cdot \mid \mathsf{c} ) in any given application.
2.6 The Law of Total Expectation
One of the benefits of conditional probability kernels is that they allow us to rewrite the law of total probability, \pi( \mathsf{x} ) = \sum_{ \mathsf{c} \in \mathcal{P} } \pi^{\mathcal{P}}( \mathsf{x} \mid \mathsf{c} ) \, \pi( \mathsf{c} ) entirely in terms of expectation values.
If we write conditional probabilities as the outputs of a conditional probability function function, \pi^{\mathcal{P}}( \mathsf{x} \mid \mathsf{c} ) = p_{\mathsf{x}}( \mathsf{c} ), then the law of total probability becomes a discrete expectation value, \begin{align*} \pi( \mathsf{x} ) &= \sum_{ \mathsf{c} \in \mathcal{P} } \pi^{\mathcal{P}}( \mathsf{x} \mid \mathsf{c} ) \, \pi( \mathsf{c} ) \\ &= \sum_{ \mathsf{c} \in \mathcal{P} } p_{\mathsf{x}}( \mathsf{c} ) \, \pi( \mathsf{c} ) \\ &= \mathbb{E}_{ \pi_{\mathcal{P}} } [ p_{\mathsf{x}} ]. \end{align*} Here the probability distribution \pi_{\mathcal{P}} is defined by the probability allocated to each partition cell, \pi_{\mathcal{P}}( \mathsf{c} ) = \pi( \mathsf{c} ).
At the same time, both the initial allocation and conditional probability function can be written in terms of expectation values of indicator functions, \pi( \mathsf{x} ) = \mathbb{E}_{\pi}[ I_{\mathsf{x}} ] and p_{\mathsf{x}}(\mathsf{c}) = \mathbb{E}_{\pi^{\mathcal{P}}_{\mathsf{c}} }[ I_{\mathsf{x}} ], respectively. Consequently, we can write the law of total probability as \begin{align*} \pi( \mathsf{x} ) &= \mathbb{E}_{ \pi_{\mathcal{P}} } [ p_{\mathsf{x}} ] \\ \mathbb{E}_{\pi}[ I_{\mathsf{x}} ] &= \mathbb{E}_{ \pi_{\mathcal{P}} } [ p_{\mathsf{x}} ], \end{align*} with p_{\mathsf{x}}(\mathsf{c}) = \mathbb{E}_{\pi^{\mathcal{P}}_{\mathsf{c}} }[ I_{\mathsf{x}} ].
Conveniently, this relationship between the two notions of expectation defined by an initial probability distribution \pi and a measurable partition \mathcal{P} generalizes to arbitrary expectation values. The expectation value of any function g : (X, \mathcal{X}) \rightarrow (\mathbb{R}, \mathcal{B}_{\mathbb{R}}) can be written as a nested expectation, \mathbb{E}_{\pi}[ g ] = \mathbb{E}_{ \pi_{\mathcal{P}} } [ e_{g} ], where e_{g}(\mathsf{c}) = \mathbb{E}_{ \pi^{\mathcal{P}}_{\mathsf{c}} }[ g ]. Here the inner expectations e_{g} are known as conditional expectation values. The overall quality is known as the law of total expectation or the law of iterated expectation.
With the laws of total probability and total expectation in hand, we can decompose not only probability allocations but also expectation values along an explicit partition. If expectation values with respect to the initial probability distribution are difficult to compute but the conditional expectation values are more straightforward to work out, then this iterative approach becomes a particularly-productive computational technique.
3 Conditioning On Implicit Partitions
The construction, and notation, of conditional probability distributions becomes particularly elegant when we implicitly define countable partitions through the level sets of a surjective function. This also paves the way for generalizing conditional probability theory to uncountable partitions implicitly defined by functions with an uncountable number of output points.
3.1 Conditioning On Countable Implicit Partitions
In Section 1, we learned that surjective functions f : X \rightarrow Y implicitly define a partition of the input space X where the partition cells are defined by the non-empty level sets f^{-1}(y) \subset X. If f is also (\mathcal{X}, \mathcal{Y})-measurable and \mathcal{Y} is a Hausdorff \sigma-algebra, then these non-empty level sets will also be \mathcal{X} measurable, allowing us to consistently allocate probability to them.
When Y contains a countable number of elements, the partition defined by these level sets will be countable. Moreover, if the probability allocated to each level set is non-zero, \pi( f^{-1}(y) ) > 0 for all y \in Y, then the partition will be \pi-non-null. In this case we can directly apply the conditional probability theory that we introduced in Section 2.
When working with partitions implicitly defined by a surjective function f : (X, \mathcal{X}) \rightarrow (Y, \mathcal{Y}), I will denote the conditional probability kernel as \begin{alignat*}{6} \pi^{f} :\; &\mathcal{X} \times Y& &\rightarrow& \; &[0, 1]& \\ &\mathsf{x}, y& &\mapsto& &\pi^{f} ( \mathsf{x} \mid y ) = \frac{ \pi(\mathsf{x} \cap f^{-1}(y)) }{ \pi(f^{-1}(y)) } &. \end{alignat*}
For each \mathsf{x} \in \mathcal{X}, the partial evaluation p_{\mathsf{x}} = \pi^{f} ( \mathsf{x} \mid \cdot ) : Y \rightarrow [0, 1] defines a \mathcal{Y}-measurable conditional probability function, and for each y \in Y the partial evaluation \begin{alignat*}{6} \pi^{f}_{y} = \pi^{f} ( \cdot \mid y ) :\; &\mathcal{X}& &\rightarrow& \; &[0, 1]& \\ &\mathsf{x}& &\mapsto& &\pi^{f} ( \mathsf{x} \mid y ) & \end{alignat*} defines a conditional probability distribution that concentrates entirely on the corresponding level set, \begin{align*} \pi^{f}_{y}( f^{-1}(y) ) &= \pi^{f}( f^{-1}(y) \mid y ) \\ &= \frac{ \pi(f^{-1}(y) \cap f^{-1}(y)) }{ \pi(f^{-1}(y)) } \\ &= \frac{ \pi(f^{-1}(y) ) }{ \pi(f^{-1}(y)) } \\ &= 1. \end{align*}
Equivalently, we can interpret each \pi^{f}_{y} as a probability distribution over not the entire ambient space X but rather just the corresponding level set, \pi^{f}_{y} : \mathcal{X}_{y} \rightarrow [0, 1], where \mathcal{X}_{y} denotes the subspace \sigma-algebra restricted to the level set f^{-1}(y). Again, we have the flexibility to interpret the conditional probability distributions induced by f as a collection of probability distributions over X that each concentrate on a particular level set, or as a collection of probability distributions over each level set.
Because we have assumed that f is (\mathcal{X}, \mathcal{Y})-measurable, we can also push \pi forward along f to define a marginal probability distribution f_{*} \pi over the output space. By definition, the pushforward probability allocated to the output atomic subset \{ y \} \in \mathcal{Y} is equal to the input probability allocated to the corresponding level set, f_{*} \pi( \{ y \} ) = \pi ( f^{-1}(y) ). This means that a surjective function f : (X, \mathcal{X}) \rightarrow (Y, \mathcal{Y}) induces a \pi-non null partition if and only if the pushforward probability allocated to every output atomic subset is non-zero, f_{*} \pi( \{ y \} ) > 0 for all y \in Y.
For the countable and measurable partition implicitly defined by a sufficiently nice surjective function, the law of total probability becomes \begin{align*} \pi( \mathsf{x} ) &= \sum_{ y \in Y } \pi^{f}( \mathsf{x} \mid y ) \, \pi(f^{-1}(y)) \\ &= \sum_{ y \in Y } \pi^{f}( \mathsf{x} \mid y ) \, f_{*} \pi( \{ y \} ). \end{align*} This, however, is just a pushforward expectation value, \begin{align*} \pi( \mathsf{x} ) &= \sum_{ y \in Y } \pi^{f}( \mathsf{x} \mid y ) \, f_{*} \pi( \{ y \} ) \\ &= \mathbb{E}_{f_{*}\pi} [ p_{\mathsf{x}} ], \end{align*} where \begin{alignat*}{6} p_{\mathsf{x}} :\; &Y& &\rightarrow& \; &[0, 1]& \\ &y& &\mapsto& &\pi^{f}( \mathsf{x} \mid y ) = \frac{ \pi(\mathsf{x} \cap f^{-1}(y)) }{ \pi(f^{-1}(y)) }& \end{alignat*} is a conditional probability function.
Similarly, the law of total expectation becomes \begin{align*} \mathbb{E}_{\pi}[ g ] &= \sum_{ x \in X } \pi( \{ x \} ) \, g(x) \\ &= \sum_{ x \in X } \left[ \sum_{ y \in Y } \pi^{f}( \{ x \} \mid y ) \, f_{*} \pi( \{ y \} ) \right] \, g(x) \\ &= \sum_{ y \in Y } \left[ \sum_{ x \in X } \pi^{f}( \{ x \} \mid y ) \, g(x) \right] \, f_{*} \pi( \{ y \} ) \\ &= \sum_{ y \in Y } e_{g} \, f_{*} \pi( \{ y \} ) \\ &= \mathbb{E}_{ f_{*} \pi } [ e_{g} ], \end{align*} where e_{g} is the conditional expectation function \begin{alignat*}{6} e_{g} :\; &Y& &\rightarrow& \; &[0, 1]& \\ &y& &\mapsto& & \mathbb{E}_{\pi^{f}_{y} }[ g ] &. \end{alignat*}
Let’s take a breath and review. A sufficiently-nice surjective function gives us two ways to manipulate a probability distribution over the input space: we can not only push it forward to a probability distribution on the output space but also decompose it into a collection of probability distributions across the level sets. Moreover, the laws of total probability and total expectation show us that these operations complement each other in the sense that we can always recover any information about the initial probability distribution by combining the information from the pushforward and conditional probability distributions.
The pushforward probability distribution quantifies how much input probability is allocated to each level set, while each conditional probability distribution quantifies how those allocations are distributed all across the corresponding level set (Figure 11). In other words, conditional probability kernels encodes all of the information that we lose when pushing a probability distribution forward along a surjective function!
Conditioning on implicit conditions suggests a variety of terminologies. We might, for example, say that we’re conditioning the initial probability distribution \pi on a surjective function f : X \rightarrow Y. At the same time, we might say that we’re conditioning \pi on the output points y \in Y, the level sets defined by that point f^{-1}(y) \in Y, or even the subspace \sigma-algebras within that level set.
All of this language, however, is just short-hand for conditioning on the partition implied by all of these intermediate objects. For example, we condition on an output value y only in the sense that it defines a subset on which we can restrict the initial probability distribution.
3.2 Conditioning On General Implicit Partitions
Up to this point, we have been able to define conditional probability distributions for measurable, \pi-non-null, and countable partitions that are defined either explicitly as a list of subsets or implicitly as the level sets of a surjective function. Unfortunately, this construction doesn’t immediately generalize to the continuous spaces that dominate practical applications.
Consider, for example, a surjective function f: (X, \mathcal{X}) \rightarrow (Y, \mathcal{Y}) where both the input space X and output space Y are continuous spaces with an uncountably infinite number of elements. This function implicitly defines a partition of the input space X into an uncountably infinite number of level sets f^{-1}(y).
As in the countable case, we can decompose any measurable subset \mathsf{x} \in \mathcal{X} into its intersections with these level sets (Figure 12), \mathsf{x} = \bigcup_{y \in Y} \left( \mathsf{x} \cap f^{-1}(y) \right). Because there are an uncountably infinite number of intersections, however, we cannot write \pi(\mathsf{x}) as a sum over the intersection probabilities, \pi( \mathsf{x} ) = \pi \left( \bigcup_{y \in Y} \left( \mathsf{x} \cap f^{-1}(y) \right) \right) \ne \sum_{y \in Y} \pi( \mathsf{x} \cap f^{-1}(y) ). Remember that probability distributions are defined to have countable additivity, not uncountable additivity! Consequently, we cannot define a law of total probability as a sum over individual output elements.
At the same time, many probability distributions that we will encounter in practical applications of probability theory will allocate zero probability to either some or all of the level sets of the conditioning function, \pi(f^{-1}(y)) = f_{*} \pi( \{ y \} ) = 0. In this case, any attempt to directly define general conditional probabilities by the ratio \pi^{f}( \mathsf{x} \mid y ) = \pi^{f}_{y}( \mathsf{x} ) = \frac{ \pi( \mathsf{x} \cap f^{-1}(y) ) }{ \pi( f^{-1}(y) ) } will result in indefinite 0 / 0 outcomes.
Is there any hope for generalizing conditional probability to uncountable partitions? Fortunately, the answer is yes.
While we cannot sum over the individual level set probabilities, we can define expectations over them. This suggests generalizing the expectation form of the law of total probability, \pi( \mathsf{x} ) = \mathbb{E}_{f_{*}\pi} [ p_{\mathsf{x}} ], for some appropriate function p_{\mathsf{x}} : Y \rightarrow [0, 1] that we will have to define. Equivalently, we might generalize the law of total expectation, \mathbb{E}_{\pi}[ g ] = \mathbb{E}_{ f_{*} \pi } [ e_{g} ], for some appropriate function e_{g} : Y \rightarrow [0, 1].
To formalize this generalization, consider a surjective function f : (X, \mathcal{X}) \rightarrow (Y, \mathcal{Y}) and a probability distribution \pi : \mathcal{X} \rightarrow [0, 1]. An intricate mathematical analysis (Chang and Pollard (1997); Leão Jr, Fragoso, and Ruffino (2004)) shows that if \pi is sufficiently well-behaved, then it defines not only the pushforward distribution f_{*} \pi : \mathcal{Y} \rightarrow [0, 1], but also a conditional probability kernel \begin{alignat*}{6} \pi^{f} ( \cdot \mid \cdot ) :\; &\mathcal{X} \times Y& &\rightarrow& \; &[0, 1] \subset \mathbb{R}& \\ &\mathsf{x}, y& &\mapsto& &\pi^{f} ( \mathsf{x} \mid y )&. \end{alignat*}
This conditional probability kernel gives a (\mathcal{Y}, \mathcal{B}_{\mathbb{R}})-measurable conditional probability function for any partial evaluation on the first argument, \pi^{f} ( \mathsf{x} \mid \cdot ) : Y \rightarrow [0, 1], and a conditional probability distribution for f_{*} \pi-almost every partial evaluation on the second argument, \begin{alignat*}{6} \pi^{f}_{y} :\; &\mathcal{X}& &\rightarrow& \; &[0, 1]& \\ &\mathsf{x}& &\mapsto& &\pi^{f} ( \mathcal{x} \mid y ) &. \end{alignat*} In particular, the conditional probability distributions entirely concentrate on the corresponding level set, \pi^{f}_{y}( f^{-1}(y) ) = \pi^{f}( f^{-1}(y) \mid y ) = 1, just as in the countable case.
These partial evaluations also satisfy a generalized law of total probability (Figure 13), \pi( \mathsf{x} ) = \mathbb{E}_{f_{*} \pi} [ p_{\mathsf{x}} ], where p_{\mathsf{x}}(y) is the conditional probability function. Moreover, they also satisfy a generalized law of total expectation, \mathbb{E}_{\pi}[g] = \mathbb{E}_{f_{*} \pi} [ e_{g} ] where \begin{alignat*}{6} e_{g} :\; &Y& &\rightarrow& \; &[0, 1]& \\ &y& &\mapsto& &\mathbb{E}_{\pi^{f}_{y} }[ g ] & \end{alignat*} is a conditional expectation value.
In the case where the output space, and hence the number of level sets, is countable, these expectations reduce to discrete summations, and the general laws of total probability and expectation reduce to our initial laws of total probability and expectation. For example, \begin{align*} \pi( \mathsf{x} ) &= \mathbb{E}_{f_{*} \pi} [ p_{\mathsf{x}} ] \\ &= \sum_{y \in Y} f_{*} \pi(y) \, p_{\mathsf{x}}(y) \\ &= \sum_{y \in Y} \pi( f^{-1}( \{ y \} ) ) \, p^{f^{-1}}(\mathsf{x} \mid y) \\ &= \sum_{y \in Y} \pi( f^{-1}( \{ y \} ) ) \, \frac{ \pi( \mathsf{x} \cap f^{-1}( \{ y \} ) )} { \pi( f^{-1}( \{ y \} ) ) }. \end{align*}
Any conditional probability kernel satisfying these properties is referred to as a disintegration of the probability distribution \pi with respect to f or, far less impressively, a regular conditional probability distribution or regular conditional probability kernel.
Personally, I find names like “regular conditional probability kernel” to be a bit of a mouthful. To streamline the terminology slightly, I will use “conditional probability kernel” to refer to disintegrations generally, and “discrete conditional probability kernel” to refer to the special case of disintegrations with respect to functions that implicitly define countable partitions.
For countable output spaces, a surjective function f and probability distribution \pi define a unique disintegration, and hence a unique discrete conditional probability kernel. More generally, there will be infinitely many disintegrations compatible with a given function and probability distribution pair. The differences between these compatible disintegrations, however, are always confined to \pi-null subsets and, consequently, they all define equivalent probabilities and expectation values.
Disintegrations completely generalize the discrete conditional probability kernels that we derived for countable partitions. We can interpret disintegrations as a collection of probability distributions that each concentrate on a particular level set or, equivalently, a collection of probability distributions defined directly on each level set. Moreover, disintegrations can be also be interpreted as complementing the pushforward probability distribution, with the latter determining how much probability is allocated to each level set and the former determining how that total allocation unfurls across each level set.
There is one final technical detail that I have purposefully left ambiguous. Earlier I noted that disintegrations exist not for any probability distribution, but rather only sufficiently “well-behaved” probability distributions. For those interested in exploring these details, disintegrations can be derived only for a special class of measures known as Radon measures. Understanding what Radon measures are, and why they are needed to define disintegrations, goes far beyond the scope of this book.
Fortunately, every probability distribution we will encounter in this book will be a Radon probability distribution – indeed non-Radon probability distributions are mathematically awkward – and we can safely take this condition for granted.
4 Independence
In general, a probability distribution will induce different behavior on different level sets of the conditioning function. The exceptional cases, where the conditional behavior is the same for almost all level sets, arises often enough in practical applications to be worthy of its own terminology.
That said, let’s first investigate what happens when conditioning on only a single subset. In particular, consider two measurable subsets \mathsf{x}_{1} \in \mathcal{X} and \mathsf{x}_{2} \in \mathcal{X} that have non-zero overlap with each other, \mathsf{x}_{1} \cap \mathsf{x}_{2} = \emptyset, and are both allocated non-zero probability, \pi(\mathsf{x}_{1}) > 0, \pi(\mathsf{x}_{2}) > 0.
The conditional probability of the first subset given the second is, by definition, \pi( \mathsf{x}_{1} \mid \mathsf{x}_{2} ) = \frac{ \pi( \mathsf{x}_{1} \cap \mathsf{x}_{2} ) } { \pi( \mathsf{x}_{2} ) } In order for the conditioning to have no affect on how probability is allocated to \mathsf{x}_{1}, we need \begin{align*} \pi( \mathsf{x}_{1} ) &= \pi( \mathsf{x}_{1} \mid \mathsf{x}_{2} ) \\ \pi( \mathsf{x}_{1} ) &= \frac{ \pi( \mathsf{x}_{1} \cap \mathsf{x}_{2} ) } { \pi( \mathsf{x}_{2} ) }, \end{align*} or \pi( \mathsf{x}_{1} \cap \mathsf{x}_{2} ) = \pi( \mathsf{x}_{1} ) \cdot \pi( \mathsf{x}_{2} ) When this condition holds we say that the two measurable subsets are independent of each other with respect to the probability distribution \pi.
The independence of subsets, however, doesn’t tell us anything about how entire conditional probability distributions behave. For example, we might be tempted to consider the case where every measurable subset \mathsf{x} \in \mathcal{X} is independent of \mathsf{x}_{2}, \pi ( \mathsf{x} \mid \mathsf{x}_{2} ) = \pi( \mathsf{x} ). In this case, the entire conditional probability distribution would reduce to the initial probability distribution. Unfortunately, a condition this strong is hard to satisfy. In fact, it holds only when \mathsf{x}_{2} = X, in which case we’re not really restricting the initial probability distribution in the first place!
A much more useful notion of independence is when almost all of the conditional probability distributions in a conditional probability kernel are equivalent, so the conditional behavior is independent of whichever partition cell, level set, or output point we consider. This behavior is often referred to not as conditional independence.
To rigorously define conditional independence, however, we need the level sets to be particularly well-behaved. In general, the level sets of a function don’t need to share the same topology. Most functions, however, feature level sets with uniform or almost-uniform topologies. For example, the level sets of the projection function \begin{alignat*}{6} \varpi :\; &\mathbb{R}^{2}& &\rightarrow& \; &\mathbb{R}& \\ &(x_{1}, x_{2})& &\mapsto& &x_{1}& \end{alignat*} are all real lines. Similarly, the level sets of the radial function \begin{alignat*}{6} r :\; &\mathbb{R}^{2}& &\rightarrow& \; &\mathbb{R}^{+}& \\ &(x_{1}, x_{2})& &\mapsto& &\sqrt{ x_{1}^{2} + x_{2}^{2} }& \end{alignat*} are all circles except for the exceptional level set r(x_{1}, x_{2}) = 0 that degenerates to a single point.
When the almost all of the level sets of a function share the same topology, we can treat them as equivalent representations of some common space L. Mathematically, we denote this as f^{-1}(y) \equiv L. In this case we can, at least in theory, construct conditional probability kernels such that almost all of the conditional probability distributions are equivalent to some common probability distribution over L, \pi^{f}_{y} ( \mathsf{x}_{y} ) = \rho ( \mathsf{x}_{y} ).
If the conditional probability kernel that we get by conditioning a probability distribution \pi on a function f : X \rightarrow Y behaves in this way, then we say that \pi is independent of f. This does not mean that the conditional probability distributions \pi^{f}_{y} behave exactly like \pi, but rather that f_{*} \pi-almost all of them behave exactly like each other. In other words, the behavior of \pi^{f}_{y} is independent of which level set f^{-1}(y), and hence which output point y \in Y, we consider.
5 Conclusion
The benefit of conditional probability theory is relatively intuitive: decomposing probability distributions and probabilistic calculations into simpler, more manageable pieces. Realizing that intuition with consistent mathematics, however, is much more complicated.
In this chapter, we have reviewed the key foundations of conditional probability theory, but we are still missing the tools needed to apply it to continuous spaces. In particular, while I have stated that conditional probability kernels exist for uncountable and null partitions, we have not yet seen how to specify them using density functions.
Defining conditional probability density functions is an awkward problem, and one to which the next chapter will be dedicated.
Acknowledgements
A very special thanks to everyone supporting me on Patreon: Adam Fleischhacker, Adriano Yoshino, Alessandro Varacca, Alexander Noll, Alexander Petrov, Alexander Rosteck, Andrea Serafino, Andrew Mascioli, Andrew Rouillard, Andrew Vigotsky, Ara Winter, Austin Rochford, Avraham Adler, Ben Matthews, Ben Swallow, Benoit Essiambre, Bradley Kolb, Brandon Liu, Brendan Galdo, Brynjolfur Gauti Jónsson, Cameron Smith, Canaan Breiss, Cat Shark, Charles Naylor, Charles Shaw, Chase Dwelle, Chris Jones, Christopher Mehrvarzi, Colin Carroll, Colin McAuliffe, Damien Mannion, dan mackinlay, Dan W Joyce, Dan Waxman, Dan Weitzenfeld, Daniel Edward Marthaler, Darshan Pandit, Darthmaluus, David Galley, David Wurtz, Denis Vlašiček, Doug Rivers, Dr. Jobo, Dr. Omri Har Shemesh, Dylan Maher, Ed Cashin, Edgar Merkle, Eric LaMotte, Ero Carrera, Eugene O’Friel, Felipe González, Fergus Chadwick, Finn Lindgren, Florian Wellmann, Geoff Rollins, Guido Biele, Håkan Johansson, Hamed Bastan-Hagh, Haonan Zhu, Hector Munoz, Henri Wallen, hs, Hugo Botha, Ian, Ian Costley, idontgetoutmuch, Ignacio Vera, Ilaria Prosdocimi, Isaac Vock, J, J Michael Burgess, jacob pine, Jair Andrade, James C, James Hodgson, James Wade, Janek Berger, Jason Martin, Jason Pekos, Jason Wong, Jeff Burnett, Jeff Dotson, Jeff Helzner, Jeffrey Erlich, Jesse Wolfhagen, Jessica Graves, Joe Wagner, John Flournoy, Jonathan H. Morgan, Jonathon Vallejo, Joran Jongerling, JU, Justin Bois, Kádár András, Karim Naguib, Karim Osman, Kejia Shi, Kristian Gårdhus Wichmann, Lars Barquist, lizzie , LOU ODETTE, Luís F, Marcel Lüthi, Marek Kwiatkowski, Mark Donoghoe, Markus P., Martin Modrák, Márton Vaitkus, Matt Moores, Matthew, Matthew Kay, Matthieu LEROY, Mattia Arsendi, Maurits van der Meer, Michael Colaresi, Michael DeWitt, Michael Dillon, Michael Lerner, Mick Cooney, N Sanders, N.S. , Name, Nathaniel Burbank, Nic Fishman, Nicholas Clark, Nicholas Cowie, Nick S, Octavio Medina, Oliver Crook, Olivier Ma, Patrick Kelley, Patrick Boehnke, Pau Pereira Batlle, Peter Johnson, Pieter van den Berg, ptr, Ramiro Barrantes Reynolds, Raúl Peralta Lozada, Ravin Kumar, Rémi, Riccardo Fusaroli, Richard Nerland, Robert Frost, Robert Goldman, Robert kohn, Robin Taylor, Ryan Grossman, S Hong, Saleem Huda, Sean Wilson, Sergiy Protsiv, Seth Axen, shira, Simon Duane, Simon Lilburn, sssz, Stan_user, Stephen Lienhard, Stew Watts, Stone Chen, Susan Holmes, Svilup, Tao Ye, Tate Tunstall, Tatsuo Okubo, Teresa Ortiz, Theodore Dasher, Thomas Kealy, Thomas Vladeck, Tiago Cabaço, Tim Radtke, Tobychev , Tom McEwen, Tomáš Frýda, Tony Wuersch, Virginia Fisher, Vladimir Markov, Wil Yegelwel, Will Farr, woejozney, yolhaj , yureq , Zach A, Zad Rafi, and Zhengchen Cai.
References
License
The text and figures in this chapter are copyrighted by Michael Betancourt and licensed under the CC BY-NC 4.0 license.