Formal probability theory is a rich and sophisticated field of mathematics with a reputation for being confusing, if not outright impenetrable. Much of that intimidation, however, is due not to the abstract mathematics but rather how they are haphazardly employed in practice. In particular, many introductions to probability theory sloppily confound the abstract mathematics with their practical implementations, convoluting what we can calculate in the theory with how we perform those calculations. To make matters even worse, probability theory is used to model a variety of subtlety different systems, which then burdens the already confused mathematics with the distinct and often conflicting philosophical connotations of those applications.

In this case study I attempt to untangle this pedagogical knot and illuminate the basic concepts and manipulations of probability theory. Our ultimate goal is to demystify what we can calculate in probability theory and how we can perform those calculations in practice. We begin with an introduction to abstract set theory, continue to probability theory, and then move onto practical implementations without any interpretational distraction. We will spend time more thoroughly reviewing sampling-based calculation methods before finally considering the classic applications of probability theory and the interpretations of the theory that then arise.

In a few places I will dip a bit deeper into the mathematics than is strictly necessary. These section are labeled “Extra Credit” and may be skipped without any consequence, but they do provide further unification and motivation of some of the more subtle aspects of probability theory.

Let me open with a warning that the section on abstract probability theory will be devoid of any concrete examples. This is not because of any conspiracy to confuse the reader, but rather a consequence of the fact that we cannot explicitly construct abstract probability distributions in any meaningful sense. Instead we must utilize problem-specific representations of abstract probability distributions which means that concrete examples will have to wait until we introduce these representations in Section 3.

1 Setting A Foundation

Because formal probability theory works with sets in a given space we first need to review the relevant results from set theory that we will need. For a more complete and rigorous, yet readable, treatment of set theory see, for example, Folland (1999) and the appendix of Lee (2011).

A set is a collection of elements and a space is the collection of all elements under consideration. For example, \(A_{1} = \{1\}\), \(A_{2} = \{1, 5, 10, 12\}\), and \(A_{3} = \{30, 2, 7\}\), are all sets contained in the space of natural numbers, \(\mathbb{N} = \left\{0, 1, \ldots \right\}\).
To denote their containment we write \(A_{1} \subset \mathbb{N}\), \(A_{2} \subset \mathbb{N}\), and \(A_{3} \subset \mathbb{N}\). Because \(A_{1}\) is also contained in \(A_{2}\) we say that \(A_{1}\) is a subset of \(A_{2}\) and write \(A_{1} \subset A_{2}\).

A point set or atomic set is a set containing a single element such as \(\{1\}\) above. The entire space itself is always a valid set, as is the empty set or null set, \(\emptyset\), which contains no elements at all.

Sets are often defined implicitly via an inclusion criterion. These sets are denoted with the set builder notation \[ A = \left\{ x \in X \mid f(x) = 0 \right\}, \] which reads “the set of elements \(x\) in the space \(X\) such that the condition \(f(x) = 0\) holds”. For example, we can define the positive real numbers as \[ \mathbb{R}^{+} = \left\{ x \in \mathbb{R} \mid x > 0 \right\}. \]

Throughout this case study we will use four sets to demonstrate the operations of set theory and probability theory. Our first two sets are finite intervals within the natural numbers, \[ \begin{align*} A_{1} &= \left\{ x \in \mathbb{N} \mid 4 \le x \le 7 \right\} \\ A_{2} &= \left\{ x \in \mathbb{N} \mid 7 \le x \le 9 \right\} \end{align*} \] Because these intervals contain only a finite number of elements we can explicitly represent them with a vector in R,

A1 <- c(4, 5, 6, 7)
A2 <- c(7, 8, 9)

The latter two sets are finite intervals within the real numbers, \[ \begin{align*} B_{1} &= \left\{ x \in \mathbb{R} \mid -2 \le x \le 1 \right\} \\ B_{2} &= \left\{ x \in \mathbb{R} \mid 0 \le x \le 3 \right\} \end{align*} \] These real intervals contain an uncountably infinite number of points so we’ll have to represent them with their boundaries.

B1_min <- -2
B1_max <- 1

B2_min <- 0
B2_max <- 3

1.1 Set Operations

There are three natural operations between sets: set complement, set union, and set intersection.

1.1.1 Set Complement

The set complement is a unary operation that maps one set into another set. Given a set, \(A\), its complement, \(A^{c}\), is defined by all of the elements not in that set, \[ A^{c} = \left\{ x \in X \mid x \notin A \right\}. \]

For example, the complement of \(A_{1}\) contains all of the positive integers below 4 and above 7, \[ A_{1}^{c} = \left\{ x \in \mathbb{N} \mid x < 4 \text{ OR } x > 7 \right\}. \] Similarly, the complement of \(B_{1}\) is given by \[ B_{1}^{c} = \left\{ x \in \mathbb{R} \mid x < -2 \text{ OR } x > 1 \right\}. \]

By definition the complement of the empty set is the entire space, \(\emptyset^{c} = X\), and vice versa, \(X^{c} = \emptyset\).

1.1.2 Set Union

The set union is a binary operator that maps two sets into a third set. We can construct the union of two sets, \(A_{1} \cup A_{2}\), as the set of elements included in either set, \[ A_{1} \cup A_{2} = \left\{ x \in X \mid x \in A_{1} \text{ OR } x \in A_{2} \right\}. \]

For example, the union of \(A_{1}\) and \(A_{2}\) is \[ A_{1} \cup A_{2} = \left\{ x \in X \mid 4 \le x \le 9 \right\}, \]

A_union <- union(A1, A2)
A_union

[1] 4 5 6 7 8 9

while the union of \(B_{1}\) and \(B_{2}\) is given by \[ B_{1} \cup B_{2} = \left\{ x \in X \mid -2 \le x \le 3 \right\}, \]

which we can specify with the boundaries

B_union_min <- -2
B_union_max <- 3

The union of any set and the empty set is just that set, \(A \cup \emptyset = A\).

1.1.3 Set Intersection

The set intersection is another binary operator that maps two sets into a third set. The intersection of two sets, \(A_{1} \cap A_{2}\), is defined as the set of elements included in both sets, \[ A_{1} \cap A_{2} = \left\{ x \in X \mid x \in A_{1} \text{ AND } x \in A_{2} \right\}. \]

For example, the intersection of \(A_{1}\) and \(A_{2}\) is \[ A_{1} \cap A_{2} = \left\{ x \in X \mid 7 \le x \le 7 \right\}, \]

A_inter <- intersect(A1, A2)
A_inter

[1] 7

while the intersection of \(B_{1}\) and \(B_{2}\) is given by \[ B_{1} \cap B_{2} = \left\{ x \in X \mid 0 \le x \le 1 \right\}, \] which we can specify with the boundaries

B_inter_min <- 0
B_inter_max <- 1

The intersection of any set and the empty set is just the empty set, \(A \cap \emptyset = \emptyset\).

1.2 Sigma Algebras

The collection of all sets in a space, \(X\), is called the power set, \(\mathcal{P}(X)\). The power set is massive; it contains the empty set, the entire space, all of the atomic sets, and more.

Unfortunately, even if the space \(X\) is well-behaved, the corresponding power set can also contain some less mathematically savory elements. For example, making a seemingly innocent assumption known as the Axiom of Choice one can show that there exist sets in the real numbers that cannot be explicitly constructed. In particular, one can show that these non-constructible sets feature some very odd properties.

Consequently, when dealing with sets we often want to consider not the entire power set but rather a restriction of the power set that avoids any pathological sets that might be hiding. We have to be careful, however, to apply this restriction in a self-consistent manner. In particular, we want our restriction to be closed under the three natural set operations so that we don’t have to worry about straying outside of our carefully chosen restriction.

For example, if a set \(A\) is in our restriction then we want its complement \(A^{c}\) to also be in our restriction. If \(A_{1}\) and \(A_{2}\) are in our restriction then we also want the restriction to contain the union, \(A_{1} \cup A_{2}\), and the intersection, \(A_{1} \cap A_{2}\). In fact we often will need our restrictions to be closed under a countably infinite, or countable, number of unions or intersections. More precisely, given a countable collection of sets, \(\left\{A_{1}, A_{2}, \ldots\right\}\), we want both the countable union, \[ \cup_{n = 1}^{\infty} A_{n}, \] and the countable intersection, \[ \cap_{n = 1}^{\infty} A_{n}, \] to be in the restriction.

A restricted collection of sets satisfying these closure properties is called a \(\sigma\)-algebra over the space \(X\) (Folland 1999) and a choice of \(\sigma\)-algebra over \(X\) will be denoted with a calligraphic font, \(\mathcal{X}\). A given space will typically admit many \(\sigma\)-algebras, although only a few will be of practical interest. For example, we can always take the trivial \(\sigma\)-algebra \(\mathcal{X} = \{ \emptyset, X \}\).

An important benefit of these closure properties is that they ensure that any non-constructible sets that may be lurking within the power set don’t persist into a \(\sigma\)-algebra. Consequently identifying any \(\sigma\)-algebra removes many of the pathological behaviors that arise in uncountably-large spaces.

Conveniently, many spaces are equipped with structures that define natural \(\sigma\)-algebras. In particular, all of the spaces that we typically deal with in practical problems, such as the integers and the real numbers, have natural \(\sigma\)-algebras. Consequently, in practice we don’t have to spend any time worrying about explicitly constructing a \(\sigma\)-algebra, and the technical need to restrict the power set to a \(\sigma\)-algebra has little impact on the practical use of sets. That said, because avoiding the pathologies of non-constructible sets is critical to a mathematically well-defined probability theory, \(\sigma\)-algebras will be a common occurance.

For more on these natural \(\sigma\)-algebras see the following optional section.

1.3 Extra Credit: Topology

A topology on a space, \(X\), is a collection of sets that contains the empty set and the full space and is closed under countable unions and finite intersections (Lee 2011). The sets in a given topology are called topologically open sets and the complement of any open set is called a topologically closed set. In a rough intuition, topologically closed sets contain their boundaries while topologically open sets do not. Be warned that open and closed in the topological sense are not exact opposites – by definition the empty set and the full space are both topologically open and topologically closed or, shudder, topologically clopen.

The topology of a space provides a formal way to define important properties like continuity. For example, the fundamental differences between discrete and continuous spaces arise from their different topologies. In particular, all of the common spaces, such as the integers and the real numbers, have unique topologies from which many of their more-intuitive properties arise.

Although superficially similar, a topology is distinct from a \(\sigma\)-algebra. For example, the topologically open sets are not closed under the complement operation. If we combine all of the topologically open sets and topologically closed sets defined by a topology, however, then we get a proper \(\sigma\)-algebra. \(\sigma\)-algebras constructed from the topology of a space in this way are denoted Borel \(\sigma\)-algebras (Folland 1999).

Because the typical spaces we’ll consider all have natural topologies they consequently also have natural Borel \(\sigma\)-algebras that exclude pathological sets that can cause problems when trying to define objects like probabilities.

1.4 Functions

A function is a relation that associates elements in one space to elements in another space. If a function, \(f\), maps elements in the space \(X\) to elements in the space \(Y\) then we write \[ f : X \rightarrow Y. \] If we want to be particularly explicit and denote how a particular element \(x \in X\) maps into elements of Y then we can also write \[ \begin{alignat*}{6} f :\; &X& &\rightarrow& \; &Y& \\ &x& &\mapsto& &f(x)&. \end{alignat*} \]

A function also induces a mapping from sets in the input space, \(X\), to sets in the output space, \(Y\), by applying the function to each individual element in the input set, \[ f_{*}(A) = \left\{ y \in Y \mid f^{-1}(y) \in A \right\}. \] This induced map is also known as a pushforward map over sets.

At the same time we can also map sets from the output space, \(Y\), to sets in the input space, \(X\), by considering all of the points that map into a given set, \[ f^{*}(B) = \left\{ x \in X \mid f(x) \in B \right\}. \] This induced map is also known as a pullback map over sets.

These induced maps will, in general, be between the power sets of the two spaces, \[ \begin{align*} f_{*} &: \mathcal{P}(X) \rightarrow \mathcal{P}(Y) \\ f^{*} &: \mathcal{P}(Y) \rightarrow \mathcal{P}(X), \end{align*} \] but they need not respect any restriction of the power sets that we might be considering, in particular a choice \(\sigma\)-algebras on the input and output spaces. When dealing with \(\sigma\)-algebras, then, we often want to limit our consideration to only those special functions that induce maps between the \(\sigma\)-algebras themselves, \[ \begin{align*} f_{*} &: \mathcal{X} \rightarrow \mathcal{Y} \\ f^{*} &: \mathcal{Y} \rightarrow \mathcal{X}. \end{align*} \]

2 Mathematical Logistics

Once the philosophy has been stripped away, probability theory is simply the study of an object, a probability distribution that assigns values to sets, and the transformations of that object. To be fair this isn’t particularly surprising as most of pure mathematics ultimately reduces to studying objects and their transformations! Fortunately, the basic concepts of probability distributions and their manipulations are reasonably intuitive, even from this more formal perspective.

Probability textbooks tend to be too simple, ignoring many important concepts and succumbing to the pedagogical issues I mentioned in the introduction, or too complex, focusing on technical details quickly falling beyond the proficiency of many readers. My favorite treatment of the more formal details of probability theory, and its predecessor measure theory, is Folland (1999), who spends significant time discussing concepts between the technical details.

Working through those details, however, is a serious investment in time. In this section we instead review those concepts relevant to the practical application of probability theory, and some of the details necessary to piece those concepts together and form a self-contained treatment.

2.1 Probability Distributions

From an abstract mathematical perspective, probability is surprisingly boring. Probability is simply a positive, conserved quantity that we want to distribute across a given space – in particular it does not necessarily refer to anything inherently random or uncertain. If we rescale this distribution in terms of the relative proportions of the total amount then regardless of the nature of this conserved quantity we can treat it as unit-less with total amounting to 1.

A probability distribution defines a mathematically self-consistent allocation of this conserved quantity across \(X\). If \(A\) is a subset of \(X\) then we write \(\mathbb{P}_{\pi}[A]\) as the probability assigned to \(A\) by the probability distribution \(\pi\).

This allocation should be exhaustive such that all of it is distributed into \(X\). In other words we require that the probability assigned to the full space is always the total probabilty available, \(\mathbb{P}_{\pi} [X] = 1\).

Moreover, any distribution of probability should globally consistent over any disjoint decomposition of the total space. For example, consider the decomposition of \(X\) into the two disjoint sets \(A_{1}\) and \(A_{2}\).

We can assign an arbitrary amount of probability to \(A_{1}\),

but then we have to assign the remaining probability to \(A_{2}\),

In other words, the total probability assigned to \(A_{1}\) and \(A_{2}\) must sum to \(1\).

Indeed we’ll want to extend this behavior to local consistency of probability distribution over the disjoint decomposition of any set. For example, consider the decomposition of \(A_{1}\) into the two disjoint sets \(A_{3}\) and \(A_{4}\).

The total probability assigned to \(A_{1}\) is given by the probability allocated to \(A_{3}\) plus that allocated to \(A_{4}\).

In particular, we can derive global consistency of probability distribution by recursively applying local consistency, \[ \begin{align*} \mathbb{P}_{\pi}[A_{3}] + \mathbb{P}_{\pi}[A_{4}] + \mathbb{P}_{\pi}[A_{2}] &= \mathbb{P}_{\pi}[A_{1}] + \mathbb{P}_{\pi}[A_{2}] \\ &= \mathbb{P}_{\pi}[X] \\ &= 1. \end{align*} \]

More formally we want the allocation of probability to any collection of disjoint sets, \(A_{n} \cap A_{m} = 0, n \ne m\), to be the same as the allocation to the union of those sets, \[ \sum_{n = 1}^{N} \mathbb{P}_{\pi} [ A_{n} ] = \mathbb{P}_{\pi} [ \cup_{n = 1}^{N} A_{n} ], \] so that no matter how we decompose the space \(X\), or any subsets of \(X\), we don’t lose any probability.

Consistency over any finite collection of sets is known as finite additivity and this would be sufficient if there were only a finite number of subsets in \(X\). Finite additivity, however, is insufficient for distributing probability across spaces which feature a countably infinite number of subsets. In particular consider computing the probability assigned to circle using only rectangles, which is how distributions across the real numbers are more formally derived. No finite number of disjoint rectangles will exactly cover the circle and hence yield the desired probability, but the limit of a countably infinite number of them will.

Consequently in general we need our probability distributions to be countably additive across any disjoint sets, \(A_{n} \cap A_{m} = 0, n \ne m\), \[ \sum_{n = 1}^{\infty} \mathbb{P}_{\pi} [ A_{n} ] = \mathbb{P}_{\pi} [ \cup_{n = 1}^{\infty} A_{n} ]. \]

This all sounds well and good, but it turns out that we cannot construct probability distributions that satisfy any kind of additivity for all subsets of every space. The problem? Those non-constructible sets to which we were introduced in Section 1.2. It turns out that when a space features non-constructible subsets we can combine a finite number of them, \(\{ \mathfrak{A}_{1}, \ldots, \mathfrak{A}_{N} \}\), in certain way to achieve sub-additivity, \[ \sum_{n = 1}^{N} \mathbb{P}_{\pi} [ \mathfrak{A}_{n} ] > \mathbb{P}_{\pi} [ \cup_{n = 1}^{N} \mathfrak{A}_{n} ], \] even when \(\mathfrak{A}_{n} \cap \mathfrak{A}_{m} = 0, n \ne m\). We can also combine a finite number of different non-constructible subsets to achieve super-additivity, \[ \sum_{n = 1}^{N} \mathbb{P}_{\pi} [ \mathfrak{A}_{n} ] < \mathbb{P}_{\pi} [ \cup_{n = 1}^{N} \mathfrak{A}_{n} ]! \]

In other words, there is no way that we can engineer a self-consistent distribution of probability across non-constructible sets. Fortunately this pathological behavior isn’t terminal because we already know how to avoid non-constructible sets in practice! All we have to do is limit our probability distributions to any well-defined \(\sigma\)-algebra, \(\mathcal{X}\).

These minimal properties needed to achieve the probability distribution behavior that we’re after are formalized in Kolmogorov’s axioms:

A probability distribution \(\pi\) on a space \(X\) with \(\sigma\)-algebra \(\mathcal{X}\) is a mapping from \(\mathcal{X}\) to the real numbers \([0, 1]\), \[ \begin{alignat*}{6} \mathbb{P}_{\pi} :\; &\mathcal{X}& &\rightarrow& \; &[0, 1]& \\ &A& &\mapsto& &\mathbb{P}_{\pi} [A]&. \end{alignat*} \]
This mapping defines a lossless allocation, \(\mathbb{P}_{\pi} [X] = 1\).
This mapping defines a consistent allocation for any collection of disjoint sets in \(\mathcal{X}\), \[ \mathbb{P}_{\pi} [ \cup_{n = 1}^{\infty} A_{n} ] = \sum_{n = 1}^{\infty} \mathbb{P}_{\pi} [ A_{n} ], \] \[ A_{n} \cap A_{m} = 0, \, n \ne m. \]

The more familiar rules of probability theory can all be derived from these axioms. For example the consistency condition implies that \[ \mathbb{P}_{\pi}[A] + \mathbb{P}_{\pi}[A^{c}] = \mathbb{P}_{\pi}[X] = 1 \] or \[ \mathbb{P}_{\pi}[A] = 1 - \mathbb{P}_{\pi}[A^{c}]. \]

With a little work one can also derive the probability sum rules, \[ \mathbb{P}_{\pi}[ A_{1} \cup A_{2} ] = \mathbb{P}_{\pi}[ A_{1} ] + \mathbb{P}_{\pi}[ A_{2} ] - \mathbb{P}_{\pi}[ A_{1} \cap A_{2} ], \] and \[ \mathbb{P}_{\pi}[ A_{1} \cap A_{2} ] = \mathbb{P}_{\pi}[ A_{1} ] + \mathbb{P}_{\pi}[ A_{2} ] - \mathbb{P}_{\pi}[ A_{1} \cup A_{2} ]. \]

These rules can be used to construct explicit probability distributions when the \(\sigma\)-algebra is small enough. For example, consider the binary space \(X = \{0, 1\}\). In this case the valid \(\sigma\)-algebra is the entire power set consisting of the empty set, \(A_{1} = \emptyset\), the atomic sets, \(A_{2} = \{ 0 \}\) and \(A_{3} = \{ 1 \}\), and the entire space, \(A_{4} = X\). What probabilities can we assign to these sets? The axioms require that \(\mathbb{P}_{\pi} [ A_{4} ] = 1\), and the complement rule then requires that \(\mathbb{P}_{\pi} [ A_{1} ] = 0\). We are free to assign any probability to one of the atomic sets, so we can take \(\mathbb{P}_{\pi} [ A_{3} ] = p\) in which case the complement rule requires that \(\mathbb{P}_{\pi} [ A_{2} ] = 1 - p\). In other words, any probability distribution on the binary space \(\{0, 1\}\) is completely specified by the parameter \(p\) and the resulting allocation \[ \begin{align*} \mathbb{P}_{\pi} [ \;\, \emptyset \;\, ] &= 0 \\ \mathbb{P}_{\pi} [ \{ 0 \} ] &= 1 - p \\ \mathbb{P}_{\pi} [ \{ 1 \} ] &= p \\ \mathbb{P}_{\pi} [ \;\, X \; ] &= 1. \end{align*} \]

A probability distribution defined by Kolmogorov’s axioms is completely specified by the probability triplet \((X, \mathcal{X}, \pi)\) which is often denoted more compactly as \[ x \sim \pi. \] Here the point \(x \in X\) simply denotes denoting the space \(X\), \(\pi\) denotes the probability distribution, and a valid \(\sigma\)-algebra is just assumed.
Care must be taken to not over interpret this notation – in particular it says nothing about individual points \(x \in X\) but rather denotes a probability distribution over all sets in the assumed \(\sigma\)-algebra.

2.2 Measurable Transformations

Once we have defined a probability distribution on a space, \(X\), and a well-behaved collection of subsets, \(\mathcal{X}\), we can then consider how the probability distribution transforms when \(X\) transforms. In particular, let \(f: X \rightarrow Y\) be a transformation from \(X\) to another space \(Y\). Can this transformation also transform our probability distribution on \(X\) onto a probability distribution on \(Y\), and if so under what conditions?

The answer is straightforward once we have selected a \(\sigma\)-algebra for \(Y\) as well, which we will denote \(\mathcal{Y}\). In order for \(f\) to induce a probability distribution on \(Y\) we need the two \(\sigma\)-algebras to be compatible in some sense. In particular we need every subset \(B \in \mathcal{Y}\) to correspond to a unique subset \(f^{-1}(B) \in \mathcal{X}\). If this holds for all subsets in \(\mathcal{Y}\) then we say that the transformation \(f\) is measurable and we can define a pushforward distribution, \(\pi_{*}\) by \[ \mathbb{P}_{\pi_{*}} [ B ] = \mathbb{P}_{\pi} [ f^{-1} (B) ]. \] In other words, if \(f\) is measurable then a self-consistent allocation of probability over \(X\) induces a self-consistent allocation of probability over \(Y\).

2.2.1 Projections

Measurable transformations can be used to project a probability distribution over a space onto a probability distribution over a lower-dimensional subspace.
Let \(\varpi: X \rightarrow Y\) be a projection operator that maps points in a space \(X\) to points in the subspace \(Y \subset X\). It turns out that in this case a \(\sigma\)-algebra on \(X\) naturally defines a \(\sigma\)-algebra on \(Y\) and the projection operator is measurable with respect to this choice. Consequently any joint probability distribution on \(X\) will transform into a unique marginal probability distribution on \(Y\). More commonly we say that we marginalize out the complementary subspace, \(Y^{C}\).

Marginalization is a bit more straightforward when we are dealing with a product space, \(X \times Y\), which is naturally equipped with the component projection operators \(\varpi_{X} : X \times Y \rightarrow X\) and \(\varpi_{Y}: X \times Y \rightarrow Y\). In this case by pushing a distribution over \((X \times Y, \mathcal{X} \times \mathcal{Y})\) forwards along \(\varpi_{X}\) we marginalize out \(Y\) to give a probability distribution over \((X, \mathcal{X})\). At the same time by pushing that same distribution forwards along \(\varpi_{Y}\) we can marginalize out \(X\) to give a probability distribution over \((Y, \mathcal{Y})\).

Consider, for example, the three-dimensional space, \(\mathbb{R}^{3}\), where the coordinate functions serve as projection operators onto the three axes, \(X\), \(Y\), and \(Z\). Marginalizing out \(X\) transforms a probability distribution over \(X \times Y \times Z\) to give a probability distribution over the two-dimensional space, \(Y \times Z = \mathbb{R}^{2}\). Marginalizing out \(Y\) then gives a probability distribution over the one-dimensional space, \(Z = \mathbb{R}\).

2.2.2 Reparameterizations

Another important class of measurable functions are those one-to-one maps for which \(f(A) \in \mathcal{Y}\) for any \(A \in \mathcal{X}\) in addition to \(f^{-1}(B) \in \mathcal{X}\) for any \(B \in \mathcal{Y}\). In this case \(f\) not only transforms a probability distribution on \(X\) into a probability distribution on \(Y\) but also transforms a probability distribution on \(Y\) into a probability distribution on \(X\). Moreover, transforming a distribution on \(X\) to one on \(Y\) and back perfectly restores the original distribution – unlike projections these reparameterizations preserve all information encoded in a probability distribution.

In this case \((X, \mathcal{X}, \pi)\) and \((Y, \mathcal{Y}, \pi_{*})\) are really just two different manifestations, or parameterizations, of the same abstract probability system. The two parameterizations, for example, might correspond to different choices of coordinate system, different choices of units, or different choices of language dialect capable of the same descriptions. The reparameterizations then translate from one equivalent manifestation to another.

For discrete spaces all reparameterizations can be reduced to permutations that simply rearrange the individual elements of the space. This simple intuition, however, doesn’t carry over to continuous spaces for which reparameterizations can be signficantly more subtle.

3 Great Expectations

Conveniently the distribution of probability across a space also provides a supplementary function: it allows us to summarize how functions of the form \(f: X \rightarrow \mathbb{R}\) behave.

Without probability the only way that we can summarize the behavior of such a function is through topological information such as minima and maxima. The introduction of a probability distribution, however, allows us to average the behavior of a function across it input domain. For example, consider the function below along with two sets, \(A_{1}\) and \(A_{2}\).

If \(\mathbb{P}_{\pi}[ A_{1} ] > \mathbb{P}_{\pi}[ A_{2} ]\) then the behavior of \(f\) within \(A_{1}\) is more relevant to an average.

Expectation values, \(\mathbb{E}_{\pi}[f]\), reduce a function to a single real number by weighting the function output at every point by the probability distributed around that point. More formally, let \(\mathcal{C}\) be the space of all functions from our target space to the real numbers, \(f : X \rightarrow \mathbb{R}\). Expectation is then a map \[ \begin{alignat*}{6} \mathbb{E}_{\pi} :\; &\mathcal{C}& &\rightarrow& \; \mathbb{R}& \\ &f& &\mapsto& &\mathbb{E}_{\pi} [f]&. \end{alignat*} \]

Formalizing this expectation that weights a function with probabilities is subtle, but we can build up some intuition using one class of functions for which the process is straightforward. An indicator function is a function that vanishes outside of a given set, \[ \mathbb{I}_{A} [x] = \left\{ \begin{array}{rr} 1, & x \in A \\ 0, & x \notin A \end{array} \right. . \]

Because the indicator function is constant within its defining set, \(A\), we can simply stipulate its expectation value to be the weight assigned that set, which is just the probability it is allocated, \[ \mathbb{E}_{\pi} [ \mathbb{I}_{A} ] \equiv \mathbb{P}_{\pi} [ A ]. \] We can then build up the expectation value of an arbitrary function with a careful approximation in terms of these indicator functions in a process known as Lebesgue integration. For more detail see the following optional section.

3.1 Extra Credit: Lebesgue Integration

As we saw in Section 3.1 only the indicator functions have immediate expectation values in terms of probabilities. In order to define expectation values of more general functions we have to build increasingly more complex functions out of these elementary ingredients.

The countable sum of indicator functions weighted by real numbers defines a simple function, \[ \phi = \sum_{n} a_{n} \mathbb{I}_{A_{n}}. \] If we require that expectation is linear over this summation then the expectation value of any simple function is given by \[ \begin{align*} \mathbb{E}_{\pi} [ \phi ] &= \mathbb{E}_{\pi} \left[ \sum_{n} a_{n} \mathbb{I}_{A_{n}} \right] \\ &= \sum_{n} a_{n} \mathbb{E}_{\pi} [ \mathbb{I}_{A_{n}} ] \\ &= \sum_{n} a_{n} \mathbb{P}_{\pi} [ A_{n} ]. \end{align*} \] Because of the countable additivity of \(\pi\) and the boundedness of probability, the expectation of a simple function will always be finite provided that each of the coefficients \(a_{n}\) are themselves finite.

We can then use simple functions to approximate an everywhere-positive function, \(g : X \rightarrow \mathbb{R}^{+}\). A simple function with only a few terms defined over only a few sets will yield a poor approximation to \(g\), but as we consider more terms and more sets we can build an increasingly accurate approximation. In particular, because of countable additivity we can always construct a simple function that approximates \(f\) with arbitrary accuracy while remaining bounded above by \(f\).

We can then define the expectation of an everywhere-positive function as the expectation of this approximating simple function. Because we were careful to consider only simple functions bounded by \(f\) we can also define the expectation of \(f\) as the largest expectation of all bounded simple functions, \[ \mathbb{E}_{\pi} [ f ] = \max_{\phi \le f} \mathbb{E}_{\pi} [ \phi ]. \]

For functions that aren’t everywhere-positive we can decompose \(X\) into a collection of neighborhoods where \(f\) is entirely positive, \(A^{+}_{n}\), and entirely negative, \(A^{-}_{m}\). In those neighborhoods where \(f\) is entirely positive we apply the above procedure to define \[ \mathbb{E}_{\pi} [ f \cdot \mathbb{I}_{A^{+}_{n}}], \] while in the neighborhoods where \(f\) is entirely negative we apply the above procedure on the negation of \(f\) to define \[ \mathbb{E}_{\pi} [ -f \cdot \mathbb{I}_{A^{+}_{n}}]. \] Those regions where \(f\) vanishes yield zero expectation values and can be ignored. We then define the expectation value of an arbitrary function \(f\) as the sum of these contributions, \[ \mathbb{E}_{\pi} [ f ] = \sum_{n = 0}^{\infty} \mathbb{E}_{\pi} [ f \cdot \mathbb{I}_{A^{+}_{n}}] - \sum_{m = 0}^{\infty} \mathbb{E}_{\pi} [ -f \cdot \mathbb{I}_{A^{-}_{m}}]. \]

Formally this procedure is known as Lebesgue integration and is a critical tool in the more general measure theory of which probability theory is a special case.

3.2 Useful Expectations

Expectation values serve two powerful purposes. As we saw above they can be used to summarize the behavior of a function \(f : X \rightarrow \mathbb{R}\) in the context of a given probability distribution. We can also invert this perspective, however, and use the expectation value of certain functions to characterize the given probability distribution itself. In this section I’ll introduce certain functions whose expectations are particularly useful in interrogating the properties of probability distributions defined on certain spaces.

First, however, a word of caution. The structure of these special functions, and hence the interpretation of these expectation values, is specific to the context of a given space and will not, in general, push forward along a transformation to another space. In particular, the interpretation is unique to a given parameterization and will change under reparameterizations even though the abstract probabilistic system does not.

If desired we can preserve the value of an expectation value if we first push the special function forward along the transformation and only then perform an expectation. Expectation and transformation do not, in general, commute!

3.2.1 This Distribution and I Are Having a Moment

When our target space is a subset of the real line, \(X \subseteq \mathbb{R}\), there is a natural embedding of \(X\) into \(\mathbb{R}\), \[ \begin{alignat*}{4} \iota :\; &X& &\rightarrow& \mathbb{R} \\ &x& &\mapsto& x. \end{alignat*} \] For example there is a natural embedding that associates the natural numbers, \(\left\{0, 1, 2, \ldots \right\}\), with the corresponding values in the real line.

Similarly there is a natural embedding of the real interval \([0, 1]\) with the corresponding interval in the full real line.

If our target space is itself the real line then the identify function serves as an appropriate embedding.

The expectation value of this embedding function, and simple transformations of the embedding function, are extremely useful in characterizing certain properties of probability distributions. Moments are expectations of powers of the embedding function, \[ \mathbb{m}_{\pi, n} \equiv \mathbb{E}_{\pi} [ \iota^{n} ]. \] Central moments center the embedding function around the first moment, \[ \mathbb{c}_{\pi, n} \equiv \mathbb{E}_{\pi} [ (\iota - \mathbb{m}_{\pi, 1})^{n} ]. \]

The mean of a probability distribution is the corresponding first moment, \[ \mathbb{m}_{\pi} = \mathbb{m}_{\pi, 1} = \mathbb{E}_{\pi} [ \iota ], \] which quantifies a sense of where in the target space the probability distribution is concentration its allocation.

At the same time the variance of a probability distribution is the second central moment, \[ \mathbb{V}_{\pi} = \mathbb{c}_{\pi, 2} = \mathbb{E}_{\pi} [ (\iota - \mathbb{m}_{\pi})^{2}], \] which quantifies a sense of breadth of the probability allocation around the mean. In practice the standard deviation \[ \mathbb{s}_{\pi} = \sqrt{ \mathbb{V}_{\pi} } \] is often more interpretable than the variance.

Higher-order central moments go on to quantify more intricate properties of a given probability distribution. The third central moment, for example, quantifies how asymmetrically probability is distributed around the mean.

While we can always define expectation values of a given function \(f: X \rightarrow \mathbb{R}\), a probability distribution will not always have a well-defined moments. On one hand the corresponding expectation values may be infinite or simply ill-posed. On the other hand the embedding function itself may not be well-defined.

For example, if our space is a subset of the real numbers in more than one dimension, \(X \subseteq \mathbb{R}^{N}, \, N > 1\), then there is no unique embedding function and hence no well-defined moments. We can, however, define component moments as expectations of the coordinate functions, \(\hat{x}_{n}: X \rightarrow \mathbb{R}\), that project a point \(x \in X\) onto each of the component real lines. These component means and variances then provide some quantification of how probability is allocated along each axis.

On the other hand, there is no unique embedding of a sphere into the real line – any embedding turns into an equally good embedding by rotating the sphere before applying the embedding. Consequently there are no unique moments for distributions on spheres and many other non-Euclidean spaces! In practice one adds additional structure to distinguish unique moments, for example with circular moments in the case of spheres.

Unfortunately it is also common to for the mean of a function to refer to its expectation value, \[ \mathbb{m}_{\pi} [f] = \mathbb{E}_{\pi} [ f ]. \] and the variance of a function to refer to the expectation \[ \mathbb{V}_{\pi} [f] = \mathbb{E}_{\pi} [ (f - \mathbb{E}_{\pi}[f])^{2}]. \] In order to avoid confusion with the mean and variance of a probability distribution, I will always specify the argument \([f]\) when these alternative notations are being used.

Finally, keep in mind that moments are defined within the context of a particular parameterization. Reparameterizing to another space introduces a different embedding function which defines different moments that quantify different properties of the abstract probability system.

3.2.2 Cumulative Distribution Functions and Quantiles

The target space is ordered when we can uniquely define an ordering of each point. For example the one-dimensional real line is ordered because each point is placed above or below all others, but the two-dimensional real plane is not ordered because, for example, there is no unique way to define whether the point \((1, 0)\) is greater than or less than the point \((0, 1)\). Similarly, there’s no self-consistent way to order a circle.

When the target space is ordered we can define one-sided interval sets, \[ I(x) = \left\{ x \in X \mid x_{\mathrm{min}} \le x \right\}, \] where \(x_{\mathrm{min}}\) is the unique smallest value in \(X\), as well as two-sided interval sets, \[ I(x_{1}, x_{2}) = \left\{ x \in X \mid x_{1} < x \le x_{2} \right\}, \]

The cumulative distribution function is constructed from the expectation values of all of the indicator functions corresponding to one-sided interval sets, \[ \Pi(x) = \mathbb{E}_{\pi} [ \mathbb{I}_{I(x)} ] = \mathbb{P}_{\pi} [ I(x) ]. \]

Cumulative distribution functions not only give the probability of one-sided intervals, the difference between cumulative distribution functions also gives the probability of two-sided intervals. Because \[ I(x_{2}) = I(x_{1}) \cup I(x_{1}, x_{2}) \] we have \[ \mathbb{P}_{\pi} [ I(x_{2}) ] = \mathbb{P}_{\pi}[ I(x_{1})] + \mathbb{P}_{\pi} [ I(x_{1}, x_{2}) ] \] or \[ \begin{align*} \mathbb{P}_{\pi} [ I(x_{1}, x_{2}) ] &= \mathbb{P}_{\pi} [ I(x_{2}) ] - \mathbb{P}_{\pi} [ I(x_{1}) ] \\ &= \Pi(x_{2}) - \Pi(x_{1}). \end{align*} \]

Quantiles are inverses of the cumulative distribution function for a given target probability, \[ x_{p} = \Pi^{-1} (p). \] More descriptively, a \(p\)-quantile is the point in the target space where we’ve accumulated \(p\) probability. The most common quantile is \(x_{0.5}\), or the median, which quantifies the point in the target space that splits the total probabilty distribution below and above. This quantifies a different sense of where a probability distribution concentrates other than the mean. Tail quantiles, such as \(x_{0.05}\) and \(x_{0.95}\), quantify a different sense of the breadth of a probability distribution than the variance.

Because a reparameterization can arbitrarily reorder a space, a cumulative distribution function is defined only within the context of a given parameterization. As with moments, quantiles in different parameterizations isolate different properties of the abstract probability system.

3.2.3 Histograms

A histogram is the collection of probabilities over a sequence of disjoint intervals, \(\left\{ B_{1}, \ldots, B_{N} \right\}\), \[ \left\{ \mathbb{P}_{\pi} [ B_{1} ], \ldots, \mathbb{P}_{\pi} [ B_{N} ] \right\}, \] or, equivalently, the expectation values of the corresponding indicator functions, \[ \left\{ \mathbb{E}_{\pi} [ \mathbb{I}_{B_{1}} ], \ldots, \mathbb{E}_{\pi} [ \mathbb{I}_{B_{N}} ] \right\}. \]

For one-dimensional, and occasionally two-dimensional target spaces, histograms are useful ways of visualizing the overall behavior of a probability distribution.

4 Representing Probability Distributions with Densities

As we saw in the previous section, formal probability theory is simply the study of probability distributions that distribute a finite, conserved quantity across a space, the expectation values that such a distribution induces, and how the distribution behaves under transformations of the underlying space. While there is myriad complexity in the details of that study, the basics concepts are relatively straightforward.

Still, those concepts have so far been presented in the abstract without any concrete examples to provide context. Why weren’t there any examples to accompany the discussion in the previous section? Because, unfortunately, probability distributions cannot be explicitly defined in most problems!

Recall that a probability distribution is defined as a map from the well-defined subsets in a \(\sigma\)-algebra to real numbers in the interval \([0, 1]\). For a \(\sigma\)-algebra with only a few elements we could completely specify this mapping with a lookup table that associates a probability with each set, but in practice almost all \(\sigma\)-algebras are simply be too large for this to be practical. In order to implement probability distributions in practice we need a way to define this mapping algorithmically so that we can calculate probabilities and expectation values on the fly.

Fortunately most probabilistic systems admit representations that faithfully recover all probabilities and expectation values on demand and hence completely specify a given probability distribution. In particular, density representations exploit the particular structure of \(X\) to fully characterize a probability distribution with special functions.

One of the most common sources of misunderstanding in probability theory is the confusion of an abstract probability distribution with these representations. In an understandable desire to present concrete examples as early as possible, many introductory treatments of probability theory begin immediately with these representations only to end up muddling their relationship with the abstract mathematics that they are characterizing. Now that we have worked through the abstract definitions, however, we can introduce representations in a more principled way.

4.1 Probability Mass Functions

When \(X\) is a discrete space any probability distribution can be represented by a probability mass function that assigns a probability to each atomic element of the space, \(\pi : X \rightarrow [0, 1]\). From these atomic allocations we can reconstruct the probability of any subset \(A \subset X\) as \[ \mathbb{P}_{\pi} [ A ] = \sum_{x \in A} \pi(x), \] and any expectation value as \[ \mathbb{E}_{\pi} [ f ] = \sum_{x \in X} \pi(x) f(x). \]

Probability mass functions also have the elegant property of transforming quite naturally along a measurable transformation \(g : X \rightarrow Y\), \[ \pi_{*} (y) = \sum_{x \in g^{-1}(y)} \pi(x). \] This implies that the probability of a set in \(Y\) is given by \[ \mathbb{P}_{\pi_{*}} [ B ] = \sum_{y \in B } \sum_{x \in g^{-1}(y)} \pi(x), \] while expectation values of functions on \(Y\) are given by \[ \mathbb{E}_{\pi_{*}} [ f ] = \sum_{x \in X } \pi(x) \, f(g(x)). \]

For a reparameterization \(g\) is one-to-one and \(g^{-1}(y)\) identifies a single point. Hence the pushforward probability mass function is just a relabeling of the original probability mass function, as expected.

4.1.1 The Poisson Family of Probability Mass Functions

Consider, for example, the Poisson probability mass functions which allocate probabilities across the natural numbers, \(X = \left\{0, 1, 2, \ldots \right\}\), \[ \pi(x \mid \lambda) = \frac{e^{-\lambda} \lambda^{x}}{x!}. \] depending on the particular value of the intensity parameter \(\lambda\). In particular, there is no singular Poisson probability mass function but rather a family of them, and a corresponding family of probability distributions specified by each member of that family.

In R the Poisson probability mass function is defined as dpois,

c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")

plot_poisson <- function(l) {
  p <- hist(0, breaks=0:21-0.5, plot=FALSE)
  p$counts <- dpois(0:20, l)

  par(mar = c(8, 6, 0, 0.5))
  plot(p, main="", col="white", border=c_dark_highlight,
       xlab="x", xlim=c(-0.5, 20.5), 
       ylab="Probability Mass", ylim=c(0, 0.2), yaxt='n',
       cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
}

l <- 5
plot_poisson(l)

The corresponding cumulative distribution function has an analytic form and in R is given by ppois

B <- 21
xs <- rep(0:B, each=2)
cdfs <- sapply(1:length(xs), function(n) ifelse(n > 1, ppois(xs[n - 1], l), 0))

par(mar = c(8, 6, 0, 0.5))
plot(xs, cdfs, type="l", main="", col=c_dark_highlight, 
     xlab="x", xlim=c(-0.5, 20.5), 
     ylab="Cumulative Probability", ylim=c(0, 1), yaxt='n',
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)

4.1.2 Computing Probabilities

It’s taken quite the buildup, but we are now finally ready compute our first probability. Let’s start with the probability allocated to the set \(A_{1}\) that we defined in Section 1.

plot_poisson_probs <- function(A, l) {
  bin_edges <- c(A, A[length(A)] + 1) - 0.5
  p_sum <- hist(A[1], breaks=bin_edges, plot=FALSE)
  p_sum$counts <- dpois(A, 5)

  plot(p_sum, col=c_dark, border=c_dark_highlight, add=T)
}

plot_poisson(l)
plot_poisson_probs(A1, l)

We can compute the probability by exhaustively summing over the elements in \(A_{1}\),

sum(sapply(A1, function(x) dpois(x, l)))

[1] 0.6016024

or we can use the difference of cumulative distribution functions,

poisson_prob <- function(A, l) {
  return(ppois(A[length(A)], l) - ppois(A[1] - 1, l))
}

poisson_prob(A1, l)

[1] 0.6016024

which delightedly yield the same answer.

Given this rousing success we can then ask what is the probability allocated to the compliment of \(A_{1}\)?

plot_poisson(l)
plot_poisson_probs(0:(A1[1] - 1), l)
plot_poisson_probs((A1[length(A1)] + 1):20, l)

Because \(A_{1}^{c}\) contains an infinite number of elements we can’t compute it by exhaustively summing probability masses, but we can compute it using the cumulative probability function,

(1 - ppois(A1[length(A1)], l)) +  ppois(A1[1] - 1, l)

[1] 0.3983976

We can also use the sum rule for complements that we derived in Section 2.1,

1 - poisson_prob(A1, l)

[1] 0.3983976

Another victory for consistency!

Emboldened in our success let’s now consider the probability allocated to the union of \(A_{1}\) and \(A_{2}\),

plot_poisson(l)
plot_poisson_probs(A_union, l)

We can compute this by exhaustively summing,

sum(sapply(A_union, function(x) dpois(x, l)))

[1] 0.703146

subtracting cumulative distribution functions,

poisson_prob(A_union, l)

[1] 0.703146

or by using the general sum rule introduced in Section 2.1

poisson_prob(A1, l) + poisson_prob(A2, l) - poisson_prob(A_inter, l)

[1] 0.703146

Intuitively, the last term here corrects for the double counting of the probabilities assigned to the overlap of the two sets,

c_light_trans <- c("#DCBCBC80")
c_light_highlight_trans <- c("#C7999980")
c_mid_trans <- c("#B97C7C80")
c_mid_highlight_trans <- c("#A2505080")
c_dark_trans <- c("#8F272780")
c_dark_highlight_trans <- c("#7C000080")

plot_poisson(l)

bin_edges <- c(A1, A1[length(A1)] + 1) - 0.5
p_sum <- hist(A1[1], breaks=bin_edges, plot=FALSE)
p_sum$counts <- dpois(A1, 5)

plot(p_sum, col=c_dark_trans, border=c_dark_highlight_trans, add=T)

bin_edges <- c(A2, A2[length(A2)] + 1) - 0.5
p_sum <- hist(A2[1], breaks=bin_edges, plot=FALSE)
p_sum$counts <- dpois(A2, 5)

plot(p_sum, col=c_dark_trans, border=c_dark_highlight_trans, add=T)

Similarly we can consider the intersection of the two sets,

plot_poisson(l)
plot_poisson_probs(A_inter, l)

which we can compute in any of the three ways we’ve discussed: exhaustive summation,

sum(sapply(A_inter, function(x) dpois(x, l)))

[1] 0.1044449

differences in cumulative distribution functions,

poisson_prob(A_inter, l)

[1] 0.1044449

and the general sum rule,

poisson_prob(A1, l) + poisson_prob(A2, l) - poisson_prob(A_union, l)

[1] 0.1044449

4.1.3 Computing Expectations

The natural numbers have a unique embedding into the real numbers so means and variances are well-defined. In fact the mean and variance of a distribution specified by a Poisson probability mass function are both given by the parameter \(\lambda\), \[ \mathbb{m}_{\pi} = \mathbb{V}_{\pi} = \lambda. \]

We can verify this by approximating the corresponding expectation values with exhaustive sums confined to an interval that contains most of the probability mass. First we have to define the embedding function, which in R is an explicit cast from integers to double precision floating point numbers

iota <- function(x) {
  return(as.double(x))
}

Technically R will do this casting implicitly, but we’re going to be very explicit here to emphasize each step of the formal calculation.

In order to estimate the mean we sum over an interval containing the first 100 natural numbers,

sum(sapply(0:100, function(x) iota(x) * dpois(x, l)))

[1] 5

which yields the desired value to floating point precision. We can observe how the mean estimate convergences to the exact value as we consider longer and longer intervals

xs <- 1:100
partial_means <- sapply(xs, function(x) 
                        sum(sapply(0:x, function(y) iota(y) * dpois(y, l))))

plot(xs, partial_means, type="l", col=c_dark_highlight, lwd=2,
     xlab="x", ylab="Partial Mean",
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5, yaxt='n')

In the same way we can estimate the variance

sum(sapply(0:100, function(x) (iota(x) - 5)**2 * dpois(x, l)))

[1] 5

which again yields the exact value to floating point precision.

4.1.4 Computing Pushforwards

We have to be ever vigilant in recognizing the limitations of a representation such as a probability mass function. Pushing our probability distribution forward along a measurable transformation yields a pushforward distribution with its own probability mass function, mean, and variance.

What happens to the representative probability mass function when we reparameterize with the one-to-one measurable transformation \(y = \sqrt{x}\)? Using the transformation rules introduced above we can readily visualize the probability mass function of the pushforward distribution ,

bin_edges <- c(-0.5, sapply(1:21, function(x) sqrt(x - 0.5)))
p <- hist(0, breaks=bin_edges, plot=FALSE)
p$density <- dpois(0:20, l)

par(mar = c(8, 6, 0, 0.5))
plot(p, main="", col=c_dark, border=c_dark_highlight,
     xlab="y = sqrt(x)", yaxt='n', ylab="Probability Mass",
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)

On this new space the probability mass function is no longer a Poissonian probability mass function!

The mean of the pushforward distribution is then given by

emp_mean <- sum(sapply(0:100, function(x) sqrt(iota(x)) * dpois(x, l)))
emp_mean

[1] 2.171154

while the variance is given by

sum(sapply(0:100, function(x) (sqrt(iota(x)) - emp_mean)**2 * dpois(x, l)))

[1] 0.2860894

After applying our transformation the mean and variance are no longer equal!

Something particularly interesting happens when we observe how the variance of the pushforward distribution changes with the parameter \(\lambda\).

emp_mean <- sum(sapply(0:100, function(x) sqrt(iota(x)) * dpois(x, 1)))
sum(sapply(0:100, function(x) (sqrt(iota(x)) - emp_mean)**2 * dpois(x, 1)))

[1] 0.4021731

emp_mean <- sum(sapply(0:100, function(x) sqrt(iota(x)) * dpois(x, 2.5)))
sum(sapply(0:100, function(x) (sqrt(iota(x)) - emp_mean)**2 * dpois(x, 2.5)))

[1] 0.3638812

emp_mean <- sum(sapply(0:100, function(x) sqrt(iota(x)) * dpois(x, 5)))
sum(sapply(0:100, function(x) (sqrt(iota(x)) - emp_mean)**2 * dpois(x, 5)))

[1] 0.2860894

emp_mean <- sum(sapply(0:100, function(x) sqrt(iota(x)) * dpois(x, 10)))
sum(sapply(0:100, function(x) (sqrt(iota(x)) - emp_mean)**2 * dpois(x, 10)))

[1] 0.2613005

emp_mean <- sum(sapply(0:100, function(x) sqrt(iota(x)) * dpois(x, 25)))
sum(sapply(0:100, function(x) (sqrt(iota(x)) - emp_mean)**2 * dpois(x, 25)))

[1] 0.2539859

emp_mean <- sum(sapply(0:100, function(x) sqrt(iota(x)) * dpois(x, 50)))
sum(sapply(0:100, function(x) (sqrt(iota(x)) - emp_mean)**2 * dpois(x, 50)))

[1] 0.2519308

For intensities above 5 or so the variance of the pushforward distribution stabilizes to a constant value independent of the intensity itself, decoupling the exact correlation between mean and variance featured in the original distribution. Indeed the square root transformation is known as a variance stabilizing transformation for distributions that admit parameterizations with Poissonian probability mass functions.

Reparameterizations like these are often used to simplify the use of a given probability distribution by pushing it forward to a space where the corresponding probability mass function, or even means and variances, have certain desirable properties.

4.2 Probability Density Functions

When our space is given by the real numbers or a subset thereof, \(X \subseteq \mathbb{R}^{N}\), we can no longer assign finite probability to each point \(x \in X\) with a probability mass function. The immediate problem is that any non-atomic subset of \(X\) will contain an uncountably infinite number of points and the sum of probability mass function over that subset will explode unless we assign zero probability mass to all but a countable number of points.

Instead we must utilize probability density functions over which we can integrate to give probabilities and expectations. Given a probability density function, \(\pi: X \rightarrow \mathbb{R}\), we can reconstruct probabilities as \[ \mathbb{P}_{\pi} [ A ] = \int_{A} \mathrm{d} x \, \pi(x), \] and expectation value as \[ \mathbb{E}_{\pi} [ f ] = \int_{X} \mathrm{d} x \, \pi(x) \, f(x). \]

Unlike probability mass functions, probability densities don’t transform quite as naturally under a measurable transformation. The complication is that the differential volumes over which we integrate will in general change under such a transformation, and probability density functions have to change in the opposite way to compensate and ensure that probabilities are conserved. The change in volumes is quantified by the determinant of the Jacobian matrix of partial derivatives, \(|J(x)|\), where \[ J_{ij} (x) = \frac{ \partial g_{i}}{\partial x_{j}} (x). \]

For example, the pushforward probability density function corresponding to the reparameterization \(g : \mathbb{R}^{N} \rightarrow \mathbb{R}^{N}\) picks up an inverse factor of the Jacobian determinent, \[ \begin{align*} \pi_{*} (y) &= \pi(g^{-1}(y)) |J(g^{-1}(x))|^{-1} \\ &= \pi(g^{-1}(y)) \left| \frac{ \partial g}{\partial x} (g^{-1}(y)) \right|^{-1} \end{align*} \]

If the measurable transformation is many-to-one then we have to take into account all of the roots of \(y = g(x)\) for a given y, \(\left\{x_{1}(y), \ldots, x_{N}(y) \right\}\), \[ \begin{align*} \pi_{*} (y) &= \sum_{n = 1}^{N} \pi(x_{n}(y)) |J(x_{n}(y))|^{-1} \\ &= \sum_{n = 1}^{N} \pi(x_{n}(y)) \left| \frac{ \partial g}{\partial x} (x_{n}(y)) \right|^{-1}. \end{align*} \]

Deriving the pushforward probability density function for transformations that change the dimensionality of the space, such as marginalizations, are more challenging and require analytically integrating an appropriately reparameterized probability density function over the complementary spaces.

4.2.1 The Gaussian Family of Probability Density Functions

To demonstrate a probability density function consider the ubiquitous Gaussian probability density functions which allocate probabilities across real line, \(X = \mathbb{R}\), \[ \pi(x \mid \mu, \sigma) = \frac{1}{\sqrt{2 \pi}} \exp \left( - \frac{1}{2} \left(\frac{x - \mu}{\sigma} \right)^{2} \right). \] Each Gaussian probability density function is specified by a location parameter, \(\mu\) and a scale parameter, \(\sigma\).

In R the Gaussian probability density function is defined as dnorm,

plot_norm <- function(mu, sigma) {
  x <- seq(-8, 8, 0.001)

  plot(x, dnorm(x, mu, sigma), type="l", col=c_dark_highlight, lwd=2,
       xlab="x", ylab="Probability Density",
       cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5, yaxt='n')
}

mu <- 1
sigma <- 1.25

plot_norm(mu, sigma)

The corresponding cumulative distribution function has an analytic form and in R is given by pnorm

x <- seq(-8, 8, 0.001)

plot(x, pnorm(x, mu, sigma), type="l", col=c_dark_highlight, lwd=2,
     xlab="x", ylab="Cumulative Probability",
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5, yaxt='n')

4.2.2 Computing Probabilities

How do we compute the probability allocated to the interval \(B_{1}\) introduced in Section 1?

plot_norm_probs <- function(mu, sigma, B_min, B_max) {
  x_int <- seq(B_min, B_max, 0.001)
  x <- c(x_int, B_max, B_min)
  y <- c(dnorm(x_int, mu, sigma), 0, 0)

  polygon(x, y, col=c_dark, border=NA)
}

plot_norm(mu, sigma)
plot_norm_probs(mu, sigma, B1_min, B1_max)

We can no longer try to exhaustively sum over the elements in \(B_{1}\) because there are infinitely too many of them, but we can use the difference of cumulative distribution functions,

norm_prob <- function(B_min, B_max, mu, sigma) {
    return(pnorm(B_max, mu, sigma) - pnorm(B_min, mu, sigma))
}

norm_prob(B1_min, B1_max, mu, sigma)

[1] 0.4918025

And the probability allocated to the compliment of \(B_{1}\)?

plot_norm(mu, sigma)
plot_norm_probs(mu, sigma, -8, B1_min)
plot_norm_probs(mu, sigma, B1_max, 8)

We can compute it using the cumulative probability function,

(1 - pnorm(B1_max, mu, sigma)) + pnorm(B1_min, mu, sigma)

[1] 0.5081975

or the sum rule for complements that we derived in Section 2.1,

1 - norm_prob(B1_min, B1_max, mu, sigma)

[1] 0.5081975

What about the probability allocated to the union of \(B_{1}\) and \(B_{2}\),

plot_norm(mu, sigma)
plot_norm_probs(mu, sigma, B_union_min, B_union_max)

Once again we can appeal to cumulative distribution functions,

norm_prob(B_union_min, B_union_max, mu, sigma)

[1] 0.9370032

or by using the general sum rule introduced in Section 2.1

norm_prob(B1_min, B1_max, mu, sigma) + 
norm_prob(B2_min, B2_max, mu, sigma) - 
norm_prob(B_inter_min, B_inter_max, mu, sigma)

[1] 0.9370032

Intuitively, the last term here corrects for the double counting of the probability allocated to the overlap of the two sets,

plot_norm(mu, sigma)

x_int <- seq(B1_min, B1_max, 0.001)
x <- c(x_int, B1_max, B1_min)
y <- c(dnorm(x_int, mu, sigma), 0, 0)

polygon(x, y, col=c_dark_trans, border=NA)

x_int <- seq(B2_min, B2_max, 0.001)
x <- c(x_int, B2_max, B2_min)
y <- c(dnorm(x_int, mu, sigma), 0, 0)

polygon(x, y, col=c_dark_trans, border=NA)

Finally we can consider the intersection of the two sets,

plot_norm(mu, sigma)
plot_norm_probs(mu, sigma, B_inter_min, B_inter_max)

which we can compute either with cumulative distribution functions,

norm_prob(B_inter_min, B_inter_max, mu, sigma)

[1] 0.2881446

or the general sum rule,

norm_prob(B1_min, B1_max, mu, sigma) + 
norm_prob(B2_min, B2_max, mu, sigma) - 
norm_prob(B_union_min, B_union_max, mu, sigma)

[1] 0.2881446

4.2.3 Computing Expectations

The real line has a unique embedding into the real line – the identify function – so means and variances are well-defined for the Gaussian family of probability density functions. In line with their names, the mean of any member is given by the location parameter, \[ \mathbb{m}_{\pi} = \mu, \] while the variance is given by the square of the scale parameter, \[ \mathbb{V}_{\pi} = \sigma^{2}. \]

4.2.4 Computing Pushforwards

How does a Gaussian probability density function change under the reparameterization \[ y = \mathrm{logistic}(x) = (1 + e^{-x})^{-1}? \]

g <- function(x) {
  return(1 / (1 + exp(-x)))
}

The first step in constructing the pushforward probability density function is computing the inverse of the transformation,

g_inv <- function(y) {
  return(log(y / (1 - y)))
}

We then need to quantify how volume changes under the transformation,

delta_x = seq(-8, 8, 0.25)

plot(1, type="n", xlim=c(-4, 4), xlab="dx", ylim=c(0, 1), ylab="",
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5, yaxt='n')
for (x in delta_x) abline(v=x, col=c_dark, lwd=2)

plot(1, type="n", xlim=c(0, 1), xlab="dy", ylim=c(0, 1), ylab="",
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5, yaxt='n')
for (x in delta_x) abline(v=g(x), col=c_dark, lwd=2)

using the reciprocal of the Jacobian determinant, which in this case is given by

abs_J <- function(y) {
  return(1 / (y * (1 - y)))
}

for both of the solutions.

Putting these two pieces together along with the initial probability density function gives the pushforward probability density function

ys <- seq(0, 1, 0.001)
pdfs <- sapply(ys, function(y) dnorm(g_inv(y), mu, sigma) * abs_J(y))

plot(ys, pdfs, type="l", col=c_dark_highlight, lwd=2,
     xlab="y = logistic(x)", ylab="Probability Density",
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5, yaxt='n')

Although the Jacobian determinant correction skews the characteristic bell curve of the initial Gaussian density function, the pushforward probability density function represents the same system just in a different space. There are no Gaussian probability distributions, just probability distributions that admit Gaussian probability density functions in certain parameterizations!

Because Gaussian probability density functions are particularly convenient to work with there is a great deal of literature in engineering reparameterizations that can push a given probability density function into something more Gaussian. In particular the Box-Cox family of transformations, \[ g(x) = \frac{ x^{\phi} - 1 }{ \phi }, \] are commonly used to massage a given space and the probability density function it admits into something that is better-approximated with a Gaussian probability density function.

4.3 Extra Credit: Radon-Nikodym Derivatives

Probability mass functions and probability density functions actually arise from the same construction, with their differences due to the structure of the underlying space.

First consider a measure, \(\mu\), which behaves exactly like a probability distribution except for finiteness of the total measure. Intuitively we can think of measures as allocations of volume to well-defined neighborhoods in a space.

Many spaces are endowed with natural measures with useful properties. Any discrete space, for example, is endowed with a counting measure, \(\chi\), that assigns unit measure to every atomic element in the space. Conveniently the counting measure of any set is given by a straightforward summation of all elements in that set, \[ \mathbb{M}_{\chi} [ A ] = \sum_{x \in A} \chi [x] = \sum_{x \in A} 1, \] and expectations are given by sums over the corresponding function values, \[ \mathbb{E}_{\chi} [ f ] = \sum_{x \in X} f(x). \]

Similarly, a given set of real numbers is endowed with a Lebesgue measure which defines the volume of any finite rectangle to be the product of the lengths of each side. We have to be careful here as the Lebesgue measure is natural only for the given parameterization of the real numbers – nonlinear transformations between the real numbers change the elementary rectangles and hence the Lebesgue measure. Given a particular parameterization, however, the measure of any neighborhood in the real numbers and the expectation of any function are given by the standard integration introduced in calculus, \[ \mathbb{M}_{\mathcal{L}} [ A ] = \int_{A} \mathrm{d} x. \]

\[ \mathbb{E}_{\mathcal{L}} [ f ] = \int_{X} \mathrm{d} x \, f(x). \]

A measure \(\nu\) is absolutely continuous with respect to another measure \(\mu\) when \(\nu\) allocates zero volume only to those sets which \(\mu\) also allocates zero volume, \[ \mathbb{M}_{\nu} [ A ] = 0 \text{ only if } \mathbb{M}_{\mu} [ A ] = 0. \] When a \(\nu\) is absolutely continuous with respect to \(\mu\) we can relate the two measures together using a single function. In particular there exists some function, \(\lambda : X \rightarrow \mathbb{R}\) such that \[ \mathbb{E}_{\nu} [ f ] = \mathbb{E}_{\mu} [ \lambda \cdot f ] \] for any function, \(f : X \rightarrow \mathbb{R}\). This function that “corrects” \(\mu\) into \(\nu\) is called the Radon-Nikodym derivative or density of \(\nu\) with respect to \(\mu\) and is written \[ \lambda(x) = \frac{ \mathrm{d} \nu }{ \mathrm{d} \mu }(x). \]

Radon-Nikodym derivatives allow us to break any probability distribution down into density function and a base measure or dominating measure, which in turn allows us to exploit any computational properties of the base measure.
Probabilities and expectation values of any probability distribution on a discrete space, for example, can be computed with the summations, \[ \begin{align*} \mathbb{P}_{\pi} [ A ] &= \sum_{x \in A} \pi(x) \\ \mathbb{E}_{\pi} [ f ] &= \sum_{x \in X} \pi(x) \, f(x) \end{align*} \] where \(\pi(x)\) is the density function, \[ \pi(x) = \frac{ \mathrm{d} \pi }{ \mathrm{d} \chi }(x). \] Admittedly using \(\pi\) to refer to the abstract probability distribution and \(\pi(x)\) to the density function can be confusing, but it has become standard notation in statistics and so I’m going to stick with the convention that you’ll see everywhere else.

Correspondingly, computations for any probability distribution defined on the real numbers are given by integrals, \[ \begin{align*} \mathbb{P}_{\pi} [ A ] &= \int_{A} \mathrm{d} x \, \pi(x) \\ \mathbb{E}_{\pi} [ f ] &= \int_{A} \mathrm{d} x \, \pi(x) \, f(x), \end{align*} \] where \(\pi(x)\) is the density function, \[ \pi(x) = \frac{ \mathrm{d} \pi }{ \mathrm{d} \mathcal{L} }(x). \] Indeed integration in calculus is most formally derived as expectation with respect to the underlying Lebesgue measure over the real numbers!

Comparing to the results in the past two sections we see that probability mass functions and probability density functions are just Radon-Nikodym derivatives of a given probability distribution function with respect to a natural measure on the latent space. By exploiting these natural measures we can reduce the abstract probability distribution into a density function amenable to explicit specification and computation.

This more formal perspective also helps to explain the transformation properties of density functions. Counting measures are invariant under the permutation induced by a reparameterizations, and so the pushforward of a probability mass function is just the composition of the original probability mass function with the reparameterization. Lebesgue measures, however, are not necessarily invariant under reparameterization. In general the pushforward of the Lebesgue measure requires a Jacobian determinant to account for the compression or expansion of volume in the transformation. This change in the base measure then induces an inverse Jacobian determinant in the Radon-Nikodym derivatives, and hence probability density functions as we saw in Section 4.2.

Finally, if two probability distributions are represented by the density functions \(\pi(x)\) and \(\rho(x)\), then the Radon-Nikodym derivative between them is given by the ratio of the density functions, \[ \frac{ \mathrm{d} \rho }{ \mathrm{d} \pi } = \frac{ \rho(x) }{ \pi(x) }. \] This implies that, while density functions themselves might not be all that meaningful, ratios of densities are. This intuition can be extremely helpful when trying to reason about well-posed computational algorithms. For example, the importance weights in importance sampling are given by the ratio of the target probability density function to the importance probability density function, which is actually the corresponding Radon-Nikodym derivative and hence has well-defined values regardless of the parameterization.

5 Representing Probability Distributions with Samples

One of the immediate difficulties with probability mass function and probability density function representations is that if a given expectation doesn’t have a closed form, as we conveniently had for the examples given above, then the computation will likely be infeasible. Exhaustive surveys of the space \(X\) needed to numerically compute a summation or integral are simply too expensive given the finite computational resources available in most problems. This is especially true when we consider complex probability distributions over high-dimensional spaces.

An intriguing alternative to these deterministic representations of a probability distribution utilizes a sequence of points in \(X\) capable of estimating expectations with more reasonable computational demands.

5.1 Sampling Processes

Consider an arbitrary sequence of points from \(X\), \(S_{N} = {x_{1}, \ldots, x_{N}}\) and define the empirical expectation of a function \(f : X \rightarrow \mathbb{R}\) as \[ \hat{f}(S_{N}) = \frac{1}{N} \sum_{n = 1}^{N} f(x_{n}). \] A sampling procedure is any process that generates sequences that satisfy the asymptotic property \[ \lim_{N \rightarrow \infty} \hat{f}(S_{N}) = \mathbb{E}_{\pi} [ f ], \] for all functions \(f\) with finite expectation values. The finite sequences generated by a sampling process are denoted samples. By this definition any subset of such a sequence is also a sample.

As the size of a sample increases the empirical expectations converge towards the true expectation values; the same applies to probabilities through expectations of indicator functions. In this way any sampling procedure which is conceptually capable of generating arbitrarily long sequences provides a faithful representation of the corresponding probability distribution. In practice, however, we need to be able to quantify the accuracy of these estimators to understand how accurately we are representing the target probability distribution.

There is absolutely no restriction on the inner workings of a sampling procedure. Provided that it can generate sequences of arbitrary length that satisfy the asymptotic property, any mechanism for generating sequences will be a valid sampling procedure. The procedure can be generated algorithmically and be completely deterministic given a known input configuration, generated from unknown external inputs with unpredictable outputs, or anywhere in between. It also need not have any nice cryptographic properties or otherwise be hard to predict from one state to the next.

Measurable transformations are especially easy to implement with samples. If \(S_{N} = {x_{1}, \ldots, x_{N}}\) is a sample from a probability distribution on \(X\) then \[ S_{N*} = \{ f(x_{1}), \ldots, f(x_{N}) \} = \{ y_{1}, \ldots, y_{N} \} \] is a sample from the pushforward probability distribution along the measurable function \(f : X \rightarrow Y\).

Samples also help to provide some intuition for why measurable transformations are often introduced as “random variables”. A variable is defined as the output of a function with a variable, but known, input, \[ y = f(x). \] If the input is distributed according to a probability distribution, however, \[ x \sim \pi \] \[ y = f(x), \] then we don’t have an explicit value at which to evaluate the function, and hence instantiate the variable. Instead we have a probability distribution over the possible variable values given by the pushforward probability distribution, \[ y \sim \pi_{*}. \] When we represent a probability distribution with a sample, however, we do have explicit values at which we can evaluate the function. Consequently we can very loosely interpret “evaluating a random variable” as “generating samples from the corresponding pushforward distribution”.

5.2 Generating Exact Samples

A sampling representation of a probability distribution is appealing, but how exactly do we generate samples whose empirical expectations converge to the true expectations of a given probability distribution? For sufficiently simple distributions we can generate exact samples.

If each point in a sample is generated independently of the others then we say that the resulting sample is exact. By this definition each subset of an exact sample, including each individual point, is also an exact sample.

Estimating expectations values using the empirical expectation of an exact samples was originally introduced as the Monte Carlo method, with the empirical expectations themselves known as Monte Carlo estimators. For a detailed history of the Monte Carlo method see Betancourt (2017). Each state in an exact sample is also known as a simulation.

Monte Carlo estimators satisfy a central limit theorem which guarantees that the empirical expectations are distributed around the true expectations, \[ \hat{f}(S_{N}) \sim \mathcal{N} \left( \mathbb{E}_{\pi} [ f ], \sqrt{\frac{ \mathbb{V}[f] }{ N } } \right). \] The square root of the variance in this central limit theorem is also known as the Monte Carlo standard error, \[ \text{MC-SE} = \sqrt{\frac{ \mathbb{V}[f] }{ N } }. \] In practice we can estimate the variance, \(\mathbb{V}[f]\), and hence the Monte Carlo standard error, using the sample itself. The error in using a variance estimate in place of the true variance scales as \(1 / N\) which is negligible compared to the dominant \(1 / \sqrt{N}\) scaling of the nominal error.

Exact samples can be generated for many one-dimensional distributions using a variety of highly optimized techniques; a comprehensive discussion of these techniques can be found in Devroye (1986). These pseudo random number generators are initialized with an input seed and then can be queried repeatedly to generate the individual elements of a sample. Technically each state in a pseudo random number generator is very weakly correlated with the previous state, but the correlations are typically weak enough to be ignored for the purposes of estimating expectations. A variety of pseudo random number generators are available in R, Python, and the Stan modeling language.

5.2.1 Playing with Poisson Samples

Let’s consider generating samples from a distribution that admits a Poisson probability mass function.

We can generate exact samples using the pseudo random number generator in R,

set.seed(8675309)

r_samples <- rpois(1000, l)

We can also generate exact samples using the pseudo random number generator in Stan (Stan Development Team 2018),

writeLines(readLines("generate_poisson.stan"))

// Values provided externally
data {
  real l;
}

generated quantities {
  // Generate an exact sample from a distribution
  // specified by a Poisson probabilty mass function
  int x = poisson_rng(l);
}

simu_data <- list("l" = l)

library(rstan)

Loading required package: ggplot2

Loading required package: StanHeaders

rstan (Version 2.18.1, GitRev: 2e1f913d3ca3)

For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)

rstan_options(auto_write = TRUE)

fit <- stan(file='generate_poisson.stan', data=simu_data, 
            seed=194838, algorithm="Fixed_param",
            iter=1000, warmup=0, chains=1)


SAMPLING FOR MODEL 'generate_poisson' NOW (CHAIN 1).
Iteration:   1 / 1000 [  0%]  (Sampling)
Iteration: 100 / 1000 [ 10%]  (Sampling)
Iteration: 200 / 1000 [ 20%]  (Sampling)
Iteration: 300 / 1000 [ 30%]  (Sampling)
Iteration: 400 / 1000 [ 40%]  (Sampling)
Iteration: 500 / 1000 [ 50%]  (Sampling)
Iteration: 600 / 1000 [ 60%]  (Sampling)
Iteration: 700 / 1000 [ 70%]  (Sampling)
Iteration: 800 / 1000 [ 80%]  (Sampling)
Iteration: 900 / 1000 [ 90%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)

 Elapsed Time: 0 seconds (Warm-up)
               0.001143 seconds (Sampling)
               0.001143 seconds (Total)

stan_samples <- extract(fit)$x[]

In order to compare these two samples to each other and the exact Poisson probability mass function we can construct an estimator for the histogram of a given set of bins, \[ \mathbb{P}_{\pi} [ B_{n} ] \approx \frac{1}{N} \sum_{n = 1}^{N} \mathbb{I}_{B_{n}} (x_{n}), \] which is just the relative number of samples that fall into each bin.

plot_poisson(l)
hist(r_samples, breaks=0:21-0.5, 
     col=c_dark_trans, border=c_mid_highlight_trans, probability=T, add=T)
hist(stan_samples, breaks=0:21-0.5, 
     col=c_mid_trans, border=c_light_highlight_trans, probability=T, add=T)

Indeed there is no indication of discrepancies between the two samples to each other or the exact probability mass function.

Similarly we can consider the empirical cumulative distribution function of the two samples which estimate the true cumulative distribution function.

B <- 21
xs <- rep(0:B, each=2)

cdfs <- sapply(1:length(xs), function(n) ifelse(n > 1, ppois(xs[n - 1], l), 0))

par(mar = c(8, 6, 0, 0.5))
plot(xs, cdfs, type="l", main="", col=c_dark_highlight, 
     xlab="x", xlim=c(-0.5, 20.5), 
     ylab="Cumulative Probability", ylim=c(0, 1), yaxt='n',
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)

ecdfs <- sapply(1:length(xs), function(n) 
                ifelse(n > 1, length(r_samples[r_samples <= xs[n - 1]]), 0)) / length(r_samples)
lines(xs, ecdfs, col=c_mid_highlight, lwd=2)

ecdfs <- sapply(1:length(xs), function(n) 
                ifelse(n > 1, length(stan_samples[stan_samples <= xs[n - 1]]), 0)) / length(stan_samples)
lines(xs, ecdfs, col=c_light_highlight, lwd=2)

To facilitate the computation of Monte Carlo estimators let’s define a Welford accumulator that accurately computes the empirical means and variances of a sample in a single pass.

welford_summary <- function(x) {
  summary = c(0, 0)
  for (n in 1:length(x)) {
    delta <- x[n] - summary[1]
    summary[1] <- summary[1] + delta / (n + 1)
    summary[2] <- summary[2] + delta * (x[n] - summary[1])
  }
  summary[2] <- summary[2] / (length(x) - 1)
  return(summary)
}

We can then use the Welford accumulator to compute the Monte Carlo estimator of a function and an estimate of its Monte Carlo Standard Error.

compute_mc_stats <- function(x) {
  summary <- welford_summary(x)
  return(c(summary[1], sqrt(summary[2] / length(x))))
}

In order to estimate the probability of a set we estimate the expectation value of the corresponding indicator function.

indicator <- function(x, A) {
  return(ifelse(A[1] <= x & x <= A[length(A)], 1, 0))
}

For example, to compute the probability of the set \(A_{1}\) we evaluate the corresponding indictor function at each of our samples and then compute the Monte Carlo estimator and standard error,

pushforward_samples = sapply(stan_samples, function(x) indicator(x, A1))
compute_mc_stats(pushforward_samples)

[1] 0.59640360 0.01553024

This is consistent with the exact computation,

poisson_prob(A1, l)

[1] 0.6016024

And we an readily visualize how the Monte Carlo estimator converges to the exact value as the size of the sample increases. The bands here in red cover the Monte Carlo estimator plus/minus 1, 2, and 3 standard errors to demonstrate the variation expected from the Monte Carlo Central Limit Theorem.

iter <- 2:1000
mc_stats <- sapply(iter, function(n) compute_mc_stats(pushforward_samples[0:n]))
     
plot_mc_evo <- function(iter, mc_stats, truth) {
  plot(1, type="n", main="", 
       xlab="Iteration", xlim=c(0, max(iter)),
       ylab="Monte Carlo Estimator",
       ylim=c(min(mc_stats[1,] - 3 * mc_stats[2,]), max(mc_stats[1,] + 3 * mc_stats[2,])))
  
  polygon(c(iter, rev(iter)), 
          c(mc_stats[1,] - 3 * mc_stats[2,], 
            rev(mc_stats[1,] + 3 * mc_stats[2,])),
          col = c_light_highlight, border = NA)
  polygon(c(iter, rev(iter)),
          c(mc_stats[1,] - 2 * mc_stats[2,], 
            rev(mc_stats[1,] + 2 * mc_stats[2,])),
          col = c_mid, border = NA)
  polygon(c(iter, rev(iter)), 
          c(mc_stats[1,] - 1 * mc_stats[2,], 
            rev(mc_stats[1,] + 1 * mc_stats[2,])),
          col = c_mid_highlight, border = NA)
  lines(iter, mc_stats[1,], col=c_dark, lwd=2)
  abline(h=truth, col="grey", lty="dashed", lw=2)
}

plot_mc_evo(iter, mc_stats, poisson_prob(A1, l))

Now we can apply this machinery to any desired probabilist computation. The probability of the complement of \(A_{1}\)?

pushforward_samples = sapply(stan_samples, function(x) 1 - indicator(x, A1))
compute_mc_stats(pushforward_samples)

[1] 0.40259740 0.01552399

1 - poisson_prob(A1, l)

[1] 0.3983976

The probability of the union of \(A_{1}\) and \(A_{2}\)?

pushforward_samples = sapply(stan_samples, function(x) indicator(x, A_union))
compute_mc_stats(pushforward_samples)

[1] 0.70029970 0.01450173

poisson_prob(A_union, l)

[1] 0.703146

The probability of the intersection of \(A_{1}\) and \(A_{2}\)?

pushforward_samples = sapply(stan_samples, function(x) indicator(x, A_inter))
compute_mc_stats(pushforward_samples)

[1] 0.09790210 0.00940713

poisson_prob(A_inter, l)

[1] 0.1044449

The mean of the distribution specified by this Poisson probability mass function?

pushforward_samples = sapply(stan_samples, function(x) iota(x))
iter <- 2:1000
mc_stats <- sapply(iter, function(n) compute_mc_stats(pushforward_samples[0:n]))
plot_mc_evo(iter, mc_stats, l)

And the variance?

pushforward_samples = sapply(stan_samples, function(x) (iota(x) - 5)**2)
compute_mc_stats(pushforward_samples)

[1] 4.7692308 0.2327141

Both are consistent with the expected values of 5.

Finally we can use our samples to quantity the pushforward distribution under the reparameterization \(g : x \mapsto \sqrt{x}\). The mean, for example, is estimated as

pushforward_samples = sapply(stan_samples, function(x) sqrt(x))
compute_mc_stats(pushforward_samples)

[1] 2.15803502 0.01668851

and is consistent with the exact (up to floating point precision) value,

sum(sapply(0:100, function(x) sqrt(iota(x)) * dpois(x, l)))

[1] 2.171154

We can also verify the pushforward probability mass function by histogramming the pushforward samples.

bin_edges <- c(-0.5, sapply(1:21, function(x) sqrt(x - 0.5)))

ps <- hist(pushforward_samples, breaks=bin_edges, plot=F)
ps$density <- ps$counts / 1000
plot(ps, col=c_mid_trans, border=c_mid_highlight_trans, 
     main="", xlab="y = sqrt(x)", yaxt='n', ylab="Probability Mass",
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
           
p <- hist(0, breaks=bin_edges, plot=F)
p$density <- dpois(0:20, l)
plot(p, col=c_dark_trans, border=c_dark_highlight_trans, add=T)

The ease with which we can construct this histogram makes it an invaluable tool for investigating pushforward distributions and verifying analytic calculations.

5.2.2 Playing with Gaussian Samples

The sampling methodology is identical when using distributions on the real line. We can generate exact samples using the pseudo random number generator in R,

r_samples <- rnorm(1000, mu, sigma)

or that in Stan,

writeLines(readLines("generate_normal.stan"))

// Values provided externally
data {
  real mu;
  real<lower=0> sigma;
}

generated quantities {
  // Generate an exact sample from a distribution
  // specified by a normal probabilty density function
  real x = normal_rng(mu, sigma);
}

simu_data <- list("mu" = mu, "sigma" = sigma)

fit <- stan(file='generate_normal.stan', data=simu_data, 
            seed=194838, algorithm="Fixed_param",
            iter=1000, warmup=0, chains=1)


SAMPLING FOR MODEL 'generate_normal' NOW (CHAIN 1).
Iteration:   1 / 1000 [  0%]  (Sampling)
Iteration: 100 / 1000 [ 10%]  (Sampling)
Iteration: 200 / 1000 [ 20%]  (Sampling)
Iteration: 300 / 1000 [ 30%]  (Sampling)
Iteration: 400 / 1000 [ 40%]  (Sampling)
Iteration: 500 / 1000 [ 50%]  (Sampling)
Iteration: 600 / 1000 [ 60%]  (Sampling)
Iteration: 700 / 1000 [ 70%]  (Sampling)
Iteration: 800 / 1000 [ 80%]  (Sampling)
Iteration: 900 / 1000 [ 90%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)

 Elapsed Time: 0 seconds (Warm-up)
               0.001199 seconds (Sampling)
               0.001199 seconds (Total)

stan_samples <- extract(fit)$x[]

We can compare these two samples to each other, and the exact Gaussian probability density function, by constructing their empirical histograms,

plot_norm(mu, sigma)
hist(r_samples, breaks=seq(-8, 8, 0.5), 
     col=c_dark_trans, border=c_mid_highlight_trans, probability=T, add=T)
hist(stan_samples, breaks=seq(-8, 8, 0.5), 
     col=c_mid_trans, border=c_light_highlight_trans, probability=T, add=T)

or their empirical cumulative distribution functions,

xs <- seq(-8, 8, 0.1)
cdfs <- sapply(xs, function(x) pnorm(x, mu, sigma))

plot(xs, cdfs, type="l", lwd=2, main="", col=c_dark_highlight, xlab="x", xlim=c(-8, 8), 
     ylab="Cumulative Probability", ylim=c(0, 1), yaxt='n',
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)

os <- sort(r_samples)
ecdfs <- sapply(os, function(x) length(r_samples[r_samples <= x])) / length(r_samples)
xs <- c(-8, rep(os, each=2), 8)
ys <- c(0, 0, rep(ecdfs, each=2))
lines(xs, ys, col=c_mid_highlight, lwd=2)

os <- sort(stan_samples)
ecdfs <- sapply(os, function(x) length(stan_samples[stan_samples <= x])) / length(stan_samples)
xs <- c(-8, rep(os, each=2), 8)
ys <- c(0, 0, rep(ecdfs, each=2))
lines(xs, ys, col=c_light_highlight, lwd=2)

Probabilities can be estimated with expectation values of the corresponding indicator functions,

indicator <- function(x, B_min, B_max) {
  return(ifelse(B_min <= x & x <= B_max, 1, 0))
}

For example, the probability of the set \(B_{1}\) is estimated as

pushforward_samples = sapply(stan_samples, function(x) indicator(x, B1_min, B1_max))
compute_mc_stats(pushforward_samples)

[1] 0.48151848 0.01581639

which compares nicely to the exact value

norm_prob(B1_min, B1_max, mu, sigma)

[1] 0.4918025

We can also study the convergence of this estimator,

iter <- 2:1000
mc_stats <- sapply(iter, function(n) compute_mc_stats(pushforward_samples[0:n]))
plot_mc_evo(iter, mc_stats, norm_prob(B1_min, B1_max, mu, sigma))

The probability of the complement of \(B_{1}\)?

pushforward_samples = sapply(stan_samples, function(x) 1 - indicator(x, B1_min, B1_max))
compute_mc_stats(pushforward_samples)

[1] 0.51748252 0.01581753

which is consistent with the exact value,

1 - norm_prob(B1_min, B1_max, mu, sigma)

[1] 0.5081975

The union of \(B_{1}\) and \(B_{2}\)?

pushforward_samples = 
  sapply(stan_samples, function(x) indicator(x, B_union_min, B_union_max))
compute_mc_stats(pushforward_samples)

[1] 0.927072927 0.008230678

which is consistent with the exact value,

norm_prob(B_union_min, B_union_max, mu, sigma)

[1] 0.9370032

The intersection of \(B_{1}\) and \(B_{2}\)?

pushforward_samples = 
  sapply(stan_samples, function(x) indicator(x, B_inter_min, B_inter_max))
compute_mc_stats(pushforward_samples)

[1] 0.27972028 0.01420846

which at this point is, unsurprisingly, consistent with the exact value!

norm_prob(B_inter_min, B_inter_max, mu, sigma)

[1] 0.2881446

How about a Monte Carlo estimator of the mean? We need to estimate the
expectation value of the identity function,

identity <- function(x) {
  return(x)
}

which yields the desired result,

pushforward_samples = sapply(stan_samples, function(x) identity(x))
compute_mc_stats(pushforward_samples)

[1] 1.0110324 0.0407888

Similarly for the variance,

pushforward_samples = sapply(stan_samples, function(x) (identity(x) - mu)**2)
compute_mc_stats(pushforward_samples)

[1] 1.65952472 0.07279322

Now let’s consider the reparameterization \(y = \mathrm{logistic}(x)\). This transformation is straightforward to analyze by considering the pushforward samples,

pushforward_samples = sapply(stan_samples, function(x) g(x))
compute_mc_stats(pushforward_samples)

[1] 0.683492936 0.007127057

In particular we can readily study the pushforward probability density function by histogramming the pushforward samples,

ys <- seq(0, 1, 0.001)
pdfs <- sapply(ys, function(y) dnorm(g_inv(y), mu, sigma) * abs_J(y))

plot(ys, pdfs, type="l", col=c_dark_highlight, lwd=2,
     xlab="y = logistic(x)", ylab="Probability Density",
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5, yaxt='n')
     
hist(pushforward_samples, breaks=seq(0, 1, 0.05), probability=T,
     col=c_mid_trans, border=c_mid_highlight_trans, add=T)

Pushing forward probability density functions is frustrating even in simple cases like an initial Gaussian, and even experts routinely make mistakes when attempting these calculations. Sampling is a critical tool for verifying calculations or identifying errors and it has saved your humble author more times then he wishes to admit.

6 Interpretations of Probability

Now that we have a basic understanding of the abstract definition of probability theory and some of the ways in which we can implement probabilistic calculations in practice we are finally ready to consider the applications of probability theory and the corresponding philosophical interpretations of this very generic, conserved quantity that we’ve been distributing. Congratulate yourself for getting this far!

6.1 Physical Distributions

An immediate application of probability theory is modeling the conserved quantities that arise in physical models of the world, such as physical mass. Physical mass can be distributed across a solid object in myriad ways, and the exact distribution affects how that object interacts with the world. At the same time mass flows dynamically through liquids and gasses as they interact with the surrounding environment. Time-dependent probability distributions that quantify the distribution of mass, charge, and other conserved quantities are a critical element of the corresponding mathematical physics.

6.2 Population Distributions

Another application of probability theory arises when we select individuals from a large population. As we select individuals we observe varying behaviors, such as political preference, or properties, such as height or age. If we assume that any process that selects individuals from the population is ignorant
of these characteristics then we can model the heterogeneity of the characteristics themselves using probability distributions.

6.3 Frequentist Statistics

Frequentist statistics uses probability theory to model the variability inherent to the observations that we make in order to learn about a particular system. The approach presumes that this variability is sufficiently simple that we can model it with a probability distribution – the more probability allocated to a neighborhood of the observation space, \(Y\), the more likely we are to observe points in that neighborhood. Probability becomes a quantification of the apparent randomness of observational outcomes. In practice this mathematical assumption cannot distinguish between ontological randomness that is inherent to the world or epistemological randomness that is due to our limited ability to resolve the world, but philosophically many frequentists presume the former.

From this perspective the outcomes of the observational process are quantified with a probability distribution over the observation space called the true data generating process. Observations are similarly idealized as exact samples from this probability distribution.

Although we don’t know the true data generating process, we can model it with a collection of mathematically convenient data generating processes, \(\pi(y ; \theta)\) parameterized by the space \(\theta \in \Theta\). Inferences about this model take the form of estimators that map observations into some selection of this model configuration space, such as points or intervals, that is considered to be consistent with the input observation. The performance of these estimators can be defined as expectations of carefully chosen loss functions with respect to the data generating processes in the model.

6.4 Bayesian Inference

Bayesian inferences uses probability theory to model any self-consistent quantification of uncertainty. This general perspective subsumes the frequentist perspective but also allows us to use probability to quantify which model configurations are most consistent with an observation. Using the model we update a prior distribution over the model configuration space that quantifies our domain expertise into a posterior distribution over the model configuration space that incorporates both our domain expertise and the information encoded in the observed data.

6.5 Everyone Play Nicely

One of the reasons why probability theory is so confusing is that these applications are not mutually exclusive and in practice are often used at the same time.

For example, consider a binary space \(X = {0, 1}\) that corresponds to the two states of a coin with 0 denoted tails and 1 denoted heads. Any probability distribution over \(X\) can be quantified with the probability allocated to heads, \(p\), which gives the consistent probability allocations \[ \begin{align*} \mathbb{P}_{\pi} [ \emptyset ] &= 0 \\ \mathbb{P}_{\pi} [ {0} ] &= 1 - p \\ \mathbb{P}_{\pi} [ {1} ] &= p \\ \mathbb{P}_{\pi} [ X ] &= 1. \end{align*} \]

If we flip an actual coin then this distribution could be applied in a frequentist manner to model the variability of outcomes given the parameter \(p\), which denotes the unknown bias for heads. But then we can also introduce Bayesian inference to quantify which biases are consistent with an observed sequence of flips. So at once we have probability as a parameter, probability distribution modeling frequentist variation in an observable process, and probability distribution quantifying which parameters are consistent with previously observed data.

Moreover, if the coins being flipped by an ensemble of individual experimenters, then we might also introduce probability distribution to model the variation in biases from experimenter to experimenter. It’s probability all the way down.

Probability theory has many applications and it’s only by considering them all that we take full advantage of its potential!

7 Conclusion

Probability theory is a powerful mathematical tool, but often that power is undercut by confusion in exactly what objects and computations probability theory defines, and how we can implement those computations in practice. By very carefully untangling the abstract theory from the practical implementations and the many applications, we can take a more principled introduction through the theory and ensure that we wield it responsibly.

8 Acknowlegements

I warmly thank Maggie Lieu, Junpeng Lao, Deen Abiola, Jeremy Mason, Victor Salgado, xcurry@github, Sean Talts, Elea McDonnell Feit, Christoph Bannwarth, Boris Shilov, ssp3nc3r@github, Max Kesin, and Ravin Kumar for helpful comments as well as the feedback from those who have attended my courses.

A very special thanks to everyone supporting me on Patreon: Aki Vehtari, Austin Rochford, Bo Schwartz Madsen, Cat Shark, Charles Naylor, Colin Carroll, Daniel Simpson, David Pascall, David Roher, Elad Berkman, Finn Lindgren, Granville Matheson, Hernan Bruno, J Michael Burgess, Joel Kronander, Jonas Beltoft Gehrlein, Joshua Mayer, Justin Bois, Lars Barquist, Marek Kwiatkowski, Maurits van der Meer, Maxim Kesin, Michael Dillon, Ole Rogeberg, Oscar Olvera, Riccardo Fusaroli, Richard Torkar, Sam Petulla, Sam Zorowitz, Seo-young Kim, Seth Axen, Simon Duane, Stephen Oates, Stijn, and Vladislavs Dovgalecs.

9 Original Computing Environment

writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))

CC=clang

CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined -Wno-unneeded-internal-declaration
CXX=clang++ -arch x86_64 -ftemplate-depth-256

CXX14FLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined -Wno-unneeded-internal-declaration -Wno-unknown-pragmas
CXX14=clang++ -arch x86_64 -ftemplate-depth-256

References

Betancourt, Michael. 2017. “The Convergence of Markov Chain Monte Carlo Methods: From the Metropolis Method to Hamiltonian Monte Carlo,” June.

Devroye, L. 1986. Non-Uniform Random Variate Generation. Springer-Verlag.

Folland, G. B. 1999. Real Analysis: Modern Techniques and Their Applications. New York: John Wiley; Sons, Inc.

Lee, John M. 2011. Introduction to Topological Manifolds. Springer.

Stan Development Team. 2018. “Stan: A C++ Library for Probability and Sampling, Version 2.17.1.” http://mc-stan.org/.

Probability Theory (For Scientists and Engineers)

Michael Betancourt

October 2018