Rare-event simulation

Patrick Laub

Lecture 2

Agenda

Now:

  • Improved cross-entropy method
  • Cross-entropy method
  • Markov chain Monte Carlo

Later:

  • Start assignment 2 and check out code demonstrations
  • I will do some marking of assignment 1

Recap

The point is estimate $\ell = \mathbb{P}(X > \gamma)$ where $X \sim f(\cdot)$ for $\ell \le 10^{-4}$ (i.e. $\gamma \gg 0$).

We can use Crude Monte Carlo to get estimates $\hat{\ell}_{\mathrm{CMC}}$ for anything, but as $\ell \searrow 0$ then $\hat{\ell}_{\mathrm{CMC}} = 0$.

Instead, use Monte Carlo with Importance Sampling $\hat{\ell}_{\mathrm{IS}}$. Instead of sampling $X \sim f(\cdot)$, take samples from $X \sim g(\cdot)$. Hope that $\mathbb{V}\mathrm{ar}_g(\hat{\ell}_{\mathrm{IS}}) < \mathbb{V}\mathrm{ar}_f(\hat{\ell}_{\mathrm{CMC}})$.

If we could sample from

$$ g^*(x) = \frac{1\{x > \gamma\} f(x)}{\ell} $$

then we'd have $\mathbb{V}\mathrm{ar}_{g^*}(\hat{\ell}_{\mathrm{IS}}) = 0$.

Narrow our focus

Example: Want to find $\mathbb{P}(Z > 6)$ for $Z \sim \mathsf{Normal}(0, 1)$ using importance sampling.

Restriction: Will only consider proposal distributions which are $\mathsf{Normal}(\mu, \sigma^2)$, with p.d.f. $f(\cdot, \boldsymbol{v})$ where $\boldsymbol{v} = [\mu, \sigma^2]$.

Question: What is the best proposal distribution — equiv. the best $\boldsymbol{v} = (\mu, \sigma^2)$ parameter vector — for our IS estimation problem?

Bonus: Here we have $f(\cdot) = f(\cdot, \boldsymbol{u})$ for $\boldsymbol{u}=[0, 1]$, and it's possible that the CMC estimator is the 'best IS estimator' here, so we're guaranteed to not do worse.

"Best is closest to perfect"

We know that

$$ g^*(x) = \frac{1\{x > \gamma\} f(x)}{\ell} $$

would be the best proposal distribution (zero variance). Try to find closest distribution inside $f(\cdot, \boldsymbol{v})$ to $g^*$.

How do we measure distance between distributions?

$$ \text{MagicDistance}(f_1, f_2)\ =\ {-} \int f_1(x) \log \bigl[ f_2(x) \bigr] \, \mathrm{d} x $$

Write the goal as an expectation

$$ \newcommand{\argmin}{\operatorname*{arg min}} \newcommand{\argmax}{\operatorname*{arg max}} \begin{aligned} \boldsymbol{v}^* &= \argmin_{\boldsymbol{v}}\ \text{MagicDistance}\bigl(g^*, f(\cdot, \boldsymbol{v})\bigr) \\ &= \argmin_{\boldsymbol{v}} - \int g^*(x) \log \bigl[ f(x, \boldsymbol{v})\bigr] \,\mathrm{d}x \\ &= \argmax_{\boldsymbol{v}} \int g^*(x) \log \bigl[ f(x, \boldsymbol{v})\bigr] \,\mathrm{d}x \\ &= \argmax_{\boldsymbol{v}} \mathbb{E}_{g^*}\bigl[ \log \bigl[ f(X, \boldsymbol{v}) \bigr] \bigr] \\ \end{aligned} $$

Maths is hard so we approximate

Replace expectation by crude Monte Carlo estimate and get

$$ \begin{aligned} \boldsymbol{v}^* &= \argmax_{\boldsymbol{v}} \mathbb{E}_{g^*}\bigl\{ \log \bigl[ f(X, \boldsymbol{v}) \bigr] \bigr\} \\ &\approx \argmax_{\boldsymbol{v}} \frac{1}{R} \sum_{r=1}^R \log \bigl[ f(X_r, \boldsymbol{v}) \bigr] \end{aligned} $$

for $X_r \overset{\mathrm{i.i.d.}}{\sim} g^*$.

This is maximum likelihood estimation! For normal distributions, we have $\boldsymbol{v}^* = (\mu_*, \sigma^2_*) $ where

$$ \mu_* = \frac{1}{R} \sum_{r=1}^R X_r \quad \text{and} \quad \sigma^2_* = \frac{1}{R} \sum_{r=1}^R (X_r - \mu_*)^2 \,. $$

Improved cross entropy

To estimate $\ell = \mathbb{P}(X > \gamma)$, with optimal IS density $g^*(x) \propto 1\{x > \gamma\} f(x)$, then:

  1. Choose a family $f( \,\cdot\, ; \boldsymbol{v})$, $R$ (e.g. $R=10^6$).
  2. Simulate $X_r \overset{\mathrm{i.i.d.}}{\sim} g^*( \,\cdot\, )$ for $r=1,\dots,R$.
  3. Set $\boldsymbol{v}_*$ to be the MLE estimate of fitting $\{X_1,\dots, X_R\}$ to $f( \,\cdot\, ; \boldsymbol{v})$. That is, $$ \boldsymbol{v}_* = \argmax_{\boldsymbol{v}} \frac{1}{R} \sum_{r=1}^R \log \bigl[ f(X_r; \boldsymbol{v}) \bigr] . $$
  4. Return the result of IS with $f( \,\cdot\, ; \boldsymbol{v}_*)$ proposal.

This is great, if we can sample from $g^*$...

"Ambitious" cross-entropy method

Replace expectation by importance sampled Monte Carlo estimate

$$ \begin{aligned} \boldsymbol{v}^* &= \argmax_{\boldsymbol{v}} \mathbb{E}_{g^*}\bigl[ \log \bigl[ f(X, \boldsymbol{v}) \bigr] \bigr] \\ &= \argmax_{\boldsymbol{v}} \mathbb{E}_{f}\bigl[ \frac{g^*(X)}{f(X)} \log \bigl[ f(X, \boldsymbol{v}) \bigr] \bigr] \\ &= \argmax_{\boldsymbol{v}} \mathbb{E}_{f}\bigl[ 1\{ X > \gamma \} \log \bigl[ f(X, \boldsymbol{v}) \bigr] \bigr] \\ &\approx \argmax_{\boldsymbol{v}} \frac{1}{R} \sum_{r=1}^R 1\{X_r > \gamma\} \log \bigl[ f(X_r, \boldsymbol{v}) \bigr] \end{aligned} $$

for $X_r \overset{\mathrm{i.i.d.}}{\sim} f$.

"Ambitious" cross-entropy method

$$ \boldsymbol{v}^* \approx \argmax_{\boldsymbol{v}} \frac{1}{R} \sum_{r=1}^R 1\{X_r > \gamma\} \log \bigl[ f(X_r, \boldsymbol{v}) \bigr] $$

for $X_r \overset{\mathrm{i.i.d.}}{\sim} f$.

Define elite set $\mathcal{E} = \{ X_r : X_r \ge \gamma \}$, then this is just maximum likelihood estimation of the elites! For normal distributions, we have $\boldsymbol{v}^* = (\mu_*, \sigma^2_*) $ where

$$ \mu_* = \frac{1}{|\mathcal{E}|} \sum_{X_r \in \mathcal{E}} X_r \quad \text{and} \quad \sigma^2_* = \frac{1}{|\mathcal{E}|} \sum_{X_r \in \mathcal{E}} (X_r - \mu_*)^2 \,. $$

Anyone see a problem here?

Cross-entropy method

Iteratively finds better importance sampling distributions, by minimising the distance between the current guess to the optimal IS distribution.

Cross-entropy method: high level

  1. Choose a family of distributions to search inside for the best IS proposal, $f( \,\cdot\, ; \boldsymbol{v})$.
  2. Choose some distribution in the family as the starting point, $f( \,\cdot\, ; \boldsymbol{v}_{0})$.
  3. Iteratively construct a better proposal $\boldsymbol{v}_i$ based on the previous $\boldsymbol{v}_{i-1}$ by stepping closer to the optimal $g^*$.
  4. When this converges to some $\boldsymbol{v}_*$, return the result of IS with $f( \,\cdot\, ; \boldsymbol{v}_*)$ proposal.

What CE gives you is a distribution, then you go off and run IS to get a rare-event estimate!

How to measure closeness of distributions?

One way to measure "distances" between distributions is the Kullback-Leibler divergence.

$$ \mathrm{KL}(f_1, f_2) = \int f_1(x) \log \bigl[ \frac{f_1(x)}{f_2(x)} \bigr] \, \mathrm{d} x $$

Have $\mathrm{KL}(f_1, f_1) = \mathbb{E}_{f_1} \bigl\{ \log \bigl[ \frac{f_1(X)}{f_1(X)} \bigr] \bigr\} = \mathbb{E}_{f_1} [ \log(1) ] = \mathbb{E}_{f_1} [ 0 ] = 0$, and

$$ \begin{aligned} \mathrm{KL}(f_1, f_2) &= \mathbb{E}_{f_1} \bigl\{ \log \bigl[ \frac{f_1(X)}{f_2(X)} \bigr] \bigr\} = \mathbb{E}_{f_1} \bigl\{ {-}\log \bigl[ \frac{f_2(X)}{f_1(X)} \bigr] \bigr\} \\ &\ge -\log \bigl\{ \mathbb{E}_{f_1} \bigl[ \frac{f_2(X)}{f_1(X)} \bigr] \bigr\} = -\log \bigl[ \int f_2(x) \mathrm{d} x \bigr] = -\log ( 1 ) = 0 \end{aligned} $$

because $\varphi(\mathbb{E}[X]) \le \mathbb{E}[ \varphi(X) ]$ when $\varphi$ is a convex function (Jensen's inequality).

Same as minimising the cross-entropy loss

The definition of the (relative entropy) cross-entropy loss is similar:

$$ \mathrm{CE}(f_1, f_2) = -\int f_1(x) \log \bigl[ f_2(x) \bigr] \, \mathrm{d} x $$
$$ \mathrm{KL}(f_1, f_2) = \int f_1(x) \log \bigl[ \frac{f_1(x)}{f_2(x)} \bigr] \, \mathrm{d} x $$

Main difference is that $\mathrm{CE}(f_1,f_1) = - \int f_1(x) \log[f_1(x)] \,\mathrm{d}x = \mathrm{Entropy}(f_1)$, which is usually $> 0$.

If keeping $f_1$ constant and searching for an $f_2$ to minimise KL$(f_1,f_2)$, then note $$ \begin{aligned} \argmin_{f_2} \mathrm{KL}(f_1,f_2) &= \argmin_{f_2} \Bigl\{ \int f_1(x) \log \bigl[ f_1(x) \bigr] \, \mathrm{d}x - \int f_1(x) \log \bigl[ f_2(x) \bigr] \, \mathrm{d}x \Bigr\} \\ &= \argmin_{f_2} \Bigl\{ - \int f_1(x) \log \bigl[ f_2(x) \bigr] \, \mathrm{d}x \Bigr\} = \argmin_{f_2} \mathrm{CE}(f_1,f_2) \,. \end{aligned}$$

During CE iterations, finding the next proposal

Overall goal: $$ \argmin_{\boldsymbol{v}} \mathrm{CE}(g^*, f(\cdot; \boldsymbol{v})) = \argmax_{\boldsymbol{v}} \mathbb{E}_f \bigl\{ 1\{ X > \gamma\} \log \bigl[ f(X; \boldsymbol{v}) \bigr] \bigr\} \,. $$

Too ambitious, define an easier goal $\gamma_i < \gamma$.

$$ \begin{aligned} \boldsymbol{v}_i &= \argmax_{\boldsymbol{v}} \mathbb{E}_f \bigl\{ 1\{ X > \gamma_i \} \log \bigl[ f(X; \boldsymbol{v}) \bigr] \bigr\} \\ &= \argmax_{\boldsymbol{v}} \mathbb{E}_{f(\cdot,\boldsymbol{v}_{i-1}))} \bigl\{ 1\{ X > \gamma_i \} \frac{f(X)}{f(X,\boldsymbol{v}_{i-1}))} \log \bigl[ f(X; \boldsymbol{v}) \bigr] \bigr\} \end{aligned} $$

During CE iterations, finding the next proposal

Sample from our previous guess

$$X_r \overset{\mathrm{i.i.d.}}{\sim} f(\cdot, \boldsymbol{v}_{i-1}) \text{ for } r=1,\dots,R$$

and set $\gamma_i = \text{Quantile}( \{ X_1, \dots, X_R\} , 1-\rho)$ for some small-ish $\rho$ like $\rho = 0.10$.

$$ \begin{aligned} \boldsymbol{v}_i &= \argmax_{\boldsymbol{v}} \mathbb{E}_{f(\cdot,\boldsymbol{v}_{i-1}))} \bigl\{ 1\{ X > \gamma_i \} \frac{f(X)}{f(X,\boldsymbol{v}_{i-1}))} \log \bigl[ f(X; \boldsymbol{v}) \bigr] \bigr\} \\ &\approx \argmax_{\boldsymbol{v}} \frac{1}{R} \sum_{r=1}^R \Bigl\{ 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; \boldsymbol{v}_{i-1})} \log \bigl[ f(X_r; \boldsymbol{v}) \bigr] \Bigr\} \,. \end{aligned} $$

Cross-entropy method: with all the gory details

  1. Choose a family $f( \,\cdot\, ; \boldsymbol{v})$, a starting point $\boldsymbol{v}_{0}$, $R$, and $\rho$.
  2. For $i=1, 2, \dots$:
    1. Simulate $X_r \overset{\mathrm{i.i.d.}}{\sim} f( \,\cdot\, ; \boldsymbol{v}_{i-1})$ for $r=1,\dots,R$.
    2. Find the quantile $\gamma_i = \mathrm{Quantile}( \{ X_1, \dots, X_R\} , 1-\rho)$.
    3. If $\gamma_i \ge \gamma$ set $\boldsymbol{v}^* = \boldsymbol{v}_{i-1}$, and quit the loop.
    4. Set $$ \boldsymbol{v}_{i} = \argmax_{\boldsymbol{v}} \frac{1}{R} \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; \boldsymbol{v}_{i-1})} \log \bigl[ f(X_r; \boldsymbol{v}) \bigr] $$ by letting $\boldsymbol{v}_{i}$ be the solution to $$ \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; \boldsymbol{v}_{i-1})} \nabla_{\hspace{-5px}\boldsymbol{v}} \Bigl\{ \log \bigl[ f(X_r; \boldsymbol{v}) \bigr] \Bigr\} = \boldsymbol{0} .$$
  3. Return the result of IS with $f( \,\cdot\, ; \boldsymbol{v}^*)$ proposal.

Worked example: Exponential distributions

Example, $f(x ; v) = \frac{1}{v} \mathrm{e}^{-\frac{x}{v}}$, i.e., $f( \,\cdot\, ; v)$ is $\mathsf{Exponential}(\frac{1}{v})$.

Say that we've simulated $X_r \overset{\mathrm{i.i.d.}}{\sim} f( \,\cdot\, ; v_{i-1})$ for $r=1,\dots,R$. Want to find $v_i$ be the solution to $$ v_{i} = \argmax_{v} \frac{1}{R} \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})} \log \bigl[ f(X_r; v) \bigr] $$

in other words, $v_i$ is the solution to

$$ \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})} \frac{\mathrm{d}}{\mathrm{d} v} \Bigl\{ \log \bigl[ f(X_r; v) \bigr] \Bigr\} = 0 .$$

Boring calculus and boring algebra

$$ \frac{\mathrm{d}}{\mathrm{d} v} \Bigl\{ \log \bigl[ \frac{1}{v} \mathrm{e}^{-\frac{X_r}{v}} \bigr] \Bigr\} = \frac{\mathrm{d}}{\mathrm{d} v} \Bigl\{ \log [ v^{-1} ] - X_r v^{-1} \Bigr\} = -v v^{-2} + X_r v^{-2} = \frac{X_r}{v^2} -\frac{1}{v} . $$

Problem to solve is then

$$ \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})} \bigl( \frac{X_r}{v_i^2} -\frac{1}{v_i} \bigr) = 0 $$$$ \frac{1}{v_i} \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})} = \frac{1}{v_i^2} \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})} X_r $$$$ v_i = \frac{ \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})} X_r }{ \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})}} . $$

Remember, the MLE estimate for the mean of an exponential distribution is

$$ \hat{\mu} = \frac{1}{R} \sum_{r=1}^R X_r \quad \Rightarrow \quad \hat{\mu}_w = \frac{ \sum_{r=1}^R w_r X_r }{ \sum_{r=1}^R w_r } . $$

Example: CE with exponential proposals

  1. Let $f( \,\cdot\, ; v)$ be the family of $\mathsf{Exponential}(\frac{1}{v})$ distributions, take a starting point $v_0$, $R$, and $\rho$.
  2. For $i=1, 2, \dots$:
    1. Simulate $X_r \overset{\mathrm{i.i.d.}}{\sim} f( \,\cdot\, ; v_{i-1})$ for $r=1,\dots,R$.
    2. Find the quantile $\gamma_i = \mathrm{Quantile}( \{ X_1, \dots, X_R\} , 1-\rho)$.
    3. If $\gamma_i \ge \gamma$ set $v^* = v_{i-1}$, and quit the loop.
    4. Set $$ v_i = \frac{ \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})} X_r }{ \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})}} . $$
  3. Return the result of IS with $f( \,\cdot\, ; v^*)$ proposal.

Multidimensional cross-entropy method

If instead of estimating $\mathbb{P}(X > \gamma)$ we want to estimate $\mathbb{P}(X_1 + X_2 > \gamma)$, what do we do?

Basically the same. Write variables as vector $\boldsymbol{X}$ and rare event in terms of a performance function $\{ S(\boldsymbol{X}) > \gamma \}$. In this example, $S(\boldsymbol{X}) = \sum_i X_i $.

Multidimensional cross-entropy method

  1. Choose a family $f( \,\cdot\, ; \boldsymbol{v})$, a starting point $\boldsymbol{v}_{0}$, $R$, and $\rho$.
  2. For $i=1, 2, \dots$:
    1. Simulate $\boldsymbol{X}_r \overset{\mathrm{i.i.d.}}{\sim} f( \,\cdot\, ; \boldsymbol{v}_{i-1})$ for $r=1,\dots,R$.
    2. Find the quantile $\gamma_i = \mathrm{Quantile}( \{ S(\boldsymbol{X}_1), \dots, S(\boldsymbol{X}_R) \} , 1-\rho)$.
    3. If $\gamma_i \ge \gamma$ set $\boldsymbol{v}^* = \boldsymbol{v}_{i-1}$, and quit the loop.
    4. Set $$ \boldsymbol{v}_{i} = \argmax_{\boldsymbol{v}} \frac{1}{R} \sum_{r=1}^R 1\{ S(\boldsymbol{X}_r) > \gamma_i \} \frac{f(\boldsymbol{X}_r)}{f(\boldsymbol{X}_r; \boldsymbol{v}_{i-1})} \log \bigl[ f(\boldsymbol{X}_r; \boldsymbol{v}) \bigr] $$ by letting $\boldsymbol{v}_{i}$ be the solution to $$ \sum_{r=1}^R 1\{ S(\boldsymbol{X}_r) > \gamma_i \} \frac{f(\boldsymbol{X}_r)}{f(\boldsymbol{X}_r; \boldsymbol{v}_{i-1})} \nabla_{\hspace{-5px}\boldsymbol{v}} \Bigl\{ \log \bigl[ f(\boldsymbol{X}_r; \boldsymbol{v}) \bigr] \Bigr\} = \boldsymbol{0} .$$
  3. Return the result of IS with $f( \,\cdot\, ; \boldsymbol{v}^*)$ proposal.

Post-script: CE is a genetic optimization algorithm

Some cool plots

Relationship to MLE

Imagine we just want to fit the data $(X_r)_{r=1,\dots,R}$ with a model $f(\cdot; \boldsymbol{v})$; we could maximise the (log-) likelihood:

$$ \boldsymbol{v}^* = \argmax_{\boldsymbol{v}} \prod_{r=1}^R f(X_r; \boldsymbol{v}) = \argmax_{\boldsymbol{v}} \sum_{r=1}^R \log \bigl[ f(X_r; \boldsymbol{v}) \bigr] $$

Say that some samples are more important than others (or are duplicates), we weight them by $w_r$:

$$ \boldsymbol{v}^* = \argmax_{\boldsymbol{v}} \sum_{r=1}^R w_r \log \bigl[ f(X_r; \boldsymbol{v}) \bigr] $$

That's exactly what the CE problem does; with $w_r = 1\{ X_r > \gamma\} \frac{f(X_r)}{f(X_r; \boldsymbol{w})}$

$$ \boldsymbol{v}^* \approx \argmax_{\boldsymbol{v}} \frac{1}{R} \sum_{r=1}^R 1\{ X_r > \gamma\} \frac{f(X_r)}{f(X_r; \boldsymbol{v}_{i-1})} \log \bigl[ f(X_r; \boldsymbol{v}) \bigr] $$

Cross-entropy method: weighted MLE

  1. Choose a family $f( \,\cdot\, ; \boldsymbol{v})$, a starting point $\boldsymbol{v}_{0}$, $R$ (e.g. $R=10^6$), and $\rho$ (e.g. $\rho = 0.10$).
  2. For $i=1,2,\dots$:
    1. Simulate $X_r \overset{\mathrm{i.i.d.}}{\sim} f( \,\cdot\, ; \boldsymbol{v}_{i-1})$ for $r=1,\dots,R$.
    2. Find the quantile $\gamma_i = \mathrm{Quantile}( \{ X_1, \dots, X_R\} , 1-\rho)$.
    3. If $\gamma_i \ge \gamma$ set $\boldsymbol{v}_* = \boldsymbol{v}_{i-1}$, and quit the loop.
    4. Create the elite set, $\mathcal{E} = \{ X_r : X_r \ge \gamma_i \}$.
    5. Choose the next $\boldsymbol{v}_{i}$ to be the weighted MLE of the elites $\mathcal{E}$ where the weights are $w_r = \frac{f(X_r)}{ f(X_r; \boldsymbol{v}_{i-1}) }$.
  3. Return the result of IS with $f( \,\cdot\, ; \boldsymbol{v}_*)$ proposal.

Random walk (discrete time, continuous values)

$$ X_n = X_{n-1} + \epsilon_n \quad \text{for } \epsilon_n \overset{\mathrm{i.i.d.}}{\sim} \mathsf{Normal}(0,1)\,. $$

Markov chains

Discrete time process $(X_n)_{n \ge 0}$, where $X_n \in \mathbb{R}$, with the Markov property:

$$ \mathbb{P}(X_n = x \mid X_0 = x_0, X_1 = x_1, \dots, X_{n-1} = x_{n-1}) = \mathbb{P}(X_n = x \mid X_{n-1} = x_{n-1}) \,. $$

Right-hand side is the transition density $p(y \mid x) = \mathbb{P}(X_n = y \mid X_{n-1} = x)$.

Stationary distributions

Some MCs have a stationary distribution, $\pi(\,\cdot\,)$, so if

$$ X_0 \sim \pi(\,\cdot\,) \quad\Rightarrow\quad X_1 \sim \pi(\,\cdot\,) \quad\Rightarrow\quad X_2 \sim \pi(\,\cdot\,) \dots $$

If we want to simulate from $\pi(\,\cdot\,)$, we can generate

$$ X_1, \dots, X_R \overset{\mathrm{i.d.}}{\sim} \pi(\,\cdot\,) $$

so long as we can generate the first $X_0$.

Big idea

Let's create a Markov chain whose stationary distribution is something we want to simulate from. E.g.

$$ \pi(x) = g^*(x) = \frac{1\{ x > \gamma \} f(x)}{\ell} \,. $$

Result is a group of algorithms called Markov chain Monte Carlo (MCMC).

Metropolis algorithm (random walk sampler)

For $n = 1, 2, \dots$

  • $Y = X_{n-1} + \epsilon_n$ is a proposal, where $\epsilon_n \overset{\mathrm{i.i.d.}}{\sim} \mathsf{Normal}(0,\sigma^2)$,
  • We either accept or reject the proposal with probability $\alpha = \pi(Y) / \pi(X_{n-1})$, so $$ X_n = \begin{cases} Y & \text{if } U \le \alpha,\\ X_{n-1} & \text{otherwise} \end{cases} $$ where $U \sim \mathsf{Uniform}(0,1)$,

Then $X_1, X_2, \dots,$ are samples from $\pi$ our target distribution.

Metropolis algorithm notes

  • We can easily have $\pi(x) = g^*(x)$, because $\pi(x)/\pi(y) = \frac{1\{ x > \gamma \} f(x)}{1\{ y > \gamma \} f(y)}$ is computable
  • The jumps can come from any symmetric distribution
  • The jumps can come from an asymmetric distribution (Metropolis–Hastings algorithm)

MCMC in action

Why does this work? Why is this $\alpha$ so special?

The secret relates to the fact that $\alpha$ makes the chain reversible with respect to the target density $\pi$.

A Markov chain with transition probability $p(y \mid x)$ is reversible with respect to a density $g$ if

$$ g(x) p(y \mid x) = g(y) p(x \mid y) $$

for all $x,y \in \mathcal{S}$. These are called the detailed balance equations.

This next part is a bit complicated:

  • Not every MC has a stationary distribution
  • For MC's which have a stationary distribution, not all are reversible w.r.t. their stationary distribution
  • But for MC's which are reversible w.r.t. some distribution $g$, then $g$ must be a stationary distribution for the MC.

Ensure reversibility

The goal is

$$ \pi(x) p(y \mid x) = \pi(y) p(x \mid y) \,. $$

Trivial if $x=y$. For $x \not= y$ the transition density is

$$ p(y \mid x) = \phi(y - x) \min\Bigl\{ \frac{\pi(y)}{\pi(x)} , 1 \Bigr\} \,. $$

Therefore,

$$ \begin{aligned} \pi(x) p(y \mid x) &= \pi(x) \phi(y - x) \min\Bigl\{ \frac{\pi(y)}{\pi(x)} , 1 \Bigr\} \\ &= \phi(y - x) \min\{ \pi(y) , \pi(x) \} \frac{\pi(y)}{\pi(y)} \\ &= \pi(y) \phi(x - y) \min\Bigl\{ 1 , \frac{\pi(x)}{\pi(y)} \Bigr\} \\ &= \pi(y) p(x \mid y) \,. \end{aligned} $$

Reversibility implies stationarity

Want to show that if a MC is reversible w.r.t. $\pi$, i.e.

$$ \pi(x) p(y \mid x) = \pi(y) p(x \mid y)$$

then $\pi$ must be a stationary distribution for the MC, i.e. $X_0 \sim \pi \Rightarrow X_1 \sim \pi$.

Proof: Say $X_0 \sim \pi$ then

$$ \begin{aligned} f_{X_1}(x_1) &= \int_{x_0 \in \mathcal{S}} f_{X_0}(x_0) p(x_1 \mid x_0) \, \mathrm{d} x_0 = \int_{x_0 \in \mathcal{S}} \pi(x_0) p(x_1 \mid x_0) \, \mathrm{d} x_0 \\ &= \int_{x_0 \in \mathcal{S}} \pi(x_1) p(x_0 \mid x_1) \, \mathrm{d} x_0 = \pi(x_1) \int_{x_0 \in \mathcal{S}} p(x_0 \mid x_1) \, \mathrm{d} x_0 = \pi(x_1) \end{aligned} $$

so $X_1 \sim \pi$.

How to start

So now we have

$$ X_0 \sim \pi(\,\cdot\,) \quad\Rightarrow\quad X_1 \sim \pi(\,\cdot\,) \quad\Rightarrow\quad X_2 \sim \pi(\,\cdot\,) \dots $$

where $\pi$ is the distribution we want. How do we simulate $X_0$? Chicken & egg.

It turns out that if we just start $X_0$ anywhere, run the Markov chain for $\infty$ time, then we have

$$ X_0 \sim f_{X_0} \Rightarrow X_1 \sim f_{X_1} \Rightarrow \dots \Rightarrow X_{\infty} \sim \pi \Rightarrow X_{\infty+1} \sim \pi \dots$$

Some fundamental results

Weak theorem of computational mathematics:

$$ 1000 = \infty \,. $$

Strong theorem of computational mathematics:

$$ 10^6 \equiv \infty \,. $$

Implications: To simulate from $\pi$, just start $X_0$ anywhere, run the Markov chain for a while, then after $N \approx \infty$ we have

$$ X_{N} \overset{\mathrm{approx.}}{\sim} \pi(\,\cdot\,) \quad\Rightarrow\quad X_{N+1} \overset{\mathrm{approx.}}{\sim} \pi(\,\cdot\,) \quad\Rightarrow\quad X_{N+2} \overset{\mathrm{approx.}}{\sim} \pi(\,\cdot\,) \dots$$

Hamiltonian MCMC