Now:
Later:
The point is estimate $\ell = \mathbb{P}(X > \gamma)$ where $X \sim f(\cdot)$ for $\ell \le 10^{-4}$ (i.e. $\gamma \gg 0$).
We can use Crude Monte Carlo to get estimates $\hat{\ell}_{\mathrm{CMC}}$ for anything, but as $\ell \searrow 0$ then $\hat{\ell}_{\mathrm{CMC}} = 0$.
Instead, use Monte Carlo with Importance Sampling $\hat{\ell}_{\mathrm{IS}}$. Instead of sampling $X \sim f(\cdot)$, take samples from $X \sim g(\cdot)$. Hope that $\mathbb{V}\mathrm{ar}_g(\hat{\ell}_{\mathrm{IS}}) < \mathbb{V}\mathrm{ar}_f(\hat{\ell}_{\mathrm{CMC}})$.
If we could sample from
$$ g^*(x) = \frac{1\{x > \gamma\} f(x)}{\ell} $$then we'd have $\mathbb{V}\mathrm{ar}_{g^*}(\hat{\ell}_{\mathrm{IS}}) = 0$.
Example: Want to find $\mathbb{P}(Z > 6)$ for $Z \sim \mathsf{Normal}(0, 1)$ using importance sampling.
Restriction: Will only consider proposal distributions which are $\mathsf{Normal}(\mu, \sigma^2)$, with p.d.f. $f(\cdot, \boldsymbol{v})$ where $\boldsymbol{v} = [\mu, \sigma^2]$.
Question: What is the best proposal distribution — equiv. the best $\boldsymbol{v} = (\mu, \sigma^2)$ parameter vector — for our IS estimation problem?
Bonus: Here we have $f(\cdot) = f(\cdot, \boldsymbol{u})$ for $\boldsymbol{u}=[0, 1]$, and it's possible that the CMC estimator is the 'best IS estimator' here, so we're guaranteed to not do worse.
We know that
$$ g^*(x) = \frac{1\{x > \gamma\} f(x)}{\ell} $$would be the best proposal distribution (zero variance). Try to find closest distribution inside $f(\cdot, \boldsymbol{v})$ to $g^*$.
How do we measure distance between distributions?
$$ \text{MagicDistance}(f_1, f_2)\ =\ {-} \int f_1(x) \log \bigl[ f_2(x) \bigr] \, \mathrm{d} x $$Replace expectation by crude Monte Carlo estimate and get
$$ \begin{aligned} \boldsymbol{v}^* &= \argmax_{\boldsymbol{v}} \mathbb{E}_{g^*}\bigl\{ \log \bigl[ f(X, \boldsymbol{v}) \bigr] \bigr\} \\ &\approx \argmax_{\boldsymbol{v}} \frac{1}{R} \sum_{r=1}^R \log \bigl[ f(X_r, \boldsymbol{v}) \bigr] \end{aligned} $$for $X_r \overset{\mathrm{i.i.d.}}{\sim} g^*$.
This is maximum likelihood estimation! For normal distributions, we have $\boldsymbol{v}^* = (\mu_*, \sigma^2_*) $ where
$$ \mu_* = \frac{1}{R} \sum_{r=1}^R X_r \quad \text{and} \quad \sigma^2_* = \frac{1}{R} \sum_{r=1}^R (X_r - \mu_*)^2 \,. $$To estimate $\ell = \mathbb{P}(X > \gamma)$, with optimal IS density $g^*(x) \propto 1\{x > \gamma\} f(x)$, then:
This is great, if we can sample from $g^*$...
Replace expectation by importance sampled Monte Carlo estimate
$$ \begin{aligned} \boldsymbol{v}^* &= \argmax_{\boldsymbol{v}} \mathbb{E}_{g^*}\bigl[ \log \bigl[ f(X, \boldsymbol{v}) \bigr] \bigr] \\ &= \argmax_{\boldsymbol{v}} \mathbb{E}_{f}\bigl[ \frac{g^*(X)}{f(X)} \log \bigl[ f(X, \boldsymbol{v}) \bigr] \bigr] \\ &= \argmax_{\boldsymbol{v}} \mathbb{E}_{f}\bigl[ 1\{ X > \gamma \} \log \bigl[ f(X, \boldsymbol{v}) \bigr] \bigr] \\ &\approx \argmax_{\boldsymbol{v}} \frac{1}{R} \sum_{r=1}^R 1\{X_r > \gamma\} \log \bigl[ f(X_r, \boldsymbol{v}) \bigr] \end{aligned} $$for $X_r \overset{\mathrm{i.i.d.}}{\sim} f$.
for $X_r \overset{\mathrm{i.i.d.}}{\sim} f$.
Define elite set $\mathcal{E} = \{ X_r : X_r \ge \gamma \}$, then this is just maximum likelihood estimation of the elites! For normal distributions, we have $\boldsymbol{v}^* = (\mu_*, \sigma^2_*) $ where
$$ \mu_* = \frac{1}{|\mathcal{E}|} \sum_{X_r \in \mathcal{E}} X_r \quad \text{and} \quad \sigma^2_* = \frac{1}{|\mathcal{E}|} \sum_{X_r \in \mathcal{E}} (X_r - \mu_*)^2 \,. $$Anyone see a problem here?
Iteratively finds better importance sampling distributions, by minimising the distance between the current guess to the optimal IS distribution.
What CE gives you is a distribution, then you go off and run IS to get a rare-event estimate!
One way to measure "distances" between distributions is the Kullback-Leibler divergence.
$$ \mathrm{KL}(f_1, f_2) = \int f_1(x) \log \bigl[ \frac{f_1(x)}{f_2(x)} \bigr] \, \mathrm{d} x $$Have $\mathrm{KL}(f_1, f_1) = \mathbb{E}_{f_1} \bigl\{ \log \bigl[ \frac{f_1(X)}{f_1(X)} \bigr] \bigr\} = \mathbb{E}_{f_1} [ \log(1) ] = \mathbb{E}_{f_1} [ 0 ] = 0$, and
$$ \begin{aligned} \mathrm{KL}(f_1, f_2) &= \mathbb{E}_{f_1} \bigl\{ \log \bigl[ \frac{f_1(X)}{f_2(X)} \bigr] \bigr\} = \mathbb{E}_{f_1} \bigl\{ {-}\log \bigl[ \frac{f_2(X)}{f_1(X)} \bigr] \bigr\} \\ &\ge -\log \bigl\{ \mathbb{E}_{f_1} \bigl[ \frac{f_2(X)}{f_1(X)} \bigr] \bigr\} = -\log \bigl[ \int f_2(x) \mathrm{d} x \bigr] = -\log ( 1 ) = 0 \end{aligned} $$because $\varphi(\mathbb{E}[X]) \le \mathbb{E}[ \varphi(X) ]$ when $\varphi$ is a convex function (Jensen's inequality).
The definition of the (relative entropy) cross-entropy loss is similar:
Main difference is that $\mathrm{CE}(f_1,f_1) = - \int f_1(x) \log[f_1(x)] \,\mathrm{d}x = \mathrm{Entropy}(f_1)$, which is usually $> 0$.
If keeping $f_1$ constant and searching for an $f_2$ to minimise KL$(f_1,f_2)$, then note $$ \begin{aligned} \argmin_{f_2} \mathrm{KL}(f_1,f_2) &= \argmin_{f_2} \Bigl\{ \int f_1(x) \log \bigl[ f_1(x) \bigr] \, \mathrm{d}x - \int f_1(x) \log \bigl[ f_2(x) \bigr] \, \mathrm{d}x \Bigr\} \\ &= \argmin_{f_2} \Bigl\{ - \int f_1(x) \log \bigl[ f_2(x) \bigr] \, \mathrm{d}x \Bigr\} = \argmin_{f_2} \mathrm{CE}(f_1,f_2) \,. \end{aligned}$$
Overall goal: $$ \argmin_{\boldsymbol{v}} \mathrm{CE}(g^*, f(\cdot; \boldsymbol{v})) = \argmax_{\boldsymbol{v}} \mathbb{E}_f \bigl\{ 1\{ X > \gamma\} \log \bigl[ f(X; \boldsymbol{v}) \bigr] \bigr\} \,. $$
Too ambitious, define an easier goal $\gamma_i < \gamma$.
$$ \begin{aligned} \boldsymbol{v}_i &= \argmax_{\boldsymbol{v}} \mathbb{E}_f \bigl\{ 1\{ X > \gamma_i \} \log \bigl[ f(X; \boldsymbol{v}) \bigr] \bigr\} \\ &= \argmax_{\boldsymbol{v}} \mathbb{E}_{f(\cdot,\boldsymbol{v}_{i-1}))} \bigl\{ 1\{ X > \gamma_i \} \frac{f(X)}{f(X,\boldsymbol{v}_{i-1}))} \log \bigl[ f(X; \boldsymbol{v}) \bigr] \bigr\} \end{aligned} $$Sample from our previous guess
$$X_r \overset{\mathrm{i.i.d.}}{\sim} f(\cdot, \boldsymbol{v}_{i-1}) \text{ for } r=1,\dots,R$$and set $\gamma_i = \text{Quantile}( \{ X_1, \dots, X_R\} , 1-\rho)$ for some small-ish $\rho$ like $\rho = 0.10$.
$$ \begin{aligned} \boldsymbol{v}_i &= \argmax_{\boldsymbol{v}} \mathbb{E}_{f(\cdot,\boldsymbol{v}_{i-1}))} \bigl\{ 1\{ X > \gamma_i \} \frac{f(X)}{f(X,\boldsymbol{v}_{i-1}))} \log \bigl[ f(X; \boldsymbol{v}) \bigr] \bigr\} \\ &\approx \argmax_{\boldsymbol{v}} \frac{1}{R} \sum_{r=1}^R \Bigl\{ 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; \boldsymbol{v}_{i-1})} \log \bigl[ f(X_r; \boldsymbol{v}) \bigr] \Bigr\} \,. \end{aligned} $$Example, $f(x ; v) = \frac{1}{v} \mathrm{e}^{-\frac{x}{v}}$, i.e., $f( \,\cdot\, ; v)$ is $\mathsf{Exponential}(\frac{1}{v})$.
Say that we've simulated $X_r \overset{\mathrm{i.i.d.}}{\sim} f( \,\cdot\, ; v_{i-1})$ for $r=1,\dots,R$. Want to find $v_i$ be the solution to $$ v_{i} = \argmax_{v} \frac{1}{R} \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})} \log \bigl[ f(X_r; v) \bigr] $$
in other words, $v_i$ is the solution to
$$ \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})} \frac{\mathrm{d}}{\mathrm{d} v} \Bigl\{ \log \bigl[ f(X_r; v) \bigr] \Bigr\} = 0 .$$Problem to solve is then
$$ \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})} \bigl( \frac{X_r}{v_i^2} -\frac{1}{v_i} \bigr) = 0 $$$$ \frac{1}{v_i} \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})} = \frac{1}{v_i^2} \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})} X_r $$$$ v_i = \frac{ \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})} X_r }{ \sum_{r=1}^R 1\{ X_r > \gamma_i \} \frac{f(X_r)}{f(X_r; v_{i-1})}} . $$Remember, the MLE estimate for the mean of an exponential distribution is
$$ \hat{\mu} = \frac{1}{R} \sum_{r=1}^R X_r \quad \Rightarrow \quad \hat{\mu}_w = \frac{ \sum_{r=1}^R w_r X_r }{ \sum_{r=1}^R w_r } . $$If instead of estimating $\mathbb{P}(X > \gamma)$ we want to estimate $\mathbb{P}(X_1 + X_2 > \gamma)$, what do we do?
Basically the same. Write variables as vector $\boldsymbol{X}$ and rare event in terms of a performance function $\{ S(\boldsymbol{X}) > \gamma \}$. In this example, $S(\boldsymbol{X}) = \sum_i X_i $.
Imagine we just want to fit the data $(X_r)_{r=1,\dots,R}$ with a model $f(\cdot; \boldsymbol{v})$; we could maximise the (log-) likelihood:
$$ \boldsymbol{v}^* = \argmax_{\boldsymbol{v}} \prod_{r=1}^R f(X_r; \boldsymbol{v}) = \argmax_{\boldsymbol{v}} \sum_{r=1}^R \log \bigl[ f(X_r; \boldsymbol{v}) \bigr] $$Say that some samples are more important than others (or are duplicates), we weight them by $w_r$:
$$ \boldsymbol{v}^* = \argmax_{\boldsymbol{v}} \sum_{r=1}^R w_r \log \bigl[ f(X_r; \boldsymbol{v}) \bigr] $$That's exactly what the CE problem does; with $w_r = 1\{ X_r > \gamma\} \frac{f(X_r)}{f(X_r; \boldsymbol{w})}$
$$ \boldsymbol{v}^* \approx \argmax_{\boldsymbol{v}} \frac{1}{R} \sum_{r=1}^R 1\{ X_r > \gamma\} \frac{f(X_r)}{f(X_r; \boldsymbol{v}_{i-1})} \log \bigl[ f(X_r; \boldsymbol{v}) \bigr] $$Discrete time process $(X_n)_{n \ge 0}$, where $X_n \in \mathbb{R}$, with the Markov property:
$$ \mathbb{P}(X_n = x \mid X_0 = x_0, X_1 = x_1, \dots, X_{n-1} = x_{n-1}) = \mathbb{P}(X_n = x \mid X_{n-1} = x_{n-1}) \,. $$Right-hand side is the transition density $p(y \mid x) = \mathbb{P}(X_n = y \mid X_{n-1} = x)$.
Some MCs have a stationary distribution, $\pi(\,\cdot\,)$, so if
$$ X_0 \sim \pi(\,\cdot\,) \quad\Rightarrow\quad X_1 \sim \pi(\,\cdot\,) \quad\Rightarrow\quad X_2 \sim \pi(\,\cdot\,) \dots $$If we want to simulate from $\pi(\,\cdot\,)$, we can generate
$$ X_1, \dots, X_R \overset{\mathrm{i.d.}}{\sim} \pi(\,\cdot\,) $$so long as we can generate the first $X_0$.
Let's create a Markov chain whose stationary distribution is something we want to simulate from. E.g.
$$ \pi(x) = g^*(x) = \frac{1\{ x > \gamma \} f(x)}{\ell} \,. $$Result is a group of algorithms called Markov chain Monte Carlo (MCMC).
For $n = 1, 2, \dots$
Then $X_1, X_2, \dots,$ are samples from $\pi$ our target distribution.
The secret relates to the fact that $\alpha$ makes the chain reversible with respect to the target density $\pi$.
A Markov chain with transition probability $p(y \mid x)$ is reversible with respect to a density $g$ if
$$ g(x) p(y \mid x) = g(y) p(x \mid y) $$for all $x,y \in \mathcal{S}$. These are called the detailed balance equations.
This next part is a bit complicated:
The goal is
$$ \pi(x) p(y \mid x) = \pi(y) p(x \mid y) \,. $$Trivial if $x=y$. For $x \not= y$ the transition density is
$$ p(y \mid x) = \phi(y - x) \min\Bigl\{ \frac{\pi(y)}{\pi(x)} , 1 \Bigr\} \,. $$Therefore,
$$ \begin{aligned} \pi(x) p(y \mid x) &= \pi(x) \phi(y - x) \min\Bigl\{ \frac{\pi(y)}{\pi(x)} , 1 \Bigr\} \\ &= \phi(y - x) \min\{ \pi(y) , \pi(x) \} \frac{\pi(y)}{\pi(y)} \\ &= \pi(y) \phi(x - y) \min\Bigl\{ 1 , \frac{\pi(x)}{\pi(y)} \Bigr\} \\ &= \pi(y) p(x \mid y) \,. \end{aligned} $$Want to show that if a MC is reversible w.r.t. $\pi$, i.e.
$$ \pi(x) p(y \mid x) = \pi(y) p(x \mid y)$$then $\pi$ must be a stationary distribution for the MC, i.e. $X_0 \sim \pi \Rightarrow X_1 \sim \pi$.
Proof: Say $X_0 \sim \pi$ then
$$ \begin{aligned} f_{X_1}(x_1) &= \int_{x_0 \in \mathcal{S}} f_{X_0}(x_0) p(x_1 \mid x_0) \, \mathrm{d} x_0 = \int_{x_0 \in \mathcal{S}} \pi(x_0) p(x_1 \mid x_0) \, \mathrm{d} x_0 \\ &= \int_{x_0 \in \mathcal{S}} \pi(x_1) p(x_0 \mid x_1) \, \mathrm{d} x_0 = \pi(x_1) \int_{x_0 \in \mathcal{S}} p(x_0 \mid x_1) \, \mathrm{d} x_0 = \pi(x_1) \end{aligned} $$so $X_1 \sim \pi$.
So now we have
$$ X_0 \sim \pi(\,\cdot\,) \quad\Rightarrow\quad X_1 \sim \pi(\,\cdot\,) \quad\Rightarrow\quad X_2 \sim \pi(\,\cdot\,) \dots $$where $\pi$ is the distribution we want. How do we simulate $X_0$? Chicken & egg.
It turns out that if we just start $X_0$ anywhere, run the Markov chain for $\infty$ time, then we have
$$ X_0 \sim f_{X_0} \Rightarrow X_1 \sim f_{X_1} \Rightarrow \dots \Rightarrow X_{\infty} \sim \pi \Rightarrow X_{\infty+1} \sim \pi \dots$$Weak theorem of computational mathematics:
$$ 1000 = \infty \,. $$Strong theorem of computational mathematics:
$$ 10^6 \equiv \infty \,. $$Implications: To simulate from $\pi$, just start $X_0$ anywhere, run the Markov chain for a while, then after $N \approx \infty$ we have
$$ X_{N} \overset{\mathrm{approx.}}{\sim} \pi(\,\cdot\,) \quad\Rightarrow\quad X_{N+1} \overset{\mathrm{approx.}}{\sim} \pi(\,\cdot\,) \quad\Rightarrow\quad X_{N+2} \overset{\mathrm{approx.}}{\sim} \pi(\,\cdot\,) \dots$$