Distributional Forecasting

ACTL3143 & ACTL5111 Deep Learning for Actuaries

Author

Eric Dong & Patrick Laub

Show the package imports
import random
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as rnd

from sklearn import set_config
set_config(transform_output="pandas")

import tf_keras
import tensorflow as tf
from tf_keras.callbacks import EarlyStopping
from tf_keras.layers import Dense
from tf_keras.models import Sequential
from tf_keras.layers import Input, Dense, Concatenate
from tf_keras.models import Model
from tf_keras.initializers import Constant

from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder

Uncertainty

Uncertainty in deep learning refers to the level of doubt one would have about the predictions made by an AI-driven algorithm. Identifying and quantifying different sources of uncertainty that could exist in AI-driven algorithms is therefore important to ensure a credible application.

Quiz

Question: If you decide to predict the claim amount of Bob using a deep learning model, which source(s) of uncertainty are you confronting?

  1. The inherent variability of the data-generating process.
  2. Parameter error.
  3. Model error.
  4. Data uncertainty.
  5. All of the above.

Answer

All of the above!

Parameter error stems primarily due to lack of data. Model error stems from assuming wrong distributional properties of the data. Data uncertainty arises due to the lack of confidence we may have about the quality of the collected data. Noisy data, inconsistent data, data with missing values or data with missing important variables can result in data uncertainty.

There are two major types of uncertainty in statistical or machine learning:

  • Aleatoric uncertainty
  • Epistemic uncertainty

Since there is no consensus on the definitions of aleatoric and epistemic uncertainty, we provide the most acknowledged definitions in the following slides.

Aleatoric Uncertainty

Qualitative Definition

Aleatoric uncertainty refers to the statistical variability and inherent noise with data distribution that modelling cannot explain.

Quantitative Definition

\text{Ale}(Y|\boldsymbol{x}) = \mathbb{V}[Y|\boldsymbol{x}],i.e., if Y|\boldsymbol{x} \sim \mathcal{N}(\mu, \sigma^2), the aleatoric uncertainty would be \sigma^2. Simply, it is the conditional variance of the response variable Y given features/covariates \boldsymbol{x}.

Epistemic Uncertainty

Qualitative Definition

Epistemic uncertainty refers to the lack of knowledge, limited data information, parameter errors and model errors.

Quantitative Definition

\text{Epi}(Y|\boldsymbol{x}) = \text{Uncertainty}(Y|\boldsymbol{x}) - \text{Ale}(Y|\boldsymbol{x}),

i.e., the total uncertainty subtracting the aleatoric uncertainty \mathbb{V}[Y|\boldsymbol{x}] would be the epistemic uncertainty.

Uncertainty

Let’s go back to the question at the beginning:

If you decide to predict the claim amount of an individual using a deep learning model, which source(s) of uncertainty are you dealing with?

  1. The inherent variability of the data-generating process \rightarrow aleatoric uncertainty.
  2. Parameter error \rightarrow epistemic uncertainty.
  3. Model error \rightarrow epistemic uncertainty.
  4. Data uncertainty \rightarrow epistemic uncertainty.

The following sections show how we prepare the datasets for studying uncertainty.

Code: Data

import pandas as pd
1sev_df = pd.read_csv('freMTPL2sev.csv')
2freq_df = pd.read_csv('freMTPL2freq.csv')

# Create a copy of freq dataframe without 'claimfreq' column
3freq_without_claimfreq = freq_df.drop(columns=['ClaimNb'])

# Merge severity dataframe with freq_without_claimfreq dataframe
new_sev_df = pd.merge(sev_df, freq_without_claimfreq, on='IDpol', 
4                                                      how='left')
5new_sev_df = new_sev_df.dropna()
6new_sev_df = new_sev_df.drop("IDpol", axis=1)
7new_sev_df[:2]
1
Imports freMTPL2sev.csv dataset
2
Imports freMTPL2freq.csv dataset
3
Drops ClaimNb column
4
Merges the two datasets ,sev_df and freq_without_claimfreq by matching the IDpol column. Assigning how='left' ensures that all rows from the left dataset sev_df is considered, and only the matching columns from freq_without_claimfreq are selected
5
Drops missing values or/and NAN values
6
Drops the IDpol column from new_sev_df
7
Retrieves first two rows of the dataset
ClaimAmount Exposure VehPower VehAge DrivAge BonusMalus VehBrand VehGas Area Density Region
0 995.20 0.59 11.0 0.0 39.0 56.0 B12 Diesel D 778.0 Picardie
1 1128.12 0.95 4.0 1.0 49.0 50.0 B12 Regular E 2354.0 Ile-de-France

Code: Preprocessing

Next we carry out some basic preprocessing

X_train, X_test, y_train, y_test = train_test_split(
  new_sev_df.drop("ClaimAmount", axis=1),
  new_sev_df["ClaimAmount"],
  random_state=2023)

# Reset each index to start at 0 again.
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

Code: Preprocessing

Next we define the column transfer. The column transfer first applies ordinal encoding to VehBrand, Region, Area and VehGas variables, and applies standard scaling to all remaining numerical values. Next we fit the defined column transfer to the training set. The fitted transformation is then applied on both training and test sets. (Note that the fitting is only carried out on the train set and the same fit is applied to both train, validation and test sets.)

Since this task does not apply entity embeddings, VehBrand and Region variables are dropped from the dataframe.

# Transformation
ct = make_column_transformer(
  (OrdinalEncoder(), ["VehBrand", "Region", "Area", "VehGas"]),
  remainder=StandardScaler(),
   verbose_feature_names_out=False
)

# We don't apply entity embedding 
X_train_ct = ct.fit_transform(X_train)
X_test_ct = ct.transform(X_test)
X_train = X_train_ct.drop(["VehBrand", "Region"], axis=1)
X_test = X_test_ct.drop(["VehBrand", "Region"], axis=1)
  • \texttt{VehGas=1} if the car gas is regular.
  • \texttt{Area=0} represents the rural area, and \texttt{Area=5} represents the urban center.

Histogram of the ClaimAmount

Plotting the empirical distribution of the target variable help us get an understanding of the inherent variability associated with the data.

plt.hist(y_train[y_train < 5000], bins=30);

Aleatoric Uncertainty

Aleatoric uncertainty refers to the inherent variability associated with the data generating process. Among many ways to capture the aleatoric uncertainty, (i) combining with probabilistic models and (ii) considering mixture models are two useful methods to quantify the inherent variability.

The following section illustrates how embedding a GLM in a neural network architecture can help us quantify the uncertainty relating to the predictions coming from the neural network. The idea is to first fit a GLM, and use the predictions from the GLM and predictions from the neural network part to define a custom loss function. This embedding presents an opportunity to compute the dispersion parameter \phi_{CANN} for the neural network. The dispersion parameter provides insights into whether the model accurately captures the inherent variability (aleatoric uncertainty) in the data or not.

GLM

The generalised linear model (GLM) is a statistical regression model that estimates the conditional mean of the response variable Y given an instance \boldsymbol{x} via a link function g: \mathbb{E}[Y|\boldsymbol{x}] = \mu(\boldsymbol{x}; \boldsymbol{\beta}_{\text{GLM}}) = g^{-1} \big(\big \langle \boldsymbol{\beta}_{\text{GLM}}, \boldsymbol{x} \big \rangle\big), where

  • \boldsymbol{x} \in \mathbb{R}^{d_{\boldsymbol{x}}} is the vector of explanatory variables, with d_{\boldsymbol{x}} denoting its dimension.
  • \boldsymbol{\beta}_{\text{GLM}} represents the vector of regression coefficients.
  • \langle \boldsymbol{a}, \boldsymbol{b}\rangle represents the inner product of \boldsymbol{a} and \boldsymbol{b}.

The idea of GLM is to find a linear combination of independent variables \boldsymbol{x} and coefficients \boldsymbol{\beta}, apply a non-linear transformation (g^{-1}) to that linear combination and set it equal to conditional mean of the response variable Y given an instance \boldsymbol{x}. The non-linear transformation provides added flexibility.

Gamma GLM

Suppose a fitted gamma GLM model has

  • a log link function g(x)=\log(x) and
  • regression coefficients \boldsymbol{\beta}_{\text{GLM}}=(\beta_0, \beta_1, \beta_2, \beta_3).

Then, it estimates the conditional mean of Y given a new instance \boldsymbol{x}=(1, x_1, x_2, x_3) as follows: \mathbb{E}[Y|\boldsymbol{x}]=g^{-1}(\langle \boldsymbol{\beta}_{\text{GLM}}, \boldsymbol{x}\rangle)=\exp\big(\beta_0+ \beta_1x_1+\beta_2x_2+\beta_3x_3\big).

A GLM can model any other exponential family distribution using an appropriate link function g.

“Loss Function” for a Gamma GLM

If Y|\boldsymbol{x} is a gamma r.v., we can parameterise its density by its mean \mu(\boldsymbol{x}; \boldsymbol{\beta}) and dispersion parameter \phi: f_{Y|\boldsymbol{X}}(y|\boldsymbol{x}, \boldsymbol{\beta}, \phi) = \frac{(\mu (\boldsymbol{x}; \boldsymbol{\beta})\cdot \phi)^{-1/\phi}}{{\Gamma(1/\phi)}} \cdot y^{1/\phi - 1} \cdot \mathrm{e}^{-y/(\mu (\boldsymbol{x}; \boldsymbol{\beta})\cdot\phi)}. The “loss function” for a gamma GLM is typically the negative log-likelihood (NLL): \sum_{i=1}^{N}-\log f_{Y|\boldsymbol{X}}(y_i|\boldsymbol{x}_i, \boldsymbol{\beta},\phi) \propto \sum_{i=1}^{N}\log \mu (\boldsymbol{x}_i; \boldsymbol{\beta})+\frac{y_i}{\mu (\boldsymbol{x}_i; \boldsymbol{\beta})} + \text{const}, i.e., we ignore the dispersion parameter \phi while estimating the regression coefficients.

Fitting Steps

Step 1. Use the advanced second derivative iterative method to find the regression coefficients: \boldsymbol{\beta}_{\text{GLM}} = \underset{\boldsymbol{\beta}}{\text{arg min}} \ \sum_{i=1}^{N}\log \mu (\boldsymbol{x}_i; \boldsymbol{\beta})+\frac{y_i}{\mu (\boldsymbol{x}_i; \boldsymbol{\beta})}

Step 2. Estimate the dispersion parameter: \phi_{\text{GLM}}=\frac{1}{N-d_{\boldsymbol{x}}}\sum_{i=1}^{N}\frac{(y_i-\mu(\boldsymbol{x}_i; \boldsymbol{\beta}_{\text{GLM}} ))^2}{\mu(\boldsymbol{x}_i; \boldsymbol{\beta}_{\text{GLM}} )^2}

Code: Gamma GLM

In Python, we can fit a gamma GLM as follows:

import statsmodels.api as sm

# Add a column of ones to include an intercept in the model
X_train_design = sm.add_constant(X_train)

# Create a Gamma GLM with a log link function
gamma_GLM = sm.GLM(y_train, X_train_design,                   
            family=sm.families.Gamma(sm.families.links.Log()))

# Fit the model
gamma_GLM = gamma_GLM.fit()

# Dispersion Parameter
mus = gamma_GLM.predict(X_train_design)
residuals = mus-y_train
variance = mus**2
dof = (len(y_train)-X_train.shape[1])
phi_GLM =  np.sum(residuals**2/variance)/dof
print(phi_GLM)
59.6306232357824

The above example of fitting a Gamma distribution assumes a constant dispersion, meaning that, the dispersion of claim amount is constant for all policyholders. If we believe that the constant dispersion assumption is quite strong, we can use a double GLM model. Fitting a GLM is the traditional way of modelling a claim amount.

CANN

The Combined Actuarial Neural Network is a novel actuarial neural network architecture proposed by Schelldorfer and Wüthrich (2019). We summarise the CANN approach as follows:

  • Find the coefficients \boldsymbol{\beta}_{\text{GLM}} of the GLM with a link function g(\cdot).
  • Find the weights \boldsymbol{w}_{\text{CANN}} of a neural network \mathcal{M}_{\text{CANN}}:\mathbb{R}^{d_{\boldsymbol{x}}}\to\mathbb{R}.
  • Given a new instance \boldsymbol{x}, we have \mathbb{E}[Y|\boldsymbol{x}] = g^{-1}\Big( \langle\boldsymbol{\beta}_{\text{GLM}}, \boldsymbol{x}\rangle + \mathcal{M}_{\text{CANN}}(\boldsymbol{x};\boldsymbol{w}_{\text{CANN}})\Big).

Architecture

Figure: CANN approach.

Code: Architecture

gamma_GLM.params
const         7.786576
Area         -0.073226
VehGas        0.082292
                ...   
DrivAge      -0.022147
BonusMalus    0.157204
Density       0.010539
Length: 9, dtype: float64
# Ensure reproducibility
1random.seed(1); tf.random.set_seed(1)

# Pre-defined constants
2glm_weights = gamma_GLM.params.iloc[1:]
3glm_bias = gamma_GLM.params.iloc[0]

# Define model inputs
4inputs = Input(shape=X_train.shape[1:])

# Non-trainable GLM linear part
glm_logmu = Dense(1, activation='linear', trainable=False,
                     kernel_initializer=Constant(glm_weights),
5                     bias_initializer=Constant(glm_bias))(inputs)

# Neural network layers
6x = Dense(64, activation='relu')(inputs)
7x = Dense(64, activation='relu')(x)
8cann_logmu = Dense(1, activation='linear')(x)
1
Sets the random seed for reproducibility
2
Stores weights computed from GLM in glm_weights
3
Stores bias computed from GLM in glm_bias
4
Specifies the model input features
5
Adds a Dense layer with just one neuron, to store the model output from the GLM. The linear activation is used to make sure that the output is a linear combination of inputs. The weights are set to be non-trainable, hence the values obtained during GLM fitting will not change during the neural network training process. kernel_initializer=Constant(glm_weights) and bias_initializer=Constant(glm_bias) ensures that weights are initialized with the optimal values estimated from GLM fit.
6
Adds another Dense layer
7
Adds another Dense layer
8
Adds the output layer with linear activation

Code: Loss Function

# Combine GLM and CANN estimates
1CANN = Model(inputs, Concatenate(axis=1)([cann_logmu, glm_logmu]))
1
Since the output of the model is evaluated by combining the output from both branches, the model is constructed by concatenating outputs from cann_logmu and glm_logmu. Note that there are two predicted values, one predicted value from the glm_logmu component and the other coming from the cann_logmu component.

We need to customise the loss function for CANN.

def CANN_negative_log_likelihood(y_true, y_pred):
    #the new mean estimate
1    CANN_logmu = y_pred[:, 0]
2    GLM_logmu = y_pred[:, 1]
3    mu = tf.math.exp(CANN_logmu + GLM_logmu)

    # Compute the negative log likelihood of the Gamma distribution
4    nll = tf.reduce_mean(CANN_logmu + GLM_logmu + y_true/mu)
    
    return nll
1
Stores the first column of the y_pred matrix as CANN_logmu (the prediction from the CANN)
2
Stores the second column of the y_pred matrix as GLM_logmu (the prediction from the glm)
3
Computes the exponential of the sum of them as mu
4
Computes the negative log likelihood of a Gamma distribution where log(\mu) is now the sum CANN_logmu + GLM_logmu

Code: Model Training

1CANN.compile(optimizer="adam", loss=CANN_negative_log_likelihood)
hist = CANN.fit(X_train, y_train,
    epochs=300, 
    callbacks=[EarlyStopping(patience=30)],  
    verbose=0,
    batch_size=64, 
2    validation_split=0.2)
1
Compiles the model with adam optimizer and the custom loss function
2
Fits the model (with a validation split defined inside the fit function)

Find the dispersion parameter.

mus = np.exp(np.sum(CANN.predict(X_train, verbose=0), axis = 1))
residuals = mus-y_train
variance = mus**2
dof = (len(y_train)-X_train.shape[1])
phi_CANN =  np.sum(residuals**2/variance) / dof
print(phi_CANN)
98.60976911896634

Mixture Distribution

One intuitive way to capture uncertainty using neural networks would be to estimate the parameters of the target distribution, instead of predicting the value it self. For example, suppose we want to predict y coming from a Gaussian distribution. Most common method would be to predict (\hat{y}) directly using a single neuron at the output layer. Another possible way would be to estimate the parameters (\mu and \sigma) of the y distribution using 2 neurons at the output layer. Estimating parameters of the distribution instead of point estimates for y can help us get an idea about the uncertainty. However, assuming distributional properties at times could be too restrictive. For example, it is possible that the actual distribution of y values is bimodal or multi modal. In such situations, assuming a mixture distribution is more intuitive.

Given a finite set of resulting random variables (Y_1, ..., Y_{K}), one can generate a multinomial random variable Y\sim \text{Multinomial}(1, \boldsymbol{\pi}). Meanwhile, Y can be regarded as a mixture of Y_1, ..., Y_{K}, i.e., Y = \begin{cases} Y_1 & \text{w.p. } \pi_1, \\ \vdots & \vdots\\ Y_K & \text{w.p. } \pi_K, \\ \end{cases} where we define a set of finite set of weights \boldsymbol{\pi}=(\pi_{1} ..., \pi_{K}) such that \pi_k \ge 0 for k \in \{1, ..., K\} and \sum_{k=1}^{K}\pi_k=1.

Mixture Distribution

Let f_{Y_k|\boldsymbol{X}} and F_{Y_k|\boldsymbol{X}} be the probability density function and the cumulative density function, respectively, of Y_k|\boldsymbol{X} for all k\in \{1, ..., K\}. The random variable Y|\boldsymbol{X}, which mixes Y_k|\boldsymbol{X}’s with weights \pi_k’s, has the density function f_{Y|\boldsymbol{X}}(y|\boldsymbol{x}) = \sum_{k=1}^{K}\pi_k(\boldsymbol{x}) f_{k}(y|\boldsymbol{x}), and the cumulative density function F_{Y|\boldsymbol{X}}(y|\boldsymbol{x}) = \sum_{k=1}^{K}\pi_k(\boldsymbol{x}) F_{k}(y|\boldsymbol{x}).

Mixture Density Network

A mixture density network (MDN) \mathcal{M}_{\boldsymbol{w}^*} outputs each distribution component’s mixing weights and parameters of Y given the input features \boldsymbol{x}, i.e., \mathcal{M}_{\boldsymbol{w}^*}(\boldsymbol{x})=(\boldsymbol{\pi}(\boldsymbol{x};\boldsymbol{w}^*), \boldsymbol{\theta}(\boldsymbol{x};\boldsymbol{w}^*)), where \boldsymbol{w}^* is the networks’ weights found by minimising the following negative log-likelihood loss function \mathcal{L}(\mathcal{D}, \boldsymbol{\theta})= - \sum_{i=1}^{N} \log f_{Y|\boldsymbol{x}}(y_i|\boldsymbol{x}, \boldsymbol{w}^*), where \mathcal{D}=\{(\boldsymbol{x}_i,y_i)\}_{i=1}^{N} is the training dataset.

Mixture Density Network

Figure: An MDN that outputs the parameters for a K component mixture distribution. \boldsymbol{\theta}_k(\boldsymbol{x}; \boldsymbol{w}^*)= (\theta_{k,1}(\boldsymbol{x}; \boldsymbol{w}^*), ..., \theta_{k,|\boldsymbol{\theta}_k|}(\boldsymbol{x}; \boldsymbol{w}^*)) consists of the parameter estimates for the kth mixture component.

Model Specification

Suppose there are two types of claims:

  • Type I: Y_1|\boldsymbol{x}\sim \text{Gamma}(\alpha_1(\boldsymbol{x}), \beta_1(\boldsymbol{x})) and,
  • Type II: Y_2|\boldsymbol{x}\sim \text{Gamma}(\alpha_2(\boldsymbol{x}), \beta_2(\boldsymbol{x})).

The density of the actual claim amount Y|\boldsymbol{x} follows \begin{align*} f_{Y|\boldsymbol{X}}(y|\boldsymbol{x}) &= \pi_1(\boldsymbol{x})\cdot \frac{\beta_1(\boldsymbol{x})^{\alpha_1(\boldsymbol{x})}}{\Gamma(\alpha_1(\boldsymbol{x}))}\mathrm{e}^{-\beta_1(\boldsymbol{x})y}y^{\alpha_1(\boldsymbol{x})-1} \\ &\quad + (1-\pi_1(\boldsymbol{x}))\cdot \frac{\beta_2(\boldsymbol{x})^{\alpha_2(\boldsymbol{x})}}{\Gamma(\alpha_2(\boldsymbol{x}))}\mathrm{e}^{-\beta_2(\boldsymbol{x})y}y^{\alpha_2(\boldsymbol{x})-1}. \end{align*} where \pi_1(\boldsymbol{x}) is the probability of a Type I claim given \boldsymbol{x}.

Output

The aim is to find the optimum weights \boldsymbol{w}^* = \underset{w}{\text{arg min}} \ \mathcal{L}(\mathcal{D}, \boldsymbol{w}) for the Gamma mixture density network \mathcal{M}_{\boldsymbol{w}^*} that outputs the mixing weights, shapes and scales of Y given the input features \boldsymbol{x}, i.e., \begin{align*} \mathcal{M}_{\boldsymbol{w}^*}(\boldsymbol{x}) = ( &\pi_1(\boldsymbol{x}; \boldsymbol{w}^*), \pi_2(\boldsymbol{x}; \boldsymbol{w}^*), \\ &\alpha_1(\boldsymbol{x}; \boldsymbol{w}^*), \alpha_2(\boldsymbol{x}; \boldsymbol{w}^*),\\ &\beta_1(\boldsymbol{x}; \boldsymbol{w}^*), \beta_2(\boldsymbol{x}; \boldsymbol{w}^*) ). \end{align*}

Architecture

Figure: We demonstrate the structure of a gamma MDN that outputs the parameters for a gamma mixture with two components.

Code: Architecture

The following code resembles the architecture of the architecture of the gamma MDN from the previous slide.

# Ensure reproducibility
1random.seed(1); tf.random.set_seed(1)

2inputs = Input(shape=X_train.shape[1:])

# Two hidden layers 
3x = Dense(64, activation='relu')(inputs)
x = Dense(64, activation='relu')(x)

4pis = Dense(2, activation='softmax')(x) #mixing weights
alphas = Dense(2, activation='exponential')(x) #shape parameters
betas = Dense(2, activation='exponential')(x) #scale parameters

# `y_pred` will now have 6 columns
5gamma_mdn = Model(inputs, Concatenate(axis=1)([pis, alphas, betas]))
1
Sets the random seeds for reproducibility
2
Defines the input layer with the number of neurons being equal to the number of input features
3
Specifies the hidden layers of the neural network
4
Specifies the neurons of the output layer. Here, softmax is used for \pi values as they must sum up to 1. exponential activation is used for both \alpha’s and \beta’s as they must be non-negative.
5
Defines the model by specifying the inputs and outputs

Loss Function

The negative log-likelihood loss function is given by

\mathcal{L}(\mathcal{D}, \boldsymbol{w}) = - \sum_{i=1}^{N} \log \ f_{Y|\boldsymbol{x}}(y_i|\boldsymbol{x}, \boldsymbol{w}) where the f_{Y|\boldsymbol{x}}(y_i|\boldsymbol{x}, \boldsymbol{w}) is defined by \begin{align*} &\pi_1(\boldsymbol{x};\boldsymbol{w})\cdot \frac{\beta_1(\boldsymbol{x};\boldsymbol{w})^{\alpha_1(\boldsymbol{x};\boldsymbol{w})}}{\Gamma(\alpha_1(\boldsymbol{x};\boldsymbol{w}))}\mathrm{e}^{-\beta_1(\boldsymbol{x};\boldsymbol{w})y}y^{\alpha_1(\boldsymbol{x};\boldsymbol{w})-1} \\ & \quad + (1-\pi_1(\boldsymbol{x};\boldsymbol{w}))\cdot \frac{\beta_2(\boldsymbol{x};\boldsymbol{w})^{\alpha_2(\boldsymbol{x};\boldsymbol{w})}}{\Gamma(\alpha_2(\boldsymbol{x};\boldsymbol{w}))}\mathrm{e}^{-\beta_2(\boldsymbol{x};\boldsymbol{w})y}y^{\alpha_2(\boldsymbol{x};\boldsymbol{w})-1} \end{align*}

Code: Loss Function

We employ functions from tensorflow_probability to code the loss function for the gamma MDN. The MixtureSameFamily function facilitates defining a mixture distribution all components from the same distribution but have different parametrization.

1import tensorflow_probability as tfp
2tfd = tfp.distributions
3K = 2 # number of mixture components

def gamma_mixture_NLL(y_true, y_pred):                                      
4    K = y_pred.shape[1] // 3
    pis =  y_pred[:, :K]                                                    
    alphas = y_pred[:, K:2*K]                                               
    betas = y_pred[:, 2*K:3*K]                                              

    # The mixture distribution is a MixtureSameFamily distribution
    mixture_distribution = tfd.MixtureSameFamily(
        mixture_distribution=tfd.Categorical(probs=pis),
5        components_distribution=tfd.Gamma(alphas, betas))

    # The loss is the negative log-likelihood of the data
6    return -mixture_distribution.log_prob(y_true)
1
Imports tfp class from tensorflow_probability
2
Stores statistical distributions in the tfp class as tfd
3
Specifies the number of components in the mixture model
4
Extracts predicted values for all model components and stores them in separate matrices
5
Specifies the mixture distribution using computed model components
6
Use the fitted model to calculate negative log likelihood given the observed data

Code: Model Training

# Employ the loss function from previous slide
1gamma_mdn.compile(optimizer="adam", loss=gamma_mixture_NLL)

hist = gamma_mdn.fit(X_train, y_train,
    epochs=300, 
    callbacks=[EarlyStopping(patience=30)],  
    verbose=0,
    batch_size=64, 
2    validation_split=0.2)
1
Compiles the model using adam optimizer and the gamma_mixture_NLL(negative log likelihood) as the loss function
2
Fits the model using the training data, with a validation split

Proper Scoring Rules

Proper scoring rules provide a summary measure for the performance of the probabilistic predictions. They are useful in comparing performances across models.

Definition

The scoring rule S : \mathcal{F} \times \mathbb{R} \to \bar{\mathbb{R}} is proper relative to the class \mathcal{F} if S(G, G)\le S(F, G) for all F,G\in \mathcal{F}. It is strictly proper if equality holds only if F = G.

Examples:

  • Logarithmic Score (NLL)
  • Continuous Ranked Probability Score (CRPS)

Proper Scoring Rules

Logarithmic Score (NLL)

The logarithmic score is defined as \mathrm{LogS}(f, y) = - \log f(y), where f is the predictive density.

Continuous Ranked Probability Score (CRPS)

The continuous ranked probability score is defined as \mathrm{crps}(F, y) = \int_{-\infty}^{\infty} (F(t) - {1}_{t\ge y})^2 \ \mathrm{d}t, where F is the cumulative distribution function.

Code: NLL

from scipy.stats import gamma

def gamma_nll(mean, dispersion, y):
    # Calculate shape and scale parameters from mean and dispersion
    shape = 1 / dispersion; scale = mean * dispersion

    # Create a gamma distribution object
    gamma_dist = gamma(a=shape, scale=scale)
    
    return -np.mean(gamma_dist.logpdf(y))

# GLM
X_test_design = sm.add_constant(X_test)
mus = gamma_GLM.predict(X_test_design)
NLL_GLM = gamma_nll(mus, phi_GLM, y_test)

# CANN
mus = np.exp(np.sum(CANN.predict(X_test, verbose=0), axis = 1))
NLL_CANN = gamma_nll(mus, phi_CANN, y_test)

# MDN
NLL_MDN = gamma_mdn.evaluate(X_test, y_test, verbose=0)

Model Comparisons

print(f'GLM: {round(NLL_GLM, 2)}')
print(f'CANN: {round(NLL_CANN, 2)}')
print(f'MDN: {round(NLL_MDN, 2)}')
GLM: 11.02
CANN: 11.5
MDN: 8.67

The above results show that MDN provides the lowest value for the Logarithmic Score(NLL). Low values for NLL indicate better calibration. One possible reason for the better performance of the MDN model(compared to the Gamma model) is the added flexibility from multiple modelling components. The multiple modelling components in the MDN model, together, can capture the inherent variation in the data better.

Epistemic Uncertainty

Dropout

Dropout is one of the most popular methods for handling epistemic uncertainty. However this method does not directly quantify the epistemic uncertainty, rather, it helps reduce the risk of overfitting. Dropout is the act of randomly selecting a proportion of neurons and deactivating them during each training iteration. It is a regularization technique that aims to reduce overfitting and improve the generalization ability of the model.

An example of neurons dropped during training.

Dropout quote #1

It’s surprising at first that this destructive technique works at all. Would a company perform better if its employees were told to toss a coin every morning to decide whether or not to go to work? Well, who knows; perhaps it would! The company would be forced to adapt its organization; it could not rely on any single person to work the coffee machine or perform any other critical tasks, so this expertise would have to be spread across several people. Employees would have to learn to cooperate with many of their coworkers, not just a handful of them.

Dropout quote #2

The company would become much more resilient. If one person quit, it wouldn’t make much of a difference. It’s unclear whether this idea would actually work for companies, but it certainly does for neural networks. Neurons trained with dropout cannot co-adapt with their neighboring neurons; they have to be as useful as possible on their own. They also cannot rely excessively on just a few input neurons; they must pay attention to each of their input neurons. They end up being less sensitive to slight changes in the inputs. In the end, you get a more robust network that generalizes better.

Code: Dropout

Dropout is just another layer in Keras.

The following code shows how we can apply a dropout to each hidden layer in the neural network. The dropout rate for each layer is 0.2. There is also an option called seed in the Dropout function, which can be used to ensure reproducibility.

from tf_keras.layers import Dropout
# Ensure reproducibility
random.seed(2); tf.random.set_seed(2)

model = Sequential([
    Dense(30, activation="leaky_relu", name="hidden1"),
    Dropout(0.2),
    Dense(30, activation="leaky_relu", name="hidden2"),
    Dropout(0.2),
    Dense(1, activation="exponential", name="output")
])

model.compile("adam", "mse")
model.fit(X_train, y_train, epochs=4, verbose=0);

Code: Dropout after training

Making predictions is the same as any other model:

Dropout has no impact on model predictions because Dropout function is carried out only during the training stage. Once the model finishes its training (once the weights and biases are computed), all neurons together contribute to the predictions(no dropping out during the prediction stage). Therefore, predictions from the model will not change across different runs.

model.predict(X_test.head(3),
                  verbose=0)
array([[ 53.365997],
       [149.5073  ],
       [ 84.2315  ]], dtype=float32)
model.predict(X_test.head(3),
                  verbose=0)
array([[ 53.365997],
       [149.5073  ],
       [ 84.2315  ]], dtype=float32)

We can make the model think it is still training:

By setting the training=True, we can let drop out happen during prediction stage as well. This will change predictions for the same output different. This is known as the Monte Carlo dropout.

model(X_test.head(3).to_numpy(),
    training=True).numpy()
array([[ 45.215286],
       [506.83798 ],
       [ 80.71608 ]], dtype=float32)
model(X_test.head(3).to_numpy(),
    training=True).numpy()
array([[170.87773],
       [140.37846],
       [231.01816]], dtype=float32)

Dropout Limitation

  • Increased Training Time: Since dropout introduces noise into the training process, it can make the training process slower.
  • Sensitivity to Dropout Rates: the performance of dropout is highly dependent on the chosen dropout rate.
  • Uncertainty Quantification: the dropout can only provide a crude approximation to the theoretically justified Bayesian approach in terms of quantifying uncertainty.

Bayesian Neural Network

Bayesian Neural Networks facilitate a systematic way to quantify uncertainty about the model predictions. BNNs assume a distribution for each parameter (weights and biases) of the model. Since BNNs assume distributions for the parameters of the model, the model end up with distributions for the outputs as well. The distributions of the output help in quantifying the uncertainty related to model predictions.

The weights \boldsymbol{w} of a Bayesian neural network (BNN) have their posterior distribution: p(\boldsymbol{w}|\mathcal{D})\propto \mathcal{L}(\mathcal{D}|\boldsymbol{w})p(\boldsymbol{w}) according to the Bayes’ theorem.

  • \mathcal{L}(\mathcal{D}|\boldsymbol{w}) represents the likelihood of data given the weights.
  • p(\boldsymbol{w}) represents the density of the prior distribution of the weights.

Tractability of Posterior Distribution

Let \boldsymbol{\theta}_0=(\boldsymbol{\mu}_{\boldsymbol{w}_0},\boldsymbol{\sigma}_{\boldsymbol{w}_0}) be the parameters of the prior distribution of weights: \boldsymbol{w}\sim \mathcal{N}(\boldsymbol{\mu}_{\boldsymbol{w}_0},\boldsymbol{\sigma}_{\boldsymbol{w}_0}). The derivation of the true posterior p(\boldsymbol{w}|\mathcal{D}) \propto \mathcal{L}(\mathcal{D}|\boldsymbol{w})p(\boldsymbol{w}) is non-trivial due to the complexity of the model. We cannot compute the true posterior distribution efficiently.

Variational Approximation

The variational approximation is a potential solution. Intuitively, we approximate the true posterior distribution with a variational distribution that is more tractable: \underbrace{p(\boldsymbol{w}|\mathcal{D})}_{\text{True Posterior Distribution}}\approx \underbrace{q(\boldsymbol{w}|\boldsymbol{\theta})}_{\text{Variational Distribution}} \sim\mathcal{N}(\boldsymbol{\mu}_{\boldsymbol{w}},\boldsymbol{\sigma}_{\boldsymbol{w}}), i.e., a normal distribution with parameters \boldsymbol{\theta}= (\boldsymbol{\mu}_{\boldsymbol{w}},\boldsymbol{\sigma}_{\boldsymbol{w}}) is used to approximate the true posterior distribution of \boldsymbol{w}|\mathcal{D}.

Demonstration

Figure: The idea is to use the blue curve (variational distribution) to approximate the purple curve (true posterior).

Code: Variational Layers

1import tensorflow_probability as tfp
2tfd = tfp.distributions

3def prior(kernel_size, bias_size, dtype=None):
    n = kernel_size + bias_size
    return lambda t: tfd.Independent(
        tfd.Normal(loc=tf.zeros(n, dtype=dtype),
                   scale=1),
4                   reinterpreted_batch_ndims=1)

5def posterior(kernel_size, bias_size, dtype=None):
    n = kernel_size + bias_size
    return Sequential([
      tfp.layers.VariableLayer(2 * n, dtype=dtype),
      tfp.layers.DistributionLambda(lambda t: tfd.Independent(
          tfd.Normal(loc=t[..., :n],
                     scale=1e-5 + tf.nn.softplus(0.01 * t[..., n:])),
6          reinterpreted_batch_ndims=1)),
    ])
1
Imports tensorflow_probability using the shortened name tfp
2
Stores statistical distributions in the tfp class as tfd
3
Specifies the prior which takes in the number of weights and biases (their sum would be the total number of parameters to estimate)
4
Specifies the prior for each parameter (normal distribution with mean=0 and standard deviation=1) reinterpreted_batch_ndims=1 specifies that the distributions of weights and biases should be considered as independent distributions.
5
Specifies the posterior distribution which taken in the number of weights and biases.
6
Builds a sequential model with (i) a VariableLayer which manages the parameters of the model(since there are n parameters with their own Normal distribution, there 2*n prior parameters) and (ii) a DistributionLambda layer which wraps the independent normal distributions as a Keras layer.

Architecture

Figure: We demonstrate the typical structure of a Bayesian neural network (BNN).

Loss Function

The KL divergence between the true posterior and variational distribution is given by: D_{\text{KL}}\left[q(\boldsymbol{w}|\boldsymbol{\theta}) || p(\boldsymbol{w}|\mathcal{D})\right] =\mathbb{E}_{\boldsymbol{w} \sim q(\boldsymbol{w}|\boldsymbol{\theta})}\left[\log\left(\frac{q(\boldsymbol{w}|\boldsymbol{\theta})}{p(\boldsymbol{w}|\mathcal{D})}\right) \right] After some algebra, we acknowledge the final representation: \begin{align*} D_{\text{KL}}\left[q(\boldsymbol{w}|\boldsymbol{\theta}) || p(\boldsymbol{w}|\mathcal{D})\right] &=\underbrace{D_{\text{KL}}\left[{q(\boldsymbol{w}|\boldsymbol{\theta})} || {p(\boldsymbol{w})}\right]}_{{\text{Complexity Loss}}} \underbrace{-\mathbb{E}_{\boldsymbol{w} \sim q(\boldsymbol{w}|\boldsymbol{\theta})}\left[\log{p(\mathcal{D}|\boldsymbol{w})}\right]}_{{\text{Error Loss}}} \\ & \quad\quad\quad\quad\quad\quad+ \ \text{const}. \end{align*}

Error Loss here corresponds to the NLL. Average loss is obtained by calculating the error for each random sample and then averaging over it.

Evaluation of Loss

In practice, we estimate loss function \mathcal{L}(\mathcal{D}, \boldsymbol{\theta}) =\underbrace{D_{\text{KL}}\left[{q(\boldsymbol{w}|\boldsymbol{\theta})} || {p(\boldsymbol{w})}\right]}_{{\text{Complexity Loss}}} \underbrace{-\mathbb{E}_{\boldsymbol{w} \sim q(\boldsymbol{w}|\boldsymbol{\theta})}\left[\log{p(\mathcal{D}|\boldsymbol{w})}\right]}_{{\text{Error Loss}}} through Monte Carlo estimates \mathcal{L}(\mathcal{D}, \boldsymbol{\theta})\approx\frac{1}{M}\sum_{m=1}^{M}\underbrace{\log\Bigg({\frac{q\left(\boldsymbol{w}^{(m)}|\boldsymbol{\theta}^{(m)}\right) }{ p\left(\boldsymbol{w}^{(m)}\right)}}\Bigg)}_{\text{Complexity Loss}} \underbrace{-\log{p\left(\mathcal{D}|\boldsymbol{w}^{(m)}\right)}}_{\text{Error Loss}} where \left\{\boldsymbol{w}^{(m)}\right\}_{m=1}^{M} are random samples of \boldsymbol{w}|\boldsymbol{\theta}.

“Bayesian-Gamma” Loss

If the output consists of the shape and scale parameter of a gamma distribution, the loss function would be \mathcal{L}(\mathcal{D}, \boldsymbol{\theta})\approx\frac{1}{M}\sum_{m=1}^{M}\underbrace{\log\Bigg({\frac{q\left(\boldsymbol{w}^{(m)}|\boldsymbol{\theta}^{(m)}\right) }{ p\left(\boldsymbol{w}^{(m)}\right)}}\Bigg)}_{\text{Complexity Loss}} \underbrace{-\sum_{i=1}^{N}\log \ f(y_i|\boldsymbol{x}_i,\boldsymbol{w}^{(m)})}_{\text{Error Loss}}, where f(y_i|\boldsymbol{x}_i,\boldsymbol{w}^{(m)}) denotes the density value of y_i given \boldsymbol{x}_i, under the mth Monte Carlo sample \boldsymbol{w}^{(m)}, i.e., f(y_i|\boldsymbol{x}_i,\boldsymbol{w}^{(m)})=\frac{\beta(\boldsymbol{x};\boldsymbol{w}^{(m)})^{\alpha(\boldsymbol{x};\boldsymbol{w}^{(m)})}}{\Gamma(\alpha(\boldsymbol{x}^{(m)};\boldsymbol{w}^{(m)}))}\mathrm{e}^{-\beta(\boldsymbol{x};\boldsymbol{w}^{(m)})y}y^{\alpha(\boldsymbol{x};\boldsymbol{w}^{(m)})-1}.

Architecture

Figure: The output of our Bayesian neural network now consists of the shape parameter \alpha(\boldsymbol{x}; \boldsymbol{w}) and the scale parameter \beta(\boldsymbol{x}; \boldsymbol{w}).

Code: Architecture

The tfp.layers allows us to extract the parameters from the output, which is a gamma distribution object.

# Ensure reproducibility
1random.seed(1); tf.random.set_seed(1)

2inputs = Input(shape=X_train.shape[1:])

# DenseVariational layer
x = tfp.layers.DenseVariational(64, posterior, prior,
3                kl_weight=1/X_train.shape[0])(inputs)
outputs = Dense(2, activation = 'softplus')(x)

# Construct the Gamma distribution on the last layer
distributions = tfp.layers.DistributionLambda(
      lambda t: tfd.Gamma(concentration=t[..., 0:1], 
4                          rate=t[..., 1:2]))(outputs)
# Define the model
5gamma_bnn = Model(inputs, distributions)
1
Sets the random seed for reproducibility
2
Specifies the input layer with dimensions equal to the number of features
3
Specifies the DenseVariational layer with 64 neurons, posteriors, prior and KL divergence weight. In the above example KL divergence term in the loss function is scaled by the inverse of the train set size. Scaling helps stabilize the training process
4
Takes the outputs from the previous layers and specifies a gamma distribution with first component as the concentration parameter and second component as the rate parameter
5
Specifies the model architecture

Code: Loss Function and Training

1def gamma_loss(y_true, y_pred):
    return -y_pred.log_prob(y_true)

# Then use the loss function when compiling the model
gamma_bnn.compile(optimizer=tf_keras.optimizers.Adam(learning_rate=0.001),
2                loss=gamma_loss)

hist = gamma_bnn.fit(X_train, y_train,
    epochs=300,
    callbacks=[EarlyStopping(patience=30)],
    verbose=0,
    batch_size=64,
3    validation_split=0.2)
1
Defines the loss function which computes the negative log likelihood of observing the true data under the predicted probability distribution
2
Compiles the model with the optimizer, loss function and a custom learning rate
3
Fits the BNN model using train set with a validation split defined inside the function

Code: Output Sampling

In practice, we can further increase the number of samples.

# Define the number of samples
n_samples = 1000

# Store all predictions in a list
alphas = []; betas = []

# Run the model `n_samples` times and store the predicted parameters
for i in range(n_samples):
  # Predict the distributions
  predicted_distributions = gamma_bnn(X_test[9:10].values)
  # Get the parameters
  alphas.append(predicted_distributions.concentration.numpy())
  betas.append(predicted_distributions.rate.numpy())

Sampled Density Functions

We plot some of the sampled posterior density functions. The variability of the sampled density functions is one critical consideration for epistemic uncertainty.

Uncertainty Quantification (UQ)

We analyse the total variance formula: \begin{align*} \mathbb{V}[Y]&=\mathbb{E}[\mathbb{V}[Y|\boldsymbol{x}]] + \mathbb{V}[\mathbb{E}[Y|\boldsymbol{x}]]\\ &\approx \underbrace{\frac{1}{M}\sum_{m=1}^{M}\mathbb{V}\big[Y|\boldsymbol{x},\boldsymbol{w}^{(m)}\big]}_{\text{Aleatoric}} \\ &\quad \quad +\underbrace{\frac{1}{M}\sum_{m=1}^{M}\bigg(\mathbb{E}\big[Y|\boldsymbol{x},\boldsymbol{w}^{(m)}\big]-\frac{1}{M}\sum_{m=1}^{M}\mathbb{E}\big[Y|\boldsymbol{x},\boldsymbol{w}^{(m)}\big]\bigg)^2}_{\text{Epistemic}}, \end{align*} where M is the number of posterior samples generated.

Code: Applying UQ

# Convert to numpy array for easier manipulation
alphas = np.array(alphas); betas = np.array(betas)

# Aleatoric uncertainty: Mean of the variances of the predicted Gamma distributions
aleatoric_uncertainty = np.mean(alphas/betas**2)

# Epistemic uncertainty: Variance of the means of the model's predictions
epistemic_uncertainty = np.var(alphas/betas)

print(f"Aleatoric uncertainty: {aleatoric_uncertainty}")
print(f"Epistemic uncertainty: {epistemic_uncertainty}")
Aleatoric uncertainty: 12227515.0
Epistemic uncertainty: 1425106.75

Deep Ensembles

Lakshminarayanan et al. (2017) proposed deep ensembles as another prominent approach to obtaining epistemic uncertainty. Such a technique can be an alternative to BNNs. It’s simple to implement and requires very little hyperparameter tuning.

We summarise the deep ensemble approach for uncertainty quantification as follows:

  1. Train D neural networks with different random weights initialisations independently in parallel. The trained weights are \boldsymbol{w}^{(1)}, ..., \boldsymbol{w}^{(D)} .

Code: Deep Ensembles I

K = 1 # number of mixtures

def MDN_DE(num_ensembles):
  models = []
  for k in range(num_ensembles):
    # Ensure reproducibility
    random.seed(k); tf.random.set_seed(k)
    inputs = Input(shape=X_train.shape[1:])

    # Two hidden layers 
    x = Dense(64, activation='relu')(inputs)
    x = Dense(64, activation='relu')(x)

    pis = Dense(1, activation='softmax')(x) # mixing weights
    alphas = Dense(1, activation='softplus')(x) # shape parameters
    betas = Dense(1, activation='softplus')(x) # scale parameters

    # Concatenate by columns: `y_pred` will now have 6 columns
    gamma_mdn_new = Model(inputs, Concatenate(axis=1)([pis, alphas, betas]))
    gamma_mdn_new.compile(optimizer="adam",
                          loss=gamma_mixture_NLL)
    gamma_mdn_new.fit(X_train, y_train,
        epochs=100, callbacks=[EarlyStopping(patience=10)],  
        verbose=0, batch_size=64, validation_split=0.2)
    models.append(gamma_mdn_new)

  return(models)

Code: Deep Ensembles II

  1. For a new instance \boldsymbol{x}, obtain \Big\{\big(\mathbb{E}\big[Y|\boldsymbol{x},\boldsymbol{w}^{(d)}\big],\mathbb{V}\big[Y|\boldsymbol{x},\boldsymbol{w}^{(d)}\big]\big)\Big\}_{d=1}^{D},
D = 10 # number of MDNs
MDN_models = MDN_DE(D)

# Store all predictions in a list
weights = [0]*D; alphas = [0]*D; betas = [0]*D

# Store the parameters
for i in range(D):
  weights[i], alphas[i], betas[i] = MDN_models[i].predict(X_test[9:10], verbose=0)[0]

# Predict the means and variances
means = np.array(alphas)/np.array(betas)
variances = np.array(alphas)/np.array(betas)**2

Code: Deep Ensembles III

  1. Apply the variance decomposition \mathbb{V}[Y]=\mathbb{E}[\mathbb{V}[Y|\boldsymbol{x}]] + \mathbb{V}[\mathbb{E}[Y|\boldsymbol{x}]]
aleatoric_uncertainty = np.mean(variances)
epistemic_uncertainty = np.var(means)

print(f"Aleatoric uncertainty: {aleatoric_uncertainty}")
print(f"Epistemic uncertainty: {epistemic_uncertainty}")
Aleatoric uncertainty: 75940600.0
Epistemic uncertainty: 16657899.0

Package Versions

from watermark import watermark
print(watermark(python=True, packages="keras,matplotlib,numpy,pandas,seaborn,scipy,torch,tensorflow,tensorflow_probability,tf_keras"))
Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.24.0

keras                 : 3.3.3
matplotlib            : 3.8.4
numpy                 : 1.26.4
pandas                : 2.2.2
seaborn               : 0.13.2
scipy                 : 1.11.0
torch                 : 2.0.1
tensorflow            : 2.16.1
tensorflow_probability: 0.24.0
tf_keras              : 2.16.0

Glossary

  • aleatoric and epistemic uncertainty
  • Bayesian neural network
  • deep ensembles
  • dropout
  • CANN
  • GLM
  • MDN
  • mixture distribution
  • posterior sampling
  • proper scoring rule
  • uncertainty quantification
  • variational approximation