Distributional Regression

ACTL3143 & ACTL5111 Deep Learning for Actuaries

Patrick Laub

Introduction

Lecture Outline

  • Introduction

  • Traditional Regression

  • Stochastic Forecasts

  • GLMs and Neural Networks

  • Combined Actuarial Neural Network

  • Mixture Density Network

  • Metrics for Distributional Regression

  • Aleatoric and Epistemic Uncertainty

  • Avoiding Overfitting

  • Dropout

  • Ensembles

Neural networks and confidence

Say we have a neural network that classifies ducks from rabbits.

A duck in the training set

A rabbit in this training set

New data can be different

Predict this is a duck

Misclassify this as a rabbit

We could do better if we collected more images of ducks with rabbit ears.

New data can be challenging

Predict this is a rabbit

Predict this is a ???

This is inherently difficult, extra training data won’t help.

Classifiers give us a probability

This is already a big step up compared to regression models.

However, neural networks’ “probabilities” can be overconfident.

We already saw a case of this.

See Guo et al. (2017), On Calibration of Modern Neural Networks.

Key idea

An example of distributional forecasting over the All Ordinaries Index
  • Earlier machine learning models focused on point estimates.
  • However, in many applications, we need to understand the distribution of the response variable.
  • Each prediction becomes a distribution over the possible outcomes

Traditional Regression

Lecture Outline

  • Introduction

  • Traditional Regression

  • Stochastic Forecasts

  • GLMs and Neural Networks

  • Combined Actuarial Neural Network

  • Mixture Density Network

  • Metrics for Distributional Regression

  • Aleatoric and Epistemic Uncertainty

  • Avoiding Overfitting

  • Dropout

  • Ensembles

Notation

  • scalars are denoted by lowercase letters, e.g., y,
  • vectors are denoted by bold lowercase letters, e.g., \boldsymbol{y} = (y_1, \ldots, y_n) ,
  • random variables are denoted by capital letters, e.g., Y
  • random vectors are denoted by bold capital letters, e.g., \boldsymbol{X} = (X_1, \ldots, X_p) ,
  • matrices are denoted by bold uppercase non-italics letters, e.g., \mathbf{X} = \begin{pmatrix} x_{11} & \cdots & x_{1p} \\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{np} \end{pmatrix} .

Regression notation

  • n is the number of observations, p is the number of features,
  • the true coefficients are \boldsymbol{\beta} = (\beta_0, \beta_1, \ldots, \beta_p),
  • \beta_0 is the intercept, \beta_1, \ldots, \beta_p are the coefficients,
  • \widehat{\boldsymbol{\beta}} is the estimated coefficient vector,
  • \boldsymbol{x}_i = (1, x_{i1}, x_{i2}, \ldots, x_{ip}) is the feature vector for the ith observation,
  • y_i is the response variable for the ith observation,
  • \hat{y}_i is the predicted value for the ith observation,
  • probability density functions (p.d.f.), probability mass functions (p.m.f.), cumulative distribution functions (c.d.f.).

Traditional Regression

Multiple linear regression assumes the data-generating process is

Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \varepsilon

where \varepsilon \sim \mathcal{N}(0, \sigma^2).

We estimate the coefficients \beta_0, \beta_1, \ldots, \beta_p by minimising the sum of squared residuals or mean squared error

\text{RSS} := \sum_{i=1}^n (y_i - \hat{y}_i)^2 , \quad \text{MSE} := \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 ,

where \hat{y}_i is the predicted value for the ith observation.

Visualising the distribution of each Y

Code
# Generate sample data for linear regression
np.random.seed(0)
X_toy = np.linspace(0, 10, 10)
np.random.shuffle(X_toy)

beta_0 = 2
beta_1 = 3
y_toy = beta_0 + beta_1 * X_toy + np.random.normal(scale=2, size=X_toy.shape)
sigma_toy = 2  # Assuming a standard deviation for the normal distribution

# Fit a simple linear regression model
coefficients = np.polyfit(X_toy, y_toy, 1)
predicted_y = np.polyval(coefficients, X_toy)

# Plot the data points and the fitted line
plt.scatter(X_toy, y_toy, label='Data Points')
plt.plot(X_toy, predicted_y, color='red', label='Fitted Line')

# Draw the normal distribution bell curve sideways at each data point
for i in range(len(X_toy)):
    mu = predicted_y[i]
    y_values = np.linspace(mu - 4*sigma_toy, mu + 4*sigma_toy, 100)
    x_values = stats.norm.pdf(y_values, mu, sigma_toy) + X_toy[i]
    plt.plot(x_values, y_values, color='blue', alpha=0.5)

plt.xlabel('$x$')
plt.ylabel('$y$')
plt.legend()

The probabilistic view

Y_i \sim \mathcal{N}(\mu_i, \sigma^2)

where \mu_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip}, and the \sigma^2 is known.

The \mathcal{N}(\mu, \sigma^2) normal distribution has p.d.f.

f(y) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - \mu)^2}{2\sigma^2}\right) .

The likelihood function is

L(\boldsymbol{\beta}) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu_i)^2}{2\sigma^2}\right) \Rightarrow \ell(\boldsymbol{\beta}) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mu_i)^2 .

Perform maximum likelihood estimation to find \boldsymbol{\beta}.

The predicted distributions

Code
y_pred = np.polyval(coefficients, X_toy[:4])

fig, axes = plt.subplots(4, 1, figsize=(5.0, 3.0))

x_min = y_pred[:4].min() - 4*sigma_toy
x_max = y_pred[:4].max() + 4*sigma_toy
x_grid = np.linspace(x_min, x_max, 100)

# Plot each normal distribution with different means vertically
for i, ax in enumerate(axes):
    mu = y_pred[i]
    y_grid = stats.norm.pdf(x_grid, mu, sigma_toy)
    ax.plot(x_grid, y_grid)
    ax.set_ylabel(f'$f(y ; \\boldsymbol{{x}}_{{{i+1}}})$')
    ax.set_xticks([y_pred[i]], labels=[r'$\mu_{' + str(i+1) + r'}$'])
    ax.plot(y_toy[i], 0, 'r|')

plt.tight_layout();

The machine learning view

The negative log-likelihood \text{NLL}(\boldsymbol{\beta}) := -\ell(\boldsymbol{\beta}) is to be minimised:

\text{NLL}(\boldsymbol{\beta}) = \frac{n}{2}\log(2\pi) + \frac{n}{2}\log(\sigma^2) + \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mu_i)^2 .

As \sigma^2 is fixed, minimising NLL is equivalent to minimising MSE:

\begin{aligned} \widehat{\boldsymbol{\beta}} &= \underset{\boldsymbol{\beta}}{\operatorname{arg\,min}}\,\, \text{NLL}(\boldsymbol{\beta}) \\ &= \underset{\boldsymbol{\beta}}{\operatorname{arg\,min}}\,\, \frac{n}{2}\log(2\pi) + \frac{n}{2}\log(\sigma^2) + \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mu_i)^2 \\ &= \underset{\boldsymbol{\beta}}{\operatorname{arg\,min}}\,\, \frac{1}{n} \sum_{i=1}^n \Bigl( y_i - \hat{y}_i(\boldsymbol{x}_i; \boldsymbol{\beta}) \Bigr)^2 \\ &= \underset{\boldsymbol{\beta}}{\operatorname{arg\,min}}\,\, \text{MSE}\bigl( \boldsymbol{y}, \hat{\boldsymbol{y}}(\boldsymbol{\mathbf{X}}; \boldsymbol{\beta}) \bigr). \end{aligned}

Generalised Linear Model (GLM)

The GLM is often characterised by the mean prediction:

\mu(\boldsymbol{x}; \boldsymbol{\beta}) = g^{-1} \left(\left\langle \boldsymbol{\beta}, \boldsymbol{x} \right\rangle\right)

where g is the link function.

Common GLM distributions for the response variable include:

  • Normal distribution with identity link (just MLR)
  • Bernoulli distribution with logit link (logistic regression)
  • Poisson distribution with log link (Poisson regression)
  • Gamma distribution with log link

Logistic regression

A Bernoulli distribution with parameter p has p.m.f.

f(y)\ =\ \begin{cases} p & \text{if } y = 1 \\ 1 - p & \text{if } y = 0 \end{cases} \ =\ p^y (1 - p)^{1 - y}.

Our model is Y|\boldsymbol{X}=\boldsymbol{x} follows a Bernoulli distribution with parameter

\mu(\boldsymbol{x}; \boldsymbol{\beta}) = \frac{1}{1 + \exp\left(-\left\langle \boldsymbol{\beta}, \boldsymbol{x} \right\rangle\right)} = \mathbb{P}(Y=1|\boldsymbol{X}=\boldsymbol{x}).

The likelihood function, using \mu_i := \mu(\boldsymbol{x}_i; \boldsymbol{\beta}), is

L(\boldsymbol{\beta}) \ =\ \prod_{i=1}^n \begin{cases} \mu_i & \text{if } y_i = 1 \\ 1 - \mu_i & \text{if } y_i = 0 \end{cases} \ =\ \prod_{i=1}^n \mu_i^{y_i} (1 - \mu_i)^{1 - y_i} .

Binary cross-entropy loss

L(\boldsymbol{\beta}) = \prod_{i=1}^n \mu_i^{y_i} (1 - \mu_i)^{1 - y_i} \Rightarrow \ell(\boldsymbol{\beta}) = \sum_{i=1}^n \Bigl( y_i \log(\mu_i) + (1 - y_i) \log(1 - \mu_i) \Bigr).

The negative log-likelihood is

\text{NLL}(\boldsymbol{\beta}) = -\sum_{i=1}^n \Bigl( y_i \log(\mu_i) + (1 - y_i) \log(1 - \mu_i) \Bigr).

The binary cross-entropy loss is basically identical: \text{BCE}(\boldsymbol{y}, \boldsymbol{\mu}) = - \frac{1}{n} \sum_{i=1}^n \Bigl( y_i \log(\mu_i) + (1 - y_i) \log(1 - \mu_i) \Bigr).

Poisson regression

A Poisson distribution with rate \lambda has p.m.f. f(y) = \frac{\lambda^y \exp(-\lambda)}{y!}.

Our model is Y|\boldsymbol{X}=\boldsymbol{x} is Poisson distributed with parameter

\mu(\boldsymbol{x}; \boldsymbol{\beta}) = \exp\left(\left\langle \boldsymbol{\beta}, \boldsymbol{x} \right\rangle\right) .

The likelihood function is

L(\boldsymbol{\beta}) = \prod_{i=1}^n \frac{ \mu_i^{y_i} \exp(-\mu_i) }{y_i!} \Rightarrow \ell(\boldsymbol{\beta}) = \sum_{i=1}^n \Bigl( -\mu_i + y_i \log(\mu_i) - \log(y_i!) \Bigr).

Poisson loss

The negative log-likelihood is

\text{NLL}(\boldsymbol{\beta}) = \sum_{i=1}^n \Bigl( \mu_i - y_i \log(\mu_i) + \log(y_i!) \Bigr) .

The Poisson loss is

\text{Poisson}(\boldsymbol{y}, \boldsymbol{\mu}) = \frac{1}{n} \sum_{i=1}^n \Bigl( \mu_i - y_i \log(\mu_i) \Bigr).

Gamma regression

A gamma distribution with mean \mu and dispersion \phi has p.d.f. f(y; \mu, \phi) = \frac{(\mu \phi)^{-\frac{1}{\phi}}}{\Gamma\left(\frac{1}{\phi}\right)} y^{\frac{1}{\phi} - 1} \mathrm{e}^{-\frac{y}{\mu \phi}}

Our model is Y|\boldsymbol{X}=\boldsymbol{x} is gamma distributed with a dispersion of \phi and a mean of \mu(\boldsymbol{x}; \boldsymbol{\beta}) = \exp\left(\left\langle \boldsymbol{\beta}, \boldsymbol{x} \right\rangle\right).

The likelihood function is L(\boldsymbol{\beta}) = \prod_{i=1}^n \frac{(\mu_i \phi)^{-\frac{1}{\phi}}}{\Gamma\left(\frac{1}{\phi}\right)} y_i^{\frac{1}{\phi} - 1} \exp\left(-\frac{y_i}{\mu_i \phi}\right)

\Rightarrow \ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ -\frac{1}{\phi} \log(\mu_i \phi) - \log \Gamma\left(\frac{1}{\phi}\right) + \left(\frac{1}{\phi} - 1\right) \log(y_i) - \frac{y_i}{\mu_i \phi} \right].

Gamma loss

The negative log-likelihood is

\text{NLL}(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ \frac{1}{\phi} \log(\mu_i \phi) + \log \Gamma\left(\frac{1}{\phi}\right) - \left(\frac{1}{\phi} - 1\right) \log(y_i) + \frac{y_i}{\mu_i \phi} \right].

Since \phi is a nuisance parameter \text{NLL}(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ \frac{1}{\phi} \log(\mu_i) + \frac{y_i}{\mu_i \phi} \right] + \text{const} \propto \sum_{i=1}^n \left[ \log(\mu_i) + \frac{y_i}{\mu_i} \right].

Note

As \log(\mu_i) = \log(y_i) - \log(y_i / \mu_i), we could write an alternative version \text{NLL}(\boldsymbol{\beta}) \propto \sum_{i=1}^n \left[ \log(y_i) - \log\Bigl(\frac{y_i}{\mu_i}\Bigr) + \frac{y_i}{\mu_i} \right] \propto \sum_{i=1}^n \left[ \frac{y_i}{\mu_i} - \log\Bigl(\frac{y_i}{\mu_i}\Bigr) \right].

Why do actuaries use GLMs?

  • GLMs are interpretable.
  • GLMs are flexible (can handle different types of response variables).
  • We get the full distribution of the response variable, not just the mean.

This last point is particularly important for analysing worst-case scenarios.

Stochastic Forecasts

Lecture Outline

  • Introduction

  • Traditional Regression

  • Stochastic Forecasts

  • GLMs and Neural Networks

  • Combined Actuarial Neural Network

  • Mixture Density Network

  • Metrics for Distributional Regression

  • Aleatoric and Epistemic Uncertainty

  • Avoiding Overfitting

  • Dropout

  • Ensembles

Stock price forecasting

Code
def lagged_timeseries(df, target, window=30):
    lagged = pd.DataFrame()
    for i in range(window, 0, -1):
        lagged[f"T-{i}"] = df[target].shift(i)
    lagged["T"] = df[target].values
    return lagged


stocks = pd.read_csv("../Time-Series-And-Recurrent-Neural-Networks/aus_fin_stocks.csv")
stocks["Date"] = pd.to_datetime(stocks["Date"])
stocks = stocks.set_index("Date")
_ = stocks.pop("ASX200")
stock = stocks[["CBA"]]
stock = stock.ffill()

df_lags = lagged_timeseries(stock, "CBA", 40)

# Split the data in time
X_train = df_lags.loc[:"2018"]
X_val = df_lags.loc["2019"]
X_test = df_lags.loc["2020":]

# Remove any with NAs and split into X and y
X_train = X_train.dropna()
X_val = X_val.dropna()
X_test = X_test.dropna()

y_train = X_train.pop("T")
y_val = X_val.pop("T")
y_test = X_test.pop("T")

X_train = X_train / 100
X_val = X_val / 100
X_test = X_test / 100
y_train = y_train / 100
y_val = y_val / 100
y_test = y_test / 100

lr = LinearRegression()
lr.fit(X_train, y_train);

stocks.plot()
plt.ylabel("Stock Price")
plt.legend(loc="upper center", bbox_to_anchor=(0.5, -0.5), ncol=4);

Noisy auto-regressive forecast

def noisy_autoregressive_forecast(model, X_val, sigma, suppress=False):
    """
    Generate a multi-step forecast using the given model.
    """
    multi_step = pd.Series(index=X_val.index, name="Multi Step")

    # Initialize the input data for forecasting
    input_data = X_val.iloc[0].values.reshape(1, -1)

    for i in range(len(multi_step)):
        # Ensure input_data has the correct feature names
        input_df = pd.DataFrame(input_data, columns=X_val.columns)
        if suppress:
            next_value = model.predict(input_df, verbose=0)
        else:
            next_value = model.predict(input_df)

        next_value += np.random.normal(0, sigma)

        multi_step.iloc[i] = next_value

        # Append that prediction to the input for the next forecast
        if i + 1 < len(multi_step):
            input_data = np.append(input_data[:, 1:], next_value).reshape(1, -1)

    return multi_step

Original forecast

lr_forecast = noisy_autoregressive_forecast(lr, X_val, 0)
Code
stock.loc[lr_forecast.index, "AR Linear"] = 100 * lr_forecast

def plot_forecasts(stock):
    stock.loc["2018-12":"2019"].plot()
    plt.axvline("2019", color="black", linestyle="--")
    plt.ylabel("Stock Price")
    plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))

plot_forecasts(stock)
residuals = y_train.loc["2015":] - lr.predict(X_train.loc["2015":])
sigma = np.std(residuals)

With noise

np.random.seed(1)
lr_noisy_forecast = noisy_autoregressive_forecast(lr, X_val, sigma)
Code
stock.loc[lr_noisy_forecast.index, "AR Noisy Linear"] = 100 * lr_noisy_forecast
plot_forecasts(stock)

With noise

np.random.seed(2)
lr_noisy_forecast = noisy_autoregressive_forecast(lr, X_val, sigma)
Code
stock.loc[lr_noisy_forecast.index, "AR Noisy Linear"] = 100 * lr_noisy_forecast
plot_forecasts(stock)

With noise

np.random.seed(3)
lr_noisy_forecast = noisy_autoregressive_forecast(lr, X_val, sigma)
Code
stock.loc[lr_noisy_forecast.index, "AR Noisy Linear"] = 100 * lr_noisy_forecast
plot_forecasts(stock)

Many noisy forecasts

num_forecasts = 300
forecasts = []
for i in range(num_forecasts):
    forecasts.append(noisy_autoregressive_forecast(lr, X_val, sigma) * 100)
noisy_forecasts = pd.concat(forecasts, axis=1)
noisy_forecasts.index = X_val.index
Code
noisy_forecasts.loc["2018-12":"2019"].plot(legend=False, alpha=0.4)
plt.ylabel("Stock Price");

95% “prediction intervals”

# Calculate quantiles for the forecasts
lower_quantile = noisy_forecasts.quantile(0.025, axis=1)
upper_quantile = noisy_forecasts.quantile(0.975, axis=1)
mean_forecast = noisy_forecasts.mean(axis=1)
Code
# Plot the mean forecast
plt.figure(figsize=(8, 3))

plt.plot(stock.loc["2018-12":"2019"].index, stock.loc["2018-12":"2019"]["CBA"], label="CBA")

plt.plot(mean_forecast, label="Mean")

# Plot the quantile-based shaded area
plt.fill_between(mean_forecast.index, 
                 lower_quantile, 
                 upper_quantile, 
                 color="grey", alpha=0.2)

# Plot settings
plt.axvline(pd.Timestamp("2019-01-01"), color="black", linestyle="--")
plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))
plt.xlabel("Date")
plt.ylabel("Stock Price")
plt.tight_layout();

Residuals

y_pred = lr.predict(X_train)
residuals = y_train - y_pred
residuals -= np.mean(residuals)
residuals /= np.std(residuals)
stats.shapiro(residuals)
/home/plaub/miniconda3/envs/ai2024/lib/python3.11/site-packages/scipy/stats/_morestats.py:1882: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")
ShapiroResult(statistic=0.9038059115409851, pvalue=0.0)

Note

Probably should model the log-returns instead of the stock prices.

Code
plt.hist(residuals, bins=40, density=True)
x = np.linspace(-3, 3, 100)
plt.xlim(-3, 3)
plt.plot(x, stats.norm.pdf(x, 0, 1));

Q-Q plot and P-P plot

Code
sm.qqplot(residuals, line="45");

Code
sm.ProbPlot(residuals).ppplot(line="45");

Residuals against time

Code
plt.plot(y_train.index, residuals)
plt.xlabel("Date")
plt.ylabel("Standardised Residuals")
plt.tight_layout();

Heteroskedasticity!

GLMs and Neural Networks

Lecture Outline

  • Introduction

  • Traditional Regression

  • Stochastic Forecasts

  • GLMs and Neural Networks

  • Combined Actuarial Neural Network

  • Mixture Density Network

  • Metrics for Distributional Regression

  • Aleatoric and Epistemic Uncertainty

  • Avoiding Overfitting

  • Dropout

  • Ensembles

French motor claim sizes

sev = pd.read_csv('freMTPL2sev.csv')
cov = pd.read_csv('freMTPL2freq.csv').drop(columns=['ClaimNb'])
sev = pd.merge(sev, cov, on='IDpol', how='left').drop(columns=["IDpol"]).dropna()
sev
ClaimAmount Exposure VehPower VehAge DrivAge BonusMalus VehBrand VehGas Area Density Region
0 995.20 0.59 11.0 0.0 39.0 56.0 B12 Diesel D 778.0 Picardie
1 1128.12 0.95 4.0 1.0 49.0 50.0 B12 Regular E 2354.0 Ile-de-France
... ... ... ... ... ... ... ... ... ... ... ...
26637 767.55 0.43 6.0 0.0 67.0 50.0 B2 Diesel C 142.0 Languedoc-Roussillon
26638 1500.00 0.28 7.0 2.0 36.0 60.0 B12 Diesel D 1732.0 Rhone-Alpes

26444 rows × 11 columns

Preprocessing

X_train, X_test, y_train, y_test = train_test_split(
  sev.drop("ClaimAmount", axis=1), sev["ClaimAmount"], random_state=2023)
ct = make_column_transformer((OrdinalEncoder(), ["Area", "VehGas"]),
    ("drop", ["VehBrand", "Region"]), remainder=StandardScaler())
X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)
plt.hist(y_train[y_train < 5000], bins=30);

Doesn’t prove that Y | \boldsymbol{X} = \boldsymbol{x} is multimodal

Code
# Make some example where the distribution is multimodal because of a binary covariate which separates the means of the two distributions
np.random.seed(1)

fig, axes = plt.subplots(3, 1, figsize=(5.0, 3.0), sharex=True)

x_min = 0
x_max = y_train.max()
x_grid = np.linspace(x_min, x_max, 100)

# Simulate some data from an exponential distribution which has Pr(X < 1000) = 0.9
n = 100
p = 0.1
lambda_ = -np.log(p) / 1000 
mu = 1 / lambda_
y_1 = np.random.exponential(scale=mu, size=n)

# Pick a truncated normal distribution with a mean of 1100 and std of 250 (truncated to be positive)
mu = 1100
sigma = 100
y_2 = stats.truncnorm.rvs((0 - mu) / sigma, (np.inf - mu) / sigma, loc=mu, scale=sigma, size=n)

# Combine y_1 and y_2 for the final histogram
y = np.concatenate([y_1, y_2])

# Determine common bins
bins = np.histogram_bin_edges(y, bins=30)


# Plot each normal distribution with different means vertically
for i, ax in enumerate(axes):
    if i == 0:
        ax.hist(y_1, bins=bins, density=True, color=colors[i+1])
        ax.set_ylabel(f'$f(y | x = 1)$')

    elif i == 1:
        ax.hist(y_2, bins=bins, density=True, color=colors[i+1])
        ax.set_ylabel(f'$f(y | x = 2)$')

    else:
        ax.hist(y, bins=bins, density=True)
        ax.set_ylabel(f'$f(y)$')

plt.tight_layout();

Gamma GLM

Suppose a fitted gamma GLM model has

  • a log link function g(x)=\log(x) and
  • regression coefficients \boldsymbol{\beta}=(\beta_0, \beta_1, \beta_2, \beta_3).

Then, it estimates the conditional mean of Y given a new instance \boldsymbol{x}=(1, x_1, x_2, x_3) as follows: \mathbb{E}[Y|\boldsymbol{X}=\boldsymbol{x}] = g^{-1}(\langle \boldsymbol{\beta}, \boldsymbol{x}\rangle) = \exp\big(\beta_0 + \beta_1 x_1 + beta_2 x_2 + \beta_3 x_3 \big).

A GLM can model any other exponential family distribution using an appropriate link function g.

Gamma GLM loss

If Y|\boldsymbol{X}=\boldsymbol{x} is a gamma r.v. with mean \mu(\boldsymbol{x}; \boldsymbol{\beta}) and dispersion parameter \phi, we can minimise the negative log-likelihood (NLL) \text{NLL} \propto \sum_{i=1}^{n}\log \mu (\boldsymbol{x}_i; \boldsymbol{\beta})+\frac{y_i}{\mu (\boldsymbol{x}_i; \boldsymbol{\beta})} + \text{const}, i.e., we ignore the dispersion parameter \phi while estimating the regression coefficients.

Fitting Steps

Step 1. Use the advanced second derivative iterative method to find the regression coefficients: \widehat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}}{\text{arg\,min}} \ \sum_{i=1}^{n}\log \mu (\boldsymbol{x}_i; \boldsymbol{\beta})+\frac{y_i}{\mu (\boldsymbol{x}_i; \boldsymbol{\beta})}

Step 2. Estimate the dispersion parameter: \phi = \frac{1}{n-p}\sum_{i=1}^{n}\frac{\bigl(y_i-\mu(\boldsymbol{x}_i; \boldsymbol{\beta})\bigr)^2}{\mu(\boldsymbol{x}_i; \boldsymbol{\beta} )^2}

(Here, p is the number of coefficients in the model. If this p doesn’t include the intercept, then p should be use \frac{1}{n-(p+1)}.)

Code: Gamma GLM

In Python, we can fit a gamma GLM as follows:

import statsmodels.api as sm

# Add a column of ones to include an intercept in the model
X_train_design = sm.add_constant(X_train)

# Create a Gamma GLM with a log link function
gamma_glm = sm.GLM(y_train, X_train_design,                   
            family=sm.families.Gamma(sm.families.links.Log()))

# Fit the model
gamma_glm = gamma_glm.fit()
gamma_glm.params
const                    7.786576
ordinalencoder__Area    -0.073226
                           ...   
remainder__BonusMalus    0.157204
remainder__Density       0.010539
Length: 9, dtype: float64
# Dispersion Parameter
mus = gamma_glm.predict(X_train_design)
residuals = y_train - mus
dof = (len(y_train)-X_train_design.shape[1])
phi_glm = np.sum(residuals**2/mus**2)/dof
print(phi_glm)
59.63363123735805

ANN can feed into a GLM

Combining GLM & ANN.

Combined Actuarial Neural Network

Lecture Outline

  • Introduction

  • Traditional Regression

  • Stochastic Forecasts

  • GLMs and Neural Networks

  • Combined Actuarial Neural Network

  • Mixture Density Network

  • Metrics for Distributional Regression

  • Aleatoric and Epistemic Uncertainty

  • Avoiding Overfitting

  • Dropout

  • Ensembles

CANN

The Combined Actuarial Neural Network is a novel actuarial neural network architecture proposed by Schelldorfer and Wüthrich (2019). We summarise the CANN approach as follows:

  • Find the coefficients \boldsymbol{\beta} of the GLM with a link function g(\cdot).
  • Find the weights \boldsymbol{w}_{\text{CANN}} of a neural network \mathcal{M}_{\text{CANN}}:\mathbb{R}^{p}\to\mathbb{R}.
  • Given a new instance \boldsymbol{x}, we have \mathbb{E}[Y|\boldsymbol{X}=\boldsymbol{x}] = g^{-1}\Big( \langle\boldsymbol{\beta}, \boldsymbol{x}\rangle + \mathcal{M}_{\text{CANN}}(\boldsymbol{x};\boldsymbol{w}_{\text{CANN}})\Big).

Shifting the predicted distributions

Code
# Ensure reproducibility
random.seed(1)

# Make a 4x1 grid of plots
fig, axes = plt.subplots(4, 1, figsize=(5.0, 3.0), sharex=True)

# Define the x-axis
x_min = 0
x_max = 5000
x_grid = np.linspace(x_min, x_max, 100)

# Plot a few gamma distribution pdfs with different means.
# Then plot gamma distributions with shifted means and the same dispersion parameter.
glm_means = [1000, 3000, 2000, 4000]
cann_means = [1500, 1400, 3000, 5000]
for i, ax in enumerate(axes):
    ax.plot(x_grid, stats.gamma.pdf(x_grid, a=2, scale=glm_means[i]/2), label=f'GLM')
    ax.plot(x_grid, stats.gamma.pdf(x_grid, a=2, scale=cann_means[i]/2), label=f'CANN')
    ax.set_ylabel(f'$f(y | x_{i+1})$')
    if i == 0:
        ax.legend(["GLM", "CANN"], loc="upper right", ncol=2)

Architecture

The CANN architecture.

Code: Architecture

random.seed(1)
inputs = Input(shape=X_train.shape[1:])

# GLM part (won't be updated during training)
glm_weights = gamma_glm.params.iloc[1:].values.reshape((-1, 1))
glm_bias = gamma_glm.params.iloc[0]  
glm_part = Dense(1, activation='linear', trainable=False,
                     kernel_initializer=Constant(glm_weights),
                     bias_initializer=Constant(glm_bias))(inputs)

# Neural network part
x = Dense(64, activation='leaky_relu')(inputs)
nn_part = Dense(1, activation='linear')(x) 

# Combine GLM and CANN estimates
mu = keras.ops.exp(glm_part + nn_part)
cann = Model(inputs, mu)

Since this CANN predicts gamma distributions, we use the gamma NLL loss function.

def cann_negative_log_likelihood(y_true, y_pred):
    return keras.ops.mean(keras.ops.log(y_pred) + y_true/y_pred)

Code: Model Training

cann.compile(optimizer="adam", loss=cann_negative_log_likelihood)
hist = cann.fit(X_train, y_train,
    epochs=100, 
    callbacks=[EarlyStopping(patience=10)],  
    verbose=0,
    batch_size=64, 
    validation_split=0.2)

Find the dispersion parameter.

mus = cann.predict(X_train, verbose=0).flatten()
residuals = y_train - mus
dof = (len(y_train)-(X_train.shape[1] + 1))
phi_cann = np.sum(residuals**2/mus**2) / dof
print(phi_cann)
31.171623242378978

Mixture Density Network

Lecture Outline

  • Introduction

  • Traditional Regression

  • Stochastic Forecasts

  • GLMs and Neural Networks

  • Combined Actuarial Neural Network

  • Mixture Density Network

  • Metrics for Distributional Regression

  • Aleatoric and Epistemic Uncertainty

  • Avoiding Overfitting

  • Dropout

  • Ensembles

Mixture Distribution

Given a finite set of resulting random variables (Y_1, \ldots, Y_{K}), one can generate a multinomial random variable Y\sim \text{Multinomial}(1, \boldsymbol{\pi}). Meanwhile, Y can be regarded as a mixture of Y_1, \ldots, Y_{K}, i.e., Y = \begin{cases} Y_1 & \text{w.p. } \pi_1, \\ \vdots & \vdots\\ Y_K & \text{w.p. } \pi_K, \\ \end{cases} where we define a set of finite set of weights \boldsymbol{\pi}=(\pi_{1} \ldots, \pi_{K}) such that \pi_k \ge 0 for k \in \{1, \ldots, K\} and \sum_{k=1}^{K}\pi_k=1.

Mixture Distribution

Let f_{Y_k|\boldsymbol{X}} and F_{Y_k|\boldsymbol{X}} be the p.d.f. and the c.d.f of Y_k|\boldsymbol{X} for all k \in \{1, \ldots, K\}.

The random variable Y|\boldsymbol{X}, which mixes Y_k|\boldsymbol{X}’s with weights \pi_k’s, has the density function f_{Y|\boldsymbol{X}}(y|\boldsymbol{x}) = \sum_{k=1}^{K}\pi_k(\boldsymbol{x}) f_{k}(y|\boldsymbol{x}), and the cumulative density function F_{Y|\boldsymbol{X}}(y|\boldsymbol{x}) = \sum_{k=1}^{K}\pi_k(\boldsymbol{x}) F_{k}(y|\boldsymbol{x}).

Mixture Density Network

A mixture density network (MDN) \mathcal{M}_{\boldsymbol{w}^*} outputs each distribution component’s mixing weights and parameters of Y given the input features \boldsymbol{x}, i.e., \mathcal{M}_{\boldsymbol{w}^*}(\boldsymbol{x})=(\boldsymbol{\pi}(\boldsymbol{x};\boldsymbol{w}^*), \boldsymbol{\theta}(\boldsymbol{x};\boldsymbol{w}^*)), where \boldsymbol{w}^* is the networks’ weights found by minimising the following negative log-likelihood loss function \mathcal{L}(\mathcal{D}, \boldsymbol{\theta})= - \sum_{i=1}^{n} \log f_{Y|\boldsymbol{X}}(y_i|\boldsymbol{x}, \boldsymbol{w}^*), where \mathcal{D}=\{(\boldsymbol{x}_i,y_i)\}_{i=1}^{n} is the training dataset.

Mixture Density Network

An MDN that outputs the parameters for a K component mixture distribution. \boldsymbol{\theta}_k(\boldsymbol{x}; \boldsymbol{w}^*)= (\theta_{k,1}(\boldsymbol{x}; \boldsymbol{w}^*), \ldots, \theta_{k,|\boldsymbol{\theta}_k|}(\boldsymbol{x}; \boldsymbol{w}^*)) consists of the parameter estimates for the kth mixture component.

Model Specification

Suppose there are two types of claims:

  • Type I: Y_1|\boldsymbol{X}=\boldsymbol{x}\sim \text{Gamma}(\alpha_1(\boldsymbol{x}), \beta_1(\boldsymbol{x})) and,
  • Type II: Y_2|\boldsymbol{X}=\boldsymbol{x}\sim \text{Gamma}(\alpha_2(\boldsymbol{x}), \beta_2(\boldsymbol{x})).

The density of the actual claim amount Y|\boldsymbol{X}=\boldsymbol{x} follows \begin{align*} f_{Y|\boldsymbol{X}}(y|\boldsymbol{x}) &= \pi_1(\boldsymbol{x})\cdot \frac{\beta_1(\boldsymbol{x})^{\alpha_1(\boldsymbol{x})}}{\Gamma(\alpha_1(\boldsymbol{x}))}\mathrm{e}^{-\beta_1(\boldsymbol{x})y}y^{\alpha_1(\boldsymbol{x})-1} \\ &\quad + (1-\pi_1(\boldsymbol{x}))\cdot \frac{\beta_2(\boldsymbol{x})^{\alpha_2(\boldsymbol{x})}}{\Gamma(\alpha_2(\boldsymbol{x}))}\mathrm{e}^{-\beta_2(\boldsymbol{x})y}y^{\alpha_2(\boldsymbol{x})-1}. \end{align*} where \pi_1(\boldsymbol{x}) is the probability of a Type I claim given \boldsymbol{x}.

Output

The aim is to find the optimum weights \boldsymbol{w}^* = \underset{w}{\text{arg\,min}} \ \mathcal{L}(\mathcal{D}, \boldsymbol{w}) for the Gamma mixture density network \mathcal{M}_{\boldsymbol{w}^*} that outputs the mixing weights, shapes and scales of Y given the input features \boldsymbol{x}, i.e., \begin{align*} \mathcal{M}_{\boldsymbol{w}^*}(\boldsymbol{x}) = ( &\pi_1(\boldsymbol{x}; \boldsymbol{w}^*), \pi_2(\boldsymbol{x}; \boldsymbol{w}^*), \\ &\alpha_1(\boldsymbol{x}; \boldsymbol{w}^*), \alpha_2(\boldsymbol{x}; \boldsymbol{w}^*), \\ &\beta_1(\boldsymbol{x}; \boldsymbol{w}^*), \beta_2(\boldsymbol{x}; \boldsymbol{w}^*) ). \end{align*}

Architecture

We demonstrate the structure of a gamma MDN that outputs the parameters for a gamma mixture with two components.

Code: Import “legacy” Keras (for now)

import tf_keras

Code: Architecture

The following code resembles the architecture of the architecture of the gamma MDN from the previous slide.

# Ensure reproducibility
random.seed(1);

inputs = tf_keras.layers.Input(shape=X_train.shape[1:])

# Two hidden layers 
x = tf_keras.layers.Dense(64, activation='relu')(inputs)
x = tf_keras.layers.Dense(64, activation='relu')(x)

pis = tf_keras.layers.Dense(2, activation='softmax')(x) # Mixing weights
alphas = tf_keras.layers.Dense(2, activation='exponential')(x) # Shape parameters
betas = tf_keras.layers.Dense(2, activation='exponential')(x) # Scale parameters
out = tf_keras.layers.Concatenate(axis=1)([pis, alphas, betas]) # shape = (None, 6)

gamma_mdn = tf_keras.Model(inputs, out)

Loss Function

The negative log-likelihood loss function is given by

\mathcal{L}(\mathcal{D}, \boldsymbol{w}) = - \frac{1}{n} \sum_{i=1}^{n} \log \ f_{Y|\boldsymbol{X}}(y_i|\boldsymbol{x}, \boldsymbol{w}) where the f_{Y|\boldsymbol{X}}(y_i|\boldsymbol{x}, \boldsymbol{w}) is defined by \begin{align*} &\pi_1(\boldsymbol{x};\boldsymbol{w})\cdot \frac{\beta_1(\boldsymbol{x};\boldsymbol{w})^{\alpha_1(\boldsymbol{x};\boldsymbol{w})}}{\Gamma(\alpha_1(\boldsymbol{x};\boldsymbol{w}))}\mathrm{e}^{-\beta_1(\boldsymbol{x};\boldsymbol{w})y}y^{\alpha_1(\boldsymbol{x};\boldsymbol{w})-1} \\ & \quad + (1-\pi_1(\boldsymbol{x};\boldsymbol{w}))\cdot \frac{\beta_2(\boldsymbol{x};\boldsymbol{w})^{\alpha_2(\boldsymbol{x};\boldsymbol{w})}}{\Gamma(\alpha_2(\boldsymbol{x};\boldsymbol{w}))}\mathrm{e}^{-\beta_2(\boldsymbol{x};\boldsymbol{w})y}y^{\alpha_2(\boldsymbol{x};\boldsymbol{w})-1} \end{align*}

Code: Loss & training

tensorflow_probability to the rescue.

import tensorflow_probability as tfp
tfd = tfp.distributions

def gamma_mixture_nll(y_true, y_pred):   
    K = y_pred.shape[1] // 3
    pis = y_pred[:, :K]                                                    
    alphas = y_pred[:, K:2*K]                                               
    betas = y_pred[:, 2*K:3*K]
    mixture_distribution = tfd.MixtureSameFamily(
        mixture_distribution=tfd.Categorical(probs=pis),
        components_distribution=tfd.Gamma(alphas, betas))
    return -tf_keras.backend.mean(mixture_distribution.log_prob(y_true))
gamma_mdn.compile(optimizer="adam", loss=gamma_mixture_nll)

hist = gamma_mdn.fit(X_train, y_train,
    epochs=100, 
    callbacks=[tf_keras.callbacks.EarlyStopping(patience=10)],  
    verbose=0,
    batch_size=64, 
    validation_split=0.2)

Metrics for Distributional Regression

Lecture Outline

  • Introduction

  • Traditional Regression

  • Stochastic Forecasts

  • GLMs and Neural Networks

  • Combined Actuarial Neural Network

  • Mixture Density Network

  • Metrics for Distributional Regression

  • Aleatoric and Epistemic Uncertainty

  • Avoiding Overfitting

  • Dropout

  • Ensembles

Proper Scoring Rules

Definition

A scoring rule is the equivalent of a loss function for distributional regression.

Denote S(F, y) to be the score given to the forecasted distribution F and an observation y \in \mathbb{R}.

Definition

A scoring rule is called proper if \mathbb{E}_{Y \sim Q} S(Q, Y) \le \mathbb{E}_{Y \sim Q} S(F, Y) for all F and Q distributions.

It is called strictly proper if equality holds only if F = Q.

Example Proper Scoring Rules

Logarithmic Score (NLL)

The logarithmic score is defined as \mathrm{LogS}(f, y) = - \log f(y), where f is the predictive density.

Continuous Ranked Probability Score (CRPS)

The continuous ranked probability score is defined as \mathrm{crps}(F, y) = \int_{-\infty}^{\infty} (F(t) - {1}_{t\ge y})^2 \ \mathrm{d}t, where F is the predicted c.d.f.

Likelihoods

Code
y_pred = np.polyval(coefficients, X_toy[:4])
y_pred[2] *= 1.1
sigma_preds = sigma_toy * np.array([1.0, 3.0, 0.5, 0.5])
fig, axes = plt.subplots(1, 4, figsize=(5.0, 3.0), sharey=True)

x_min = y_pred[:4].min() - 4*sigma_toy
x_max = y_pred[:4].max() + 4*sigma_toy
x_grid = np.linspace(x_min, x_max, 100)

# Plot each normal distribution with different means vertically
for i, ax in enumerate(axes):
    y_grid = stats.norm.pdf(x_grid, y_pred[i], sigma_preds[i])
    ax.plot(x_grid, y_grid)
    ax.plot([y_toy[i], y_toy[i]], [0, stats.norm.pdf(y_toy[i], y_pred[i], sigma_preds[i])], color='red', linestyle='--')
    ax.scatter([y_toy[i]], [stats.norm.pdf(y_toy[i], y_pred[i], sigma_preds[i])], color='red', zorder=10)
    ax.set_title(f'$f(y ; \\boldsymbol{{x}}_{{{i+1}}})$')
    ax.set_xticks([y_pred[i]], labels=[r'$\mu_{' + str(i+1) + r'}$'])
    # ax.set_ylim(0, 0.25)

    # Turn off the y axes
    ax.yaxis.set_visible(False)
    
plt.tight_layout();

Code: NLL

def gamma_nll(mean, dispersion, y):
    # Calculate shape and scale parameters from mean and dispersion
    shape = 1 / dispersion; scale = mean * dispersion

    # Create a gamma distribution object
    gamma_dist = stats.gamma(a=shape, scale=scale)
    
    return -np.mean(gamma_dist.logpdf(y))

# GLM
X_test_design = sm.add_constant(X_test)
mus = gamma_glm.predict(X_test_design)
nll_glm = gamma_nll(mus, phi_glm, y_test)

# CANN
mus = cann.predict(X_test, verbose=0)
nll_cann = gamma_nll(mus, phi_cann, y_test)

# MDN
nll_mdn = gamma_mdn.evaluate(X_test, y_test, verbose=0)

Model Comparisons

print(f'GLM: {round(nll_glm, 2)}')
print(f'CANN: {round(nll_cann, 2)}')
print(f'MDN: {round(nll_mdn, 2)}')
GLM: 11.02
CANN: 10.44
MDN: 8.67

Aleatoric and Epistemic Uncertainty

Lecture Outline

  • Introduction

  • Traditional Regression

  • Stochastic Forecasts

  • GLMs and Neural Networks

  • Combined Actuarial Neural Network

  • Mixture Density Network

  • Metrics for Distributional Regression

  • Aleatoric and Epistemic Uncertainty

  • Avoiding Overfitting

  • Dropout

  • Ensembles

Categories of uncertainty

There are two major categories of uncertainty in statistical or machine learning:

  • Aleatoric uncertainty
  • Epistemic uncertainty

Since there is no consensus on the definitions of aleatoric and epistemic uncertainty, we provide the most acknowledged definitions in the following slides.

Aleatoric Uncertainty

Qualitative Definition

Aleatoric uncertainty refers to the statistical variability and inherent noise with data distribution that modelling cannot explain.

Quantitative Definition

\text{Ale}(Y|\boldsymbol{X}=\boldsymbol{x}) = \mathbb{V}[Y|\boldsymbol{X}=\boldsymbol{x}],i.e., if Y|\boldsymbol{X}=\boldsymbol{x} \sim \mathcal{N}(\mu, \sigma^2), the aleatoric uncertainty would be \sigma^2. Simply, it is the conditional variance of the response variable Y given features/covariates \boldsymbol{x}.

Epistemic Uncertainty

Qualitative Definition

Epistemic uncertainty refers to the lack of knowledge, limited data information, parameter errors and model errors.

Quantitative Definition

\text{Epi}(Y|\boldsymbol{X}=\boldsymbol{x}) = \text{Uncertainty}(Y|\boldsymbol{X}=\boldsymbol{x}) - \text{Ale}(Y|\boldsymbol{X}=\boldsymbol{x}),

i.e., the total uncertainty subtracting the aleatoric uncertainty \mathbb{V}[Y|\boldsymbol{X}=\boldsymbol{x}] would be the epistemic uncertainty.

Sources of uncertainty

If you decide to predict the claim amount of an individual using a deep learning model, which source(s) of uncertainty are you dealing with?

  1. The inherent variability of the data-generating process \rightarrow aleatoric uncertainty.
  2. Parameter error \rightarrow epistemic uncertainty.
  3. Model error \rightarrow epistemic uncertainty.
  4. Data uncertainty \rightarrow epistemic uncertainty.

Avoiding Overfitting

Lecture Outline

  • Introduction

  • Traditional Regression

  • Stochastic Forecasts

  • GLMs and Neural Networks

  • Combined Actuarial Neural Network

  • Mixture Density Network

  • Metrics for Distributional Regression

  • Aleatoric and Epistemic Uncertainty

  • Avoiding Overfitting

  • Dropout

  • Ensembles

Traditional regularisation

Say all the m weights (excluding biases) are in the vector \boldsymbol{\theta}. If we change the loss function to \text{Loss}_{1:n} = \frac{1}{n} \sum_{i=1}^n \text{Loss}_i + \lambda \sum_{j=1}^{m} \left| \theta_j \right|

this would be using L^1 regularisation. A loss like

\text{Loss}_{1:n} = \frac{1}{n} \sum_{i=1}^n \text{Loss}_i + \lambda \sum_{j=1}^{m} \theta_j^2

is called L^2 regularisation.

Regularisation in Keras

Code
from sklearn.datasets import fetch_california_housing
features, target = fetch_california_housing(as_frame=True, return_X_y=True)

NUM_FEATURES = len(features.columns)

X_main, X_test, y_main, y_test = train_test_split(
    features, target, test_size=0.2, random_state=1
)
X_train, X_val, y_train, y_val = train_test_split(
    X_main, y_main, test_size=0.25, random_state=1
)

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_val_sc = scaler.transform(X_val)
X_test_sc = scaler.transform(X_test)
from keras.regularizers import L1, L2

def l1_model(regulariser_strength=0.01):
  random.seed(123)
  model = Sequential([
      Dense(30, activation="leaky_relu",
        kernel_regularizer=L1(regulariser_strength)),
      Dense(1, activation="exponential")
  ])

  model.compile("adam", "mse")
  model.fit(X_train_sc, y_train, epochs=4, verbose=0)
  return model

def l2_model(regulariser_strength=0.01):
  random.seed(123)
  model = Sequential([
      Dense(30, activation="leaky_relu",
        kernel_regularizer=L2(regulariser_strength)),
      Dense(1, activation="exponential")
  ])

  model.compile("adam", "mse")
  model.fit(X_train_sc, y_train, epochs=10, verbose=0)
  return model

Weights before & after L^2

model = l2_model(0.0)
weights = model.layers[0].get_weights()[0].flatten()
print(f"Number of weights almost 0: {np.sum(np.abs(weights) < 1e-5)}")
plt.hist(weights, bins=100);
Number of weights almost 0: 0

model = l2_model(1.0)
weights = model.layers[0].get_weights()[0].flatten()
print(f"Number of weights almost 0: {np.sum(np.abs(weights) < 1e-5)}")
plt.hist(weights, bins=100);
Number of weights almost 0: 0

Weights before & after L^1

model = l1_model(0.0)
weights = model.layers[0].get_weights()[0].flatten()
print(f"Number of weights almost 0: {np.sum(np.abs(weights) < 1e-5)}")
plt.hist(weights, bins=100);
Number of weights almost 0: 0

model = l1_model(1.0)
weights = model.layers[0].get_weights()[0].flatten()
print(f"Number of weights almost 0: {np.sum(np.abs(weights) < 1e-5)}")
plt.hist(weights, bins=100);
Number of weights almost 0: 36

Early-stopping regularisation

A very different way to regularize iterative learning algorithms such as gradient descent is to stop training as soon as the validation error reaches a minimum. This is called early stopping… It is such a simple and efficient regularization technique that Geoffrey Hinton called it a “beautiful free lunch”.

Alternatively, you can try building a model with slightly more layers and neurons than you actually need, then use early stopping and other regularization techniques to prevent it from overfitting too much. Vincent Vanhoucke, a scientist at Google, has dubbed this the “stretch pants” approach: instead of wasting time looking for pants that perfectly match your size, just use large stretch pants that will shrink down to the right size.

Dropout

Lecture Outline

  • Introduction

  • Traditional Regression

  • Stochastic Forecasts

  • GLMs and Neural Networks

  • Combined Actuarial Neural Network

  • Mixture Density Network

  • Metrics for Distributional Regression

  • Aleatoric and Epistemic Uncertainty

  • Avoiding Overfitting

  • Dropout

  • Ensembles

Dropout

An example of neurons dropped during training.

Dropout quote #1

It’s surprising at first that this destructive technique works at all. Would a company perform better if its employees were told to toss a coin every morning to decide whether or not to go to work? Well, who knows; perhaps it would! The company would be forced to adapt its organization; it could not rely on any single person to work the coffee machine or perform any other critical tasks, so this expertise would have to be spread across several people. Employees would have to learn to cooperate with many of their coworkers, not just a handful of them.

Dropout quote #2

The company would become much more resilient. If one person quit, it wouldn’t make much of a difference. It’s unclear whether this idea would actually work for companies, but it certainly does for neural networks. Neurons trained with dropout cannot co-adapt with their neighboring neurons; they have to be as useful as possible on their own. They also cannot rely excessively on just a few input neurons; they must pay attention to each of their input neurons. They end up being less sensitive to slight changes in the inputs. In the end, you get a more robust network that generalizes better.

Code: Dropout

Dropout is just another layer in Keras.

from keras.layers import Dropout

random.seed(2); 

model = Sequential([
    Dense(30, activation="leaky_relu"),
    Dropout(0.2),
    Dense(30, activation="leaky_relu"),
    Dropout(0.2),
    Dense(1, activation="exponential")
])

model.compile("adam", "mse")
model.fit(X_train_sc, y_train, epochs=4, verbose=0);

Code: Dropout after training

Making predictions is the same as any other model:

model.predict(X_train_sc.head(3),
                  verbose=0)
array([[1.0587903],
       [1.2814349],
       [0.9994641]], dtype=float32)
model.predict(X_train_sc.head(3),
                  verbose=0)
array([[1.0587903],
       [1.2814349],
       [0.9994641]], dtype=float32)

We can make the model think it is still training:

model(X_train_sc.head(3),
    training=True).numpy()
array([[1.082524  ],
       [0.74211466],
       [1.1583111 ]], dtype=float32)
model(X_train_sc.head(3),
    training=True).numpy()
array([[1.0132376],
       [1.2697867],
       [0.7800578]], dtype=float32)

Dropout Limitation

  • Increased Training Time: Since dropout introduces noise into the training process, it can make the training process slower.
  • Sensitivity to Dropout Rates: the performance of dropout is highly dependent on the chosen dropout rate.

Accidental dropout (“dead neurons”)

My first ANN for California housing


random.seed(123)

model = Sequential([
    Dense(30, activation="relu"),
    Dense(1)
])

model.compile("adam", "mse")
hist = model.fit(X_train, y_train,
        epochs=5, verbose=0)
hist.history["loss"]
[25089.478515625,
 12.956829071044922,
 13.395614624023438,
 7.074806213378906,
 5.800335884094238]

Find dead ReLU neurons

acts = model.layers[0](X_train).numpy()
print(X_train.shape, acts.shape)
acts[:3]
(12384, 8) (12384, 30)
array([[261.458   , 502.33704 ,  93.64283 , ..., 537.54865 , 325.7366  ,
        398.99435 ],
       [ 18.983932,  52.9067  ,   0.      , ...,  28.361092,  10.988864,
         58.194595],
       [266.2954  , 517.58154 ,  98.64309 , ..., 553.68005 , 336.69986 ,
        411.61124 ]], dtype=float32)
dead = acts.mean(axis=0) == 0
np.sum(dead)
7
idx = np.where(dead)[0][0]
acts[:, idx-1:idx+2]
array([[ 0.      ,  0.      ,  0.      ],
       [18.991873,  0.      ,  0.      ],
       [ 0.      ,  0.      ,  0.      ],
       ...,
       [ 0.      ,  0.      ,  0.      ],
       [ 0.      ,  0.      ,  0.      ],
       [ 0.      ,  0.      ,  0.      ]], dtype=float32)

Trying different seeds

Create a function which counts the number of dead ReLU neurons in the first hidden layer for a given seed:

def count_dead(seed):
    random.seed(seed)
    hidden = Dense(30, activation="relu")
    acts = hidden(X_train).numpy()
    return np.sum(acts.mean(axis=0) == 0)

Then we can try out different seeds:

num_dead = [count_dead(seed) for seed in range(1_000)]
np.median(num_dead)
5.0

Look at distribution of dead ReLUs

labels, counts = np.unique(num_dead, return_counts=True)
plt.bar(labels, counts, align='center');

Ensembles

Lecture Outline

  • Introduction

  • Traditional Regression

  • Stochastic Forecasts

  • GLMs and Neural Networks

  • Combined Actuarial Neural Network

  • Mixture Density Network

  • Metrics for Distributional Regression

  • Aleatoric and Epistemic Uncertainty

  • Avoiding Overfitting

  • Dropout

  • Ensembles

Ensembles

Combine many models to get better predictions.

Deep Ensembles

Train M neural networks with different random initial weights independently (even in parallel).

def build_model(seed):
    random.seed(seed)
    model = Sequential([
        Dense(30, activation="leaky_relu"),
        Dense(1, activation="exponential")
    ])
    model.compile("adam", "mse")

    es = EarlyStopping(restore_best_weights=True, patience=5)
    model.fit(X_train_sc, y_train, epochs=1_000,
        callbacks=[es], validation_data=(X_val_sc, y_val), verbose=False)
    return model
M = 3
seeds = range(M)
models = []
for seed in seeds:
    models.append(build_model(seed))

Deep Ensembles II

Say the trained weights by \boldsymbol{w}^{(1)}, \ldots, \boldsymbol{w}^{(M)}, then we get predictions \bigl\{ \hat{y}(\boldsymbol{x}; \boldsymbol{w}^{(m)}) \bigr\}_{m=1}^{M}

y_preds = []
for model in models:
    y_preds.append(model.predict(X_test_sc, verbose=0))

y_preds = np.array(y_preds)
y_preds
array([[[3.2801466 ],
        [0.76298356],
        [2.4068608 ],
        ...,
        [2.3385763 ],
        [2.1730225 ],
        [1.096715  ]],

       [[3.1832185 ],
        [0.72296774],
        [2.5727806 ],
        ...,
        [2.3812106 ],
        [2.27971   ],
        [1.06247   ]],

       [[3.0994337 ],
        [0.77855957],
        [2.6037261 ],
        ...,
        [2.5923505 ],
        [2.354589  ],
        [1.2019893 ]]], dtype=float32)

Package Versions

from watermark import watermark
print(watermark(python=True, packages="keras,matplotlib,numpy,pandas,seaborn,scipy,torch,tensorflow,tensorflow_probability,tf_keras"))
Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.24.0

keras                 : 3.3.3
matplotlib            : 3.9.0
numpy                 : 1.26.4
pandas                : 2.2.2
seaborn               : 0.13.2
scipy                 : 1.11.0
torch                 : 2.3.1
tensorflow            : 2.16.1
tensorflow_probability: 0.24.0
tf_keras              : 2.16.0

Glossary

  • aleatoric and epistemic uncertainty
  • combined actuarial neural network
  • dead ReLU
  • deep ensembles
  • distributional forecasts
  • dropout
  • generalised linear model
  • mixture density network
  • mixture distribution
  • Monte Carlo dropout
  • proper scoring rule