Lab: Backpropagation

ACTL3143 & ACTL5111 Deep Learning for Actuaries

Backpropagation performs a backward pass to adjust the neural network’s parameters. It’s an algorithm that uses gradient descent to update the neural network weights.

Linear Regression via Batch Gradient Descent

Let θ(t)=(w(t),b(t))\boldsymbol{\theta}^{(t)}=(w^{(t)}, b^{(t)}) be the parameter estimates of the ttth iteration. Let D={(xi,yi)}i=1N\mathcal{D}= \{(x_i, y_i)\}_{i=1}^{N} represents the training batch. Let mean squared error (MSE) be the loss/cost function L\mathcal{L}.

Finding the Gradients

  • Step 1: Write down L(D,θ(t))\mathcal{L}(\mathcal{D}, \boldsymbol{\theta}^{(t)}) and y^(xi;θ(t))\hat{y}(x_i; \boldsymbol{\theta}^{(t)}) L(D,θ(t))=1Ni=1N(y^(xi;θ(t))yi)2y^(xi;θ(t))=w(t)xi+b(t)\begin{align*} \mathcal{L}(\mathcal{D},\boldsymbol{\theta}^{(t)}) &=\frac{1}{N} \sum_{i=1}^{N} \big(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}) - y_i \big)^2 \\ \hat{y}(x_i; \boldsymbol{\theta}^{(t)}) &= w^{(t)}x_i + b^{(t)} \end{align*}
  • Step 2: Derive L(y^(xi;θ(t)),yi)y^(xi;θ(t))\frac{\partial \mathcal{L}(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}), y_i)}{\partial \hat{y}(x_i; \boldsymbol{\theta}^{(t)})} and y^(xi;θ(t))θ(t)\frac{\partial\hat{y}(x_i; \boldsymbol{\theta}^{(t)})}{\partial \boldsymbol{\theta}^{(t)}} L(y^(xi;θ(t)),yi)y^(xi;θ(t))=2(y^(xi;θ(t))yi)y^(xi;θ(t))w(t)=xiy^(xi;θ(t))b(t)=1\begin{align*} \frac{\partial \mathcal{L}(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}), y_i)}{\partial \hat{y}(x_i; \boldsymbol{\theta}^{(t)})} & = 2 \big(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}) - y_i \big) \\ \frac{\partial\hat{y}(x_i; \boldsymbol{\theta}^{(t)})}{\partial w^{(t)}} & = x_i \\ \frac{\partial\hat{y}(x_i; \boldsymbol{\theta}^{(t)})}{\partial b^{(t)}} & = 1 \end{align*}
  • Step 3: Derive L(D,θ(t))θ(t)\frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta}^{(t)})}{\partial \boldsymbol{\theta}^{(t)}} L(D,θ(t))w(t)=1Ni=1NL(y^(xi;θ(t)),yi)y^(xi;θ(t))y^(xi;θ(t))w(t)=2Ni=1N(y^(xi;θ(t))yi)xi(1) \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta}^{(t)})}{\partial w^{(t)}} = \frac{1}{N}\sum_{i=1}^{N}\frac{\partial \mathcal{L}(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}), y_i)}{\partial \hat{y}(x_i; \boldsymbol{\theta}^{(t)})} \frac{\partial\hat{y}(x_i; \boldsymbol{\theta}^{(t)})}{\partial w^{(t)}} = \frac{2}{N} \sum_{i=1}^{N} \big(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}) - y_i \big) \cdot x_i \tag{1} L(D,θ(t))b(t)=1Ni=1NL(y^(xi;θ(t)),yi)y^(xi;θ(t))y^(xi;θ(t))b(t)=2Ni=1N(y^(xi;θ(t))yi)1(2) \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta}^{(t)})}{\partial b^{(t)}} = \frac{1}{N}\sum_{i=1}^{N}\frac{\partial \mathcal{L}(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}), y_i)}{\partial \hat{y}(x_i; \boldsymbol{\theta}^{(t)})} \frac{\partial\hat{y}(x_i; \boldsymbol{\theta}^{(t)})}{\partial b^{(t)}} = \frac{2}{N} \sum_{i=1}^{N} \big(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}) - y_i \big) \cdot 1 \tag{2}

Then, we initialise θ(0)=(w(0),b(0))\boldsymbol{\theta}^{(0)} = (w^{(0)}, b^{(0)}) and then apply gradient descent for t=1,2,t=1, 2, \ldots w(t+1)=w(t)ηL(D,θ(t))ww(t)b(t+1)=b(t)ηL(D,θ(t))bb(t)\begin{align} w^{(t+1)} &= w^{(t)} - \eta \cdot \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta}^{(t)})}{\partial w}\bigg|_{w^{(t)}} \\ b^{(t+1)} &= b^{(t)} - \eta \cdot \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta}^{(t)})}{\partial b}\bigg|_{b^{(t)}} \end{align} using the derivatives derived from and . η\eta is a chosen learning rate.

Exercise

  1. Use backpropagation algorithm to find θ(3)\theta^{(3)} with θ(0)=(w(0)=1,b(0)=0)\theta^{(0)}= (w^{(0)} = 1, b^{(0)} = 0). The dataset D\mathcal{D} is as follows:

That is, the true model would be yi=3xi+1y_i = 3 x_i + 1, i.e., w=3,b=1w = 3, b = 1. Implement batch gradient descent.

Neural Network

For a neural network with HH hidden layers:

  • L0L_0 is the input layer (the zeroth hidden layer). LkL_k represents the kkth hidden layer for k{1,2,,H}k\in \{1, 2, \ldots, H\}. LH+1L_{H+1} is the output layer (the H+1H+1th hidden layer).
  • ϕ(k)\phi^{(k)} represents the activation function for the kkth hidden layer, with k{1,2,,H}k\in \{1, 2, \ldots, H\}. ϕ(H+1)\phi^{(H+1)} represents the activation function for the output layer.
  • wj(k)\boldsymbol{w}^{(k)}_j represents the weights connecting the activated neurons a(k1)\boldsymbol{a}^{(k-1)} from the k1k-1th hidden layer to the jjth neuron in the kkth hidden layer, where k{1,,H+1}k\in \{1, \ldots, H+1\} and j{1,,qk}j\in \{1, \ldots, q_{k}\}, i.e., qkq_{k} denotes the number of neurons in the kkth hidden layer. a(0)=z(0)=x\boldsymbol{a}^{(0)} = \boldsymbol{z}^{(0)} =\boldsymbol{x} by definition.
  • bj(k)b^{(k)}_j represents the bias for the jjth neuron in the kkth hidden layer.

Gradients For the Output Layer

The gradient for w1(H+1)\boldsymbol{w}_1^{(H+1)}, i.e., the weights connecting the neurons in the HHth (last) hidden layer to the first neuron of the output layer, is given by: L(D,θ)w1(H+1)=L(D,θ)y^1y^1z1(H+1)z1(H+1)w1(H+1)(3) \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta})}{\partial \boldsymbol{w}^{(H+1)}_1} = \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta})}{\partial \hat{y}_1} \frac{\partial \hat{y}_1}{\partial z^{(H+1)}_1 } \frac{\partial z^{(H+1)}_1}{\partial \boldsymbol{w}^{(H+1)}_1} \tag{3} where

  • y^1=a1(H+1)=ϕ(z1(H+1))\hat{y}_1=a^{(H+1)}_1= \phi (z^{(H+1)}_1)
  • z1(H+1)=a(H),w1(H+1)+b1(H+1)z^{(H+1)}_1 = \langle \boldsymbol{a}^{(H)}, \boldsymbol{w}_1^{(H+1)} \rangle + b^{(H+1)}_1.
  • ,\langle \cdot, \cdot \rangle represents the inner product.

Gradients For the Hidden Layers

The gradient for w1(k)\boldsymbol{w}_1^{(k)}, i.e., the weights connecting the activated neurons a(k1)\boldsymbol{a}^{(k-1)} to the first neuron of the kkth hidden layer a1(k)a^{(k)}_1, is given by: L(D,θ)w1(k)=L(D,θ)a1(k)a1(k)z1(k)δ1(k)z1(k)w1(k)=l{1,,qk+1}L(D,θ)zl(k+1)zl(k+1)a1(k)Total Derivativea1(k)z1(k)z1(k)w1(k)=l{1,,qk+1}δl(k+1)w1,l(k+1)a1(k)z1(k)δ1(k)a(k1)(4) \begin{aligned} \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta})}{\partial \boldsymbol{w}^{(k)}_1} &= \underbrace{\textcolor{blue}{\frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta})}{\partial a^{(k)}_{1}}} \frac{\partial a^{(k)}_{1}}{\partial z^{(k)}_1 }}_{\delta_{1}^{(k)}} \frac{\partial z^{(k)}_1}{\partial \boldsymbol{w}^{(k)}_1} \nonumber \\ &= \underbrace{\textcolor{blue}{\sum_{l\in \{1,\ldots,q_{k+1}\}} \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta})}{ \partial z^{(k+1)}_l} \frac{\partial z^{(k+1)}_l }{\partial a_{1}^{(k)}}}}_{\textcolor{blue}{\text{Total Derivative}}} \frac{\partial a^{(k)}_{1}}{\partial z_1^{(k)}} \frac{\partial z^{(k)}_1}{\partial \boldsymbol{w}^{(k)}_1} \nonumber \\ & = \underbrace{\textcolor{blue}{\sum_{l\in \{1,\ldots,q_{k+1}\}} \delta_l^{(k+1)} w_{1,l}^{(k+1)}} \frac{\partial a^{(k)}_{1}}{\partial z_1^{(k)}}}_{\delta^{(k)}_1} \boldsymbol{a}^{(k-1)} \end{aligned} \tag{4}

Based on , the derivative of the loss function with respect to the pre-activated value of the iith neuron in the kkth hidden layer is given by δi(k)=L(D,θ)ai(k)ai(k)zi(k)=l{1,,qk+1}δl(k+1)wi,l(k+1)ai(k)zi(k) \delta^{(k)}_i = \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta})}{\partial a^{(k)}_{i}} \frac{\partial a^{(k)}_{i}}{\partial z^{(k)}_i} = \sum_{l\in \{1,\ldots,q_{k+1}\}} \delta_l^{(k+1)} w_{i,l}^{(k+1)} \frac{\partial a^{(k)}_{i}}{\partial z^{(k)}_i}

Example 1

  • From input layer L0L_0 to the first hidden layer L1L_1: a1(1)=ϕ(1)(w1,1(1)x1+w2,1(1)x2+w3,1(1)x3+b1(1))=ϕ(1)(w1(1),x+b1(1))a2(1)=ϕ(1)(w1,2(1)x1+w2,2(1)x2+w3,2(1)x3+b2(1))=ϕ(1)(w2(1),x+b2(1))\begin{align*} a^{(1)}_1 &= \phi^{(1)}\big(w^{(1)}_{1, 1}x_1 + w^{(1)}_{2, 1}x_2 + w^{(1)}_{3, 1} x_3 + b^{(1)}_1\big) = \phi^{(1)} (\langle \boldsymbol{w}^{(1)}_{1}, \boldsymbol{x} \rangle + b^{(1)}_1 )\\ a^{(1)}_2 &= \phi^{(1)}\big(w^{(1)}_{1, 2}x_1 + w^{(1)}_{2, 2}x_2 + w^{(1)}_{3, 2} x_3 + b^{(1)}_2\big) = \phi^{(1)} (\langle \boldsymbol{w}^{(1)}_{2}, \boldsymbol{x} \rangle + b^{(1)}_2) \end{align*}
  • From the first hidden layer L1L_1 to the output layer layer L2L_2: y^=ϕ(2)(w1,1(2)a1(1)+w2,1(2)a2(1)+b1(2))=ϕ(2)(w1(2),a(1)+b1(2))\begin{align*} \hat{y} &= \phi^{(2)}\big(w^{(2)}_{1, 1} a^{(1)}_1 + w^{(2)}_{2, 1} a^{(1)}_2 + b^{(2)}_1\big) = \phi^{(2)}( \langle \boldsymbol{w}^{(2)}_{1}, \boldsymbol{a}^{(1)} \rangle + b^{(2)}_1) \end{align*}
  • ϕ(1)(z)=S(z)\phi^{(1)}(z)= S(z) (sigmoid function) and ϕ(2)(z)=exp(z)\phi^{(2)}(z) = \exp(z) (exponential function).

Let θ(t)=(w(t),b(t))=(w1(t,1),w2(t,1),w1(t,2),b1(t,1),b2(t,1),b1(t,2))\boldsymbol{\theta}^{(t)}=(\boldsymbol{w}^{(t)}, \boldsymbol{b}^{(t)})= \Big(\boldsymbol{w}^{(t, 1)}_1, \boldsymbol{w}^{(t, 1)}_2, \boldsymbol{w}^{(t, 2)}_1, b^{(t,1)}_1, b^{(t,1)}_2, b^{(t,2)}_1\Big) be the parameter estimates of the ttth iteration. For illustration, we assume the bias terms (b1(t,1),b2(t,1),b1(t,2))\big(b^{(t,1)}_1, b^{(t,1)}_2, b^{(t,2)}_1\big) are all zeros.

  • For w1(2)\boldsymbol{w}_1^{(2)}, apply equation
  • For w1(1)\boldsymbol{w}^{(1)}_1, apply equation
  • For w2(1)\boldsymbol{w}^{(1)}_2, apply equation

Implementing Backpropagation in Python

See Week_4_Lab_Notebook.ipynb for more details. The required packages/functions are as follows:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

import random
import numpy as np
import pandas as pd

from keras.models import Sequential
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.initializers import Constant

True weights:

w1_1 = np.array([[0.25], [0.5], [0.75]])
w1_2 = np.array([[0.75], [0.5], [0.25]])
w2_1 = np.array([[2.0], [3.0]])

Some synthetic data to work with:

# Generate 10000 random observations of 3 numerical features
np.random.seed(0)
X = np.random.randn(10000, 3)

# Sigmoid activation function
def sigmoid(z):
  return(1/(1+np.exp(-z)))

# Hidden Layer 1
z1_1 = X @ w1_1 # The first neuron before activation
z1_2 = X @ w1_2 # The second neuron before activation
a1_1 = sigmoid(z1_1) # The first neuron after activation
a1_2 = sigmoid(z1_2) # The second neuron after activation

# Output Layer
z2_1 = np.concatenate((a1_1, a1_2), axis = 1) @ w2_1 # Pre-activation of the ouput
a2_1 = np.exp(z2_1) # Output

# The actual values
y = a2_1

From Scratch

# Initialised weights
w1_1_hat = np.array([[0.2], [0.6], [1.0]])
w1_2_hat = np.array([[0.4], [0.8], [1.2]])
w2_1_hat = np.array([[1.0], [2.0]])

losses = []
num_iterations = 5000
for _ in range(num_iterations):
  # Compute Forward Passes
  # Hidden Layer 1
  z1_1_hat = X @ w1_1_hat  # The first neuron before activation
  z1_2_hat = X @ w1_2_hat  # The second neuron before activation
  a1_1_hat = sigmoid(z1_1_hat) # The first neuron after activation
  a1_2_hat = sigmoid(z1_2_hat) # The second neuron after activation
  a1_hat = np.concatenate((a1_1_hat, a1_2_hat), axis = 1)

  # Output Layer
  z2_1_hat = a1_hat @ w2_1_hat # The output before activation
  y_hat = np.exp(z2_1_hat).reshape(len(y), 1) # The ouput

  # Track the Losses
  loss = (y_hat - y)**2
  losses.append(np.mean(loss))

  # Compute Deltas
  delta2_1 = 2 * (y_hat - y) * np.exp(z2_1_hat)
  delta1_1 = w2_1_hat[0] * delta2_1 * sigmoid(z1_1_hat) * (1-sigmoid(z1_1_hat))
  delta1_2 = w2_1_hat[1] * delta2_1 * sigmoid(z1_2_hat) * (1-sigmoid(z1_2_hat))

  # Compute Gradients
  d2_1_hat = delta2_1 * a1_hat
  d1_1_hat = delta1_1 * X
  d1_2_hat = delta1_2 * X

  # Learning Rate
  eta = 0.0005

  # Apply Batch Gradient Descent
  w2_1_hat -= eta * np.mean(d2_1_hat, axis = 0).reshape(2, 1)
  w1_1_hat -= eta * np.mean(d1_1_hat, axis = 0).reshape(3, 1)
  w1_2_hat -= eta * np.mean(d1_2_hat, axis = 0).reshape(3, 1)

print(w1_1_hat)
print(w1_2_hat)
print(w2_1_hat)
[[0.24985576]
 [0.5000211 ]
 [0.75018656]]
[[0.74987578]
 [0.49998626]
 [0.25009692]]
[[1.99874327]
 [3.00125615]]

From Keras

# An initialiser for the weights in the neural network 
init1 = Constant([[0.2, 0.4], [0.6, 0.8], [1.0, 1.2]])
init2 = Constant([[1.0, 2.0]])

# Build a neural network 
# `use_bias` (whether to include bias terms for the neurons or not) is True by default
# `kernel_initializer` adjusts the initialisations of the weights 
x = Input(shape=X.shape[1:], name="Inputs")
a1 = Dense(2, "sigmoid", use_bias=False,
          kernel_initializer=init1)(x)
y_hat = Dense(1, "exponential", use_bias=False,
            kernel_initializer=init2)(a1)
model = Model(x, y_hat)

# Choosing the optimiser and the loss function
model.compile(optimizer="adam", loss="mse")

# Model Training
# We don't implement early stopping to make the results comparable to the previous section
hist = model.fit(X, y, epochs=5000, verbose=0, batch_size = len(y))

# Print out the weights
print(model.get_weights())
[array([[0.3025748 , 0.80548114],
       [0.49333417, 0.5067073 ],
       [0.6842524 , 0.2076197 ]], dtype=float32), array([[2.5133712, 2.5152776],
       [2.4867477, 2.4848893]], dtype=float32)]