Recurrent Neural Networks

ACTL3143 & ACTL5111 Deep Learning for Actuaries

Author

Patrick Laub

Show the package imports

import random
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from keras.models import Sequential
from keras.layers import Dense, Input
from keras.callbacks import EarlyStopping

Tensors & Time Series

Shapes of data

Illustration of tensors of different rank.

The `axis` argument in numpy

Starting with a (3, 4)-shaped matrix:

X = np.arange(12).reshape(3,4)
X

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

The above code creates an array with values from 0 to 11 and converts that array into a matrix with 3 rows and 4 columns.

axis=0: (3, 4) \leadsto (4,).

X.sum(axis=0)

array([12, 15, 18, 21])

The above code returns the column sum. This changes the shape of the matrix from (3,4) to (4,). Similarly, X.sum(axis=1) returns row sums and will change the shape of the matrix from (3,4) to (3,).

axis=1: (3, 4) \leadsto (3,).

X.prod(axis=1)

array([   0,  840, 7920])

The return value’s rank is one less than the input’s rank.

Important

The axis parameter tells us which dimension is removed.

Using `axis` & `keepdims`

With keepdims=True, the rank doesn’t change.

X = np.arange(12).reshape(3,4)
X

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

axis=0: (3, 4) \leadsto (1, 4).

X.sum(axis=0, keepdims=True)

array([[12, 15, 18, 21]])

axis=1: (3, 4) \leadsto (3, 1).

X.prod(axis=1, keepdims=True)

array([[   0],
       [ 840],
       [7920]])

X / X.sum(axis=1)

ValueError: operands could not be broadcast together with shapes (3,4) (3,)

X / X.sum(axis=1, keepdims=True)

array([[0.  , 0.17, 0.33, 0.5 ],
       [0.18, 0.23, 0.27, 0.32],
       [0.21, 0.24, 0.26, 0.29]])

The rank of a time series

Say we had n observations of a time series x_1, x_2, \dots, x_n.

This \mathbf{x} = (x_1, \dots, x_n) would have shape (n,) & rank 1.

If instead we had a batch of b time series’

\mathbf{X} = \begin{pmatrix} x_7 & x_8 & \dots & x_{7+n-1} \\ x_2 & x_3 & \dots & x_{2+n-1} \\ \vdots & \vdots & \ddots & \vdots \\ x_3 & x_4 & \dots & x_{3+n-1} \\ \end{pmatrix} \,,

the batch \mathbf{X} would have shape (b, n) & rank 2.

Multivariate time series

Multivariate time series consists of more than 1 variable observation at a given time point. Following example has two variables x and y.

t	x	y
0	x_0	y_0
1	x_1	y_1
2	x_2	y_2
3	x_3	y_3

Say n observations of the m time series, would be a shape (n, m) matrix of rank 2.

In Keras, a batch of b of these time series has shape (b, n, m) and has rank 3.

Note

Use \mathbf{x}_t \in \mathbb{R}^{1 \times m} to denote the vector of all time series at time t. Here, \mathbf{x}_t = (x_t, y_t).

Some Recurrent Structures

Recurrence relation

A recurrence relation is an equation that expresses each element of a sequence as a function of the preceding ones. More precisely, in the case where only the immediately preceding element is involved, a recurrence relation has the form

u_n = \psi(n, u_{n-1}) \quad \text{ for } \quad n > 0.

Example: Factorial n! = n (n-1)! for n > 0 given 0! = 1.

Diagram of an RNN cell

The RNN processes each data in the sequence one by one, while keeping memory of what came before.

The following figure shows how the recurrent neural network combines an input X_l with a preprocessed state of the process A_l to produce the output O_l. RNNs have a cyclic information processing structure that enables them to pass information sequentially from previous inputs. RNNs can capture dependencies and patterns in sequential data, making them useful for analysing time series data.

Schematic of a simple recurrent neural network. E.g. SimpleRNN, LSTM, or GRU.

A SimpleRNN cell.

All the outputs before the final one are often discarded.

SimpleRNN

Say each prediction is a vector of size d, so \mathbf{y}_t \in \mathbb{R}^{1 \times d}.

Then the main equation of a SimpleRNN, given \mathbf{y}_0 = \mathbf{0}, is

\mathbf{y}_t = \psi\bigl( \mathbf{x}_t \mathbf{W}_x + \mathbf{y}_{t-1} \mathbf{W}_y + \mathbf{b} \bigr) .

Here, \begin{aligned} &\mathbf{x}_t \in \mathbb{R}^{1 \times m}, \mathbf{W}_x \in \mathbb{R}^{m \times d}, \\ &\mathbf{y}_{t-1} \in \mathbb{R}^{1 \times d}, \mathbf{W}_y \in \mathbb{R}^{d \times d}, \text{ and } \mathbf{b} \in \mathbb{R}^{d}. \end{aligned}

At each time step, a simple Recurrent Neural Network (RNN) takes an input vector x_t, incorporate it with the information from the previous hidden state {y}_{t-1} and produces an output vector at each time step y_t. The hidden state helps the network remember the context of the previous words, enabling it to make informed predictions about what comes next in the sequence. In a simple RNN, the output at time (t-1) is the same as the hidden state at time t.

SimpleRNN (in batches)

The difference between RNN and RNNs with batch processing lies in the way how the neural network handles sequences of input data. With batch processing, the model processes multiple (b) input sequences simultaneously. The training data is grouped into batches, and the weights are updated based on the average error across the entire batch. Batch processing often results in more stable weight updates, as the model learns from a diverse set of examples in each batch, reducing the impact of noise in individual sequences.

Say we operate on batches of size b, then \mathbf{Y}_t \in \mathbb{R}^{b \times d}.

The main equation of a SimpleRNN, given \mathbf{Y}_0 = \mathbf{0}, is \mathbf{Y}_t = \psi\bigl( \mathbf{X}_t \mathbf{W}_x + \mathbf{Y}_{t-1} \mathbf{W}_y + \mathbf{b} \bigr) . Here, \begin{aligned} &\mathbf{X}_t \in \mathbb{R}^{b \times m}, \mathbf{W}_x \in \mathbb{R}^{m \times d}, \\ &\mathbf{Y}_{t-1} \in \mathbb{R}^{b \times d}, \mathbf{W}_y \in \mathbb{R}^{d \times d}, \text{ and } \mathbf{b} \in \mathbb{R}^{d}. \end{aligned}

Simple Keras demo

1num_obs = 4
2num_time_steps = 3
3num_time_series = 2

X = np.arange(num_obs*num_time_steps*num_time_series).astype(np.float32) \
4        .reshape([num_obs, num_time_steps, num_time_series])

output_size = 1                                                                 
y = np.array([0, 0, 1, 1])

1: Defines the number of observations
2: Defines the number of time steps
3: Defines the number of time series
4: Reshapes the array to a range 3 tensor (4,3,2)

1X[:2]

1: Selects the first two slices along the first dimension. Since the tensor of dimensions (4,3,2), X[:2] selects the first two slices (0 and 1) along the first dimension, and returns a sub-tensor of shape (2,3,2).

array([[[ 0.,  1.],
        [ 2.,  3.],
        [ 4.,  5.]],

       [[ 6.,  7.],
        [ 8.,  9.],
        [10., 11.]]], dtype=float32)

1X[2:]

1: Selects the last two slices along the first dimension. The first dimension (axis=0) has size 4. Therefore, X[2:] selects the last two slices (2 and 3) along the first dimension, and returns a sub-tensor of shape (2,3,2).

array([[[12., 13.],
        [14., 15.],
        [16., 17.]],

       [[18., 19.],
        [20., 21.],
        [22., 23.]]], dtype=float32)

Keras’ SimpleRNN

As usual, the SimpleRNN is just a layer in Keras.

1from keras.layers import SimpleRNN

2random.seed(1234)
model = Sequential([
3  SimpleRNN(output_size, activation="sigmoid")
])
4model.compile(loss="binary_crossentropy", metrics=["accuracy"])

5hist = model.fit(X, y, epochs=500, verbose=False)
6model.evaluate(X, y, verbose=False)

1: Imports the SimpleRNN layer from the Keras library
2: Sets the seed for the random number generator to ensure reproducibility
3: Defines a simple RNN with one output node and sigmoid activation function
4: Specifies binary crossentropy as the loss function (usually used in classification problems), and specifies “accuracy” as the metric to be monitored during training
5: Trains the model for 500 epochs and saves output as hist
6: Evaluates the model to obtain a value for the loss and accuracy

2024-05-13 16:53:01.494268: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2024-05-13 16:53:01.494304: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:134] retrieving CUDA diagnostic information for host: luthen
2024-05-13 16:53:01.494309: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:141] hostname: luthen
2024-05-13 16:53:01.494392: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:165] libcuda reported version is: 535.171.4
2024-05-13 16:53:01.494412: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:169] kernel reported version is: 535.171.4
2024-05-13 16:53:01.494415: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:248] kernel version seems to match DSO: 535.171.4

[3.1845884323120117, 0.5]

The predicted probabilities on the training set are:

model.predict(X, verbose=0)

array([[0.97],
       [1.  ],
       [1.  ],
       [1.  ]], dtype=float32)

SimpleRNN weights

To verify the results of predicted probabilities, we can obtain the weights of the fitted model and calculate the outcome manually as follows.

model.get_weights()

[array([[0.68],
        [0.21]], dtype=float32),
 array([[0.49]], dtype=float32),
 array([-0.51], dtype=float32)]

def sigmoid(x):
  return 1 / (1 + np.exp(-x))

W_x, W_y, b = model.get_weights()

Y = np.zeros((num_obs, output_size), dtype=np.float32)
for t in range(num_time_steps):
    X_t = X[:, t, :]
    z = X_t @ W_x + Y @ W_y + b
    Y = sigmoid(z)

Y

array([[0.97],
       [1.  ],
       [1.  ],
       [1.  ]], dtype=float32)

LSTM internals

Simple RNN structures encounter vanishing gradient problems, hence, struggle with learning long term dependencies. LSTM are designed to overcome this problem. LSTMs have a more complex network structure (contains more memory cells and gating mechanisms) and can better regulate the information flow.

Diagram of an LSTM cell. Notation for the diagram.

GRU internals

GRUs are simpler compared to LSTM, hence, computationally more efficient than LSTMs.