Deep Learning with Keras

ACTL3143 & ACTL5111 Deep Learning for Actuaries

Patrick Laub

California House Price Prediction

Lecture Outline

California House Price Prediction
EDA & Baseline Model
Our First Neural Network
Force positive predictions
Preprocessing
Early Stopping
Quiz

Import the data

from sklearn.datasets import fetch_california_housing

features, target = fetch_california_housing(
    as_frame=True, return_X_y=True)
features

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24
...	...	...	...	...	...	...	...	...
20637	1.7000	17.0	5.205543	1.120092	1007.0	2.325635	39.43	-121.22
20638	1.8672	18.0	5.329513	1.171920	741.0	2.123209	39.43	-121.32
20639	2.3886	16.0	5.254717	1.162264	1387.0	2.616981	39.37	-121.24

20640 rows × 8 columns

What is the target?

target

0        4.526
1        3.585
2        3.521
         ...  
20637    0.923
20638    0.847
20639    0.894
Name: MedHouseVal, Length: 20640, dtype: float64

The dataset

The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

Columns

MedInc median income in block group
HouseAge median house age in block group
AveRooms average number of rooms per household
AveBedrms average # of bedrooms per household
Population block group population
AveOccup average number of household members
Latitude block group latitude
Longitude block group longitude

An entire ML project

ML life cycle

Questions to answer in ML project

You fit a few models to the training set, then ask:

(Selection) Which of these models is the best?
(Future Performance) How good should we expect the final model to be on unseen data?

Set aside a fraction for a test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    features, target, random_state=42
)

Illustration of a typical training/test split.

Note: Compare X_/y_ names, capitals & lowercase.

Basic ML workflow

Splitting the data.

For each model, fit it to the training set.
Compute the error for each model on the validation set.
Select the model with the lowest validation error.
Compute the error of the final model on the test set.

Split three ways

# Thanks https://datascience.stackexchange.com/a/15136
X_main, X_test, y_main, y_test = train_test_split(
    features, target, test_size=0.2, random_state=1
)

# As 0.25 x 0.8 = 0.2
X_train, X_val, y_train, y_val = train_test_split(
    X_main, y_main, test_size=0.25, random_state=1
)

X_train.shape, X_val.shape, X_test.shape

((12384, 8), (4128, 8), (4128, 8))

Why not use test set for both?

Thought experiment: have m classifiers: f_1(\mathbf{x}), \dots, f_m(\mathbf{x}).

They are just as good as each other in the long run \mathbb{P}(\, f_i(\mathbf{X}) = Y \,)\ =\ 90\% , \quad \text{for } i=1,\dots,m .

Evaluate each model on the test set, some will be better than others.

Take the best, you’d think it has \approx 98\% accuracy!

EDA & Baseline Model

Lecture Outline

California House Price Prediction
EDA & Baseline Model
Our First Neural Network
Force positive predictions
Preprocessing
Early Stopping
Quiz

The training set

X_train

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
9107	4.1573	19.0	6.162630	1.048443	1677.0	2.901384	34.63	-118.18
13999	0.4999	10.0	6.740000	2.040000	108.0	2.160000	34.69	-116.90
5610	2.0458	27.0	3.619048	1.062771	1723.0	3.729437	33.78	-118.26
...	...	...	...	...	...	...	...	...
8539	4.0727	18.0	3.957845	1.079625	2276.0	2.665105	33.90	-118.36
2155	2.3190	41.0	5.366265	1.113253	1129.0	2.720482	36.78	-119.79
13351	5.5632	9.0	7.241087	0.996604	2280.0	3.870968	34.02	-117.62

12384 rows × 8 columns

Location

Python’s matplotlib package \approx R’s basic plots.

import matplotlib.pyplot as plt

plt.scatter(features["Longitude"], features["Latitude"])

Location #2

Python’s seaborn package \approx R’s ggplot2.

import seaborn as sns

sns.scatterplot(x="Longitude", y="Latitude", data=features);

Features

print(list(features.columns))

['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']

How many?

num_features = len(features.columns)
num_features

num_features = features.shape[1]
features.shape

(20640, 8)

Linear Regression

\hat{y} = w_0 + \sum_{i=1}^N w_i x_i .

from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train);

The w_0 is in lr.intercept_ and the others are in

print(lr.coef_)

[ 4.34267965e-01  9.88284781e-03 -9.39592954e-02  5.86373944e-01
 -1.58360948e-06 -3.59968968e-03 -4.26013498e-01 -4.41779336e-01]

Make some predictions

X_train.head(3)

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
9107	4.1573	19.0	6.162630	1.048443	1677.0	2.901384	34.63	-118.18
13999	0.4999	10.0	6.740000	2.040000	108.0	2.160000	34.69	-116.90
5610	2.0458	27.0	3.619048	1.062771	1723.0	3.729437	33.78	-118.26

y_pred = lr.predict(X_train.head(3))
y_pred

array([1.81699287, 0.0810446 , 1.62089363])

prediction = lr.intercept_
for w_i, x_i in zip(lr.coef_, X_train.iloc[0]):
    prediction += w_i * x_i
prediction

1.8169928680677785

Plot the predictions

Calculate mean squared error

import pandas as pd

y_pred = lr.predict(X_train)
df = pd.DataFrame({"Predictions": y_pred, "True values": y_train})
df["Squared Error"] = (df["Predictions"] - df["True values"]) ** 2
df.head(4)

	Predictions	True values	Squared Error
9107	1.816993	2.281	0.215303
13999	0.081045	0.550	0.219919
5610	1.620894	1.745	0.015402
13533	1.168949	1.199	0.000903

df["Squared Error"].mean()

0.5291948207479792

Using `mean_squared_error`

df["Squared Error"].mean()

0.5291948207479792

from sklearn.metrics import mean_squared_error as mse

mse(y_train, y_pred)

0.5291948207479792

Store the results in a dictionary:

mse_lr_train = mse(y_train, lr.predict(X_train))
mse_lr_val = mse(y_val, lr.predict(X_val))

mse_train = {"Linear Regression": mse_lr_train}
mse_val = {"Linear Regression": mse_lr_val}

Our First Neural Network

Lecture Outline

California House Price Prediction
EDA & Baseline Model
Our First Neural Network
Force positive predictions
Preprocessing
Early Stopping
Quiz

What are Keras and TensorFlow?

Keras is common way of specifying, training, and using neural networks. It gives a simple interface to various backend libraries, including Tensorflow.

Keras as a independent interface, and Keras as part of Tensorflow.

Create a Keras ANN model

Decide on the architecture: a simple fully-connected network with one hidden layer with 30 neurons.

Create the model:

from keras.models import Sequential
from keras.layers import Dense, Input

model = Sequential(
    [Input((num_features,)),
     Dense(30, activation="leaky_relu"),
     Dense(1, activation="leaky_relu")]
)

Inspect the model

model.summary()

Model: "sequential"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 30)             │           270 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 1)              │            31 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 301 (1.18 KB)

 Trainable params: 301 (1.18 KB)

 Non-trainable params: 0 (0.00 B)

The model is initialised randomly

model = Sequential([Dense(30, activation="leaky_relu"), Dense(1, activation="leaky_relu")])
model.predict(X_val.head(3), verbose=0)

array([[-91.88699  ],
       [-57.336792 ],
       [ -1.2164348]], dtype=float32)

model = Sequential([Dense(30, activation="leaky_relu"), Dense(1, activation="leaky_relu")])
model.predict(X_val.head(3), verbose=0)

array([[-63.595753],
       [-34.14082 ],
       [ 17.690414]], dtype=float32)

Controlling the randomness

import random

random.seed(123)

model = Sequential([Dense(30, activation="leaky_relu"), Dense(1, activation="leaky_relu")])

display(model.predict(X_val.head(3), verbose=0))

random.seed(123)
model = Sequential([Dense(30, activation="leaky_relu"), Dense(1, activation="leaky_relu")])

display(model.predict(X_val.head(3), verbose=0))

array([[ 1.3595750e+03],
       [ 8.2818079e+02],
       [-1.2993939e+00]], dtype=float32)

array([[ 1.3595750e+03],
       [ 8.2818079e+02],
       [-1.2993939e+00]], dtype=float32)

Fit the model

random.seed(123)

model = Sequential([
    Dense(30, activation="leaky_relu"),
    Dense(1, activation="leaky_relu")
])

model.compile("adam", "mse")
%time hist = model.fit(X_train, y_train, epochs=5, verbose=False)
hist.history["loss"]

CPU times: user 1.84 s, sys: 162 ms, total: 2 s
Wall time: 1.59 s

[18765.189453125,
 178.23837280273438,
 103.30640411376953,
 48.04053497314453,
 18.110933303833008]

Make predictions

y_pred = model.predict(X_train[:3], verbose=0)
y_pred

WARNING:tensorflow:5 out of the last 5 calls to <function TensorFlowTrainer.make_predict_function.<locals>.one_step_on_data_distributed at 0x79bd896f7ce0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.

array([[ 0.5477159 ],
       [-1.525452  ],
       [-0.25848356]], dtype=float32)

Note

The .predict gives us a ‘matrix’ not a ‘vector’. Calling .flatten() will convert it to a ‘vector’.

print(f"Original shape: {y_pred.shape}")
y_pred = y_pred.flatten()
print(f"Flattened shape: {y_pred.shape}")
y_pred

Original shape: (3, 1)
Flattened shape: (3,)

array([ 0.5477159 , -1.525452  , -0.25848356], dtype=float32)

Plot the predictions

WARNING:tensorflow:6 out of the last 6 calls to <function TensorFlowTrainer.make_predict_function.<locals>.one_step_on_data_distributed at 0x79bd896f7ce0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.

Assess the model

y_pred = model.predict(X_val, verbose=0)
mse(y_val, y_pred)

8.391657291598232

mse_train["Basic ANN"] = mse(
    y_train, model.predict(X_train, verbose=0)
)
mse_val["Basic ANN"] = mse(y_val, model.predict(X_val, verbose=0))

Some predictions are negative:

y_pred = model.predict(X_val, verbose=0)
y_pred.min(), y_pred.max()

(-5.371005, 16.863848)

y_val.min(), y_val.max()

(0.225, 5.00001)

Force positive predictions

Lecture Outline

California House Price Prediction
EDA & Baseline Model
Our First Neural Network
Force positive predictions
Preprocessing
Early Stopping
Quiz

Try running for longer

random.seed(123)

model = Sequential([
    Dense(30, activation="leaky_relu"),
    Dense(1, activation="leaky_relu")
])

model.compile("adam", "mse")

%time hist = model.fit(X_train, y_train, \
    epochs=50, verbose=False)

CPU times: user 13.4 s, sys: 818 ms, total: 14.3 s
Wall time: 9.54 s

Loss curve

plt.plot(range(1, 51), hist.history["loss"])
plt.xlabel("Epoch")
plt.ylabel("MSE");

Loss curve

plt.plot(range(2, 51), hist.history["loss"][1:])
plt.xlabel("Epoch")
plt.ylabel("MSE");

Predictions

y_pred = model.predict(X_val, verbose=0)
print(f"Min prediction: {y_pred.min():.2f}")
print(f"Max prediction: {y_pred.max():.2f}")

Min prediction: -0.79
Max prediction: 12.92

plt.scatter(y_pred, y_val)
plt.xlabel("Predictions")
plt.ylabel("True values")
add_diagonal_line()

mse_train["Long run ANN"] = mse(
    y_train, model.predict(X_train, verbose=0)
)
mse_val["Long run ANN"] = mse(y_val, model.predict(X_val, verbose=0))

Try different activation functions

Enforce positive outputs (softplus)

random.seed(123)

model = Sequential([
    Dense(30, activation="leaky_relu"),
    Dense(1, activation="softplus")
])

model.compile("adam", "mse")

%time hist = model.fit(X_train, y_train, epochs=50, \
    verbose=False)

import numpy as np
losses = np.round(hist.history["loss"], 2)
print(losses[:5], "...", losses[-5:])

CPU times: user 13.2 s, sys: 790 ms, total: 14 s
Wall time: 13.2 s
[1.856457e+04 5.640000e+00 5.640000e+00 5.640000e+00 5.640000e+00] ... [5.64 5.64 5.64 5.64 5.64]

Plot the predictions

Enforce positive outputs (\mathrm{e}^{\,x})

random.seed(123)

model = Sequential([
    Dense(30, activation="leaky_relu"),
    Dense(1, activation="exponential")
])

model.compile("adam", "mse")

%time hist = model.fit(X_train, y_train, epochs=5, verbose=False)

losses = hist.history["loss"]
print(losses)

CPU times: user 1.65 s, sys: 112 ms, total: 1.76 s
Wall time: 2.46 s
[nan, nan, nan, nan, nan]

Preprocessing

Lecture Outline

California House Price Prediction
EDA & Baseline Model
Our First Neural Network
Force positive predictions
Preprocessing
Early Stopping
Quiz

Re-scaling the inputs

from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
X_test_sc = scaler.transform(X_test)

plt.hist(X_train.iloc[:, 0])
plt.hist(X_train_sc[:, 0])
plt.legend(["Original", "Scaled"]);

Same model with scaled inputs

random.seed(123)

model = Sequential([
    Dense(30, activation="leaky_relu"),
    Dense(1, activation="exponential")
])

model.compile("adam", "mse")

%time hist = model.fit( \
    X_train_sc, \
    y_train, \
    epochs=50, \
    verbose=False)

CPU times: user 12.3 s, sys: 612 ms, total: 12.9 s
Wall time: 16.5 s

Loss curve

plt.plot(range(1, 51), hist.history["loss"])
plt.xlabel("Epoch")
plt.ylabel("MSE");

Loss curve

plt.plot(range(2, 51), hist.history["loss"][1:])
plt.xlabel("Epoch")
plt.ylabel("MSE");

Predictions

y_pred = model.predict(X_val_sc, verbose=0)
print(f"Min prediction: {y_pred.min():.2f}")
print(f"Max prediction: {y_pred.max():.2f}")

Min prediction: 0.00
Max prediction: 18.45

plt.scatter(y_pred, y_val)
plt.xlabel("Predictions")
plt.ylabel("True values")
add_diagonal_line()

mse_train["Exp ANN"] = mse(
    y_train, model.predict(X_train_sc, verbose=0)
)
mse_val["Exp ANN"] = mse(y_val, model.predict(X_val_sc, verbose=0))

Comparing MSE (smaller is better)

On training data:

mse_train

{'Linear Regression': 0.5291948207479792,
 'Basic ANN': 8.374382131620425,
 'Long run ANN': 0.9770473035600079,
 'Exp ANN': 0.3182808342909683}

On validation data (expect worse, i.e. bigger):

mse_val

{'Linear Regression': 0.5059420205381367,
 'Basic ANN': 8.391657291598232,
 'Long run ANN': 0.9279673788287134,
 'Exp ANN': 0.36969620817676596}

Comparing models (train)

train_results = pd.DataFrame(
    {"Model": mse_train.keys(), "MSE": mse_train.values()}
)
train_results.sort_values("MSE", ascending=False)

	Model	MSE
1	Basic ANN	8.374382
2	Long run ANN	0.977047
0	Linear Regression	0.529195
3	Exp ANN	0.318281

Comparing models (validation)

val_results = pd.DataFrame(
    {"Model": mse_val.keys(), "MSE": mse_val.values()}
)
val_results.sort_values("MSE", ascending=False)

	Model	MSE
1	Basic ANN	8.391657
2	Long run ANN	0.927967
0	Linear Regression	0.505942
3	Exp ANN	0.369696

Early Stopping

Lecture Outline

California House Price Prediction
EDA & Baseline Model
Our First Neural Network
Force positive predictions
Preprocessing
Early Stopping
Quiz

Choosing when to stop training

Illustrative loss curves over time.

Try early stopping

Hinton calls it a “beautiful free lunch”

from keras.callbacks import EarlyStopping

random.seed(123)
model = Sequential([
    Dense(30, activation="leaky_relu"),
    Dense(1, activation="exponential")
])
model.compile("adam", "mse")

es = EarlyStopping(restore_best_weights=True, patience=15)

%time hist = model.fit(X_train_sc, y_train, epochs=1_000, \
    callbacks=[es], validation_data=(X_val_sc, y_val), verbose=False)
print(f"Keeping model at epoch #{len(hist.history['loss'])-10}.")

CPU times: user 6.24 s, sys: 471 ms, total: 6.71 s
Wall time: 4.73 s
Keeping model at epoch #14.

Loss curve

plt.plot(hist.history["loss"])
plt.plot(hist.history["val_loss"])
plt.legend(["Training", "Validation"]);

Loss curve II

plt.plot(hist.history["loss"])
plt.plot(hist.history["val_loss"])
plt.ylim([0, 8])
plt.legend(["Training", "Validation"]);

Predictions

Comparing models (validation)

	Model	MSE
1	Basic ANN	8.391657
2	Long run ANN	0.927967
0	Linear Regression	0.505942
4	Early stop ANN	0.386975
3	Exp ANN	0.369696

The test set

Evaluate only the final/selected model on the test set.

mse(y_test, model.predict(X_test_sc, verbose=0))

0.4026048522207643

model.evaluate(X_test_sc, y_test, verbose=False)

0.4026048183441162

Another useful callback

from pathlib import Path
from keras.callbacks import ModelCheckpoint

random.seed(123)
model = Sequential(
    [Dense(30, activation="leaky_relu"), Dense(1, activation="exponential")]
)
model.compile("adam", "mse")
mc = ModelCheckpoint(
    "best-model.keras", monitor="val_loss", save_best_only=True
)
es = EarlyStopping(restore_best_weights=True, patience=5)
hist = model.fit(
    X_train_sc,
    y_train,
    epochs=100,
    validation_split=0.1,
    callbacks=[mc, es],
    verbose=False,
)
Path("best-model.keras").stat().st_size

Quiz

Lecture Outline

California House Price Prediction
EDA & Baseline Model
Our First Neural Network
Force positive predictions
Preprocessing
Early Stopping
Quiz

Critique this 💩 regression code

X_train = features[:80]; X_test = features[81:]
y_train = targets[:80]; y_test = targets[81:]

model = Sequential([
   Input((2,)),
  Dense(32, activation='relu'),
   Dense(32, activation='relu'),
  Dense(1, activation='sigmoid')
])
model.compile(optimizer="adam", loss='mse')
es = EarlyStopping(patience=10)
fitted_model = model.fit(X_train, y_train, epochs=5,
  callbacks=[es], verbose=False)

trainMAE = model.evaluate(X_train, y_train, verbose=False)
hist = model.fit(X_test, y_test, epochs=5,
  callbacks=[es], verbose=False)
hist.history["loss"]
testMAE = model.evaluate(X_test, y_test, verbose=False)

f"Train MAE: {testMAE:.2f} Test MAE: {trainMAE:.2f}"

'Train MAE: 4.82 Test MAE: 4.32'

The data

sns.scatterplot(
  x="$x_1$", y="$x_2$",
  c=targets, data=features);

sns.displot(targets, kde=True, stat="density");

Package Versions

from watermark import watermark
print(watermark(python=True, packages="keras,matplotlib,numpy,pandas,seaborn,scipy,torch,tensorflow,tf_keras"))

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.24.0

keras     : 3.3.3
matplotlib: 3.8.4
numpy     : 1.26.4
pandas    : 2.2.2
seaborn   : 0.13.2
scipy     : 1.11.0
torch     : 2.0.1
tensorflow: 2.16.1
tf_keras  : 2.16.0

Glossary

callbacks
cost/loss function
early stopping
epoch
Keras, Tensorflow, PyTorch

matplotlib, seaborn
neural network architecture
targets
training/test split
validation set

Deep Learning with Keras

California House Price Prediction

Import the data

What is the target?

The dataset

Columns

An entire ML project

Questions to answer in ML project

Set aside a fraction for a test set

Basic ML workflow

Split three ways

Why not use test set for both?

EDA & Baseline Model

The training set

Location

Location #2

Features

Linear Regression

Make some predictions

Plot the predictions

Calculate mean squared error

Using mean_squared_error

Our First Neural Network

What are Keras and TensorFlow?

Create a Keras ANN model

Inspect the model

The model is initialised randomly

Controlling the randomness

Fit the model

Make predictions

Plot the predictions

Assess the model

Force positive predictions

Try running for longer

Loss curve

Loss curve

Predictions

Try different activation functions

Enforce positive outputs (softplus)

Plot the predictions

Enforce positive outputs (\mathrm{e}^{\,x})

Preprocessing

Re-scaling the inputs

Same model with scaled inputs

Loss curve

Loss curve

Predictions

Comparing MSE (smaller is better)

Comparing models (train)

Comparing models (validation)

Early Stopping

Choosing when to stop training

Try early stopping

Loss curve

Loss curve II

Predictions

Comparing models (validation)

The test set

Another useful callback

Quiz

Critique this 💩 regression code

The data

Package Versions

Glossary

Using `mean_squared_error`