Exercise: Police Reports of US Car Crashes

ACTL3143 & ACTL5111 Deep Learning for Actuaries

Your task is to follow on from the example in the lecture, where we look at the National Motor Vehicle Crash Causation Survey dataset. In the data, you will use the police reports (text data) alongside the weather-related boolean variables to predict the boolean INJSEVB variable, that is, the presence of bodily injury.

DALL-E’s rendition of this car crash police report dataset.

The data

The data is sourced from the Swiss Association of Actuaries’ Actuarial Data Science tutorials; see this article for a longer treatment of this dataset using transformers. The dataset is hosted on Github.

Encoding the text

Implement two (or more) models by preprocessing the text data in your choice of the following options:

  1. A basic bag-of-words approach like in the lectures.
  2. Try instead using TF-IDF values, e.g. from sklearn.feature_extraction.text.TfidfVectorizer.
  3. Either of the above but using both unigrams and bigrams.
  4. Any of the above but after lemmatising the data first.

For example, you may fit one model based on the bag-of-words representation of the original data and another model based on the bag-of-words representation of the lemmatised text data.

Neural network

Use any deep learning architecture to process these inputs, and report the final value of the accuracies you can achieve.

Notes: do not predict the number of vehicles in the accident. Make sure you don’t include INJSEVA as an input to your models, as that is related to the target variable. Also, don’t bother using the top-k accuracy metric.

Permutation importance

Run the permutation importance algorithm on your best fitted model from Part 1, and report the 3 most import input variables/features (i.e. words or bigrams) for your model.