Car Insurance Modelling using Python
Car insurance modelling refers to the use of statistical and mathematical models to accurately predict events such as the frequency of claims, the severity of claims, or the total cost associated with claims, which in turn helps in pricing insurance products, managing risk, and optimizing business strategies. If you want to learn how to perform Car Insurance Modelling, this article is for you. In this article, I’ll take you through the task of Car Insurance Modelling using Python.
Car Insurance Modelling: Process We Can Follow
The end goal of Car Insurance Modelling is to create a model that minimizes prediction errors and provides valuable insights into the factors influencing claim frequencies or claim amounts.
Below is the process we can follow for the task of Car Insurance Modelling:
- Understand what the business aims to achieve with the model, whether it’s reducing risk, optimizing pricing, increasing competitiveness, or a combination of objectives.
- Collect data from internal systems (claims history, customer records) and external sources (public vehicle records, geographic information systems).
- Create new features that might be predictive of risk or claim cost, such as aggregating past claims to calculate claim frequency or severity.
- Choose appropriate statistical or machine learning algorithms based on the problem characteristics. Common choices might include linear regression, generalized linear models, decision trees, or ensemble methods.
- Ensure that the model meets business objectives and risk thresholds.
To get started with the task of Car Insurance Modelling, we need a dataset based on car insurance claims. I found an ideal dataset for this task. You can download the dataset from here.
Car Insurance Modelling using Python
Now, let’s get started with the task of Car Insurance Modelling by importing the necessary Python libraries and the dataset:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
# Load the data
insurance_data = pd.read_csv("insurance_claims.csv")
# Displaying the first few rows of the dataset
print(insurance_data.head())
age_of_driver car_age region number_of_claims
0 30 7 Urban 0
1 33 10 Rural 2
2 39 11 Suburban 1
3 18 12 Urban 0
4 21 8 Urban 0
The dataset consists of the following columns:
- age_of_driver: The age of the driver.
- car_age: The age of the car.
- region: The region where the driver is located (Urban, Rural, Suburban).
- number_of_claims: The number of claims made.
So, the objective of this Car Insurance Modelling task will be to predict the frequency of insurance claims made by drivers.
Now, let’s analyze the summary statistics and then move on to missing values and distribution analysis:
# Summary statistics of the dataset
summary_stats = insurance_data.describe(include='all')
# Checking for missing values
missing_values = insurance_data.isnull().sum()
print(summary_stats)
age_of_driver car_age region number_of_claims
count 1000.000000 1000.000000 1000 1000.000000
unique NaN NaN 3 NaN
top NaN NaN Rural NaN
freq NaN NaN 343 NaN
mean 33.112000 6.673000 NaN 0.675000
std 9.253598 4.377583 NaN 0.822223
min 18.000000 0.000000 NaN 0.000000
25% 25.000000 3.000000 NaN 0.000000
50% 33.000000 6.000000 NaN 0.000000
75% 41.000000 11.000000 NaN 1.000000
max 49.000000 14.000000 NaN 5.000000
print(missing_values)
age_of_driver 0
car_age 0
region 0
number_of_claims 0
dtype: int64
Now, let’s proceed with visualizing the distributions and analyzing the categorical variables:
# Set the aesthetic style of the plots
sns.set_style("whitegrid")
# Plotting distributions of numerical variables
fig, ax = plt.subplots(1, 3, figsize=(18, 5))
sns.histplot(insurance_data['age_of_driver'], kde=True, bins=15, ax=ax[0])
ax[0].set_title('Age of Driver Distribution')
sns.histplot(insurance_data['car_age'], kde=True, bins=15, ax=ax[1])
ax[1].set_title('Car Age Distribution')
sns.histplot(insurance_data['number_of_claims'], kde=False, bins=range(6), ax=ax[2])
ax[2].set_title('Number of Claims Distribution')
plt.tight_layout()
plt.show()

Observations:
- Age of Driver: Appears approximately normally distributed around the 30s.
- Car Age: Also seems fairly normally distributed with a slight right skew.
- Number of Claims: Most drivers have 0 claims, indicating a right-skewed distribution which is common in count data like insurance claims.
Data Preprocessing
The ‘region’ variable in the dataset is categorical and should be converted to a numerical format suitable for modelling. One-hot encoding is a common approach we can use here. Once preprocessing is complete, we’ll proceed to build the GLM model focusing on a Poisson or Negative Binomial distribution, as these are common for modelling count data like insurance claims.
Let’s start with preprocessing the data:
from sklearn.model_selection import train_test_split
# One-Hot Encoding for 'region' variable
insurance_data_encoded = pd.get_dummies(insurance_data, columns=['region'], drop_first=True)
# Splitting the data into training and testing sets
train, test = train_test_split(insurance_data_encoded, test_size=0.2, random_state=42)
print(train.head())
age_of_driver car_age number_of_claims region_Suburban region_Urban
29 26 13 1 0 0
535 42 9 2 1 0
695 44 8 1 0 0
557 41 3 1 1 0
836 31 12 0 0 1
The ‘region’ variable has been one-hot encoded, resulting in two new variables region_Suburban and region_Urban (with region_Rural being the baseline category implicitly). The data has also been split into training (80%) and testing (20%) sets.
Car Insurance Modelling for Event Frequency
For modelling event frequency, especially for count data like number of claims, Poisson or Negative Binomial regression is typically used. The choice between them usually depends on the variance of the count data:
- Poisson Regression: Assumes the mean and variance of the count data are equal. Suitable when data does not exhibit overdispersion.
- Negative Binomial Regression: More flexible, can handle overdispersion (variance greater than mean) in the count data.
We’ll start with Poisson Regression and check if it’s a suitable model based on its fit and diagnostics. If needed, we can then consider the Negative Binomial Regression.
Let’s proceed to build the Poisson GLM model using the training data. We’ll consider ‘age_of_driver’, ‘car_age’, ‘region_Suburban’, and ‘region_Urban’ as predictors and ‘number_of_claims’ as the response variable:
# Preparing the data for modeling
X_train = train.drop('number_of_claims', axis=1)
y_train = train['number_of_claims']
# Adding constant to the predictor variables
X_train_const = sm.add_constant(X_train)
# Building the Poisson GLM model
poisson_glm = sm.GLM(y_train, X_train_const, family=sm.families.Poisson()).fit()
# Displaying the model summary
poisson_glm.summary()
Generalized Linear Model Regression Results
Dep. Variable: number_of_claims No. Observations: 800
Model: GLM Df Residuals: 795
Model Family: Poisson Df Model: 4
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -845.44
Date: Sun, 07 Jan 2024 Deviance: 829.41
Time: 11:17:53 Pearson chi2: 770.
No. Iterations: 5 Pseudo R-squ. (CS): 0.02749
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const 0.0906 0.181 0.502 0.616 -0.263 0.444
age_of_driver -0.0205 0.005 -4.321 0.000 -0.030 -0.011
car_age 0.0152 0.010 1.539 0.124 -0.004 0.035
region_Suburban 0.0799 0.106 0.755 0.450 -0.127 0.287
region_Urban 0.0861 0.108 0.794 0.427 -0.126 0.298
The Poisson GLM has been fit with the following results:
- const: Represents the intercept. The estimate is 0.0906, but it’s not statistically significant (p-value: 0.616).
- age_of_driver: The coefficient is -0.0205, indicating that with each additional year of the driver’s age, the log of expected claims decreases. It’s statistically significant (p-value < 0.001), suggesting a strong relationship between the age of the driver and the number of claims.
- car_age: The coefficient is 0.0152, suggesting that older cars tend to have slightly higher expected claims, although this effect is not statistically significant (p-value: 0.124).
- region_Suburban and region_Urban: The coefficients for these categories are 0.0799 and 0.0861, respectively. Both are compared to the baseline “Rural” category and indicate slightly higher expected claims in suburban and urban areas. However, neither is statistically significant (p-values: 0.450 and 0.427, respectively).
The model suggests that as drivers get older, the expected number of claims decreases significantly. There’s some indication that car age might affect the number of claims, with older cars having more claims, but this isn’t statistically significant in this model. The model does not find statistically significant evidence that being in suburban or urban regions affects the number of claims compared to being in a rural area.
Summary
So, car insurance modelling refers to the use of statistical and mathematical models to accurately predict events such as the frequency of claims, the severity of claims, or the total cost associated with claims, which in turn helps in pricing insurance products, managing risk, and optimizing business strategies. I hope you liked this article on Car Insurance Modelling using Python. Feel free to ask valuable questions in the comments section below.