Car Insurance Modelling using Python

Nov 02, 2025 • 5 min read

Car insurance modelling refers to the use of statistical and mathematical models to accurately predict events such as the frequency of claims, the severity of claims, or the total cost associated with claims, which in turn helps in pricing insurance products, managing risk, and optimizing business strategies. If you want to learn how to perform Car Insurance Modelling, this article is for you. In this article, I’ll take you through the task of Car Insurance Modelling using Python.

Car Insurance Modelling: Process We Can Follow

The end goal of Car Insurance Modelling is to create a model that minimizes prediction errors and provides valuable insights into the factors influencing claim frequencies or claim amounts.

Below is the process we can follow for the task of Car Insurance Modelling:

Understand what the business aims to achieve with the model, whether it’s reducing risk, optimizing pricing, increasing competitiveness, or a combination of objectives.
Collect data from internal systems (claims history, customer records) and external sources (public vehicle records, geographic information systems).
Create new features that might be predictive of risk or claim cost, such as aggregating past claims to calculate claim frequency or severity.
Choose appropriate statistical or machine learning algorithms based on the problem characteristics. Common choices might include linear regression, generalized linear models, decision trees, or ensemble methods.
Ensure that the model meets business objectives and risk thresholds.

To get started with the task of Car Insurance Modelling, we need a dataset based on car insurance claims. I found an ideal dataset for this task. You can download the dataset from here.

Car Insurance Modelling using Python

Now, let’s get started with the task of Car Insurance Modelling by importing the necessary Python libraries and the dataset:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

# Load the data
insurance_data = pd.read_csv("insurance_claims.csv")

# Displaying the first few rows of the dataset
print(insurance_data.head())

   age_of_driver  car_age    region  number_of_claims
0             30        7     Urban                 0
1             33       10     Rural                 2
2             39       11  Suburban                 1
3             18       12     Urban                 0
4             21        8     Urban                 0

The dataset consists of the following columns:

age_of_driver: The age of the driver.
car_age: The age of the car.
region: The region where the driver is located (Urban, Rural, Suburban).
number_of_claims: The number of claims made.

So, the objective of this Car Insurance Modelling task will be to predict the frequency of insurance claims made by drivers.

Now, let’s analyze the summary statistics and then move on to missing values and distribution analysis:

# Summary statistics of the dataset
summary_stats = insurance_data.describe(include='all')

# Checking for missing values
missing_values = insurance_data.isnull().sum()

print(summary_stats)

        age_of_driver      car_age region  number_of_claims
count     1000.000000  1000.000000   1000       1000.000000
unique            NaN          NaN      3               NaN
top               NaN          NaN  Rural               NaN
freq              NaN          NaN    343               NaN
mean        33.112000     6.673000    NaN          0.675000
std          9.253598     4.377583    NaN          0.822223
min         18.000000     0.000000    NaN          0.000000
25%         25.000000     3.000000    NaN          0.000000
50%         33.000000     6.000000    NaN          0.000000
75%         41.000000    11.000000    NaN          1.000000
max         49.000000    14.000000    NaN          5.000000

print(missing_values)

age_of_driver       0
car_age             0
region              0
number_of_claims    0
dtype: int64

Now, let’s proceed with visualizing the distributions and analyzing the categorical variables:

# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# Plotting distributions of numerical variables
fig, ax = plt.subplots(1, 3, figsize=(18, 5))

sns.histplot(insurance_data['age_of_driver'], kde=True, bins=15, ax=ax[0])
ax[0].set_title('Age of Driver Distribution')

sns.histplot(insurance_data['car_age'], kde=True, bins=15, ax=ax[1])
ax[1].set_title('Car Age Distribution')

sns.histplot(insurance_data['number_of_claims'], kde=False, bins=range(6), ax=ax[2])
ax[2].set_title('Number of Claims Distribution')

plt.tight_layout()
plt.show()

Observations:

Age of Driver: Appears approximately normally distributed around the 30s.
Car Age: Also seems fairly normally distributed with a slight right skew.
Number of Claims: Most drivers have 0 claims, indicating a right-skewed distribution which is common in count data like insurance claims.

Data Preprocessing

The ‘region’ variable in the dataset is categorical and should be converted to a numerical format suitable for modelling. One-hot encoding is a common approach we can use here. Once preprocessing is complete, we’ll proceed to build the GLM model focusing on a Poisson or Negative Binomial distribution, as these are common for modelling count data like insurance claims.

Let’s start with preprocessing the data:

from sklearn.model_selection import train_test_split

# One-Hot Encoding for 'region' variable
insurance_data_encoded = pd.get_dummies(insurance_data, columns=['region'], drop_first=True)

# Splitting the data into training and testing sets
train, test = train_test_split(insurance_data_encoded, test_size=0.2, random_state=42)

print(train.head())

     age_of_driver  car_age  number_of_claims  region_Suburban  region_Urban
29              26       13                 1                0             0
535             42        9                 2                1             0
695             44        8                 1                0             0
557             41        3                 1                1             0
836             31       12                 0                0             1

The ‘region’ variable has been one-hot encoded, resulting in two new variables region_Suburban and region_Urban (with region_Rural being the baseline category implicitly). The data has also been split into training (80%) and testing (20%) sets.

Car Insurance Modelling for Event Frequency

For modelling event frequency, especially for count data like number of claims, Poisson or Negative Binomial regression is typically used. The choice between them usually depends on the variance of the count data:

Poisson Regression: Assumes the mean and variance of the count data are equal. Suitable when data does not exhibit overdispersion.
Negative Binomial Regression: More flexible, can handle overdispersion (variance greater than mean) in the count data.

We’ll start with Poisson Regression and check if it’s a suitable model based on its fit and diagnostics. If needed, we can then consider the Negative Binomial Regression.

Let’s proceed to build the Poisson GLM model using the training data. We’ll consider ‘age_of_driver’, ‘car_age’, ‘region_Suburban’, and ‘region_Urban’ as predictors and ‘number_of_claims’ as the response variable:

# Preparing the data for modeling
X_train = train.drop('number_of_claims', axis=1)
y_train = train['number_of_claims']

# Adding constant to the predictor variables
X_train_const = sm.add_constant(X_train)

# Building the Poisson GLM model
poisson_glm = sm.GLM(y_train, X_train_const, family=sm.families.Poisson()).fit()

# Displaying the model summary
poisson_glm.summary()

Generalized Linear Model Regression Results
Dep. Variable:	number_of_claims	No. Observations:	800
Model:	GLM	Df Residuals:	795
Model Family:	Poisson	Df Model:	4
Link Function:	Log	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-845.44
Date:	Sun, 07 Jan 2024	Deviance:	829.41
Time:	11:17:53	Pearson chi2:	770.
No. Iterations:	5	Pseudo R-squ. (CS):	0.02749
Covariance Type:	nonrobust		
coef	std err	z	P>|z|	[0.025	0.975]
const	0.0906	0.181	0.502	0.616	-0.263	0.444
age_of_driver	-0.0205	0.005	-4.321	0.000	-0.030	-0.011
car_age	0.0152	0.010	1.539	0.124	-0.004	0.035
region_Suburban	0.0799	0.106	0.755	0.450	-0.127	0.287
region_Urban	0.0861	0.108	0.794	0.427	-0.126	0.298

The Poisson GLM has been fit with the following results:

const: Represents the intercept. The estimate is 0.0906, but it’s not statistically significant (p-value: 0.616).
age_of_driver: The coefficient is -0.0205, indicating that with each additional year of the driver’s age, the log of expected claims decreases. It’s statistically significant (p-value < 0.001), suggesting a strong relationship between the age of the driver and the number of claims.
car_age: The coefficient is 0.0152, suggesting that older cars tend to have slightly higher expected claims, although this effect is not statistically significant (p-value: 0.124).
region_Suburban and region_Urban: The coefficients for these categories are 0.0799 and 0.0861, respectively. Both are compared to the baseline “Rural” category and indicate slightly higher expected claims in suburban and urban areas. However, neither is statistically significant (p-values: 0.450 and 0.427, respectively).

The model suggests that as drivers get older, the expected number of claims decreases significantly. There’s some indication that car age might affect the number of claims, with older cars having more claims, but this isn’t statistically significant in this model. The model does not find statistically significant evidence that being in suburban or urban regions affects the number of claims compared to being in a rural area.

Summary

So, car insurance modelling refers to the use of statistical and mathematical models to accurately predict events such as the frequency of claims, the severity of claims, or the total cost associated with claims, which in turn helps in pricing insurance products, managing risk, and optimizing business strategies. I hope you liked this article on Car Insurance Modelling using Python. Feel free to ask valuable questions in the comments section below.