Skip to contents

Introduction

Session Settings
# Graphs----
face_text='plain'
face_title='plain'
size_title = 14
size_text = 11
legend_size = 11

global_theme <- function() {
  theme_minimal() %+replace%
    theme(
      text = element_text(size = size_text, face = face_text),
      legend.position = "bottom",
      legend.direction = "horizontal", 
      legend.box = "vertical",
      legend.key = element_blank(),
      legend.text = element_text(size = legend_size),
      axis.text = element_text(size = size_text, face = face_text), 
      plot.title = element_text(
        size = size_title, 
        hjust = 0.5
      ),
      plot.subtitle = element_text(hjust = 0.5)
    )
}

# Outputs
options("digits" = 2)

In Brief

The objective of this vignette is to illustrate the practical application of Generalized Additive Models (GAM) in analyzing insurance data, with an emphasis on the beMTPL dataset from Charpentier (2014). Our focus lies in public liability for drivers, particularly in providing comprehensive insights into insurance contracts and claims associated with Belgium motor third-party liability insurance. Our objective is to develop a model to explore the factors that impact claim occurrences within the insurance dataset, with a special focus on the elderly.

Required Packages

Show the code
required_libraries <- c(
  "tidyverse", 
  "CASdatasets",
  "wesanderson",
  "mgcv",
  "broom",
  "knitr"
)
invisible(lapply(required_libraries, library, character.only = TRUE))

Data

The data used in this vignette come from the Belgium motor third-party liability insurance portfolio.

The dataset, beMTPL, encompasses details regarding contracts and clients obtained from a Belgium insurance company, related to a public liability insurance portfolio.

For convenience, the beMTPL table will be referred to as CLAIMS.

Dictionaries

The list of the 22 variables from the beMTPL dataset is reported in Table 1.

Table 1: Content of the beMTPL dataset: CLAIMS
Attribute Type Description
insurance_contract Numeric Unique identifier for the contract
policy_year Numeric Year of study or observation for the insured person
insured_year_birth Numeric insured’s year of birth
exposure Numeric Exposure duration in years
vehicle_age Numeric Age of the vehicle in years
policy_holder_age Numeric Seniority of the insured at the insurance agency
driver_license_age Numeric Age of the driver’s licence
vehicle_brand Character Brand of the vehicle
mileage Numeric Mileage of the vehicle
vehicle_power Numeric Power value of the vehicle
catalog_value Numeric Catalog value of the vehicle
claim_value Numeric Value of the claim
number_of_liability_claims Numeric Number of liability claims
number_of_bodily_injury_liability_claims Numeric Number of bodily injury liability claims
claim_time Numeric Time of the accident
claim_responsibility_rate Numeric Rate of responsibility for the claim (100% full responsibility, 0% no responsibility
driving_training_label Bolean Bolean indicating driving training
signal Bolean 1 = warning, 0 = no warning

Importation

Code for importing our datasets
data(beMTPL16)

CLAIMS <- beMTPL16

CLAIMS <- CLAIMS |>
  mutate(insured_age = 2016 - insured_birth_year)

CLAIMS <- CLAIMS |>
  group_by(insurance_contract, policy_year) |>
  mutate(ClaimNB = sum(number_of_liability_claims == 1))

Models

Purpose

In the domain of public liability for automobile accidents, particularly with a focus on elderly drivers, Generalized Additive Models (GAM) are a reliable tool for understanding and predicting accident frequencies, repair costs, and claim patterns.

By employing GAM, insurers can better anticipate future challenges, refine pricing strategies, and enhance their resilience in an ever-evolving risk environment, specifically addressing the unique risks associated with elderly drivers.

In this analysis, we explore the relationship between the response variable target and the explanatory variables DriverAge and vehicle_age. This modeling framework aligns with the principles outlined by Agresti (2013), a prominent figure in statistical methodology, who emphasizes the importance of considering multiple explanatory factors in regression analysis.

To model the frequency of insurance claims, we employ a Generalized Additive Model (GAM) approach for the response variable ClaimNB, which represents the count of insurance claims and is assumed to follow a Quasi-Poisson distribution:

ClaimNBQuasiPoisson(λ), \text{ClaimNB} \sim \text{QuasiPoisson}(\lambda),

where λ\lambda is the mean rate of claims. The GAM approach allows for flexible, nonlinear relationships between λ\lambda and the predictor variables through the use of smooth functions. Specifically, we express the natural logarithm of λ\lambda as a combination of these smooth functions and an additional term accounting for exposure:

log(λi)=β0+f1(insured agei)+f2(vehicle agei)+log(exposure), \begin{equation} \log(\lambda_i) = \beta_0 + f_1(\text{insured age}_i) + f_2(\text{vehicle age}_i) + \log(\text{exposure}), \end{equation}

where f1(insured agei)f_1(\text{insured age}_i), f2(vehicle agei)f_2(\text{vehicle age}_i) are smooth functions of the predictor variables.

In this model, DriverAge represents the age of the insured individual, vehicle_age denotes the age of the vehicle, and log(exposure)\log(\text{exposure}) adjusts for the exposure variable. The intercept β0\beta_0 and the smooth functions f1f_1 and f2f_2 are estimated through regression to quantify their impact on the expected rate of claims. The smooth functions allow the model to capture complex, nonlinear relationships between the predictors and the response variable, providing a more flexible and accurate fit to the data.

The estimated lambda parameter, which represents the mean of claims, is 0.37.

set.seed(1234) 

theoretic_count <- rpois(nrow(CLAIMS), mean(CLAIMS$ClaimNB))

tc_df <- tibble(theoretic_count)

freq_theoretic <- prop.table(table(tc_df$theoretic_count))

freq_claim <- prop.table(table(CLAIMS$ClaimNB))

freq_theoretic_df <- tibble(
  Count = as.numeric(names(freq_theoretic)),
  Frequency = as.numeric(freq_theoretic),
  Source = "Theoretical Count"
)

freq_claim_df <- tibble(
  Count = as.numeric(names(freq_claim)),
  Frequency = as.numeric(freq_claim),
  Source = "Empirical Count"
)

freq_combined <- freq_theoretic_df |> 
  rbind(freq_claim_df)

The theoretical and empirical histograms associated with a Poisson distribution are shown in Figure 1.

Code for the following graph
ggplot(freq_combined, aes(x = Count, y = Frequency, fill = Source)) +
  geom_bar(stat = "identity", position = "dodge2", width = 0.3) +
  labs(x = "Claim Number", y = "Frequency", fill = "Legend") +
  theme(legend.position = "right") +
  scale_fill_manual(
    NULL,
    values = c("Empirical Count" = "black", "Theoretical Count" = "#1E88E5")
  ) +
  labs(fill = "Legend") +
  labs(x = "Claim Number", y = NULL) +
  theme(legend.position = "right")+
  global_theme()
Figure 1: Theoretical and empirical histogram of claims in frequence
reg <- gam(
  ClaimNB ~ -1 + s(insured_age) + s(vehicle_age) + offset(log(exposure)),
  family = quasipoisson,
  data = CLAIMS
)

summary(reg)

Family: quasipoisson
Link function: log

Formula:
ClaimNB ~ -1 + s(insured_age) + s(vehicle_age) + offset(log(exposure))

Approximate significance of smooth terms:
                edf Ref.df   F p-value
s(insured_age) 4.80   5.78 105  <2e-16 ***
s(vehicle_age) 3.94   4.71 468  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  -0.0915   Deviance explained =   -5%
GCV = 0.79694  Scale est. = 0.75161   n = 70791

This generalized additive model (GAM) predicts the number of claims based on insured_age and vehicle_age as predictors. The smooth terms in the model are statistically significant, indicating that both insured_age and vehicle_age have a meaningful effect on the number of claims.

A positive coefficient for s(insured_age) suggests that increasing the age of the insured is associated with a higher expected log count of total liability claims. Similarly, the positive coefficient for s(vehicle_age) indicates that an increase in vehicle age is linked to a higher expected log count of claims.

Graphs

Code to create the following graph
plot(reg, select = 2)
Figure 4: Estimated effects of vehicle age

References

Agresti, Alan. 2013. Categorical Data Analysis, 3rd Edition.
Charpentier, Arthur. 2014. Computational Actuarial Science with R. The R Series. Chapman; Hall/CRC. https://www.routledge.com/Computational-Actuarial-Science-with-R/Charpentier/p/book/9781138033788.

See also

For more similar claim frequency datasets with a Poisson-like distribution, see freMTPL (import with data("freMTPLfreq")): French automobile dataset, norauto: Norwegian automobile dataset (import with data("norauto")), ausprivauto0405 (import with data("ausprivauto0405")): Australian automobile dataset, or pg17trainpol (import with data("pg17trainpol")).