French Motor Third-Part Liability datasets used for 100 percent Data Science game
pricingame.Rdpg15training, pg15pricing are the two datasets used for the 2015 pricing game of the
French institute of Actuaries organized on November 5, 2015.
pg15training contains 100,000 TPL policies for private motor insurance used
to fit the models, whereas pg15pricing contains 36,311 policies of the same
guarantee for which the premium is computed.
Each record has been observed at most one year and contains risk features of the
policyholder and the insured vehicle.
For confidentiality reasons, most categorical levels have unknown meaning.
pg16trainpol, pg16trainclaim, pg16test
are the three datasets used for the
2016 pricing game of the French institute of Actuaries organized on November 8, 2016.
pg16trainpol contains 87,228 policies for private motor insurance
and pg16trainclaim contains 4,568 claims of those 87,228 TPL policies.
Policies are guaranteed for all kinds of material damages, but not bodily injuries.
Both datasets are used to fit the models, whereas pg16test is used for
training.
For confidentiality reasons, most categorical levels have unknown meaning.
pg17trainpol, pg17trainclaim are the two training datasets used for the
2017 pricing game of the French institute of Actuaries organized on November 16, 2017.
pg17trainpol contains 100,000 policies for private motor insurance
and pg17trainclaim contains 14,243 claims of those 100,00 TPL policies.
These training sets correspond to year \(t=0\).
pg17testyear1, pg17testyear2, pg17testyear3,
pg17testyear4 are the
four test datasets used for the pricing game: each has 100,000 rows of new policies
(drivers willing to purchase insurance for Year \(t\) with \(t=1,2,3,4\)).
Format
pg15training and pg15pricing are two dataframes with the same columns:
PolNumThe policy number.
CalYearThe underwriting year.
GenderThe gender of the car driver.
TypeThe car type (a single letter).
CategoryThe car category (a string character).
OccupationThe occupation of the driver (a string character).
AgeThe driver age, in years (in France, people can drive a car at 18).
Group1The group of the car.
BonusThe bonus-malus (French no-claim discount):
-30means a 30 percent bonus while+20means a 20 percent malus; see details below.PoldurThe policy age (in year).
ValueThe car value (in euro).
AdindA dummy variable indicating a material cover.
SubGroup2The subregion of the driver home (unknown category).
Group2The region of the driver home (unknown category).
DensityThe density of inhabitants (number of inhabitants per km2) in the city the driver of the car lives in.
ExpdaysExposure in days.
NumtppdThe number of third-party material claims.
NumtpbiThe number of third-party bodily injury claims.
IndtppdThe total cost of third-party material claims (euro).
IndtpbiThe total cost of third-party bodily injury claims (euro).
pg16trainpol, pg16trainclaim, pg16test are dataframes with the following columns:
YearThe coverage year.
BeginDate,EndDateBeginning date and ending date of the coverage period (of class
"Date").ExposureThe exposure as a fraction of year, computed as the difference between
EndDateandBeginDatedivided by 365.PolicyIDThe identification number of the policy.
PolicyAgeCategThe category of the policy age.
PolicyCategThe category of the policy.
CompanyCreationA dummy indicating if the company has been created.
FleetMgtThe fleet management category.
FleetSizeCategThe fleet size category
AreaThe geographical area.
PayFreqThe payment frequency.
VehiclAgeThe vehicle age category.
VehiclNbThe number of vehicles
VehiclCategThe vehicle category.
VehiclPowerThe vehicle power
LicNbThe license number of the vehicle.
DeducThe deductible category
SumInsuredThe category of the sum insured.
BusinessTypeThe business type.
ChannelDistThe distribution channel.
ClaimNbThe claim number.
ClaimChargeThe claim charge.
DirectCompAs claims correspond only to material damage, the French claim convention (IDA) was applied. So the insurer may directly refund the insured (when
DirectComp=TRUE) even if the insurer will sue the third-party insurer to recover the indemnity afterwards.CompRateThe rate of compensation (in percent).
SettlYearThe settlement year.
pg17*** are dataframes with the following columns:
id_clientThe client identification number: a string of the form
Annnnnnnn(Afollowed by an 8-digit number). First client ID isA00000001and last isA00091488.id_vehicleThe vehicle identification number: a string of the form
Vnn(aVfollowed by a 2-digit number). First vehicle is always numbered V01. If a client has multiple vehicles, then the numeration increases by 1. There is no particular ordering in the vehicles, so their rank should not represent anything valuable.id_policyThe policy identification number, a string of the form
Annnnnnnn-Vnnresulting from appendingid_clientandid_vehicle.id_yearThe year of coverage, Year ID begins at
"Year 0"and ends at"Year 4".pol_bonusThe policy bonus (French no-claim discount):
0.5means a 30 percent bonus while1.2means a 20 percent malus; see details below.pol_coverageThe coverage category: The coverage are of 4 types : Mini, Median1, Median2 and Maxi, in this order. As you can guess, Mini policies covers only Third Party Liability claims, whereas Maxi policies covers all claims, including Damage, Theft, Windshield Breaking, Assistance, etc.
pol_durationThe policy duration: Policy duration represents how old the policy is. It is expressed in year, accounted from the beginning of the current year i. Oldest policies in this portfolio can last since prehistoric ages of 45 years.
pol_sit_durationThe policy current endorsement duration: Situation duration represent how old the current policy caracteristics are. It can be different from pol duration, because the same insurance policy could have evolved in the past (e.g. by changing coverage, or vehicle, or drivers, ...).
pol_pay_freqThe payment frequency: The price of the insurance coverage can be paid annually, bi-annually, quarterly or monthly.
pol_paydA dummy indicating pay as you drive: a string with Yes or No, which indicates whether our client has subscribed a mileage-based policy or not. In those early ages of Year 0, Pay As You Drive was not that current, so they represent a minority in the portfolio.
pol_usageThe policy usage: it describes what usage the driver makes from his vehicle, most of time. There are 4 possible values :
"WorkPrivate"which is the most common,"Retired"which is presumed to be aimed at retired people (who also are presumed driving less kilometers),"Professional"which denotes a professional usage of the vehicle, and"AllTrips"which is quite similar to Professional (including pro tours). As for the coverage, it would be very surprising that this variable had no effect on frequency.pol_insee_codeThe INSEE code of the French city/municipality where the policyholder lives: it is a 5-digits alphanumeric code used by the French National Institute for Statistics and Economic Studies (hence INSEE) to identify "communes" and departments in France. There are about 36,000 "communes" in France, but not every one of them is present in the dataset (there are only 18,000 of them). The first 2 digits of insee code identifies the department (they are 96, not including overseas departments). The insee code or department code can be used to possibly merge external data to the datasets: population density, OSM data, etc.
drv_drv2A character string indicating if there is a secondary driver: there is always a first driver, which characteristics (age, sex, licence) are provided, but a secondary driver is optional, and is present 1 time out of 3.
drv_age1,drv_age2The driver age of the ith driver: it is expressed in years counted from the beginning of the considered year. Then,
drv_age1increases by 1 every year, like in real world... Legal age to drive is 18, so you shouldn't find any age below that limit. Due to the fact that the database is built on existing situations before Year 0, in fact the minimum age is 19 in Year 0 dataset. On the other side, you'll also find quite old drivers.drv_sex1,drv_sex2The driver sex of the ith driver. European rules force insurers to charge the same price for women and men. But driver's gender can still be used in academic studies, and that's why drv sex1 is still available in the datasets, and can be used as discriminatory variable in this pricing game.
drv_age_lic1,drv_age_lic2The age of the driving license of the ith driver. As for the other ages, it is expressed in integer years from the beginning of the current year.
vh_ageThe vehicle age: This variable is the vehicle's age, the difference between the year of release and the current year.
vh_cylThe engine cylinder displacement is expressed in ml in a continuous scale. This variable should be highly correlated with din power of the vehicle.
vh_dinThe
vh_dinis a representation of the motor power. Highly correlated with din power, cylinder, speed and even value of the vehicle.vh_fuelThe vehicle fuel type: with mainly two values
"Diesel"and"Gasoline". Very few Hybrid vehicles can also be found, but, 6 years ago, the hybrid market was still at its beginning.vh_makeThe vehicle carmaker. As the database is built from a French insurance, the three major brands are Renault, Peugeot and Citroen.
vh_modelThe vehicle model. As a subdivision of the carmake, vehicle is identified by its model name.
vh_sale_begin,vh_sale_endvh_sale_beginandvh_sale_endare the dates (in fact: ages) from the beginning of the current year of the beginning and the end of marketing years of the vehicle. This could for instance identify policies that covers very new vehicles or second-hand ones.vh_speedThe vehicle maximum speed (km/h), as stated by the manufacturer.
vh_typeThe vehicle type, either
"Tourism"or"Commercial". There are more"Commercial"types for"Professional"policy usage than for"WorkPrivate".vh_valueThe vehicle's value (replacement value) is expressed in euros, without inflation so it should be stable from a year to another.
vh_weightThe vehicle weight (kg).
id_claimThe claim identification number: a string of the form
CLnn(CLfollowed by a 2-digit number). Numbering of the claims begins at 1 for every policy and each year. Then, the last value of id claim is the maximum number of claims for a vehicle in a year. Two-digits representation is sufficient : this maximum doesn't exceed 7 (but not on Year 0, where the maximum is 6).claim_nbThe claim number, as we are talking about individual claims, each claim nb has a value of 1.
claim_amountThe claim amount: amounts range from (approx.) -2,000 to +300,000. Yes, there are negative values, they come from claims where our driver's liability is not engaged, so there's a legal recourse.
The bonus/malus system is compulsory in France, but we will only use it here as a possible feature. The coefficient is attached to the driver. It starts at 1 for young drivers (i.e. first year of insurance). Then, every year without claim, the bonus decreases by 5 percent until it reaches its minimum of 0.5. Without any claim, the bonus evolution would then be : 1 \(->\) 0.95 \(->\) 0.9 \(->\) 0.85 \(->\) 0.8 \(->\) 0.76 \(->\) 0.72 \(->\) 0.68 \(->\) 0.64 \(->\) 0.6 \(->\) 0.57 \(->\) 0.54 \(->\) 0.51 \(->\) 0.5. Every time the driver causes a claim (only certain types of claims are taken into account), the coefficient increases by 25 percent, with a maximum of 3.5. Thus, the range of bonus/malus coefficient extends from 0.5 to 3.5 in the datasets.
Source
Datasets from unknown private insurers.
See https://freakonometrics.hypotheses.org/20034 for the first pricing game.
See https://actinfo.hypotheses.org/69 for the second pricing game.
See https://actinfo.hypotheses.org/86 for the third pricing game.
Examples
# (1) load of data
#
data(pg15training)
data(pg15pricing)
data(pg16trainpol)
data(pg16trainclaim)
data(pg16test)
data(pg17trainpol)
data(pg17trainclaim)
data(pg17testyear1)
# (2) some check
# should be zero
sum(!pg16trainclaim$PolicyID %in% pg16trainpol$PolicyID)
#> [1] 0
# should be true
NROW(pg16trainclaim) == sum(pg16trainpol$ClaimNb)
#> [1] TRUE