AI Use Cases for Insurance: Part II

August 5, 2021
· 4 min read

This post was originally part of the DataRobot Community. Visit now to browse discussions and ask questions about DataRobot, AI Platform, data science, and more.

(Make sure to check out Part I in this Insurance Series to understand the use case we’re addressing.)

Before building any models, you will need to ensure your data is set up properly. Let’s take a look at how to structure our dataset so that it’s ready for predictive modeling.

Policy Lifecycle

To start, we’ll examine an insurance policy from a timeline perspective (Figure 1). When a customer purchases an insurance policy, it’s designed to cover their losses over a defined period of time, e.g., both 1-year and 6-month policies are generally standard policy terms.

The first time an individual purchases a policy with an insurance company is called the policy inception.

At the end of each policy term (1 year, in our use case), the customer decides whether or not they’ll continue doing business with their insurance company through a policy renewal. If the customer decides to purchase a subsequent policy, the insurance company will issue a new price to the customer at the renewal date.

Figure 1. Term
Figure 1. Term

But out in the real world, changes sometimes happen. The policyholder may decide they wish to cancel their policy early (Figure 2). This could be for an assortment of reasons; for example, they might no longer need coverage or maybe another insurance company is offering a more attractive price. This type of event is referred to as a mid-term cancellation.

We’ll want to ensure our dataset can properly reflect mid-term cancellations as well as renewals.

Figure 2. Cancellation
Figure 2. Cancellation

The final piece to this puzzle—and the main reason anyone purchases insurance—is financial protection from losses (Figure 3). During all policy periods, a customer may experience any number of losses or accidents.

This (losses) is what we’re most interested in having our pricing model predict.

Figure 3. Losses
Figure 3. Losses

In this example, our policy had two losses in the first policy term for $500 and $2,000, none in the second policy term, and then one large loss of $3,000 in their third and final policy term.

Row Identifiers

The next step is figuring out how we can take all this information and present it in a tabular data format appropriately set up for modeling (Figure 4).

We need our data to match the format in which it will be used at a real world insurance company for building an insurance premium. The insurance company would need our model whenever they’re issuing a new policy or upon policy renewal.

This means we’ll need three rows in our dataset to represent the policy timeline shown above, in Figure 3. These three rows equate to the single policy inception and then each of the subsequent policy renewals.

Each row has the same policy number but different policy effective dates. We can refer to these columns as our row identifiers.

Figure 4 Setup
Figure 4. Setup

Model Inputs

Next, we’ll want our data to indicate how long the customer was covered by their insurance policy during each policy term (Figure 5). We refer to this as the Earned Exposure. In our example, for the first two policy terms, this policy was covered for the full year, so those first two rows will receive an earned exposure amount of 1.00.

Figure 5. Earned Exposure
Figure 5. Earned Exposure

For the third policy term, this policyholder cancelled early, so we calculate the fraction of the full year for which they were insured; in this case, it’s 0.625 which equates to the time period between January 1st and August 15th.

We’ll also need a column indicating how many losses the policyholder experienced during each term (Figure 6). We refer to this as the claim count. There were 2 losses in the first term, none in the second term, and then 1 loss in the third policy term. We’ll add this column to our dataset as well.

Figure 6. Claim Count
Figure 6. Claim Count

Claim count will capture the frequency component, but we’ll also want to know claim severity (i.e., the financial loss). To capture this information, we’ll sum up the dollar amount for all claims during each policy term; in our example, $2,500 was paid out in the first term, $0 in the second, and $3,000 in the third term.

Figure 7. Losses
Figure 7. Losses

We’ll refer to these three columns—exposure, claim count, and losses—as our model inputs. DataRobot will be looking for us to indicate these columns when setting up our model in the DataRobot platform.

Model Features

Finally, we’ll incorporate model features into our dataset (Figure 8). These features will vary based upon the type of insurance product for which you’re building pricing models. Some classic examples of model features for insurance pricing models include the age of the policyholder, geographic information such as the territory in which they live, prior claim history, and maybe even available discounts.

Figure 8. Modeling Features
Figure 8. Modeling Features

There’s a very important component to consider when setting up your model features: you’ll need to ensure the features only contain information which was known prior to the policy effective date. For example, you should not use information discovered in 2016 in order to predict losses on a policy in 2015. Violating this principle is called target leakage.

By now, if you’ve been following this series, you have a very good understanding of the use case and insurance business problem (from Part I), and the data we’ll want for building predictive models (from this article). When ready, check out Part III in this series.

Enabling Accuracy Monitoring in DataRobot
Learn More
About the author
Linda Haviland
Linda Haviland

Community Manager

Meet Linda Haviland
  • Listen to the blog
  • Share this post
    Subscribe to DataRobot Blog
    Newsletter Subscription
    Subscribe to our Blog