What is Target Leakage in Machine Learning and How Do I Avoid It?
We should all be familiar with data science best practices to prevent models from “overfitting” to the data by using validation techniques such as train-test-split, k-fold cross validation (CV), etc. However, AI practitioners should also pay close attention to the content of their training data, or else we could be at risk of causing target leakage (sometimes also known as data leakage). In this blog post, I’ll explain what target leakage is, show you an example, and then provide a few suggestions to avoid potential leakage.
Let’s say, for example, you want to build a predictive model that will be used to approve/deny a customer’s application for a loan. After the raw data collection and aggregation work, you obtain a dataset that is data rich, meaning it contains various data types such as numerical, categorical, text, etc., and the dataset is also wide (i.e., contains many columns). Using this dataset, you end up building many remarkably accurate models.
Great! Give yourself a pat on the back, thanks to the superb accuracy scores of your built models! But, the scores appear suspiciously too accurate: the training and CV scores all look extremely good and the number of model errors appears to be very low. In the DataRobot platform, this is evident from the Leaderboard validation scores where, for LogLoss metric, the lower the score, the better the model. In Figure 1, you see examples of very low validation/CV/holdout scores and quite nonsensical predictions. The surprisingly low scores and unusual predictions should raise red flags.
What could have happened? When you look back at this example’s training dataset, “Loan applications [5-year-contracts] (Year: 2015),” you discover a column called “loan_status.” You ask yourself, “Would a new candidate’s loan application have “loan status” in the input form?” The answer? No, because the loan hasn’t even been approved/denied yet! The “loan_status” feature is a proxy for the target, thus causing target leakage for the built models.
Basically, target leakage can happen when you train your model on a dataset that includes information that would NOT be available at the time of prediction, i.e., when you apply the trained model in the field. The consequence of this is that for certain types of target leakage, bad model behavior will only be spotted when the model is deployed in production; this means it’s difficult to detect leakage during model development, even if we use k-fold CV and related validation strategies.
Let’s say we have a loan applicant named CustomerX. The loan outcome for CustomerX wouldn’t be known until after the prediction point, i.e., until after the loan is approved and provided.
While it could be that you may have outcome data from previous loans, if you include all information when training your models (including the outcome information from previous loans), you could potentially build models that are too optimistic and unreliable when applied to data in the real world.
By now you can imagine what target leakage looks like for temporal (aka time series) data. To put it simply, if you include data “from the future” in your training dataset, then clearly your model would be cheating by already knowing the answer “from the future” to make predictions “in the future;” this is almost like building a time machine, which is impossible. In order to build accurate AND robust predictive models, what are some ways to avoid target leakage?
Leakage is not an issue that exists in one stage but rather it can take many forms and be introduced during different stages when building an AI system. One type of leakage can be inadvertently caused by a chosen validation strategy. To resolve this type of leakage, you should take extra care when partitioning your data, such as setting aside a holdout portion for final testing purposes only. (To learn more about leakage that appears in testing and validation areas, see Yuriy Guts’s presentation.)
Leakage detection is tricky because it’s more about the workflow: how and when the predictions of a model will be used. As illustrated in the above loan example, leakage caused by certain “leaky” features is a difficult challenge. Determining whether one or more features are at risk of causing target leakage usually requires subject matter expertise and domain knowledge. Here are some suggestions to keep in mind when determining potential target leakage:
- If a trained model is extremely accurate, you should be suspicious of the results; as the saying goes, if it’s too good to be true, then it probably is.
- Inspect your training dataset and perform exploratory data analysis (EDA) to determine if any features are statistically highly correlated to the target (or may completely—or near-perfectly—predict the target).
The DataRobot platform already has built-in target leakage detection which runs during the EDA stage after target selection. Since target leakage can lead to serious problems for models put in production, the GUI shows many visual indicators which prompt you to take action (Figure 4). For a feature that is flagged as “high risk of target leakage,” DataRobot automatically removes the “leaky” feature during the Autopilot model building stage.
As you can see in Figure 5, the validation scores and predictions for the resulting built models look much better now!
- Artificial Intelligence Wiki: Target Leakage
- AI Simplified: Target Leakage
- DataRobot Public Documentation > Data Quality Assessment > Target Leakage