What is Target Leakage in Machine Learning and How Do I Avoid It?
This post was originally part of the DataRobot Community. Visit now to browse discussions and ask questions about DataRobot, AI Platform, data science, and more.
We should all be familiar with data science best practices to prevent models from “overfitting” to the data by using validation techniques such as train-test-split, k-fold cross validation (CV), etc. However, AI practitioners should also pay close attention to the content of their training data, or else we could be at risk of causing target leakage (sometimes also known as data leakage). In this blog post, I’ll explain what target leakage is, show you an example, and then provide a few suggestions to avoid potential leakage.
Let’s say, for example, you want to build a predictive model that will be used to approve/deny a customer’s application for a loan. After the raw data collection and aggregation work, you obtain a dataset that is data rich, meaning it contains various data types such as numerical, categorical, text, etc., and the dataset is also wide (i.e., contains many columns). Using this dataset, you end up building many remarkably accurate models.
Great! Give yourself a pat on the back, thanks to the superb accuracy scores of your built models! But, the scores appear suspiciously too accurate: the training and CV scores all look extremely good and the number of model errors appears to be very low. In the DataRobot platform, this is evident from the Leaderboard validation scores where, for LogLoss metric, the lower the score, the better the model. In Figure 1, you see examples of very low validation/CV/holdout scores and quite nonsensical predictions. The surprisingly low scores and unusual predictions should raise red flags.
What could have happened? When you look back at this example’s training dataset, “Loan applications [5-year-contracts] (Year: 2015),” you discover a column called “loan_status.” You ask yourself, “Would a new candidate’s loan application have “loan status” in the input form?” The answer? No, because the loan hasn’t even been approved/denied yet! The “loan_status” feature is a proxy for the target, thus causing target leakage for the built models.
Basically, target leakage can happen when you train your model on a dataset that includes information that would NOT be available at the time of prediction, i.e., when you apply the trained model in the field. The consequence of this is that for certain types of target leakage, bad model behavior will only be spotted when the model is deployed in production; this means it’s difficult to detect leakage during model development, even if we use k-fold CV and related validation strategies.
Let’s say we have a loan applicant named CustomerX. The loan outcome for CustomerX wouldn’t be known until after the prediction point, i.e., until after the loan is approved and provided.
While it could be that you may have outcome data from previous loans, if you include all information when training your models (including the outcome information from previous loans), you could potentially build models that are too optimistic and unreliable when applied to data in the real world.
By now you can imagine what target leakage looks like for temporal (aka time series) data. To put it simply, if you include data “from the future” in your training dataset, then clearly your model would be cheating by already knowing the answer “from the future” to make predictions “in the future;” this is almost like building a time machine, which is impossible. In order to build accurate AND robust predictive models, what are some ways to avoid target leakage?
Leakage is not an issue that exists in one stage but rather it can take many forms and be introduced during different stages when building an AI system. One type of leakage can be inadvertently caused by a chosen validation strategy. To resolve this type of leakage, you should take extra care when partitioning your data, such as setting aside a holdout portion for final testing purposes only. (To learn more about leakage that appears in testing and validation areas, see Yuriy Guts’s presentation.)
Leakage detection is tricky because it’s more about the workflow: how and when the predictions of a model will be used. As illustrated in the above loan example, leakage caused by certain “leaky” features is a difficult challenge. Determining whether one or more features are at risk of causing target leakage usually requires subject matter expertise and domain knowledge. Here are some suggestions to keep in mind when determining potential target leakage:
- If a trained model is extremely accurate, you should be suspicious of the results; as the saying goes, if it’s too good to be true, then it probably is.
- Inspect your training dataset and perform exploratory data analysis (EDA) to determine if any features are statistically highly correlated to the target (or may completely—or near-perfectly—predict the target).
The DataRobot platform already has built-in target leakage detection which runs during the EDA stage after target selection. Since target leakage can lead to serious problems for models put in production, the GUI shows many visual indicators which prompt you to take action (Figure 4). For a feature that is flagged as “high risk of target leakage,” DataRobot automatically removes the “leaky” feature during the Autopilot model building stage.
As you can see in Figure 5, the validation scores and predictions for the resulting built models look much better now!
- Artificial Intelligence Wiki: Target Leakage
- AI Simplified: Target Leakage
- DataRobot Public Documentation > Data Quality Assessment > Target Leakage
Software Engineer II, DataRobot
Alex Shoop is a Software Engineer II at DataRobot and a member of the MLDev team. He joined DataRobot right out of grad school, after finishing a BS in Actuarial Mathematics and Computer Science, an MS in Financial Mathematics, and an MS in Data Science. Shoop speaks Japanese and French fluently and has a passion for video production.
We will contact you shortly
We’re almost there! These are the next steps:
- Look out for an email from DataRobot with a subject line: Your Subscription Confirmation.
- Click the confirmation link to approve your consent.
- Done! You have now opted to receive communications about DataRobot’s products and services.
Didn’t receive the email? Please make sure to check your spam or junk folders.
Accelerate Your AI Journey with the DataRobot Partner EcosystemMarch 28, 2023· 3 min read
How MLOps Enables Machine Learning Production at ScaleMarch 23, 2023· 4 min read
A New Era of Value-Driven AIMarch 16, 2023· 2 min read
In this installment of the AI Simplified video series Jake Shaver, Program Manager at DataRobot, talks about target leakage, one of the most challenging issues when developing a machine learning model. “Target leakage is a problem because if you use a mode that contains leakage, it may look good during the training phase, but it will make wildly inaccurate predictions…
At DataRobot, we strongly believe AI is a force for good. From the way deep learning algorithms can create art and writing to its applications for health and medicine, as well as astonishing head-to-head match-ups between AIs and humans in games, AI has made enormous strides towards imitating human behavior. At the same time, we recognize that AI is a…
As they strive to improve models, data scientists continually try new approaches to refine their predictions. To help data scientists experiment faster, DataRobot has added Composable ML to automated machine learning. This allows data science teams to incorporate any machine learning algorithm or feature engineering method and seamlessly combine them with hundreds of built-in methods. After adding the preferred code,…