Data Science Fails: If It Looks Too Good To Be True
You’ve probably seen amazing AI news headlines such as: AI can predict earthquakes. Using just a single heartbeat, an AI achieved 100% accuracy predicting congestive heart failure. A new marketing model is promising to increase the response rate tenfold. It all seems too good to be true. But as the modern proverb says, “If it seems too good to be true, it probably is.”
Target leakage, sometimes called data leakage, is one of the most challenging problems when developing the machine learning models that power modern AI. It happens when you train your algorithm on a dataset that includes information that would not be available at the time of prediction when you then apply that model to data you collect in the future. Since it already knows the actual outcomes, the model’s results will be unrealistically accurate for the training data, like bringing an answer sheet into an exam.
It was several years ago that I gained firsthand experience of target leakage. I was helping a bank cross-sell insurance policies to their customers, and we were brainstorming the types of input features that might be predictive of which customers would be interested in the product and likely to respond to a targeted marketing campaign. My idea was to add the number of banking and insurance products the customer had already purchased before each marketing campaign commenced, as that would be a good proxy for the brand value to that customer. My team didn’t have direct access to the data (after all, you should limit the number of people who have access to people’s banking data). So, we wrote specifications for the data extract, and IT supplied the data. A talented data scientist in my team created a model that predicted a tenfold increase in campaign response rate versus previous campaigns but was suspicious that much of the predictive power came from the product count. After IT reassured us that they had used the correct definition, we proceeded with the marketing campaign. It was a dismal failure. We achieved a response rate that was lower than in the past! We held a post-mortem to find the problem and correct it. After many emails and investigations we discovered that IT had taken a shortcut. They decided it was too difficult to get the historical product count at the required date before the campaign commenced and instead had provided the current product count. This choice created target leakage because the current product count included the results of the historical market campaign. If the current product count equaled one, we knew for sure that the customer had not responded to the campaign.
Case Study: Predicting Earthquakes
In the first blog in this series, I discussed a Nature article describing a deep learning model that predicted earthquakes with surprising accuracy and then showed how the choice of algorithm was flawed. This blog focuses on the surprising accuracy achieved by the model. Predicting earthquakes is difficult. So it was surprising that a naïve model, knowing nothing about physics and geology, was able to outperform established, specialized earthquake models. It seemed too good to be true.
One of my colleagues, Rajiv Shah, became suspicious and decided to investigate further and replicate the research results. The data was publicly available, and the authors had described the methodology with sufficient detail to replicate the results. One conspicuous warning sign was that the accuracy on the testing dataset was significantly higher than on the training set. That isn’t normal. Usually, a machine learning model has slightly higher accuracy on the data it learned from versus independent test data because it memorizes some of the answers for the examples it has seen before but may get slightly confused by new cases.
Rajiv found what has been described as a “glaring schoolboy error.” There was some overlap in the data used to both train and test the model. Its accuracy was artificially high. After correcting the analysis, using best practice data partitioning, Rajiv found only weak predictive power.
“Data leakage is bad, because the goal of a predictive model is to generalize to new examples,” Shah explained to The Register. “For a model to generalize, it should be tested on data that resembles the ‘actual world’.”
“Typically this is done with a random sample of your data – the test set – which you never expose to the model,” he added. “This ensures your model has not learned from this data and provides a strong measure to ascertain generalizability. With data leakage, the test set is not really independent and any metrics, therefore, cannot generalize to performance in the ‘actual world’.”
Case Study: Congestive Heart Failure
A new research paper “A convolutional neural network approach to detect congestive heart failure”, scheduled to be published in Biomedical Signal Processing and Control in January 2020, boasts of achieving “100% CHF detection accuracy” using a new deep learning model that predicts congestive heart failure “on the basis of one raw electrocardiogram (ECG) heartbeat only.” Such a fantastic result caught the attention of news media around the world, prompting headlines such as “Medical marvel: AI can now help predict heart failure risk with 100% accuracy.” It seemed too good to be true, and it was. Two warning signs were already triggering suspicion.
The first warning sign was achieving 100% accuracy. Perfect accuracy is usually an indication of a problem. Very few things in life can be known with 100% certainty. Either the outcome is trivially easy to predict, or you’ve accidentally used data that exploits knowledge of the outcomes.
The second warning sign was achieving accuracy from a single heartbeat. That isn’t much data from which to infer a health outcome. ECG data is notoriously noisy, and to compound the problem, the data used in the study used only “lead 1” values, (i.e., sourced from only a single electrical lead).
The data was sourced from publicly available datasets, containing almost half a million heartbeats. Each subject’s (person) data rows were allocated to a single data partition (training/validation/test). So far, so good. Half a million data rows are enough data to find patterns, and it is best practice to allocate all data rows for each subject to a single data partition (to avoid target leakage problems such as seen in the earthquake prediction study).
However, the 100% accuracy headline, used in the abstract and news articles, is misleading. It was calculated on the training data. Whenever you measure the accuracy on the training data, you overestimate the accuracy. It is quite common for a model to overtrain, to memorize the training data and not perform as well on new data. In this case, the accuracy on test data was only 97.8%.
After delving more deeply into the data, we discovered problems with the experimental design. Even though there are half a million rows of data, there are only 33 subjects. That means there are only 33 independent outcomes, not enough to trust that the model has generalized to work for most of the population. The risk is that, instead of predicting congestive heart failure, the model learned to identify individual subjects, as a cheating shortcut to predicting the outcome.
Of the 33 subjects in the study, 18 were normal healthy subjects, while 15 subjects had severe congestive heart failure. However, the healthy subjects came from a different database than unhealthy subjects. The two databases were not directly comparable. They collected data from different machines at different times. Furthermore, the two data sources stored the heartbeat data using different time frequencies, requiring the high frequency data to be downsampled to be comparable to lower frequency data, creating pre-processing artifacts in the data. It is quite likely that the algorithm is cheating by using the characteristics of the transformed data source, rather than the attributes of heartbeats from healthy versus unhealthy hearts.
To summarize, the study results are too good to be true because:
It didn’t use out-of-sample data to measure accuracy.
The heartbeats came from only 33 people.
Data sources for unhealthy and healthy subjects came from different data sources, with different characteristics.
The tools to build machine learning and AI models are becoming easier to access and use. But errors are still prevalent, and those errors can be counterintuitive and subtle. They can easily lead you to the wrong conclusions if you’re not careful. For more examples of how target leakage occurs, I highly recommend this presentation and tutorial on target leakage.
Be suspicious of your model if the results seem too good to be true. Follow best practices to reduce the chance of errors:
- Measure accuracy only on out-of-sample data, such as the validation data or the holdout data, never from the training data.
- Partition your data carefully. Don’t allow data from the same subject (whether that be a person or a geological fault line) to be allocated across multiple partitions.
- Check for target leakage. When possible, use tools that have guardrails that detect potential target leakage.
- Use the latest generation of explainability tools, checking for common sense behavior. Check which input features are the most important. For the most important input features, check which values lead to yes or no decisions, or high or low predictions.
- Take extra care with data that spans multiple periods. It is very easy for partial target leakage to creep in via your feature engineering or hyperparameter tuning. When possible, use tools that are capable of time-aware model training.
Are your current methods and tools at risk of creating AIs that are too good to be true? Click here to arrange a demo of DataRobot’s guardrails and built-in best practice data science methodologies.