Data Science Fails: There’s No Such Thing As A Free Lunch
When I was young, I took a packed lunch to school every day, and since I grew up in Australia, my packed lunch would include a couple of Vegemite sandwiches. Unless you grew up in Australia, you’ve probably never tasted it. And judging by this American’s first taste reaction of “Oh, that’s bad!”, you probably wouldn’t like the taste if you tried it out. But I loved my Vegemite sandwiches, and they were my one-and-only lunchtime choice, no matter what the circumstances.
While this blog isn’t about Vegemite, it is related to lunch, specifically the no free lunch theorem. In short, the theorem states that no algorithm can be equally good at learning everything, which means that you can’t know in advance which algorithm will work best on your data. I recommend “The Lack of A Priori Distinctions between Learning Algorithms” for readers with a technical background who want to know more about the theoretical background. However, despite this well-established theorem, it is common practice for data scientists to rely on only a limited number of modeling methods. Typically, companies are trying to roll out projects quickly, leading to tight timeframes, and all too often people will end up trying only one algorithm and/or limited preprocessing and tuning parameters.
Humans can be biased in choosing algorithms. Sometimes they prefer to run a particular type of algorithm that they are comfortable with, or they will always choose to use a specific approach to treat missing values during data preparation. These are examples of cognitive biases such as status-quo bias and anchoring. In the data science community, there’s often a lot of hype surrounding the latest algorithm, whether it be about “deep learning” or a “decision jungle.” All of this hype and attention can also trigger human bias. The people building AI solutions are human, and just like everyone else, they are subject to biases such as the availability heuristic, the bandwagon effect, and appeal to novelty.
Case Study: Earthquake Aftershocks
The following case study demonstrates the bandwagon effect and appeal to novelty biases. It is a real-life example where the data scientist used a hyped algorithm instead of checking for the best algorithm. By the end of this case study, you’ll surely realize there is no such thing as a free lunch!
In August 2018, Nature published the article “Deep learning of aftershock patterns following large earthquakes.” Using a training dataset containing more than 130,000 mainshock–aftershock pairs, the authors trained a deep learning algorithm to “identify a static-stress-based criterion that forecasts aftershock locations without prior assumptions about fault orientation.” The paper describes the details of the deep learning algorithm, including its inputs and six-layer architecture. After comparing the model accuracy versus well-known physics-based models, the authors concluded that “the neural-network forecast can explain aftershock locations better than can widely used metrics.” The results were surprisingly accurate, reporting an AUC of almost 0.85.
Very quickly, news of this article went viral. Headlines included “Google and Harvard team up to use deep learning to predict earthquake aftershocks” and “Artificial intelligence nails predictions of earthquake aftershocks.” The research was even included in the release notes for TensorFlow as an example of what deep learning could do.
However, for me, the article and subsequent hype raised a couple of red flags:
- Forecasting seismic activity is challenging. The accuracy looks too good to be true.
- The paper uses only a single machine learning algorithm, applying only one architecture.
Due to space limitations, we will return to the first identified red flag in a later blog. Here we will focus on the second point, the lack of algorithmic diversity.
At the same time that the Nature article was published, industry analysts were commenting on the hype associated with certain technologies. One industry analyst listed “Deep Neural Networks” as being at the peak of the hype cycle. The authors of the Nature article had only used an algorithm that industry analysts categorized as overhyped!
A year later, Nature published a follow-up article, “One neuron versus deep learning in aftershock prediction,” using the same data but written by different authors. This new article compared the original model against a simple linear logistic regression model, concluding that the simpler model “provides comparable or better accuracy.” The authors further concluded that “deep learning does not offer new insights or better accuracy in predicting aftershock patterns.”
Humans can be biased in choosing algorithms. You may have your favorites, or you may be excited to try the coolest and latest algorithms. But to avoid an algorithm bias, step aside and let competition between a champion and challenger model decide which method is superior. A lack of diversity in model-building usually leads to suboptimal results. A recent benchmarking exercise on a wide range of business use cases concluded: “The diversity of algorithms earning top accuracy rankings demonstrates the need to test as many different algorithms as possible to find the best one for your data.”