AI in Financial Markets, Part 2: A Whole New World. No Prior Needed.
“I define nothing. Not beauty, not patriotism. I take each thing as it is, without prior rules about what it should be.” — Bob Dylan
It’s entirely possible that someone, somewhere, has successfully built one of the amazing market predicting magic boxes from the previous post in this series, but if so, they’re (sensibly) keeping very quiet about it; the fact that we’re called “DataRobot, Inc.,” rather than, say, “DataRobot Capital Partners,” is evidence that we certainly haven’t built one yet¹. To recap, predicting share prices is actually one of the most difficult tasks in machine learning: machine learning succeeds where there are complex behaviors, which can be described in data and where consistent inputs lead to consistent outcomes. All too often, outcomes in financial markets are anything but consistent.
Examples Are Better When They Involve Small Furry Animals.
Let’s illustrate this with a ridiculously oversimplified example². Imagine you’re training a puppy to fetch you your morning newspaper³. On Monday morning, your puppy runs to the front door when it hears the paperboy, grabs the newspaper, and brings it to you. You give the puppy a treat and lots of praise for being a good boy. Tuesday morning, the puppy does the same, but doesn’t get a treat, although he does receive a pat on the head. On Wednesday, the puppy again fetches the newspaper (because he’s an enthusiastic fictitious little dog and doesn’t understand humans’ strange ways) and is rewarded with three dog treats. On Thursday, the puppy fetches the newspaper yet again — but you scold him, take away the rest of his food, and lock him up⁴.
The puppy spends the rest of Thursday pondering what he did wrong and decides on Friday that it might have been better for him if he’d been born a kitten, because at least there’s no expectation that kittens learn from such bizarre, inconsistent behaviors—they can just do their own thing. And you give up on the idea of canine-assisted newspaper deliveries in the morning.
This is hugely oversimplified, but the point is that it’s very hard for anyone or anything to learn anything when the same inputs and actions can produce very different outcomes.
No Free Lunches Here. But the Food Is Very Tasty.
So machine learning techniques cannot magically make “signal” appear out of thin air, nor can they necessarily make unstable factors more stable. So far, so disappointing, so unsurprising. Nevertheless, there are a number of useful characteristics that set apart machine learning techniques from the traditional quantitative and ‘quantamental’ investing toolbox:
Machine learning models are (often) non-linear, non-parametric and don’t assume a prior statistical distribution. Machine learning models are typically built algorithmically rather than using statistical techniques. This makes such models better able to detect discontinuities in outcomes as a result of input variables crossing certain thresholds (step functions or changes in behaviors), and frees them from the tyranny of assuming that the normal (or log-normal) distribution⁵ underlies the phenomenon to be modeled.
Machine learning models (usually) focus on quality of actual predictions rather than statistical measures of fit and significance. Statistical measures of fit are often concerned with how closely a given model conforms to an assumed, prior distribution, and are optimized towards this goal. Machine learning models are optimized and evaluated in terms of how good their predictions are compared to what actually happened, with best practice requiring that machine learning models are scored on their ability to make predictions on previously unseen (out-of-sample) data with known outcomes.
Data science techniques are (often) very good at extracting value from non-traditional forms of data. Evaluating the sentiment of central bank statements? Scoring the risk of credit downgrades from company filings? Using satellite imagery to understand economic activity levels? There’s been a boom in alternative sources of data in the last decade; while they are probably not the investing panacea that some might suggest, the addition of text or image data to modern natural language processing and deep learning techniques can result in model performance improvements of some 10-20%.
Machine learning models are (often) more robust in the face of poor data quality. Certain families of machine learning algorithms, notably those based on decision trees, interpret missing values as just another ‘branch’ of a tree and don’t require interpolation or otherwise backfilling missing data. Indeed, in some cases, if there is a systematic pattern to which values are missing, these types of algorithms can identify this and use it as a signal in itself.
Machine learning models are (often) better at dealing with multiple highly correlated variables and don’t necessarily require (or benefit from) parsimony. This means that, rather than carrying out time-consuming univariate analyses before building a multivariate model, model building can precede feature selection; while high correlations between features can still affect the interpretability of a model, this can be dealt with by taking collinear groups of variables into account when calculating outcomes-based interpretability.
Machine learning models are (usually) much easier to interpret than you might expect. For a long time, machine learning has had a bad rap, with models decried as mysterious black boxes where it’s impossible to see what drives them. This is unjustified and outdated: modern outcomes-based interpretability techniques test machine learning models on how they react to certain stimuli being applied. Done systematically, these techniques are conceptually similar to a risk manager subjecting a portfolio to multiple scenarios and stress tests; they answer questions such as what the most important drivers are of model performance, the sensitivity of predictions to changes in these drivers, and the reasons behind the individual predictions made by a machine learning model.
It’s Not All Rainbows and Unicorns…
So, there are a number of excellent reasons to consider machine learning techniques for inclusion in an investor’s or quant’s arsenal. That said, everything has its limitations, and it’s important to be aware of these.
Garbage in, garbage out is still very much a thing. While machine learning models can be excellent at modeling highly complex behaviors and systems which are driven by hundreds or thousands of variables, ultimately, if there’s no signal (i.e., the input data doesn’t describe the underlying behaviors well and thus isn’t predictive of the outcome you are modeling), even the best machine learning algorithm won’t be able to turn this data into a usable model. Incidentally, this is why it’s important to be able to “fail fast” in data science; you want to avoid spending too much time trying to solve the intractable.
Non-parametric machine learning models (sometimes) aren’t boundless. Certain machine learning algorithms⁶ are great for many things but can struggle to deal with values of input variables they’ve not seen before. A linear model or neural network could simply extrapolate by applying a coefficient to an extreme value and making an “educated guess”⁷; a tree-based model won’t have coefficients and may end up effectively saying, “uh, I don’t know, what happened last time the inputs went to an extreme in this direction?”
Machine learning models aren’t immune against overfitting (either). At some point in their careers (hopefully early on), most quants will have seen a model which looked amazing on paper and backtested incredibly well—and stopped working pretty much immediately once in production. For some, this may have caused an involuntary career change; others will appreciate the importance of making sure that machine learning models aren’t overfitted—you don’t want models to learn the training data so well that they don’t actually work on new data. Certain best practices need to be followed in order to avoid overfitting: ensure that models are properly validated, selected and calibrated on previously unseen data (out-of-sample validation and testing) and that they do not learn from information that would not be available at the time of prediction (target leakage).
Artificial intelligence is no match for human laziness or stupidity; domain expertise and data understanding are as important as ever. With AI, a good model is a necessary but not sufficient condition to generate value; in order to really add value from predictive modeling—no matter what the techniques or algorithms used—an understanding of the domain being modeled and the data used to describe it is crucial.
…But Automating Machine Learning Helps.
Barriers to entry are falling fast. You no longer need armies of rocket scientists or Ph.Ds in order to participate in the AI revolution. Modern automated machine learning platforms such as DataRobot hold benefits for users at a variety of technological skill levels, whether or not they are able to code; a lack of programming skills is no barrier to building sophisticated, powerful machine learning models with DataRobot⁸.
Automation accelerates governance and compliance processes. In heavily regulated industries such as the financial markets, the use of a standardized, automated process by which machine learning models are built, deployed, and managed can substantially reduce approval times and compliance overhead. This allows model validators to focus on the specifics of a given use case, secure in the knowledge that a reproducible end-to-end data science process has been followed, irrespective of the model builder’s technical abilities.
But it’s not just about making the technology accessible to a wider audience. With DataRobot, many repetitive parts of the process of building machine learning models are automated, while ensuring that best practices are designed in; this allows users to become massively more productive. As a result, experienced quants and financial data scientists who understand their data well and are comfortable with coding in Python or R are finding that automated machine learning gives them the ability to massively expand the problem space they can explore.
We’ll get into what that actually means, and why it matters, in more detail in part three of this series.
¹ Further evidence, if it were needed: the author of this drives a Volvo. Not a Ferrari.
² Here’s one for all the reinforcement learning fans out there.
³ Feel free to go ahead and imagine it’s 1995 while you’re at it.
⁴ No actual animals were harmed in the writing of this blog post.
⁵ Yes, even if it’s fat-tailed, leptokurtotic, tautological or any other kind of distribution, for that matter. As we know, financial markets tend to appear log-normally distributed, until it really matters, at which point the distribution can look more like a toddler’s crayon drawing.
⁶ Especially decision trees and those algorithms based on them — random forests and gradient-boosted trees, for instance. They’re great when it’s unlikely that extrapolation will be needed, though, and can outperform even sophisticated neural networks in many ‘vanilla’ use cases.
⁷ How good such extrapolations actually are is a separate question—at extremes, it’s quite likely that the behavior(s) being modelled have changed and the educated guesses are pretty bad, too.
⁸ You should still be able to understand the data and generally find your way around a computer, though; if you struggle with Excel, maybe best leave machine learning to the youngsters.