AI in Financial Markets, Part 4: Best Practice Makes Perfect
So far in this series, we’ve discussed how the securities industry would use AI in general (part one), compared machine learning in the front office to “traditional” quant (part two), and described how automated machine learning in particular can bring enormous power and efficiency gains to the quantitative investment process (part three). In this installment, we’ll describe our recommendations for ensuring that your machine learning models won’t just look great on paper but will actually work on new data in production. We’ll also look at how DataRobot’s integrated best practices and guardrails can help you with this.
It’s a trap!
“It is trivial to produce a historical walk-forward backtest with a high Sharpe ratio, by trying thousands of alternative model specifications. […] As a result of this selection bias, most published discoveries in finance are false.” — Prof. Marcos López de Prado, Advances in Financial Machine Learning
Whoa. Ouch. Don’t hold back, Professor. Practitioners will have a great deal of sympathy with this view; it’s all too easy to cross the line and end up doing the dreaded data mining rather than building a robust investment strategy. The annals of investment management are littered with stories of finance academics’ investment vehicles that failed to monetize very promising “discoveries” as they were unable to replicate their stellar “paper” results with real money. At the same time, it’s important that we don’t confuse data mining (bad, and likely to fool us with randomness) with the process of repeated, iterative calibration that’s usually needed in order to make a machine learning model work well; even the most sophisticated automated machine learning solution may require some rounds of tinkering before being ready for the world.
Philosophy, not Foolosophy
So, how do we avoid this trap, while still constructing models that are robust and workable? It’s important to approach the problem thoughtfully and pragmatically. We’ve found that keeping the following common-sense considerations in mind greatly helps our quant clients when building their machine learning models:
What are you actually testing? Be as specific as possible and don’t take the “I’ve got a bunch of data, let’s see what I can find” approach. Set up one or multiple hypotheses. “Stocks that experience earnings expectations downgrades and weak price momentum will outperform over the following three months” is a good hypothesis to test, “there’s got to be something in the text of earnings releases which will allow us to predict forward returns” isn’t. Keep simple probability theory in mind and don’t allow yourself to be fooled by randomness: assuming 50/50 odds of a successful backtest, running few as eight different problem setups across three backtests is likely to return one that looks good just from random chance.
Unseen data is your best friend.
There really is no such thing as too much out-of-sample testing; too little happens all the time. When you’re evaluating or calibrating multiple variants of the same hypothesis, or comparing multiple machine learning approaches, you also need to be very careful that you don’t accidentally turn your “unseen” data into “seen” data (this can happen very easily and tends to end badly). Test the stability of your preferred model’s performance by exposing it to more, previously unseen data once you’ve chosen it; DataRobot keeps a further holdout set locked away specifically for this purpose and allows users to further evaluate stability on more, ‘external’ unseen data if desired. Remember that models’ in-sample performance should look great by design and, therefore, holds no information whatsoever; this is why DataRobot does not expose in-sample performance metrics or analytics numbers as a rule.
“One model for all seasons” is not an actual thing.
The economy evolves over time; expectations about the business cycle evolve even faster, as do financial markets environments. Don’t expect your models to work over the long term; better to have the agility to quickly identify when a régime change is on the horizon — and to react quickly. Automated machine learning gives you the power to efficiently develop a veritable arsenal of models suited to particular market régimes and/or to quickly build new models in response to a change in environment. DataRobot’s MLOps data drift tracking allows users to identify phase shifts in deployed models’ input data, even before prediction outcomes (actuals) are known.
Backtest for stability and decay.
Related: simulate how your model maintenance and refresh process would look in production. Would you have reached the same conclusions if you’d built your models six months, a year, two years, five years ago? Use simulations to understand how often you will need to retrain your model (i.e. how quickly its performance would have decayed after training and what scheduled refresh frequency will make sense). DataRobot’s powerful Python and R API integrations allow users to add another layer of automation to the automated machine learning process to very efficiently carry out these kinds of simulations. Once your model is deployed, DataRobot MLOps model accuracy and data drift monitoring keeps a watchful eye on it — also flagging when an unscheduled refresh might be in order due to unexpected performance deterioration or data changes.
Sometimes less data is more; sometimes more data is more, but less often than you’d think.
Sometimes reducing the amount of data that you’re training your model with by shortening the timespan used in training can yield better results, especially in time-series problems and other fields where behaviors are unstable over time. To identify the period over which the observed behaviors are strongest, test multiple different model training durations and evaluate how these affect model performance. As before, use API automation to make this process as painless as possible.
Smell tests and sense checks are important.
Have I bored you enough about the importance of domain expertise yet? If something looks great but you can’t fully understand or explain the underlying behavior, chances are you’ve found a statistical artefact that you won’t be able to replicate in production. Granted, many financial markets behaviors are complex and may seem counterintuitive at first glance, but always ask yourself whether your findings make intuitive sense — or can, at least, be explained.
Algorithmic diversity is useful for sense checking.
We’d already mentioned that there’s no such thing as an algorithm that can magically create “signal” out of thin air or random data. Conversely, if you’ve found something that models well when using one family of machine learning models but shows zero promise with any other approach — yep, you’ve probably found yourself another statistical artefact. To avoid this, use automated machine learning to try out as many different machine learning approaches to a given problem as you can lay your hands on. Only proceed if you can replicate your modeling success, at least to an extent, with multiple different machine learning techniques. (Did I mention that DataRobot’s automated machine learning allows users to evaluate a wide variety of machine learning algorithms, based on about a dozen open source libraries?)
Naïveté is a virtue.
While we’re on the subject of sense checking: don’t forget to check the models you build against naïve baselines. These are, simply put, “stupid models” — always predicting the same thing, such as the average of the target variable, or the more frequent response. In time-series modeling, the baseline is the most recent value of the target variable. It’s important to calculate the error metrics for these and compare them to your candidate models; if you can’t outperform them, it’s a sure sign that you need to do some more work before you actually have something ready to go.
Don’t expect to shoot the lights out.
Financial markets are noisy, financial markets data even more so. Just like with other investment decisions, it pays to be realistic about what’s a good outcome — the old adage suggests that you’re doing well if you’re right 55% of the time, and can make money from being right less than 50% of the time as long as you are disciplined at managing risk. This applies to machine learning in quant finance, too: an R2 of 0.2 to 0.3 can be a good outcome in this context, with 0.4 or 0.5 high enough to seem deeply suspicious. (We’ll save the discussion on why R2 is generally not a good metric for machine learning models for some other time.)
Self-awareness is paramount.
That amazing, super-accurate model you just built which is making you look like some kind of quant investing genius and will let you retire before you hit 30? Sorry, but it’s much more probable that you’re not actually the quant Einstein: acknowledge this and look for errors in your model — it’s likely that you’re experiencing data leakage of some sort. If it looks too good to be true, it usually is. Surprisingly often, you will end up spending time on actively making your model performance worse — but realistic and replicable. A good understanding of the data is crucial, as it can be hard to detect the root causes of these types of errors (data leakage and target leakage over time); DataRobot’s automated target leakage detection capabilities and sophisticated analytics support this by highlighting where to look for such leakage.
Automatic for all the people
“We’re functioning automatic / And we are dancing mechanic / We are the robots” — Kraftwerk (1978)
One of the great advantages of automated machine learning is that, as we saw, a lot of what constitutes best practice can be incorporated into the automated workflow. DataRobot’s product design philosophy embodies this: we provide an AI Cloud platform that bakes in a lot of expertise, best practices and guardrails. This can be particularly useful for those occasions where you’re simply not aware that you’re doing things that aren’t a great idea, or haven’t the time to get into the detail of what constitutes best practice for a given machine learning approach.
Turbocharge your expertise and self-awareness by outsourcing best practices.
Our quant investment management clients tell us that what they love about DataRobot is how it allows them to focus on their value add — their deep expertise in financial markets and understanding of financial data — by automating much of the model-specific data engineering, training and selection processes. DataRobot’s built-in guardrails such as automated target leakage detection, data quality checking, redundant feature detection and feature reduction, amongst many others, complete the picture.
Incidentally, while you’re at it, outsource (some of) your compliance documentation workload: if you’re reading this, you’re probably working in a highly regulated environment, where model validation and/or model risk management is becoming increasingly important. This often requires detailed documentation of the model building process and its results. DataRobot automatically generates compliance documentation; our customers speak of a 50%-70% reduction in overhead as a result. And that’s before we even start talking about the need for challenger models, which get built automatically in DataRobot as a by-product of the automated machine learning process.
Not a miracle drug, but strong medicine
So, there we have it. Machine learning in quant finance: not Miracle On Wall Street, nor a panacea, but a versatile, useful addition to the traditional quant’s toolbox of mathematical and statistical techniques. Used with a modicum of good sense and an awareness of good practice, coupled with realistic expectations (and maybe some Python code), automated machine learning has the power to act as a force multiplier for quant analysts.
We hope that this series of blogs has illustrated why and how, despite the hype, increasing numbers of sell-side and buy-side professionals are harnessing the power of automated machine learning in their daily work and using DataRobot to build, deploy, and monitor models that generate millions of dollars annually. Will you join them?