AI in Financial Markets, Part 5 of 4: Flippantly Answered Questions
This was going to be a four-part blog series. I figured that I’d covered most of what financial markets participants might be interested in hearing about when it comes to AI in the markets (and had probably bored the audience for long enough, all things considered). Then we did a webinar on the topic together with FactSet, had a great turnout, and got asked lots of great questions from our listeners — so many, in fact, that we had to answer them by email as we ran out of time. And the questions were sufficiently good that we thought you might be interested in reading the Q&A, too, so here goes.
Financial markets are known to have an extremely non-stable structure. How do you find the right balance within your model between having a model that reacts quickly using mostly recent data (to have a better fit) vs. having a more stable model that can use longer (hence, more) data?
In one word: empirically. Remember that in financial markets, in particular, and time-series modeling in general, more history doesn’t automatically mean that you get better models. If a longer history means that you enter a different behavioral régime, your model performance will deteriorate. So trying out different lengths of the training period can be a valuable way of discovering how persistent a behavior actually is. Don’t go nuts, though. You need to avoid trying out so many variations that you end up with something that, by pure randomness, looks good but doesn’t generalize. You can avoid this kind of overfitting by being specific in your hypothesis formulation and rigorously testing your candidate models on multiple out-of-sample, out-of-time datasets; at all costs, avoid selection and look-ahead biases.
Are there particular areas of the markets where you see machine learning working better or worse than in others?
We see machine learning models adding value in lots of different areas of the markets and many different asset classes. As previously discussed, we see particular power in predicting second-order variables (i.e. modeling variables that influence returns, instead of directly predicting returns) and in a variety of use cases that form part of the business of financial markets (see blog post 1).
One interesting pattern that we have noticed is that generally, the longer the prediction horizon or data frequency, the more difficult it is to build good machine learning models. So quarterly numbers tend to be very difficult to work with, and monthly numbers can be a struggle sometimes, too. With such infrequent data, there is a trade-off between using a very limited number of data points in order to not go beyond the current behavioral environment (limited data often makes it hard to build good models), or working with a longer history that spans multiple behavioral régimes (thus being a worse fit for the individual ones, so again it’s hard to build good models). On the other hand, there are a lot of great use cases using tick and market microstructure data, which is a bit like a firehose: if you need some more data to work with, you just need to open the tap for a little while.
Another pattern we see is not a surprise: the less efficient the price discovery mechanism is, the more value that machine learning models can add—as long as there’s enough training data. So machine learning models probably won’t add much value on the U.S. Treasury Bond future, nor will they on a corporate bond that trades once every six months: the sweet spot will be somewhere in between.
One other thing that’s worth mentioning: simulation-type problems can be a bit of a tricky fit for supervised machine learning, as there’s often no clear concept of a “target variable” to predict. That said, you can use machine learning models to make the predictions which enter the simulations, instead of linear or parametric models; this generally doesn’t make the process any faster, but allows the simulator to take advantage of non-linearity, which can help. In some cases, machine learning can also be used to predict simulation outcomes as a function of the simulation input parameters; this can, for instance, make it much faster to price certain exotic derivatives.
You mentioned the difficulties that certain machine learning algorithms have with extrapolating beyond the bounds of the training data. If need be, can you focus on those algorithms that are able to extrapolate in DataRobot?
Yes, very easily — DataRobot’s leaderboard ranks the different candidate blueprints, models and algorithms by their out-of-sample performance. If you don’t want to use a model based on, say, decision trees, you would simply select a model trained using another algorithm family. The leaderboard comparison will show you whether there’s a trade-off between that model and the tree-based models in terms of out-of-sample performance.
As a reminder: even if an algorithm is able to make predictions that go beyond the limits of the training data, those predictions won’t necessarily make any sense, as you have no experience on whether the behaviors inside the training data are consistent in environments beyond it. Proceed with extreme caution!
How would you handle scenarios where certain predictors are missing data for a few years in a longitudinal dataset, maybe because the data gathering did not begin for that predictor until recently?
First, I’d check whether the predictors actually add any value to the model by building a few different variants of the model: one trained on the full dataset including the variables with limited history (let’s call this Model A), trained over the period for which the full data is available, and another trained over the same period of time but excluding the limited-history variables
(Model B). I’d also train a third model covering the full history that excludes the limited history variables (Model B*). If Model A performs better than Model B, I probably would take that result and not investigate further; if it doesn’t, the comparison between Model B and Model B* will tell me whether adding further history actually helps model performance. If it does, and it’s better than Model A, I’d look for proxies for the limited history variables with a longer history; if not, Model A is good to go.
If you’re referring to a scenario where you’re looking to backtest a strategy over a longer period of time and some of the data in your current model wouldn’t have been available in past periods, the solution is even simpler: evaluate a model built on just the longer history data for the period when the shorter history data isn’t available, then use a model built on the full dataset once it’s available.
So, tl;dr: try out different variants, empiricism wins again. Don’t go crazy with the different variants, as you don’t want to do the data science version of p-value hacking (quants will know this as the dreaded “data mining”). But comparing multiple models built in different ways usually gives good insights, especially when using DataRobot’s standardized analytics.
We hear a lot about the hybrid approach in machine learning. What is it, and does DataRobot support it?
Generally, hybrid approaches in machine learning combine two or more different types of algorithms in order to reduce model error and potentially solve problems which the individual algorithms would be less suited to, or less performant at. DataRobot has quite a few blueprints (machine learning pipelines) which use such approaches, typically combining a supervised machine learning algorithm (one that is designed to predict a target variable by learning from historical observations) with one or more unsupervised learning techniques (clustering, dimensionality reduction). We find that adding clustering, in particular, to a supervised machine learning algorithm like XGBoost can reduce prediction error by 10-15%, depending on the use case.
How does the greedy search algorithm to populate DataRobot’s leaderboard work?
In a nutshell: we first identify the set of all the machine learning pipelines (“blueprints”) that can be applied to the problem at hand, then use a combination of heuristics (to ensure algorithm diversity) and recommendation (to identify those blueprints that are likely to be performant) to identify the initial algorithms. Multiple rounds of model training ensue, starting with a large spectrum of blueprints that are trained on small amounts of data, gradually reducing the number of blueprints trained (filtering on out-of-sample performance), while increasing the size of the training data, finally cross-validating the best-performing algorithms and trying out some ensembles to see whether this will further improve the performance.
Please elaborate on the different types of feature extraction that DataRobot does.
DataRobot does four main kinds of feature extraction and selection automatically:
- Transforming features to match a particular machine learning algorithm or make it more performant (automated feature engineering), including dimensionality reduction using techniques such as principal-component analysis or singular value decomposition
- Evaluating differences, ratios and other transformations and combinations in datasets where observations are independent (automated feature discovery)
- Constructing rolling transformations and evaluating different lags in time series problems where autoregressiveness is present (time series feature engineering)
- Automatically generating a reduced feature list on a modeling project’s best model and retraining it (automated feature selection)
Additionally, users have the flexibility to build a wide range of feature transformations using the DataRobot Paxata data preparation platform before pushing the data to DataRobot MLdev. The MLdev API also integrates seamlessly with Python and R’s powerful data preparation capabilities, as well as providing connectivity to other databases such as KDB.
What are the advantages of an enterprise solution like DataRobot compared to open platforms like scikit-learn or Tensorflow?
Cutting edge data science and machine learning are simply unthinkable without open-source packages such as Tensorflow; this is where the innovation lies these days. That said, DataRobot is not built by the crowd. We have some 350 incredibly talented engineers and data scientists on our team, whose job it is to engineer our platform to enterprise grade and work with our customers to ensure that it meets their needs. This includes a number of contributors to popular open-source libraries such as numpy, pandas, scikit-learn, keras, caret, pic2vec, urllib3 and many others.
So we take the best of what’s out there in the open-source data science community and ensure that it’s suitable for enterprise use — contributing to the open source community itself when needed to make this happen. For example, recently members of our team were building some modeling pipelines, including elements from an open-source machine learning library which we had not previously supported. Their testing revealed some critical bugs under the hood and development efforts were then refocused towards fixing the actual open-source library and pushing those changes out to the community.
With a “best of both worlds” solution such as DataRobot, there’s still someone at the end of a phone to shout at if there’s an issue. And you don’t have to worry about making sure that all the parts of the open source stack are up-to-date either.
Does the DataRobot engine run on my desktop computer? How is performance managed, CPU vs GPU selection, etc?
DataRobot is a powerful platform whose requirements exceed the capabilities of a single desktop computer. There are various ways of running the DataRobot platform:
- On DataRobot’s managed AI cloud
- Via the FactSet Workstation, with the backend running on Factset’s AWS cloud
- Inside your enterprise’s firewall, on a Virtual Private Cloud such as Microsoft Azure, Amazon Web Services or Google Cloud
- Inside your enterprise’s firewall, on a data lake/cluster running Hadoop; and
- Inside your enterprise’s firewall, on a bare-metal Linux cluster
Performance is managed dynamically by the DataRobot app engine, with the user being able to choose how much compute to apply to a modeling project by selecting the number of workers (each worker being able to train one machine learning model at one time). DataRobot runs entirely on CPUs, no expensive GPUs are needed.
Would you say that DataRobot’s learning curve is digestible for a portfolio manager or analyst, or is it targeted at in-house data analysts and quants who would live in the app?
I’d say that a certain amount of data literacy is important — I wouldn’t expect great results from giving DataRobot to an “old school” portfolio manager who struggles with Excel, for instance. We have two target user groups: first, people who understand the data well but aren’t quants or machine learning experts and want to be able to harness the power of machine learning without needing to get technical or learn how to code. We greatly automate the process with smart default settings and a variety of guardrails for this “democratization” audience. Through the process of using DataRobot and its built-in explainability and documentation, such users learn a lot about machine learning and how to frame machine learning problems, often quickly moving on to complex problem sets.
Our other target group is, of course, sophisticated quants and data scientists, who use DataRobot’s automation as a force multiplier for their productivity, by automating the boring, repetitive stuff where they don’t necessarily have an edge.
Is there a course designed around DataRobot to give us hands-on experience?
A wide variety of instructor-led and self-paced training programmes for different skill levels are available at https://university.datarobot.com/, with further resources and self-paced learning at DataRobot Community’s learning center: https://community.datarobot.com/t5/learning-center/ct-p/Learning
There’s also the DataRobot free trial, details at:
In your demo, you built 72 different models to solve this binary classification problem. Some users may not have the expertise in machine learning to make the choice between models, and blindly using a model can be dangerous. What do you do to prevent from giving a machine gun to a 3 year old?
Great question. It’s a combination of several things that work together.
First, make sure that the machine gun has “safety mechanisms” such as built-in best practices and guardrails. For example, rank models strictly on their out-of-sample performance and no in-sample performance data ever being exposed, and combine the appropriate data engineering with each algorithm in the machine learning pipelines.
Second, train the users in “gun safety.” This doesn’t have to take that long — for instance, our Citizen Data Scientist Starter Quest takes an afternoon and is self paced; our AutoML I course consists of three four-hour sessions — but provides valuable context in how to frame machine learning problems and evaluate the models.
Third, make sure that the gun’s “scope” shows the users what they’re pointing at: provide users with sophisticated, standardized analytics that allow them to evaluate each model’s performance in-depth and understand the model’s drivers and how the model will respond in different scenarios.
And finally, support the users with experienced data scientists, a wealth of self-service content, and a growing online user community. (Sorry, ran out of gun metaphors.)
What, over 2,000 words and you still haven’t answered my question?
Hot damn. Come on over to community.datarobot.com and we’ll do our best to answer it there.