Using Machine Learning to Predict Out-Of-Sample Performance of Trading Algorithms
A guest blog by Thomas Wiecki, Lead Data Scientist, Quantopian
Earlier this year, we used DataRobot, a machine learning platform, to test a large number of preprocessing, imputation, and classifier combinations to predict out-of-sample performance. In this blog post, I’ll take some time to first explain the results from a unique data set assembled from strategies run on Quantopian. From these results, it became clear that while the Sharpe ratio of a backtest was a very weak predictor of the future performance of a trading strategy, we could instead use DataRobot to train a classifier on a variety of features to predict out-of-sample performance with much higher accuracy.
What is Backtesting?
Backtesting is ubiquitous in algorithmic trading. Quants run backtests to assess the merit of a strategy, academics publish papers showing phenomenal backtest results, and asset allocators at hedge funds take backtests into account when deciding where to deploy capital and who to hire. This makes sense if (and only if) a backtest is predictive of future performance. But is it?
This question is actually of critical importance to Quantopian. As we announced at QuantCon 2016, we have started to deploy some of our own capital to a select number of trading strategies. We found these algorithms by analyzing thousands of algorithms by using only pyfolio,our open source finance analysis library. As a data scientist, it is my job to come up with better (and more automatic) methods to identify promising trading algorithms.
What makes this possible (and a whole lot of fun) is that we have access to a unique dataset of millions of backtests being run on the Quantopian platform. While our terms-of-use prohibit us from looking at code, we do know when code was last edited. Knowing this, we can rerun the strategy up until yesterday and evaluate how the strategy continued to perform after the last code, meaning how it performed on data the quant did not have access to at the time of writing the code. Using this approach, we were able to assemble a data set of 888 strategies based on various criteria with at least 6 months of out-of-sample data.
The first thing I looked at was how predictive the backtest or in-sample (IS) Sharpe ratio was of the out-of-sample (OOS) Sharpe ratio:
With a disappointing R²=0.02, this shows that we can’t learn much about how well a strategy will continue to perform from a backtest. This is a real problem and might cause us to throw our hands in the air and give up. Instead, we wondered if perhaps all the other information we have about an algorithm (like other performance metrics, or position and transactions statistics) would allow us to do a better job.
Using Machine Learning to Predict Out-Of-Sample Performance
Our team created 58 features based on backtest results, including:
- Tail-ratio (the 95th percentile divided by the 5th percentile of the daily returns distribution)
- Average monthly turnover, how many backtests a user ran (which is indicative of overfitting)
- The minimum, maximum and standard deviation of the rolling Sharpe ratio
- …and many more (for a full list, see our SSRN manuscript ).
We then tried various machine learning techniques to try and predict out-of-sample Sharpe ratio based on these features. This is where DataRobot came in really handy. We uploaded our data set to DataRobot and from there, a huge combination of preprocessing, feature extraction, and regressors were tried in a cross-validated manner in parallel. Most of these could be done using scikit-learn, but it is a huge time saver and produced results better than my own humble attempts.
Using an ExtraTrees regressor (similar to a Random Forest), DataRobot achieved an R² of 0.17 on the 20% hold-out set not used for training. The lift-chart below shows that the regressor does a pretty good job at predicting OOS Sharpe ratio.
While it is generally difficult to intuit what the machine learning regressor has learned, we can look at which features it identified as most predictive. Of note, an important feature does not have to be predictive by itself or in relation to an OOS Sharpe ratio in a linear way. These methods are so powerful because they can learn extremely complex, nonlinear interactions of multiple features that are predictive in the aggregate.
Below you can appreciate the most informative feature.
It is interesting to point out that tail-ratio and kurtosis are both measures that assess the tails of the returns distribution. Moreover, the number of backtests a user ran (“user_backtest_days”) is also very informative of how a strategy will perform going forward.
While we showed that we can do a much better job at predicting an OOS Sharpe ratio than using only the backtest Sharpe ratio, it is not clear yet if this makes a difference in reality. After all, no one earns money from a good R² value. We then asked if a portfolio constructed from the 10 strategies ranked highest by our machine learning regressor would do well. We compared this to selecting the top 10 strategies ranked by their in-sample Sharpe ratio as well as many simulations of choosing 10 strategies randomly.
The below image shows our results when applying this portfolio construction method on the hold-out set.
We found it interesting that even at highly quantitative hedge funds, asset allocation is still done in a discretionary way, at least for the most part. I believe that a long-term goal for this research would be to further automate this aspect using Machine Learning, as depicted in the figure below.
It would also be useful to look more at the OOS data. Usually, deployment decisions aren’t done on backtest performance alone. At Quantopian, we require at least 6 months of OOS data. This number is rather arbitrary, however, and there might be strategies where we gain confidence more quickly, and ones where we would need to wait longer.
For more on how we approached comparing backtest and out-of-sample performance on cohort algorithms and how we found the best results were achieved with DataRobot, you can view the full report of our test here. You can also view my presentation on this subject from Quantcom here.
About the Author:
Thomas Wiecki is the lead data science at Quantopian, where he uses probabilistic programming and machine learning to help build the world’s first crowdsourced hedge fund. Among other open source projects, he is involved in the development of PyMC—a probabilistic programming framework written in Python. Thomas holds a PhD from Brown University. Follow him at @twiecki and https://twiecki.github.io/