Using Small Datasets to Build Models
The world is going through extremely turbulent times. With the ongoing disruption of our lives, communities, and businesses from the COVID-19 pandemic, predictions from existing machine learning models trained prior to the pandemic become less reliable. There is plenty of historical data, but historical examples from before the pandemic may not provide the relevant examples needed to train a model that is useful today. We are left with only training data gathered in the most recent weeks, data which may, in turn, prove to be irrelevant even a few weeks from now as the situation continues to evolve.
The patterns our existing models have uncovered have changed and are likely to continue to change. Many data science professionals need a plan to update their models, with a focus on small datasets and datasets that have limited breadth of data available. This requires bringing a thorough understanding of the limited data and involving their subject matter experts in the analysis. In this post, we cover some of the tools and techniques data scientists can use to extract signals from small datasets.
There are inherent limitations when fitting machine learning models to smaller datasets. As the training datasets get smaller, the models have fewer examples to learn from, increasing the risk of overfitting. An overfit model is a model that is too specific to the training data and will not generalize well to new examples. In the example above, the model on the left doesn’t really capture a meaningful signal in the training data — it hasn’t learned much at all. The model on the right is too specific to the training data — it has learned every training point really well. But if we were to generate a new prediction with the model on the right, there is a good chance it will be far off. The model in the middle strikes a balance between a being too general to be useful and too specific, or overfit.
Finding this balance is trickier with smaller datasets, because the model has fewer examples to learn from, and because it is more difficult to check if it is overfit. A reliable way to check for overfitting is to score your model on data the model has never seen before, often referred to as a validation, or holdout, set. An overfit model is likely to have a greater error on unseen data than data it was trained on. With small datasets, setting validation data aside provides your model with even fewer examples to learn from, and you have fewer examples to set aside as holdouts to verify the model isn’t overfit.
Take the example of a clinical trial. We’re looking to predict a response to a new treatment, and have quite a few predictors to work with. We have plenty of predictive features from each participant’s medical record, but the training sample is limited by the number of patients who participated in the clinical trial. It becomes challenging to tease out which predictors are actually associated with the response to the treatment, and which predictors are simply associated with the outcome due to chance. In a large study, we could set aside, or hold out, enough patients to test out our model outcome – we know what the true outcome was from the results of the trial, and we can check if our model consistently arrives at the same correct conclusion having never seen the holdout patient records during training. However, with a small study, each patient record removed as a holdout both eliminates a valuable datapoint with which to train the model, and only provides limited verification that the model has fit to the actual underlying physiology.
To handle these challenges, we offer a few suggestions here and many more in the full video.
Types of Models for Smaller Datasets: Developing Intuition
One technique to get around these problems is to use simpler models, since they can be less prone to overfitting. With DataRobot, you automatically get a ranking of what models work best based on out-of-sample scores, but you will often see simpler model types making it to the top of the leaderboard when modeling with small datasets. Models like the elastic net classifier, support vector machines, or Eureqa models often do well. Regularized linear models, for example, tend to work well because they are both relatively simple linear models and do a good job of feature selection, or feature reduction. Forcing the model to use fewer features reduces the risk the model fits to noise or spurious correlations.
Cross-validation is a technique to increase the amount of out-of-sample validation data available that is useful for modeling with small datasets. When tuning model hyperparameters, our recommendation is to use a more sophisticated version of cross-validation, called nested cross-validation. Nested cross-validation uses a series of train, validation, and test splits within each fold. We can think of as having an inner and then an outer loop. In the inner loop, the score is approximately maximized by fitting the model to each training set, then directly maximized by selecting the hyperparameters over the validation set in the outer loop. The out-of-sample error is estimated by averaging test scores over the different cross-validation folds. Studies have shown that it provides much better performance when building models on small datasets — and it is why it’s been built into DataRobot.
For wider datasets, feature selection becomes more important. While techniques like regularization help reduce features, if there are far more columns than rows, the problem of overfitting can still persist. As an additional measure, we suggest a strategy where you build multiple projects based on different reshuffled partitions of the data by rerunning autopilot in DataRobot with different random seeds. The random seed is the number used to randomly shuffle your training data into different cross-validation folds.
By refitting models on reshuffled cross-validation and holdout folds, you can better understand what features hold up consistently across the dataset and you aren’t subject to the crazy outlier issue. The amount of variation of results across projects built on reshuffled cross-validation folds provides insight into the robustness of the models on the dataset. If the Feature Impact, or Feature Effects or Prediction Explanations look completely different depending on how you shuffle the training data, it’s a sign your model is likely to be unreliable. If this happens, consider aggregating insights like the model feature impact scores across multiple projects, built with different random seeds to iteratively reduce noisy features.
To think about how this works in practice, let’s return to the example of the clinical trial. We can’t increase the number of patients in the study, but, to provide more confidence that our model is reliable, we can repeatedly shuffle our training and validation folds. By repeatedly putting different combinations of patient records into training and validation folds, we can check whether both the predictions of the outcome and the features used by the model to generate the predictions are consistent. If the same predictors can be reliably used no matter how we shuffle the patient data, we can be more confident our model has found a real pattern rather than a correlation due to chance.
It’s easy to overlook the difficulty of working with small datasets. However, there are many different tools and techniques you can use to help improve the accuracy of your models, as well as gain insights on your smaller datasets. Some of the examples discussed here include model selection, cross-validation, and running repeated trials. As always, involving subject matter experts in the development and validation of models is important.