How to Tackle Imbalanced Data with DataRobot
This post was originally part of the DataRobot Community. Visit now to browse discussions and ask questions about DataRobot, AI Cloud, data science, and more.
Even with imbalanced data, DataRobot can build a good model. In this article, we explain imbalanced data and share how to best attack imbalanced data with DataRobot.
Imbalanced data occurs when there is an overrepresentation of a certain value inside the target variable. Imbalanced datasets for binary classification projects have an overrepresentation of the majority class compared to the minority class; for example, the target variable graph in Figure 1 shows the imbalance in is_bad. For regression projects, the imbalanced dataset tends to have an overrepresentation of zeros in the target variable as shown in the target variable graph for IncurredClaims (Figure 2). In these cases, when the minority class is significantly more important than the majority class, you need to take this imbalance into account when building models.
DataRobot automatically handles imbalanced data during its model building process by picking the best model optimization metric given the distribution of the values in the target variable.
For imbalanced binary classification datasets, DataRobot uses the LogLoss metric which is known to be robust to imbalanced data (Figure 3). For zero-inflated regression datasets, DataRobot uses Tweedie Deviance (Figure 4).
Additionally, DataRobot includes a variety of modeling algorithms which are robust to imbalanced data, such as the different kinds of Tree-based models (Figure 5) and the frequency-severity models (Figure 6).
You can also use the Smart Downsampling technique (from the Advanced Options tab) to intentionally downsample the majority class in order to build faster models with similar accuracy (Figure 7). This can be done for both classification or zero-inflated regression problems.
Finally, you could leverage information provided in the ROC curve to minimize the effect of imbalanced data to model performance. For instance, depending on the level of tolerance your business has for false positives versus false negatives, you could use the Prediction Distribution graph (Figure 8 ) to adjust the prediction thresholds for your classifiers. This could improve the predictions of the minority class.
You could also pay more attention to the Matthews Correlation Coefficient (Figure 9) for model evaluation because it will give you a better sense of the model’s performance over the other metrics in that tab. The Mathews Correlation Coefficient was specifically designed to evaluate models built from imbalanced datasets.
Wikipedia: Matthews correlation coefficient
DataRobot documentation portal: Smart Downsampling
We will contact you shortly
We’re almost there! These are the next steps:
- Look out for an email from DataRobot with a subject line: Your Subscription Confirmation.
- Click the confirmation link to approve your consent.
- Done! You have now opted to receive communications about DataRobot’s products and services.
Didn’t receive the email? Please make sure to check your spam or junk folders.
New DataRobot and Snowflake Integrations: Seamless Data Prep, Model Deployment, and MonitoringMarch 16, 2023· 5 min read
How the DataRobot AI Platform Is Delivering Value-Driven AIMarch 16, 2023· 4 min read
A New Era of Value-Driven AIMarch 16, 2023· 2 min read
As we see from countless examples, the demand for AI is at a fever pitch across every industry. Becoming AI-driven is no longer really optional. As AI continues to advance at such an aggressive pace, solutions built on machine learning are quickly becoming the new norm. To meet the demands of the modern world, we have to experiment fast, collaborate…
Since I joined DataRobot nearly two years ago, I’ve been fortunate to spend much of my time meeting with and learning from our users and customers. Time and time again, we hear about the need for AI to support cross-functional teams and users. To provide the ability to integrate diverse data sources. To offer the flexibility to deploy AI solutions…
Earlier we covered Ordinary Least Squares regression with a single variable. In this posting we will build upon that by extending Linear Regression to multiple input variables giving rise to Multiple Regression, the workhorse of statistical learning. We first describe Multiple Regression in an intuitive way by moving from a straight line in a single predictor case to a 2d…