Modeling Options in DataRobot
This article covers the options you have when creating a project in DataRobot. There are several tabs within Advanced Options that allow you to customize modeling.
After you upload your data and select a target variable, DataRobot will automatically choose the best settings for your specific dataset. If you wish to tweak these settings, click the Show advanced options link at the bottom of the page.
When you select the Data tab, the Partitioning tab opens. Partitioning is important because it determines which rows of data are part of each validation approach. For example, you wouldn’t want all of the rows that achieve your target in the same partition, because then the model wouldn’t have that class available for training and validation in the other partitions.
DataRobot supports the following partitioning methods:
- Partition Feature
You can also use our OTV mode to do Date/Time partitioning. This allows you to train on earlier data and then test on later data.
With random partitioning, DataRobot randomly assigns observations (rows) to the training, validation, and holdout sets.
Partition Feature partitioning
With the partition feature option, you can choose a partition feature and DataRobot will create a distinct partition for each unique value of that feature. This is useful when you want DataRobot to respect some partitioning you made outside of DataRobot.
With group partitioning, you choose a group feature and DataRobot ensures that all of the observations with the same value are in the same partition. This sounds similar to feature partitioning but the difference is that with group partitioning, you can have multiple values in the same partition. You will never have the same value in two partitions though.
With stratified partitioning, rows are assigned in a way that ensures similar target distribution across each partition. This is important if you have an imbalanced target feature.
Smart downsampling is used when you have a big and unbalanced dataset. For example, if you have a large dataset and the minority class only makes up 1% of your target, then it may make sense to reduce the overall sample and do so in a way that reduces your class imbalance.
Simply toggle Downsample Data and adjust the slider to downsample the majority class.
Feature Discovery is a supervised approach to reducing the training dataset to only informative features. This is toggled automatically, and will result in removal of those fields that will likely not contribute to the model. This makes the processing time faster and makes interpretability easier while keeping accuracy about the same. For example, if you have a feature that only has one value for all of the rows, then there isn’t really a pattern so DataRobot will omit that feature from modeling. You can turn off Feature Discovery if you prefer.
Feature constraints allows you to introduce monotonicity into your modeling. This is something that you might do if you know the direction of the relationship between the feature and the target. For example, a higher valued home or car should always lead to higher insurance rates.
In order to use this you will need to create a feature list with only numerical features. The ones that have positive relationships should be indicated as Monotonic Increasing and those with inverse relationships should be indicated as Monotonic Decreasing.
Bias and Fairness
Bias and Fairness provides tools for evaluating, understanding, and mitigating bias in your models. Use this feature to test models (binary classification) for bias against specified protected classes. You identify the protected classes (i.e., features) and value of the target that will lead to a favorable outcome. For example, if the model is used by HR to predict the best candidates for a position, some features (classes) you may want to protect include age, gender, religion, race, etc. Then for the target of Hired the favorable outcome would be “Yes” (since clearly it’s better to be hired). You also specify the primary fairness metric for the model. If you don’t know what you want to use, DataRobot will walk you through a series of questions that guide you to the most appropriate metric for your use case.
There are a number of customizations you can carry out under the Additional tab.
You can change the optimization metric used in your project to validate the models and tune the hyperparameters.
Automation Settings allows you to customize some of the processes that take place during Autopilot.
These include options around:
- Searching for interactions.
- Including only blueprints with Scoring Code.
- Automatically creating Blenders.
- Including only models with SHAP feature impact and prediction explanations rather than permutation importance and XEMP.
- Recommending and preparing the top model for Deployment (train on 100% of the data).
- Including Blenders in the recommendation.
- Using Accuracy Optimized blueprints, which can be much slower but are very accurate.
- Using the informative features-leakage removed feature list by default.
If you are using a dataset with more than 50,000 rows, DataRobot will use an approach to determine how many models to run cross-validation on automatically. You can override this here. Importantly, if cross-validation was not run automatically on a model, you can do this manually on the Leaderboard.
Upper Bound Running Time
You can set a limit for how long a single model can take to run (in hours).
You can limit the value of the response (target) to a percentile of the original values (between 0.5 and 1).
Random Seed & Positive Class Assignment
You can set the random seed and indicate the positive class.
You can indicate a feature to be used as weights for the different rows. This is especially useful if you are doing smart downsampling.
Exposure, Count of Events, and Offset
You can set the Exposure, which transforms the feature to add value to predictions. The target must be a positive numeric with a cardinality greater than 2.
You can also specify the Count of Events. This contains the frequency of non-zero events that contributed to the target value. This is used in frequency-severity blueprints only when the target is zero-inflated.
The Offset setting allows you to add the value listed for each feature to the prediction prior to applying the link function.
There are four modeling modes to choose from:
Autopilot mode is the most computationally thorough mode. When selected, DataRobot will start with a small percentage of the data and apply a variety of different algorithms to the problem. The algorithms that do well will survive to the next round of modeling and will be provided more data. The models from that round that perform the best will get even more data, and so on. This mode is useful when you want to know what the best approach is to solve your specific modeling problem.
Quick mode is a narrower run of the Autopilot mode. This mode (the default mode) starts out using a larger percentage of the data and focuses only on those algorithms that tend to rise to the top of the Leaderboard.
Manual mode allows you to select individual models from the Repository to run. This is useful if you already know which kind of modeling approach you want to use.
Comprehensive mode allows you to run all blueprints in the Repository blueprints, using the maximum Autopilot sample size to ensure more accuracy for models. This Autopilot mode is useful if you want to find the most accurate model for your use case, regardless of the time required. (Note that Comprehensive mode is not supported for time series or unsupervised projects.)
See the DataRobot public documentation for Advanced options.