Cross-Validation

What is Cross-Validation?

Cross-validation is an extension of the training, validation, and holdout (TVH) process that minimizes the sampling bias of machine learning models. Data partitioning for the regular TVH process looks like this:

TVH 1

With cross-validation, we still have our holdout data, but we use several different portions of the data for validation rather than using one fifth for the holdout, one for validation, and the remainder for training like in the above example. Since the non-holdout data was split into five buckets, we call this “5-fold cross-validation.” If there had been 4, it would be 4-fold cross-validation.

TVH

Next, we build five iterations of a model (represented by the five different rows), training it on the gray part of the set and testing it on the blue part. Then, we take the average performance of the model on each of the validation partitions and use it to measure model performance.

Why is Cross-Validation important?

If the original validation partition is not representative of the overall population, then the resulting model may appear to have a high accuracy when in reality it just happens to fit the unusual validation set well, causing you to implement a model that actually has poor accuracy when applied to future data. With cross-validation, you double-check how accurate your model is on multiple different subsets of data, making sure it will generalize well to data you collect in the future.

Cross-Validation + DataRobot

DataRobot automatically uses 5-fold cross-validation, but also allows you to manually partition your data.

DR-crossvalidation

Alternatively, rather than using TVH or cross-validation, you can specify group partitioning or out-of-time partitioning, which trains models on data from one time period and validates the model on data from a later time period. For the maximum level of control, you can specify which partition to which each record in your dataset is assigned.