Assessing Accuracy of Your Models
Accuracy in machine learning is multidimensional. Accuracy refers to a subset of model performance indicators that measure the model’s aggregated errors in different ways. Although it is necessary to decide what error metric to optimize for, fixating on only one score is a myopic perspective for building production-ready AI systems. You can best evaluate accuracy through multiple tools and visualizations, along with explainability features and bias and fairness testing when appropriate.
How Do I Evaluate the Accuracy of My Model?
The characteristics of your predictive target help determine the best error metric by which to optimize your model. As a classic example, consider naive accuracy in a binary classification problem: that is, the percent of the time your model predicts the class of true or false correctly. If your target distribution is heavily imbalanced such that 97% of your training sample has the value of false, a model that predicts false every single time would have a naive accuracy score of 97%. Although impressive on paper, this is obviously useless to your goal of training a model that can find those rare cases of true.
Binary classification models are often optimized using an error metric called LogLoss. In binary classification, it is typical to drill down into the accuracy of the class assignment of your examples with a confusion matrix and a set of metrics and visualizations derived from it. The confusion matrix allows you to access counts of true positives, false positives, true negatives, and false negatives based on a classification threshold and compute values such as a model’s sensitivity, specificity, precision, and F1 score.
This performance can also be visualized via a Receiver Operating Characteristic (ROC) or Precision-Recall curve. The ROC curve can also be summarized in a single metric: the Area Under the Curve (AUC). Which of these metrics you prioritize depends on the use case. Tools such as a Profit Curve can also be helpful in identifying the optimal classification threshold.
In regression analysis, metrics such as RMSE or MAE measure the discrepancy between predicted and actual values. RMSE more strongly penalizes values for being farther from the actual, and MAE scales linearly. The lift chart is a simple visual representation of the model’s ability to correctly rank examples from low to high. Inspecting the distribution of residuals provides another avenue of evaluating model performance. It can provide clues if the model is habitually under- or overpredicting and, depending on the underlying distribution, if metrics such as Poisson or Tweedie deviance might be better suited for optimizing your model.
How Do I Explain How Accurately a Model Is Performing?
When reporting the accuracy of a model, visualizations like the ones described above can be instrumental in communicating to both technical and nontechnical audiences. Comparisons to other modeling approaches are also valuable. For instance, it is always possible to generate some sort of naive or baseline model, such as a Majority Class Classifier, against which you can more clearly see the predictive lift of the chosen approach. Ideally, you should attempt and compare multiple competitive modeling approaches. The DataRobot leaderboard is populated with each end-to-end modeling approach built against a given dataset and enables direct comparisons between diverse methods. Ultimately, it might not be the most purely accurate choice because the other dimensions of performance and trust weigh in, but it is still critical to understand holistically.
Accuracy Is Just a Piece of the Puzzle
Accuracy is only one dimension of model performance that directly contributes to the trustworthiness of generated predictive models. The full list includes the following:
Robustness and Stability
A model in production encounters all sorts of unclean, chaotic data—from typos to anomalous events—which can trigger unintended behavior. Find out how to test whether your model is ready for the real world.
The speed of model scoring directly impacts how it can be used in your business process. Find out how the speed of predictions influences their trustworthiness.