Evaluating machine learning models with a confusion matrix
This article was originally published at Algorithimia’s website. The company was acquired by DataRobot in 2021. This article may not be entirely up-to-date or refer to products and offerings no longer in existence. Find out more about DataRobot MLOps here.
One of the final (and arguably most important steps) in developing a machine learning model is evaluating its accuracy. You can’t trust a model to make good predictions about new and unknown data if it’s struggling with training data. Regression models evaluating accuracy usually means calculating metrics like squared error, R-squared, or root mean squared error. All of these metrics are variations on measuring how far a predicted value is from an actual value.
Accuracy measurements used in regression models don’t make much sense with classification models. As a refresher, classification algorithms (ie. logistic regression, K-nearest neighbor, decision tree) predict whether or not a new observation will belong to a particular group or class. It’s a binary “yes” or “no” rather than a continuous variable.
The purpose of this blog is to help you evaluate the accuracy of classification models by:
- Summarizing model results
- Organizing results into a table known as confusion matrix
- Calculating metrics with data in the confusion matrix
Evaluating the accuracy of classification models
Evaluating the accuracy of a classification model starts with summarizing the results into the following groups:
- True positives (TP): Observations predicted to be part of a class, that are actually part of the class.
- True negatives (TN): Observations predicted not to be part of a class, that are actually not part of the class.
- False positives (FP): Observations predicted to be part of a class, that are actually not part of the class.
- False negatives (FN): Observations predicted not to be part of a class, that are actually part of the class.
You can visually represent these findings in a confusion matrix.
What is a confusion matrix?
A confusion matrix is a table that allows you to visualize the performance of a classification model. You can also use the information in it to calculate measures that can help you determine the usefulness of the model.
Here is how you would arrange a 2×2 confusion matrix in abstract terms:
Rows represent predicted classifications, while columns represent the true classes from the data.
Confusion matrix example
Let’s look at how a confusion matrix could be used in the business context. The following matrix represents the results of a model predicting if a customer will purchase an item after receiving a coupon. The training data consists of 1000 observations.
The matrix shows us that 525 customers actually made a purchase, while 475 did not make a purchase. The model predicted, however, that 600 made a purchase and 400 did not. So, is this a good model? Let’s figure that out by calculating important classification metrics with the confusion matrix.
In the simplest terms, accuracy tells you generally how “right” your predictions are. It is the sum of true positives and negatives divided by the total population. In this case that’s 0.775 ((450+325)/1000), meaning that 77.5% of the model’s predictions were correct.
Is that good or bad? It depends. One issue with accuracy is that it assumes positive and negative errors are equal. We know that’s not always the case. In addition, depending on the problem you’re trying to solve you may be concerned about minimizing the prevalence of false positives over false negatives or vice versa.
The next measures help you deal with this issue.
Also known as recall or the true positive rate, sensitivity tells you how often the model chooses the positive class when the observation is in fact in the positive class. It is calculated by dividing the number of true positives in the matrix by the total number of real positives in the data.
In our example, sensitivity is 0.857 (450/525), meaning that the model correctly predicts that a customer will make a purchase 85.7% of the time.
Precision measures how often a model is correct when it predicts the positive class. It is calculated by dividing the number of true positives in the matrix by the total number of predicted positives.
In our example, precision is 0.75 (450/600). In other words, when the model predicted a positive class, it was correct 75% or the time.
Also known as the true negative rate, specificity measures how often the model chooses the negative class when the observation is in fact in the negative class. It is calculated by dividing the number of true negatives by the total number of real negatives in the data.
The true negative rate in our example is 0.684 ((325/(150+325)), meaning that the model correctly predicts that a customer will not make a purchase 68.4% of the time.
What can you do with classification metrics?
So now that you have all of these metrics how do you make sense of them? As we said earlier, general accuracy is often not enough information to allow you to decide on a model’s value. One useful thing that a confusion matrix allows you to see is the prevalence of false positives and false negatives. If you’re familiar with statistics, you may know them as Type I and Type II errors, respectively.
Our coupon scenario may not be the best example for explaining this, but consider a computer vision machine learning model that classifies parts of images from CT scans as cancerous or non-cancerous. Presumably, you would want the model to minimize the number of false negatives, erring on the assumption that it is better to be put into the cancerous class and not actually have cancer, than to have the condition go undetected and untreated. In this case, you might be fine with a model that has high recall, but low precision.
Often data scientists are testing out multiple models to answer a question or solve a problem. Confusion matrices can help with side-by-side comparisons of different classification methods. You can see not only how accurate one model is over the other, but also see more granularly how a model does in sensitivity or specificity, as those might be more important factors than general accuracy itself.