Trusted AI Cornerstones: Performance Evaluation
AI has moved from the realms of theoretical mathematics and advanced hardware to an everyday aspect of life, decisions big and small. As its development has accelerated and its use proliferated, our requirements for AI systems have gotten more sophisticated.
At DataRobot, we define the benchmark of AI maturity as AI you can trust. Unlike an intrinsic quality such as accuracy or even fairness, trust is a characteristic of the human-machine relationship. Trust must be established between an AI user and the system.
This is the first in a series of three blog posts that shares our perspective on AI you can trust. In this installment, I’ll cover four key elements of trusted AI that relate to the performance of a model: data quality, accuracy, robustness and stability, and speed.
Quality Input Means Quality Output
Trustworthy AI is built on a foundation of high-quality data. The performance of any machine learning model is tightly linked to the data it was trained on and validated against.
Depending on your use case, you might have a mix of data in your enterprise that includes open source public data and third-party data, in addition to internal, private data. It’s important to maintain a traceable record of the provenance of your data: which data sources were used, when they were accessed, and how they were verified.
When you use third-party or open source data, it’s vital to find as much information as possible about how the data was collected, including its recency.
When you have the data in hand, assess its quality. This includes the basics:
- Computing summary statistics on each feature
- Measuring associations between features
- Observing feature distributions and their correlation with the predictive target
- Identifying outliers
You also need to identify missing or disguised missing values, duplicated rows, unique identifiers that do not carry any information for the model, and target leakage (the use of information in the model training process that would not be expected to be available at prediction time).
Use Multiple Tools and Visualizations
Accuracy is a subset of model performance indicators that measure the model’s aggregated errors in different ways. Accuracy is best evaluated through multiple tools and visualizations, alongside explainability features, and bias and fairness testing.
Binary classification models are often optimized using an error metric called LogLoss. Within binary classification, it is typical to drill down into the accuracy of the class assignment of your examples with a confusion matrix and a set of metrics and visualizations derived from it. The confusion matrix allows you to access counts of true positives, false positives, true negatives, and false negatives based on a classification threshold and additionally compute values such as a model’s sensitivity, specificity, precision, and F1 score.
The DataRobot leaderboard is populated with each end-to-end modeling approach built against a given dataset. It enables direct comparisons of accuracy between diverse machine learning approaches.
Small Changes Should Result in Small Changes
A production AI model confronts all manner of unclean, chaotic data—from typos to anomalous events—that might trigger unintended behavior from the model. Identifying those weaknesses in advance and tailoring your modeling approach to minimize them is critical to robust and stable model performance.
For model stability, the relevant changes in your data are typically very small disturbances, against which you want ideally to see stable and consistent behavior from your model. In most cases, the resulting prediction should only change a small amount.
Model robustness is measured against large disturbances or values the model has never seen before. It’s worth understanding how your model’s predictions will respond to these data disturbances. Will your prediction blow up and assume a very high value? Will it plateau asymptotically or be driven to zero?
Get Useful Conclusions Quickly
Of course, you want to develop models quickly, but the focus shouldn’t just be on getting a model up and running, but on how long it takes to a prediction. At DataRobot, that’s our goal.
There are two ways to score models. The first, called a batch prediction, sends new records as a batch for the model to score all at once. The size of the batch is extremely relevant, because there are massive differences between scoring a few hundred records with numeric and categorical data and scoring gigabytes of data. Depending on the situation, the speed to score might be more or less significant in your model selection. If the process only runs monthly and takes an hour, shaving off ten minutes is unlikely worth the tradeoff in accuracy. If the process runs daily on a large dataset, using a computationally less intensive modeling approach that returns values faster might be desirable.
You also may require that predictions are returned in real time. For example, in digital advertising, sometimes the span of a click is all the time you have to allocate a particular ad to a particular user. For real-time prediction scoring, your model might have only milliseconds to devise a score. Fostering trust in AI systems will bring to fruition transformative AI technologies, such as autonomous vehicles or the large-scale integration of machine intelligence into medicine. But building a trusted AI model takes time and planning. The four steps I shared in this post will put you on the road to trustworthy performance in modeling.