Assessing Trustworthiness of Your Data
When designing trustworthy AI, you should start with the underlying data before building a model. Data quality is the first dimension and the foundation of AI trust. As the old adage goes, “Garbage in, garbage out.” Although this is an oversimplification of the problem, the core concept remains: the performance of any machine learning model is intimately tied to the data it was trained on and validated against.
What Kinds of Data Sources Should I Use in Machine Learning?
Data provenance and integrity are essential elements of understanding your model. Depending on your use case, you might end up with a mix of the following:
- Internal, private data to your enterprise
- Open-source public data
- Third-party data
It is very important to maintain a traceable record of the provenance of your data: what data sources were used, when they were accessed, and how they were verified.
When using third-party or open-source data, you should find as much information as possible on how the data was collected, including its recency. This can inform whether the data is ultimately relevant to your use case. All these points also apply to your internal data, where you are likely to have increased transparency and might have an easier time identifying if there was any sampling bias introduced via the data collection.
How Do You Assess the Quality of Your Data?
The next step, before any modeling takes place, is exploratory data analysis, including data quality assessments. It is important to understand your data and explore what kinds of relationships may be present yourself. This includes the basics:
- Computing summary statistics on each of your features
- Measuring associations between features
- Observing feature distributions and their correlation with the predictive target
- Identifying any outliers
Other aspects of data quality assessment include missing or disguised missing values, duplicated rows, unique identifiers that do not carry any information for the model, and target leakage detection. Before deciding how to deal with missing data–for instance, whether certain rows or features should be excluded from modeling, and/or if a value should be imputed–you must understand if there were any systemic behaviors correlated with the missing values. For example, did specific locations of your retail chain habitually fail to report certain information? You might be able to request additional data and fill those gaps proactively.
Target leakage is a data quality issue unique to machine learning. It refers to the exposure of information to the model that it should not have at the time of prediction. This enables your model to “cheat” and seemingly perform better than it possibly could in production. For example, data from your historical records such as the levying of late fees on a loan would not be available to you when attempting to predict at the time of a loan application whether that applicant will make their payments on time. You should regard a feature with a high univariate correlation with the target with suspicion, though subject matter expertise is essential in identifying subtler kinds target leakage, which might be masked in a composite feature.
How Do You Improve Your Training Data?
Data cleaning includes the handling of missing values and outliers or inliers, along with the dropping of duplicate rows or columns. You might want to remove whole rows or features from the data due to the prevalence of missing values or outliers. Imputation might also be an appropriate method to handle missing values, though the type of imputation depends on the modeling approach used.
Certain industries might have specific, preferred methods of imputation, as with gene expression data. It’s also important to remember to perform imputation after partitioning the data, to avoid encoding statistical information about the validation data in the training data. DataRobot handles missing values automatically after performing a series of data quality checks before modeling.
Feature engineering is often a critical step of data preparation. A foundational understanding of your raw data enables you to derive robust, trustworthy new features. In DataRobot, the Paxata data preparation tool allows you to perform version control and track snapshots of your data, so you always know what data was used to train a model and what feature engineering or data cleaning steps had been done to it first.