Blending Datasets with DataRobot AI Catalog
Today we are going to share how you can modify a single dataset or blend multiple datasets using the DataRobot AI Catalog.
The AI Catalog is a centralized hub to store materialized and virtual datasets that can be used for model training and batch predictions.
You can access the AI Catalog by clicking on that tab located on the main page (Figure 1). DataRobot will take you to the AI Catalog where you can browse all the datasets to which you have access to (Figure 2).
Figure 1. Accessing the AI Catalog from the AI Catalog tab
Figure 2. The AI Catalog showing some datasets
If you want to use a modified version of the data available in the AI Catalog or you want to combine multiple files into one, you can use the AI Catalog Spark SQL functionality. The results of these combinations are called “Blended Datasets.”
The Spark SQL queries used can range from a simple single table manipulation query to complex multi-table manipulation queries.
Let’s first use a Spark SQL query to perform a simple single table manipulation query. The data we are going to use has information about Lending Club customers and their credit risk. It includes a number of attributes of these customers as shown in Figure 3.
Figure 3. A sample of the Lending Club dataset highlighting the annual_inc variable
We want to create a DataRobot project from this dataset, after removing all borrowers with an annual income (annual_inc) less than $40,000.
- Click Add to catalog and select Prepare data with — Spark SQL (Figure 4).
Figure 4. Select Add to catalog (1) and then Prepare Data with Spark SQL (2)
- In the Select tables from the Catalog for blending pop-up window, you can search for the dataset you want to modify (Figure 5).
Figure 5. Selecting the data to be used in the Spark SQL query
- Click the Select button next to the dataset you want, then click Add selected data.
This opens the Spark SQL command window where you can define your query (Figure 6).
Figure 6. The Define Spark SQL Query pop-up windowIn Figure 6, you see: (1) editor to define query, (2) Run button to execute the query, (3) the Result button to get a preview of resulting data, (4) the Settings button, and (5) the button for materializing the new dataset.
- Write your desired Spark SQL query in the SQL editor. For this example, our query is:
SELECT * from Lending_Club WHERE annual_inc > 40000
- Click Run.
- You should see some results come up under the Result tab.
- Click Settings and give the new dataset a name.
- Add this new dataset to the AI Catalog by clicking Create blended dataset.
In this second example, we will blend two separate datasets: a Customer Spending dataset and an associated Customer Profile dataset. The Customer Spending dataset contains information about the spending habits of each customer in terms of the retail category they shopped in and the amount of money they spent there (Figure 7).
Figure 7. Sample from the Customer Spending dataset
The Customer Profile dataset contains information that some of these customers entered when they were applying for a loan (Figure 8).
Figure 8. Sample from the Customer Profile dataset
In order to blend these two datasets, follow the steps provided for Example 1 but use this Spark SQL query instead (as shown in Figure 9):
‘SELECT cp.CustomerID, Amount, loan_amnt, zip_code, addr_state FROM Customer_Spending cs LEFT JOIN Customer_Profile cp ON cs.CustomerID=cp.CustomerID’
Figure 9. Sample query for merging multiple datasets using the Spark SQL query
Also make sure you select the “Customer Spending” and “Customer Profile” datasets when identifying which datasets to include for blending (Figure 10).
Figure 10. Selecting multiple datasets for blending
If you’re a licensed DataRobot customer, search in-app documentation for AI Catalog.