Blending Datasets with DataRobot AI Catalog

April 17, 2020
by
· 3 min read

This post was originally part of the DataRobot Community. Visit now to browse discussions and ask questions about DataRobot, AI Platformdata science, and more.

Today we are going to share how you can modify a single dataset or blend multiple datasets using the DataRobot AI Catalog.

The AI Catalog is a centralized hub to store materialized and virtual datasets that can be used for model training and batch predictions.

You can access the AI Catalog by clicking on that tab located on the main page (Figure 1). DataRobot will take you to the AI Catalog where you can browse all the datasets to which you have access to (Figure 2).

Figure 1. Accessing the AI Catalog from the AI Catalog tabFigure 1. Accessing the AI Catalog from the AI Catalog tab

Figure 2. The AI Catalog showing some datasetsFigure 2. The AI Catalog showing some datasets

If you want to use a modified version of the data available in the AI Catalog or you want to combine multiple files into one, you can use the AI Catalog Spark SQL functionality. The results of these combinations are called “Blended Datasets.”

The Spark SQL queries used can range from a simple single table manipulation query to complex multi-table manipulation queries.

Example 1

Let’s first use a Spark SQL query to perform a simple single table manipulation query. The data we are going to use has information about Lending Club customers and their credit risk. It includes a number of attributes of these customers as shown in Figure 3.

Figure 3. A sample of the Lending Club dataset highlighting the annual_inc variableFigure 3. A sample of the Lending Club dataset highlighting the annual_inc variable

We want to create a DataRobot project from this dataset, after removing all borrowers with an annual income (annual_inc) less than $40,000.

  1. Click Add to catalog and select Prepare data with — Spark SQL (Figure 4).

    Figure 4. Select Add to catalog (1) and then Prepare Data with Spark SQL (2)Figure 4. Select Add to catalog (1) and then Prepare Data with Spark SQL (2)
  2. In the Select tables from the Catalog for blending pop-up window, you can search for the dataset you want to modify (Figure 5).

    Figure 5. Selecting the data to be used in the Spark SQL queryFigure 5. Selecting the data to be used in the Spark SQL query
  3. Click the Select button next to the dataset you want, then click Add selected data.

    This opens the Spark SQL command window where you can define your query (Figure 6).

    Figure 6. The Define Spark SQL Query pop-up windowFigure 6. The Define Spark SQL Query pop-up windowIn Figure 6, you see: (1) editor to define query, (2) Run button to execute the query, (3) the Result button to get a preview of resulting data, (4) the Settings button, and (5) the button for materializing the new dataset.
  4. Write your desired Spark SQL query in the SQL editor. For this example, our query is:
SELECT * from Lending_Club WHERE annual_inc > 40000​
  1. Click Run.
  2. You should see some results come up under the Result tab.
  3. Click Settings and give the new dataset a name.
  4. Add this new dataset to the AI Catalog by clicking Create blended dataset.

Example 2

In this second example, we will blend two separate datasets: a Customer Spending dataset and an associated Customer Profile dataset. The Customer Spending dataset contains information about the spending habits of each customer in terms of the retail category they shopped in and the amount of money they spent there (Figure 7).

Figure 7. Sample from the Customer Spending datasetFigure 7. Sample from the Customer Spending dataset

The Customer Profile dataset contains information that some of these customers entered when they were applying for a loan (Figure 8).

Figure 8. Sample from the Customer Profile datasetFigure 8. Sample from the Customer Profile dataset

In order to blend these two datasets, follow the steps provided for Example 1 but use this Spark SQL query instead (as shown in Figure 9):

‘SELECT cp.CustomerID, Amount, loan_amnt, zip_code, addr_state
FROM Customer_Spending cs
LEFT JOIN Customer_Profile cp ON cs.CustomerID=cp.CustomerID’

Figure 9. Sample query for merging multiple datasets using the Spark SQL queryFigure 9. Sample query for merging multiple datasets using the Spark SQL query

Also make sure you select the “Customer Spending” and “Customer Profile” datasets when identifying which datasets to include for blending (Figure 10).

Figure 10. Selecting multiple datasets for blendingFigure 10. Selecting multiple datasets for blending

More information

If you’re a licensed DataRobot customer, search in-app documentation for AI Catalog.

Documentation
Prepare data in AI Catalog with Spark SQL
Learn more

About the author
Linda Haviland
Linda Haviland

Community Manager

Meet Linda Haviland
  • Listen to the blog
     
  • Share this post
    Subscribe to DataRobot Blog
    Newsletter Subscription
    Subscribe to our Blog