DataRobot & Snowflake Data Marketplace: The Perfect Complement
DataRobot enables organizations to leverage the transformational power of machine learning by delivering the world’s most trusted enterprise AI platform that spans the full lifecycle from data to value.
As the AutoML category creator, DataRobot has since expanded its capabilities to accept heterogeneous data formats, including image, geospatial, text, time series, and many others. New algorithms are constantly being added to the platform, from classic linear regression to adaptive neural networks, using an intelligent search to automatically configure the architecture. As new algorithms prove their merit, they are incorporated into DataRobot’s algorithm repository, and the platform continues to adapt and evolve even after models are deployed through a concept called continuous learning. DataRobot wraps all of this into an easy-to-use, enterprise-grade platform.
However, we’d like to let you in on a little secret. Although the DataRobot platform makes leading edge AI technology available to customers, often the most bang-for-the-buck strategy for improving model accuracy is not necessarily to focus on the shiny new algorithm. Rather, the best strategy is to discover how to enrich your modeling data by incorporating additional data sources that are predictive and not captured in the existing methodology.
With support for all data types and Secure Data Sharing capabilities, Snowflake’s Data Cloud is designed to effortlessly store and provide your models with all relevant data–even data outside your four walls. Secure Data Sharing allows you to share data across your ecosystem of partners, suppliers, and customers and powers the Snowflake Data Marketplace, where users can discover and access third-party data and data services from over 150 providers (and growing).
Snowflake Data Marketplace
The Snowflake Data Marketplace has a rich ecosystem of third-party data and services providing ready-to-query data that can enrich a predictive model’s feature set, using DataRobot as a rapid prototyping platform. When integrated with your primary dataset, often the new features from these third-party data sources can significantly improve the predictive signal and overall accuracy of the machine learning model in development.
Forecasting In-Store Foot Traffic
To illustrate this, we will look at a real example.
Here, we have a model that predicts foot traffic for a chain of coffee shops. The ability to forecast ahead of time how many customers will visit a coffee shop allows managers to better allocate staff and supplies to accommodate demand and increase profits.
In this case, the base data (or primary dataset) contains the history of every customers’ visits to the chain and includes features such as city, latitude/longitude, date and time, and so on. To supplement this data, we conveniently discovered that the Snowflake Data Marketplace has corresponding weather information that can easily be joined by date and location. Because the weather data is the actual weather for that day, we’ve lagged the weather to remove target leakage. It’s important to use lags or forecasted weather because realized weather patterns can be susceptible to data leaks.
To create a baseline, we first run a DataRobot project to forecast foot traffic using the base dataset only. When DataRobot’s Autopilot process is complete, we end up with a root mean square error (RMSE) of 11.6.
Data Source Comparison
Although the RMSE of 11.6 is good, we want to improve the overall accuracy of our machine learning model. To do this, we turn to the weather dataset from Snowflake’s Data Marketplace and join it to our base dataset. The process is simple, and if you have a Snowflake account, getting data from the Snowflake Data Marketplace involves only a few clicks. Powered by Snowflake’s data-sharing technology, data from the Snowflake Data Marketplace allows users to query data instantaneously without additional ETL processing steps.
After joining our base dataset with the weather forecasts and rerunning our DataRobot project, we find that the RMSE has now dropped to 9.7.
This means that, by enriching our original dataset using Snowflake’s Data Marketplace, we are able to get an instant 13.1% improvement in accuracy. To highlight this, we leverage DataRobot’s ability to create multiple feature lists so we can easily mix and match models run on different data sources. Notice that the two models above use the same modeling algorithm and the same training and test data but different lists of features. This allows us to change only the variables fed into the DataRobot model, thus creating a measurable apples-to-apples comparison between the base data and the enriched data with weather added.
We can also leverage DataRobot’s model comparison to give us a more in-depth view of how the model compares across multiple metrics.
DataRobot and Snowflake Feature Discovery
When enriching your data with data from the Snowflake Data Marketplace, most likely you are performing the joins, aggregations, and feature generation upstream from the modeling process and manually using SQL.
DataRobot and Snowflake can also help you relieve the burden of this time-intensive, manual activity. For instance, if you’re not completely sure how to join your datasets or lack the time to manually generate and test hundreds of derived features, DataRobot’s powerful automation can help. Working with the Snowflake engineering team, the DataRobot team built a revolutionary capability called Automated Feature Discovery. Automated Feature Discovery allows you to connect to datasets from the Snowflake Data Marketplace while you are building your model, versus manually doing this upstream. Using automated and AI-assisted capabilities, DataRobot identifies all the relationships between your datasets and then automatically joins the datasets together, creating and testing a staggering variety of model features for you. All you need to do is kick off the DataRobot project using Snowflake source data, and then a subsequent call is made to Snowflake whereby the feature engineering processing is pushed down to use Snowflake’s compute resources.
When the Automated Feature Discovery process finishes, the new dataset contains the features from all of the original sources, along with all of the newly engineered features identified by DataRobot for predictive modeling. Not only can you bring in numerous additional data sources via the Snowflake Data Marketplace, but you can also keep all of the processing required to build the new features in the same environment (Snowflake) that stores the data. This removes the need to move data back and forth between your data management and machine learning environments. It also increases productivity by allowing DataRobot to do the heavy lifting of discovering and generating all of the features you need for modeling.
DataRobot and Snowflake: Getting the Most Value Out of Your Data
We’ve shown you just one of the benefits of a DataRobot and Snowflake integration using a simple foot-traffic forecasting example that incorporates weather enrichment data. But the power of enriching data for improved predictive models extends to nearly all industries and use cases. For instance, portfolio managers at investment management firms are enriching market data with fundamental data from Factset or with risk factors data from MSCI. Healthcare providers are leveraging HIPAA-compliant health data for improved clinical effectiveness. Software developers are leveraging new forms of user experience data with machine learning to identify drivers of customer behavior to inform product decisions. All companies are beginning to use alternative data to better understand their customers.
In business, speed is of the essence, and companies need to quickly evaluate if there is value from the vast amount of data available. In combination with Snowflake and the Snowflake Data Marketplace, DataRobot can serve as the perfect rapid prototyping platform.
Learn more about what can be accomplished with DataRobot and Snowflake here!