Enabling Tomorrow’s Business Analysts with Predictive Analytics: Part Two
Thanks to Dr. Kai R. Larsen, Associate Professor of Information Management, University of Colorado, for contributing this two-part guest blog.
Part 2: Taking Advantage of DataRobot and Alteryx to Overcome Barriers to Success
As I laid out in Part 1 of this blog, being a university professor attempting to equip students across all disciplines with predictive analytics skills has its challenges. Operationalizing predictive analytics is fraught with complexities, and getting data ready for predictive analytics in the first place comes with its own set of challenges. But I’ve found success overcoming these challenges by bringing two tools into the classroom: DataRobot and Alteryx.
The conceptual goals of predictive analytics and its complexities can be addressed through the use of DataRobot. I believe that at this point, it is the only tool in the industry delivering automatic algorithm-specific preprocessing, understanding of algorithms, and automatic evaluation and selection of algorithms. While DataRobot provides advanced functionality that would likely stump and outperform most recent graduates of MS in Analytics programs, its surface characteristics are such that it could easily be used to teach predictive analytics to all undergraduate students.
The tool takes a square matrix of data, so assumes that data blending has already taken place. The data is uploaded into the cloud environment, and DataRobot immediately goes to work on calculating basic statistics. The user is then asked to select the target variable and to click the big round “easy-button.” The system goes to work setting aside a random holdback sample of 20% of the data for eventual evaluation of the solution accuracy, splitting the data into folds for 5-fold cross validation, characterizing the target variable, pre-processing the data in a number of different ways depending on class of algorithm, and running through a large sample of top-performing algorithms with 16% of the available data. The algorithms with their most common pre-processing step are then ranked in a leaderboard based on their performance on the fifth fold of the data (the part not used to build the model in one round of five-fold cross-validation). Once this round is over, the tool automatically selects algorithms to continue on with 32% of the data, and if, for example, one of the random forest algorithms outperformed other algorithms, then multiple versions of that algorithm are instantiated with different pre-processing steps. This is repeated before moving on to 64% of the data.
Throughout this process, the user gets to observe the gladiatorial sport of the different algorithms jockeying for supremacy on the Amazon cloud-powered platform. Finally, with 64% of the data used for training (16% for the final fold and 20% for the holdback sample), the best algorithms are “blended” together using four different auto-blenders which are themselves ranked in the leaderboard. The tool now allows use of the best algorithms to predict new cases, i.e. “What is the probability that John, whose cell phone contract expires tomorrow, will actually take his business somewhere else?” The models may be explored in terms of their key attributes with opportunities to discuss confusion matrices, precision and recall and area under the curve of ROCs, textual features driving the target variable as well as the most predictive features and their curves and patterns.
Simply put, my goal with DataRobot is to create two-hour analysts: workers who can take a square matrix, conduct predictive analytics, and create a presentation for management, all in the space of two hours. Potentially, three such projects and their presentations can, under the best of circumstances, be performed in a day, each analysis comparable with what used to take months of effort by the most expensive business analytics experts. This is the conceptual part of the puzzle. Students will not be experts on any one algorithm beyond what they may have already learned in their statistics class(es), which regardless becomes meaningless when 50 algorithms are involved. Let’s all agree that as long as proper cross-validation methodology and evaluation of metrics is taught, black boxes are our friends.
Now comes the tricky part. I never thought I’d say that the actual predictive analytics would be the easy part, but such is the new world of predictive analytics. I suggested that there were three challenges to overcome when data isn’t already in neat square matrix form. Connecting to the vast majority of data should be easy and mostly automatic beyond pointing the tool to the location of the data and authenticating if necessary. While there are now several tools that do that, after trying out several, my favorite is Alteryx. It provides a workflow-based process with every tool you could ever wish for, ranging from data access and transformation to predictive analytics, geographic evaluations, and customer data access. I like it for its easygoing, quiet competence and ability to handle data ranging from small to truly large. It enables a visual interface to data joining, which with properly developed hands-on exercises is reasonably straightforward (though absolutely harder to learn in four weeks). It allows inner, left, right, and full outer joins as well as summarization through “group by” approaches, which should take care of the vast majority of functionality not already automated by DataRobot.
The fact that Alteryx now comes with two DataRobot tools (the Alteryx Connector for DataRobot)–one for training a model and one for scoring new instances–should make enterprises take note. For the first time, there exists a tool that makes A-Z analytics easy enough for any undergraduate to be ready to perform in line with all but the best data science competitors. Here is something for everyone.
The value of DataRobot became very clear to me when I used it in my own research work. I had a very strongly technical PhD student work with me on one predictive analytics task for a project that took two to three months. Then we pulled the same data into DataRobot to compare results, and in one hour, DataRobot had outperformed the PhD student by a factor of two, simply because he had missed a class of algorithms that worked really well for the data in question and had not thought to balance the training data. In this case, I was able to effectively use Alteryx for down sampling before uploading the data to DataRobot. Since then, I’ve noticed that DataRobot has gotten better at appropriately downsampling, thereby making my role as an expert less important.
Therein lies the value of both Alteryx and DataRobot: they represent an opportunity to endow current students and future leaders with decades of analytical know-how at a fraction of the cost. There is little doubt that predictive analytics will find a way into the core of most business schools. What remains is simply a question of which colleges will be first to empower their students with outsized advantages as they move into enterprises with many more targets of opportunity, as analytics-trained employees.
Learn more on the December 13 webinar: From Raw Data to Predictions at S&P Global