DataRobot at PyData Silicon Valley 2014: Highlights, Tutorial and Slides

May 7, 2014 by

PyDataLogoBig-sv2014

We just returned from a great weekend in Menlo Park, CA (at Facebook’s HQ) where we attended PyData Silicon Valley 2014. The conference brought together top users and developers of data analysis tools in Python. It was a great place to share ideas on how to best apply the language and tools to address challenges in data management, processing, analytics and visualization.

We’d like to first thank NumFOCUS and all volunteers for organizing this great event and Facebook for hosting it at their headquarters in Menlo Park, CA.

This was our third time attending PyData, and this time we were especially looking forward to hearing how companies like Facebook use Python to analyze petabytes of data.

 

Here are some of our highlights of the conference:

Jason Sundram of Facebook gave a fantastic talk on “A Full Stack Approach to Data Visualization: Terabytes (and Beyond) at Facebook”. Jason demonstrated how he turns terabytes of data from Facebook into compelling, interactive, data-driven applications. He showed the audience different types of visualizations they create — and in particular, how some of it is displayed on a massive panel wall at Facebook.

Another talk we really enjoyed was by Greg Lamp from YHat on “ggplot for python”. Greg explained how ggplot provides a high-level grammar that allows user to quickly and easily make plots that are actually visually appealing. He gave an example of how ggplot works by analyzing a dataset of baseball pitches and identifying the vulnerabilities of certain players (such as low hit rate in certain regions of the strike zone) in an intuitive plot. You can view his whole tutorial here.

 

DataRobot and PyData

One of our data scientists, Peter Prettenhofer, lead a tutorial on Friday on his favorite data-science algorithm, Gradient Boosted Regression Trees (GBRT). GBRT is a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. This algorithm is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, and the Heritage Health Prize. Peter is the primary author of a popular implementation of Gradient Boosted Regression Trees in the Python machine learning toolkit, scikit-learn. Peter began his talk with a brief introduction to the GBRT model [slides here] and continued on with an in-depth tutorial dedicated to applying GBRT successfully in practice using scikit-learn. He covered topics including regularization, model tuning, and model interpretation — all of which can significantly improve your score on Kaggle. Peter ran the tutorial through an IPython notebook which you can download here.

 

In addition to leading this tutorial session, we also sponsored and exhibited. We were thrilled to  give demos of the DataRobot beta product to many attendees. If you attended the event, saw our demo and would like to really give it a try yourself, request an invite to our beta program.

 

Until next time, PyData.