Journey to Apache Spark Background

Journey to Apache Spark

September 27, 2016
by
3 min

In a previous blog post, I mentioned that Shachar Harussi and I discussed the lessons we learned building the Apache Spark-based architecture for the Paxata platform at DataVersity in Chicago this week. I can’t wait to talk about this week at Strata NYC. Come by booth #301 and find me to ask me more about these topics!

Here is a second peek into what we talked about – refer to this post to get some context.

Why use Apache Spark for data preparation?

Data preparation includes gathering and exploring millions of records, cleaning up missing data and invalid formats, transforming and combining data with other datasets, correcting mistakes and understanding outliers, filtering and segmenting datasets down to the data that matters.

So, why use Spark? When Paxata was founded, we knew that there were important things to consider for how people want to prepare their data:

  1. Business analysts need to see their data in rows and columns, not a bunch of boxes and arrows.
  2. Business analysts need to prepare all of their data for their eventual analysis, not just some of it.
  3. As business analysts prepare their data (cleaning, combining, fixing, re-shaping) they need to see changes reflected in the data immediately.

So, we needed to build an experience that would show lots of data, react quickly, calculate changes across huge datasets, and be scalable to million (billions!) of rows. We designed an interactive spreadsheet-like experience to address the first point. Now, to be smart, scalable, and speedy we needed  to investigate available open-source projects in distributed computing.

The image below demonstrates the interactive experience of the Paxata self-service data preparation app with  semantic column profiles, numerical range filtering, and text searching for millions of records. 

search-1024x600

Spark vs the Other Distributed Systems Projects

We considered projects along the spectrum of storage to pure computing.

screen-shot-2016-09-23-at-10-19-07-am-1024x580

Distributed file systems like Alluxio (previously Tachyon) and Ceph addressed scalability concerns for “big data” problems, but did not address an analyst’s needs to interact with the data and make changes.

Databases like Apache HBase and Apache Cassandra offered scalable data storage as well as real-time querying and filtering, which was an improvement on pure-storage. These projects showed potential; they could accommodate an analyst zooming around their data, scrolling across millions of records with relative ease. But what would happen if data had to be reshaped, aggregated, and pivoted?

The image below demonstrates the interactive experience of the Paxata self-service data preparation app with a number of shaping options, including deduplicate, group by, transpose, pivot, and depivot. Depivot is shown here, with new values dynamically calculated on the fly based on the chosen parameters. 

shape-1024x599

After continued investigation, we decided on Apache Spark. Not only is it scalable to the data volumes we anticipated, but it also has a simple, robust, and flexible way of expressing computations as an immutable stream of data, the Resilient Distributed Dataset RDD. For more information about Apache Spark, I highly recommend reading about it on the Databricks website or in this tutorial.

The Paxata team will be available at Strata NYC. Come by booth #301 to find out more!

Please note that Apache HBase, Apache Cassandra, Apache Spark are trademarks of the Apache Software Foundation.

Free Trial
DataRobot Paxata

Interactively explore, combine, and shape diverse datasets into data ready for machine learning and AI applications

About the author
DataRobot

Enabling the AI-Driven Enterprise

The leader in enterprise AI, delivering trusted AI technology and enablement services to global enterprises competing in today’s Intelligence Revolution. Its enterprise AI platform maximizes business value by delivering AI at scale and continuously optimizing performance over time.

Meet DataRobot
Share this post
Subscribe to our Blog

Thanks! Check your inbox to confirm your subscription.

Thank You!

We’re almost there! These are the next steps:

  • Look out for an email from DataRobot with a subject line: Your Subscription Confirmation.
  • Click the confirmation link to approve your consent.
  • Done! You have now opted to receive communications about DataRobot’s products and services.

Didn’t receive the email? Please make sure to check your spam or junk folders.

Close

Newsletter Subscription
Subscribe to our Blog