Building Paxata with Apache Spark Background

Building Paxata with Apache Spark

September 22, 2016
by
3 min

Shachar Harussi (a senior distributed systems engineer at Paxata) and I are happy and grateful that we got to share the lessons we learned while building the Apache Spark-based architecture for the Paxata platform today at Dataversity in Chicago. Thank you so much to the wonderful folks at Enterprise Dataversity!

I am excited to talk about this more next week at Strata NYC. Come by booth #301 to learn about how Paxata uses Spark and how we innovated and optimized Spark for data preparation. Here is a peek into what we talked about today:

Apache Spark: Lightning-fast cluster computing

How can a business analyst use Spark?

This was my favorite question during our talk. For context – Shachar and I described how the Paxata development team chose Spark from among other big data technologies, how we invested in Spark, and how we adapted Spark to create the interactive and scalable Paxata platform.

On top of the platform, the Paxata self-service data preparation app is where business analysts profile, manipulate transform, and cleanse data. In response to this question – every click, every action a business analysts takes in the Paxata app get compiled into Spark-specific transformations that are processed in the Spark layer, then the results are bubbled back up to the interface.

When a business analyst launches her browser and logs into the Paxata self-service data preparation app, she is interfacing with Spark! When she looks at 2 million  transactions, or insurance claims, or leads, she is looking at the data like a spreadsheet.

This is what an analyst sees using Paxata data prep:

Explore

As she scrolls through her data, sorting, filtering, and reshaping it, she is actually using Spark. She isn’t using a commandline, she isn’t creating a script that she will then submit as a job somewhere (and wait, and wait). She is actively interfacing with her data, and her actions are being processed and understood with Spark. The Paxata platform enables business analysts to work with their data interactively and intelligently, thanks to a Spark-based computing layer and a spreadsheet-like user interface.

Paxata is built on Spark vs “Works with” Spark

We decided to build the Paxata platform on top of Spark — not compatible with Spark, but truly written for and with Spark — to provide a modern, dynamic, and experience for business analysts to see, explore, and fix all of their millions (and billions!) of rows of data. Spark is a crucial part of the powerful computing layer in the Paxata platform.

Sometimes companies advertise their “compatibility” with Spark. What does that mean? It could be that instead of producing direct results, they are producing a script, code, or a job that you will then have to execute and process somewhere separately. The Paxata platform is a unified solution –  business analysts get all of their results immediately. Being compatible with Spark instead of being designed for and built with Spark means not taking advantage of the incredible distributed processing that Spark is designed for!

Thank you so much to the wonderful folks at Enterprise DataVersity!

Look for me at Booth #301 at Strata NYC – I look forward to talking to you about how we built our own system with Spark to accommodate the natural iterative and collaborative nature of true data preparation. Learn more about Paxata at this session with analysts from Citi, Standard Chartered Bank (SCB) and Polaris discussing how they use Paxata to analyze transactions and behaviors looking for indicators of money laundering, fraud and human trafficking.

FREE TRIAL
DataRobot Paxata

Interactively explore, combine, and shape diverse datasets into data ready for machine learning and AI applications

About the author
DataRobot

Enabling the AI-Driven Enterprise

The leader in enterprise AI, delivering trusted AI technology and enablement services to global enterprises competing in today’s Intelligence Revolution. Its enterprise AI platform maximizes business value by delivering AI at scale and continuously optimizing performance over time.

Meet DataRobot
Share this post
Subscribe to our Blog

Thanks! Check your inbox to confirm your subscription.

Thank You!

We’re almost there! These are the next steps:

  • Look out for an email from DataRobot with a subject line: Your Subscription Confirmation.
  • Click the confirmation link to approve your consent.
  • Done! You have now opted to receive communications about DataRobot’s products and services.

Didn’t receive the email? Please make sure to check your spam or junk folders.

Close

Newsletter Subscription
Subscribe to our Blog