Journey to Apache Spark
In a previous blog post, I mentioned that Shachar Harussi and I discussed the lessons we learned building the Apache Spark-based architecture for the Paxata platform at DataVersity in Chicago this week. I can’t wait to talk about this week at Strata NYC. Come by booth #301 and find me to ask me more about these topics!
Here is a second peek into what we talked about – refer to this post to get some context.
Why use Apache Spark for data preparation?
Data preparation includes gathering and exploring millions of records, cleaning up missing data and invalid formats, transforming and combining data with other datasets, correcting mistakes and understanding outliers, filtering and segmenting datasets down to the data that matters.
So, why use Spark? When Paxata was founded, we knew that there were important things to consider for how people want to prepare their data:
- Business analysts need to see their data in rows and columns, not a bunch of boxes and arrows.
- Business analysts need to prepare all of their data for their eventual analysis, not just some of it.
- As business analysts prepare their data (cleaning, combining, fixing, re-shaping) they need to see changes reflected in the data immediately.
So, we needed to build an experience that would show lots of data, react quickly, calculate changes across huge datasets, and be scalable to million (billions!) of rows. We designed an interactive spreadsheet-like experience to address the first point. Now, to be smart, scalable, and speedy we needed to investigate available open-source projects in distributed computing.
The image below demonstrates the interactive experience of the Paxata self-service data preparation app with semantic column profiles, numerical range filtering, and text searching for millions of records.
Spark vs the Other Distributed Systems Projects
We considered projects along the spectrum of storage to pure computing.
Distributed file systems like Alluxio (previously Tachyon) and Ceph addressed scalability concerns for “big data” problems, but did not address an analyst’s needs to interact with the data and make changes.
Databases like Apache HBase and Apache Cassandra offered scalable data storage as well as real-time querying and filtering, which was an improvement on pure-storage. These projects showed potential; they could accommodate an analyst zooming around their data, scrolling across millions of records with relative ease. But what would happen if data had to be reshaped, aggregated, and pivoted?
The image below demonstrates the interactive experience of the Paxata self-service data preparation app with a number of shaping options, including deduplicate, group by, transpose, pivot, and depivot. Depivot is shown here, with new values dynamically calculated on the fly based on the chosen parameters.
After continued investigation, we decided on Apache Spark. Not only is it scalable to the data volumes we anticipated, but it also has a simple, robust, and flexible way of expressing computations as an immutable stream of data, the Resilient Distributed Dataset RDD. For more information about Apache Spark, I highly recommend reading about it on the Databricks website or in this tutorial.
The Paxata team will be available at Strata NYC. Come by booth #301 to find out more!
Please note that Apache HBase, Apache Cassandra, Apache Spark are trademarks of the Apache Software Foundation.
DataRobot is the leader in Value-Driven AI – a unique and collaborative approach to AI that combines our open AI platform, deep AI expertise and broad use-case implementation to improve how customers run, grow and optimize their business. The DataRobot AI Platform is the only complete AI lifecycle platform that interoperates with your existing investments in data, applications and business processes, and can be deployed on-prem or in any cloud environment. DataRobot and our partners have a decade of world-class AI expertise collaborating with AI teams (data scientists, business and IT), removing common blockers and developing best practices to successfully navigate projects that result in faster time to value, increased revenue and reduced costs. DataRobot customers include 40% of the Fortune 50, 8 of top 10 US banks, 7 of the top 10 pharmaceutical companies, 7 of the top 10 telcos, 5 of top 10 global manufacturers.
We will contact you shortly
We’re almost there! These are the next steps:
- Look out for an email from DataRobot with a subject line: Your Subscription Confirmation.
- Click the confirmation link to approve your consent.
- Done! You have now opted to receive communications about DataRobot’s products and services.
Didn’t receive the email? Please make sure to check your spam or junk folders.
Optimizing Large Language Model Performance with ONNX on DataRobot MLOpsJune 1, 2023· 11 min read
Belong @ DataRobot: AAPI Heritage Month with the ACTnow! CommunityMay 25, 2023· 3 min read
Deep Learning for Decision-Making Under UncertaintyMay 18, 2023· 5 min read
Many companies are experiencing mounting pressure to have a generative AI strategy, but most are not equipped to meaningfully put generative AI to work. For AI leaders, there are deeper questions you need to ask as you consider your path with generative AI.
Discover the challenges and benefits of big data in AI, downsampling, and smart sampling techniques to reduce data size without losing accuracy.
DataRobot 9.0 helps organizations scale the use of AI to create value enterprise-wide. Discover how it simplifies ML production, automates deployment, and manages model drift to maintain business value.