Building Paxata with Apache Spark Background

Connecting with the Spark Community

April 24, 2016
by
2 min

On Tuesday night, the Paxata Lab was packed with people who came from all over the Bay Area to participate in the Spark Workshop on the Peninsula Meetup of the SF Big Analytics group, the first in a four-part series. As long-time advocates of Spark (we built the Paxata platform on Spark and released to our customers with release 1.0.0 back in 2014), we were excited to host an event for the community! 40+ people joined the Meetup and stayed until well past 9:30pm learning how to program with Spark, eating pizza and writing code!

The workshop was led by the incredible Holden Karau. She is currently working on Apache Spark at IBM, speaks frequently at conferences around the world, and is a great advocate for the open-source community (not to mention a huge Hello Kitty fan!). Here is a taste of what the meetup was like in Holden’s words:

What is Spark?

“What is Spark? It’s a really great general purpose distributed system. It has a nice API, nicer than Map Reduce, and it has a good optimizer that allows me to think less.”

This meetup was part intro-to-Spark and part hands-on exercise with cheerful, helpful, and super smart TA’s Rachel Warren, Anya Bida, and Sara Asher from Alpine Data. 50 people learned about RDDs, the Spark Context, and dove into a word count example. As the instructors explained, word count examples are required for any intro-to-Big-Data-coding sessions (think Hive, Flink, Map-Reduce). Those are the rules!

Screen Shot 2016-04-25 at 1.42.37 PM

Slides from Holden Karau Lighting Fast Cluster Computing with Python (and just a wee bit of Scala) are available here.

More from Holden during the meetup –
Comparing Spark to Map Reduce

“Resiliency is achieved in a different way in Spark than traditional MapReduce. In MapReduce, resiliency is achieved because I’m always writing to a whole bunch of disks. It’s a good strategy, but it’s slow.
Spark’s creators said that because node failure doesn’t happen that often, I don’t have to write everything to disk. If we lose a node, Spark just recomputes the data for that node.”

Holden and Rachel are also working on a follow up to the book Learning about Spark with a new book High Performance Spark.

Paxata will be hosting a Data Prepsters Meetup with Tableau and the TAM group at our offices in Redwood City on Wednesday, May 18th from 6pm-8:30pm. The topic is “Data Freedom – Tableau shares how to truly that your reality.” There will be sushi, data blending discussions and networking.

Screen Shot 2016-04-25 at 1.52.56 PM

The next SF Big Analytics Meetup is May 3rd at the IBM Spark Technology Center. Part two in the “Big Data Toolbox” series will be at the Alpine Data Labs on May 17th.

FREE TRIAL
DataRobot Paxata

Interactively explore, combine, and shape diverse datasets into data ready for machine learning and AI applications

About the author
DataRobot

Enabling the AI-Driven Enterprise

The leader in enterprise AI, delivering trusted AI technology and enablement services to global enterprises competing in today’s Intelligence Revolution. Its enterprise AI platform maximizes business value by delivering AI at scale and continuously optimizing performance over time.

Meet DataRobot
Share this post
Subscribe to our Blog

Thanks! Check your inbox to confirm your subscription.

Thank You!

We’re almost there! These are the next steps:

  • Look out for an email from DataRobot with a subject line: Your Subscription Confirmation.
  • Click the confirmation link to approve your consent.
  • Done! You have now opted to receive communications about DataRobot’s products and services.

Didn’t receive the email? Please make sure to check your spam or junk folders.

Close

Newsletter Subscription
Subscribe to our Blog