Building Paxata with Apache Spark Background

Connecting with the Spark Community

April 24, 2016
· 2 min read

On Tuesday night, the Paxata Lab was packed with people who came from all over the Bay Area to participate in the Spark Workshop on the Peninsula Meetup of the SF Big Analytics group, the first in a four-part series. As long-time advocates of Spark (we built the Paxata platform on Spark and released to our customers with release 1.0.0 back in 2014), we were excited to host an event for the community! 40+ people joined the Meetup and stayed until well past 9:30pm learning how to program with Spark, eating pizza and writing code!

The workshop was led by the incredible Holden Karau. She is currently working on Apache Spark at IBM, speaks frequently at conferences around the world, and is a great advocate for the open-source community (not to mention a huge Hello Kitty fan!). Here is a taste of what the meetup was like in Holden’s words:

What is Spark?

“What is Spark? It’s a really great general purpose distributed system. It has a nice API, nicer than Map Reduce, and it has a good optimizer that allows me to think less.”

This meetup was part intro-to-Spark and part hands-on exercise with cheerful, helpful, and super smart TA’s Rachel Warren, Anya Bida, and Sara Asher from Alpine Data. 50 people learned about RDDs, the Spark Context, and dove into a word count example. As the instructors explained, word count examples are required for any intro-to-Big-Data-coding sessions (think Hive, Flink, Map-Reduce). Those are the rules!

Screen Shot 2016-04-25 at 1.42.37 PM

Slides from Holden Karau Lighting Fast Cluster Computing with Python (and just a wee bit of Scala) are available here.

More from Holden during the meetup –
Comparing Spark to Map Reduce

“Resiliency is achieved in a different way in Spark than traditional MapReduce. In MapReduce, resiliency is achieved because I’m always writing to a whole bunch of disks. It’s a good strategy, but it’s slow.
Spark’s creators said that because node failure doesn’t happen that often, I don’t have to write everything to disk. If we lose a node, Spark just recomputes the data for that node.”

Holden and Rachel are also working on a follow up to the book Learning about Spark with a new book High Performance Spark.

Paxata will be hosting a Data Prepsters Meetup with Tableau and the TAM group at our offices in Redwood City on Wednesday, May 18th from 6pm-8:30pm. The topic is “Data Freedom – Tableau shares how to truly that your reality.” There will be sushi, data blending discussions and networking.

Screen Shot 2016-04-25 at 1.52.56 PM

The next SF Big Analytics Meetup is May 3rd at the IBM Spark Technology Center. Part two in the “Big Data Toolbox” series will be at the Alpine Data Labs on May 17th.

DataRobot Data Prep

Interactively explore, combine, and shape diverse datasets into data ready for machine learning and AI applications

Try now for free
About the author

Value-Driven AI

DataRobot is the leader in Value-Driven AI – a unique and collaborative approach to AI that combines our open AI platform, deep AI expertise and broad use-case implementation to improve how customers run, grow and optimize their business. The DataRobot AI Platform is the only complete AI lifecycle platform that interoperates with your existing investments in data, applications and business processes, and can be deployed on-prem or in any cloud environment. DataRobot and our partners have a decade of world-class AI expertise collaborating with AI teams (data scientists, business and IT), removing common blockers and developing best practices to successfully navigate projects that result in faster time to value, increased revenue and reduced costs. DataRobot customers include 40% of the Fortune 50, 8 of top 10 US banks, 7 of the top 10 pharmaceutical companies, 7 of the top 10 telcos, 5 of top 10 global manufacturers.

Meet DataRobot
  • Listen to the blog
  • Share this post
    Subscribe to DataRobot Blog
    Newsletter Subscription
    Subscribe to our Blog