Connecting with the Spark Community
On Tuesday night, the Paxata Lab was packed with people who came from all over the Bay Area to participate in the Spark Workshop on the Peninsula Meetup of the SF Big Analytics group, the first in a four-part series. As long-time advocates of Spark (we built the Paxata platform on Spark and released to our customers with release 1.0.0 back in 2014), we were excited to host an event for the community! 40+ people joined the Meetup and stayed until well past 9:30pm learning how to program with Spark, eating pizza and writing code!
The workshop was led by the incredible Holden Karau. She is currently working on Apache Spark at IBM, speaks frequently at conferences around the world, and is a great advocate for the open-source community (not to mention a huge Hello Kitty fan!). Here is a taste of what the meetup was like in Holden’s words:
What is Spark?
“What is Spark? It’s a really great general purpose distributed system. It has a nice API, nicer than Map Reduce, and it has a good optimizer that allows me to think less.”
This meetup was part intro-to-Spark and part hands-on exercise with cheerful, helpful, and super smart TA’s Rachel Warren, Anya Bida, and Sara Asher from Alpine Data. 50 people learned about RDDs, the Spark Context, and dove into a word count example. As the instructors explained, word count examples are required for any intro-to-Big-Data-coding sessions (think Hive, Flink, Map-Reduce). Those are the rules!
Slides from Holden Karau Lighting Fast Cluster Computing with Python (and just a wee bit of Scala) are available here.
More from Holden during the meetup –
Comparing Spark to Map Reduce
“Resiliency is achieved in a different way in Spark than traditional MapReduce. In MapReduce, resiliency is achieved because I’m always writing to a whole bunch of disks. It’s a good strategy, but it’s slow.
Spark’s creators said that because node failure doesn’t happen that often, I don’t have to write everything to disk. If we lose a node, Spark just recomputes the data for that node.”
Paxata will be hosting a Data Prepsters Meetup with Tableau and the TAM group at our offices in Redwood City on Wednesday, May 18th from 6pm-8:30pm. The topic is “Data Freedom – Tableau shares how to truly that your reality.” There will be sushi, data blending discussions and networking.
The next SF Big Analytics Meetup is May 3rd at the IBM Spark Technology Center. Part two in the “Big Data Toolbox” series will be at the Alpine Data Labs on May 17th.