Successfully leveraging a data lake across multiple Hadoop distros Background

Successfully leveraging a data lake across multiple Hadoop distros

June 27, 2016
· 3 min read

Comparing Hadoop distribution vendors is a popular topic among Big Data writers. In many organizations, however, the comparison is happening inside of their own walls, with test clusters running multiple distributions side-by-side, serving multiple internal needs.

Every organization has multiple databases, and with the growing popularity of Hadoop and technologies, more than one Hadoop distribution as well. Analysts access data stored in any number of these databases and disparate Hadoop-based file systems to prepare it for their downstream business intelligence tool of choice, but can’t bring it all together due to technical obstacles crossing different Hadoop distributions. It is impossible to connect to more than one Hadoop system without Java class loader conflicts.


Most software vendors solve this problem by shipping separate code bases for each Hadoop distribution. Analysts are faced with a dilemma; they can’t access data from multiple Hadoop distributions without switching to a different application for each data source. IT teams are also frustrated with the growing tangle of disconnected data lakes, totally separated from the tools that analysts are demanding. 


Why does this happen? Data integration across varying distributions of Hadoop is a challenge because of Java class loader conflicts. The Paxata team recognized, however, that many of our customers needed to be able to access data across Hadoop environments. To be able to connect to more than one Hadoop distribution or version, the Paxata team developed an interface that dynamically segments import and export traffic to and from Hadoop into code that understands distribution-specific Hadoop versus non-Hadoop configurations.

So, what actually happens when Hadoop is more than Java can handle?

Paxata designed a solution that enables customers to bring in data from anywhere – regardless of the Hadoop distribution or version. Paxata’s self-service data preparation platform concurrently supports our cloud and on-premise customers to access data from multiple Hadoop platforms and versions with an interface “wall”. The interface wall dynamically segments import and export traffic to and from Hadoop into code that understands distribution-specific Hadoop versus non-Hadoop configurations. Analysts can connect to their data without having to switch applications or worse, not connect to their data at all.

For example, this interface wall enabled a team of several dozen analysts at a major financial services company to work with data from over 5000+ databases and across several Hadoop distributions.

Financial services has been facing the Big Data issues of volume, veracity, and variety since before they became buzzwords. Big banks adopted Hadoop-based infrastructures as early adopters to face their massive volumes as data:

“…. one cannot overlook the issue of volume; estimates contend that financial and securities organizations are juggling around 3.8 petabytes per firm. Following behind the investment institutions, the banking industry is contending with around 1.9 petabytes…” – Datanami (link)


To dilute the risk of early adoption (of course), these institutions have invested in multiple distributions of Hadoop for various, sometimes overlapping, purposes. Hadoop-based information management architectures are serving teams in financial organizations across a breadth of scenarios, beyond the standard data management concerns around storage, integration, and security.

  • Risk analysis: Detecting fraud, risk assessment, estimating impact, scoring customers potential clients
  • Customer: Customer prospect analysis, Offer/product recommendation, customer service,  customer segmentation  and  experience analysis, targeted services, portfolio analysis
  • Data processing: Logs, trade data, social media feeds
  • Compliance: Compliance, auditing, governance

With Paxata’s elegant interface wall, IT can untangle the mess of data stores to data tools. Analysts don’t need to worry about where their data is stored, they can just get to work in the tool they’re comfortable with.

DataRobot Data Prep

Interactively explore, combine, and shape diverse datasets into data ready for machine learning and AI applications

Try now for free
About the author

The Next Generation of AI

DataRobot AI Platform is the next generation of AI. The unified platform is built for all data types, all users, and all environments to deliver critical business insights for every organization. DataRobot is trusted by global customers across industries and verticals, including a third of the Fortune 50.

Meet DataRobot
  • Listen to the blog
  • Share this post
    Subscribe to DataRobot Blog
    Thank you

    We will contact you shortly

    Thank You!

    We’re almost there! These are the next steps:

    • Look out for an email from DataRobot with a subject line: Your Subscription Confirmation.
    • Click the confirmation link to approve your consent.
    • Done! You have now opted to receive communications about DataRobot’s products and services.

    Didn’t receive the email? Please make sure to check your spam or junk folders.

    Newsletter Subscription
    Subscribe to our Blog