Successfully leveraging a data lake across multiple Hadoop distros Background

Successfully leveraging a data lake across multiple Hadoop distros

June 27, 2016
· 3 min read

Comparing Hadoop distribution vendors is a popular topic among Big Data writers. In many organizations, however, the comparison is happening inside of their own walls, with test clusters running multiple distributions side-by-side, serving multiple internal needs.

Every organization has multiple databases, and with the growing popularity of Hadoop and technologies, more than one Hadoop distribution as well. Analysts access data stored in any number of these databases and disparate Hadoop-based file systems to prepare it for their downstream business intelligence tool of choice, but can’t bring it all together due to technical obstacles crossing different Hadoop distributions. It is impossible to connect to more than one Hadoop system without Java class loader conflicts.


Most software vendors solve this problem by shipping separate code bases for each Hadoop distribution. Analysts are faced with a dilemma; they can’t access data from multiple Hadoop distributions without switching to a different application for each data source. IT teams are also frustrated with the growing tangle of disconnected data lakes, totally separated from the tools that analysts are demanding. 


Why does this happen? Data integration across varying distributions of Hadoop is a challenge because of Java class loader conflicts. The Paxata team recognized, however, that many of our customers needed to be able to access data across Hadoop environments. To be able to connect to more than one Hadoop distribution or version, the Paxata team developed an interface that dynamically segments import and export traffic to and from Hadoop into code that understands distribution-specific Hadoop versus non-Hadoop configurations.

So, what actually happens when Hadoop is more than Java can handle?

Paxata designed a solution that enables customers to bring in data from anywhere – regardless of the Hadoop distribution or version. Paxata’s self-service data preparation platform concurrently supports our cloud and on-premise customers to access data from multiple Hadoop platforms and versions with an interface “wall”. The interface wall dynamically segments import and export traffic to and from Hadoop into code that understands distribution-specific Hadoop versus non-Hadoop configurations. Analysts can connect to their data without having to switch applications or worse, not connect to their data at all.

For example, this interface wall enabled a team of several dozen analysts at a major financial services company to work with data from over 5000+ databases and across several Hadoop distributions.

Financial services has been facing the Big Data issues of volume, veracity, and variety since before they became buzzwords. Big banks adopted Hadoop-based infrastructures as early adopters to face their massive volumes as data:

“…. one cannot overlook the issue of volume; estimates contend that financial and securities organizations are juggling around 3.8 petabytes per firm. Following behind the investment institutions, the banking industry is contending with around 1.9 petabytes…” – Datanami (link)


To dilute the risk of early adoption (of course), these institutions have invested in multiple distributions of Hadoop for various, sometimes overlapping, purposes. Hadoop-based information management architectures are serving teams in financial organizations across a breadth of scenarios, beyond the standard data management concerns around storage, integration, and security.

  • Risk analysis: Detecting fraud, risk assessment, estimating impact, scoring customers potential clients
  • Customer: Customer prospect analysis, Offer/product recommendation, customer service,  customer segmentation  and  experience analysis, targeted services, portfolio analysis
  • Data processing: Logs, trade data, social media feeds
  • Compliance: Compliance, auditing, governance

With Paxata’s elegant interface wall, IT can untangle the mess of data stores to data tools. Analysts don’t need to worry about where their data is stored, they can just get to work in the tool they’re comfortable with.

DataRobot Data Prep

Interactively explore, combine, and shape diverse datasets into data ready for machine learning and AI applications

Try now for free
About the author

Value-Driven AI

DataRobot is the leader in Value-Driven AI – a unique and collaborative approach to AI that combines our open AI platform, deep AI expertise and broad use-case implementation to improve how customers run, grow and optimize their business. The DataRobot AI Platform is the only complete AI lifecycle platform that interoperates with your existing investments in data, applications and business processes, and can be deployed on-prem or in any cloud environment. DataRobot and our partners have a decade of world-class AI expertise collaborating with AI teams (data scientists, business and IT), removing common blockers and developing best practices to successfully navigate projects that result in faster time to value, increased revenue and reduced costs. DataRobot customers include 40% of the Fortune 50, 8 of top 10 US banks, 7 of the top 10 pharmaceutical companies, 7 of the top 10 telcos, 5 of top 10 global manufacturers.

Meet DataRobot
  • Listen to the blog
  • Share this post
    Subscribe to DataRobot Blog
    Newsletter Subscription
    Subscribe to our Blog