Successfully leveraging a data lake across multiple Hadoop distros
Comparing Hadoop distribution vendors is a popular topic among Big Data writers. In many organizations, however, the comparison is happening inside of their own walls, with test clusters running multiple distributions side-by-side, serving multiple internal needs.
Every organization has multiple databases, and with the growing popularity of Hadoop and technologies, more than one Hadoop distribution as well. Analysts access data stored in any number of these databases and disparate Hadoop-based file systems to prepare it for their downstream business intelligence tool of choice, but can’t bring it all together due to technical obstacles crossing different Hadoop distributions. It is impossible to connect to more than one Hadoop system without Java class loader conflicts.
Most software vendors solve this problem by shipping separate code bases for each Hadoop distribution. Analysts are faced with a dilemma; they can’t access data from multiple Hadoop distributions without switching to a different application for each data source. IT teams are also frustrated with the growing tangle of disconnected data lakes, totally separated from the tools that analysts are demanding.
Why does this happen? Data integration across varying distributions of Hadoop is a challenge because of Java class loader conflicts. The Paxata team recognized, however, that many of our customers needed to be able to access data across Hadoop environments. To be able to connect to more than one Hadoop distribution or version, the Paxata team developed an interface that dynamically segments import and export traffic to and from Hadoop into code that understands distribution-specific Hadoop versus non-Hadoop configurations.
So, what actually happens when Hadoop is more than Java can handle?
Paxata designed a solution that enables customers to bring in data from anywhere – regardless of the Hadoop distribution or version. Paxata’s self-service data preparation platform concurrently supports our cloud and on-premise customers to access data from multiple Hadoop platforms and versions with an interface “wall”. The interface wall dynamically segments import and export traffic to and from Hadoop into code that understands distribution-specific Hadoop versus non-Hadoop configurations. Analysts can connect to their data without having to switch applications or worse, not connect to their data at all.
For example, this interface wall enabled a team of several dozen analysts at a major financial services company to work with data from over 5000+ databases and across several Hadoop distributions.
Financial services has been facing the Big Data issues of volume, veracity, and variety since before they became buzzwords. Big banks adopted Hadoop-based infrastructures as early adopters to face their massive volumes as data:
“…. one cannot overlook the issue of volume; estimates contend that financial and securities organizations are juggling around 3.8 petabytes per firm. Following behind the investment institutions, the banking industry is contending with around 1.9 petabytes…” – Datanami (link)
To dilute the risk of early adoption (of course), these institutions have invested in multiple distributions of Hadoop for various, sometimes overlapping, purposes. Hadoop-based information management architectures are serving teams in financial organizations across a breadth of scenarios, beyond the standard data management concerns around storage, integration, and security.
- Risk analysis: Detecting fraud, risk assessment, estimating impact, scoring customers potential clients
- Customer: Customer prospect analysis, Offer/product recommendation, customer service, customer segmentation and experience analysis, targeted services, portfolio analysis
- Data processing: Logs, trade data, social media feeds
- Compliance: Compliance, auditing, governance
With Paxata’s elegant interface wall, IT can untangle the mess of data stores to data tools. Analysts don’t need to worry about where their data is stored, they can just get to work in the tool they’re comfortable with.