Successfully leveraging a data lake across multiple Hadoop distros
Comparing Hadoop distribution vendors is a popular topic among Big Data writers. In many organizations, however, the comparison is happening inside of their own walls, with test clusters running multiple distributions side-by-side, serving multiple internal needs.
Every organization has multiple databases, and with the growing popularity of Hadoop and technologies, more than one Hadoop distribution as well. Analysts access data stored in any number of these databases and disparate Hadoop-based file systems to prepare it for their downstream business intelligence tool of choice, but can’t bring it all together due to technical obstacles crossing different Hadoop distributions. It is impossible to connect to more than one Hadoop system without Java class loader conflicts.
Most software vendors solve this problem by shipping separate code bases for each Hadoop distribution. Analysts are faced with a dilemma; they can’t access data from multiple Hadoop distributions without switching to a different application for each data source. IT teams are also frustrated with the growing tangle of disconnected data lakes, totally separated from the tools that analysts are demanding.
Why does this happen? Data integration across varying distributions of Hadoop is a challenge because of Java class loader conflicts. The Paxata team recognized, however, that many of our customers needed to be able to access data across Hadoop environments. To be able to connect to more than one Hadoop distribution or version, the Paxata team developed an interface that dynamically segments import and export traffic to and from Hadoop into code that understands distribution-specific Hadoop versus non-Hadoop configurations.
So, what actually happens when Hadoop is more than Java can handle?
Paxata designed a solution that enables customers to bring in data from anywhere – regardless of the Hadoop distribution or version. Paxata’s self-service data preparation platform concurrently supports our cloud and on-premise customers to access data from multiple Hadoop platforms and versions with an interface “wall”. The interface wall dynamically segments import and export traffic to and from Hadoop into code that understands distribution-specific Hadoop versus non-Hadoop configurations. Analysts can connect to their data without having to switch applications or worse, not connect to their data at all.
For example, this interface wall enabled a team of several dozen analysts at a major financial services company to work with data from over 5000+ databases and across several Hadoop distributions.
Financial services has been facing the Big Data issues of volume, veracity, and variety since before they became buzzwords. Big banks adopted Hadoop-based infrastructures as early adopters to face their massive volumes as data:
“…. one cannot overlook the issue of volume; estimates contend that financial and securities organizations are juggling around 3.8 petabytes per firm. Following behind the investment institutions, the banking industry is contending with around 1.9 petabytes…” – Datanami (link)
To dilute the risk of early adoption (of course), these institutions have invested in multiple distributions of Hadoop for various, sometimes overlapping, purposes. Hadoop-based information management architectures are serving teams in financial organizations across a breadth of scenarios, beyond the standard data management concerns around storage, integration, and security.
- Risk analysis: Detecting fraud, risk assessment, estimating impact, scoring customers potential clients
- Customer: Customer prospect analysis, Offer/product recommendation, customer service, customer segmentation and experience analysis, targeted services, portfolio analysis
- Data processing: Logs, trade data, social media feeds
- Compliance: Compliance, auditing, governance
With Paxata’s elegant interface wall, IT can untangle the mess of data stores to data tools. Analysts don’t need to worry about where their data is stored, they can just get to work in the tool they’re comfortable with.
DataRobot is the leader in Value-Driven AI – a unique and collaborative approach to AI that combines our open AI platform, deep AI expertise and broad use-case implementation to improve how customers run, grow and optimize their business. The DataRobot AI Platform is the only complete AI lifecycle platform that interoperates with your existing investments in data, applications and business processes, and can be deployed on-prem or in any cloud environment. DataRobot and our partners have a decade of world-class AI expertise collaborating with AI teams (data scientists, business and IT), removing common blockers and developing best practices to successfully navigate projects that result in faster time to value, increased revenue and reduced costs. DataRobot customers include 40% of the Fortune 50, 8 of top 10 US banks, 7 of the top 10 pharmaceutical companies, 7 of the top 10 telcos, 5 of top 10 global manufacturers.
We will contact you shortly
We’re almost there! These are the next steps:
- Look out for an email from DataRobot with a subject line: Your Subscription Confirmation.
- Click the confirmation link to approve your consent.
- Done! You have now opted to receive communications about DataRobot’s products and services.
Didn’t receive the email? Please make sure to check your spam or junk folders.
Optimizing Large Language Model Performance with ONNX on DataRobot MLOpsJune 1, 2023· 11 min read
Belong @ DataRobot: AAPI Heritage Month with the ACTnow! CommunityMay 25, 2023· 3 min read
Deep Learning for Decision-Making Under UncertaintyMay 18, 2023· 5 min read
Many companies are experiencing mounting pressure to have a generative AI strategy, but most are not equipped to meaningfully put generative AI to work. For AI leaders, there are deeper questions you need to ask as you consider your path with generative AI.
Discover the challenges and benefits of big data in AI, downsampling, and smart sampling techniques to reduce data size without losing accuracy.
DataRobot 9.0 helps organizations scale the use of AI to create value enterprise-wide. Discover how it simplifies ML production, automates deployment, and manages model drift to maintain business value.