Self Service Data Preparation Are You Seeing The Full Picture Background
  • Blog
  • AI & ML Expertise
  • Self-Service Data Preparation Powered by Big Data Fabric: The Secret to Becoming Information-Inspired

Self-Service Data Preparation Powered by Big Data Fabric: The Secret to Becoming Information-Inspired

November 6, 2018
· 4 min read

Last week, industry analyst firm Forrester Research recognized Paxata as a leader in The Forrester Wave™: Data Preparation Solutions, Q4 2018. This report updated The Forrester Wave™: Data Preparation Tools, Q2 2017, where Paxata was also recognized as a leader. This honor added to our excitement of Paxata already receiving leadership recognition in The Forrester Wave™: Big Data Fabric, Q2 2018 earlier this year. In fact, Paxata is the only vendor acknowledged as a leader in both of these Wave reports. While it is always a pleasure to celebrate our leadership positions, of greater importance is that these accolades confirm and combine two extremely critical data management concepts: that a modern data management architecture must have world-class self-service data preparation capabilities.

Big Data Fabric Defined

Big data fabric is a data management framework that brings your diverse, multi-cloud, and hybrid data landscape under management. It speaks to the importance of playing the data where it lies, and the need for end-to-end metadata management and governance. And it points to using this same data architecture to fuel and power all of the data-driven use cases you might have in your organization – such as a single customer view, operational efficiencies, analytics, data science, or new products. Data preparation – i.e., the ability to easily find, clean, shape, and prepare data from your data landscape for use in any downstream use case – is deemed a critical capability for big data fabric by Forrester.

Self-Service Data Preparation Defined

Adding self-service to the general data prep description implies that we wish to empower not only traditional technical experts like developers, but also average business consumers of data – such as data analysts, citizen data scientists, citizen data engineers, and business analysts. By providing an easy-to-use, Excel-like, visual interface, users can easily interact in real-time with their data. Intelligent algorithms embedded into the system can dynamically profile the data and provide recommendations on how to clean or standardize it.

Self-Service Data Prep Plus Big Data Fabric Accelerates Your Data-driven Initiatives

We have all heard the idea that data is the new oil fueling digital transformation. While true, this is only half the story. Because on its own, the only thing raw data can fuel is the pockets of those you pay to store your data. To be truly insightful, raw data needs to be turned into information: data that has context and is clean, complete, and consumable.

The challenge is that this must be done across a highly dispersed data landscape at scale, so that every person, process, app, and device in the enterprise can become smarter and more informed.

Here are five key requirements that will determine your success:

  • Ease of use to empower business consumers at scale. To empower the entire business with information, we must first enable them to get to the data, wherever it might reside. Then we must help them understand that data and prep it by themselves for their needs – whether it be a new BI report, an Excel model, or data science project. Accomplishing this means providing an easy, user-friendly experience that allows them to prep data via intuitive, point-and-click interfaces that do not require programming skills.
  • Intelligence to ensure business users don’t get it wrong. Embedded algorithms and AI (artificial intelligence) should continually profile the data and guide the user on ways to clean, standardize, and recommend ways to join or combine data. Not only does this accelerate the process, it also provides “guardrails” to ensure casual users do not make errors, such as performing a cartesian product when joining datasets.
  • Powered by an adaptive, elastic architecture that can scale out and contract as needed. Successful data projects require speed on one side and the ability to scale to large data and compute capacity on the other. Your big data fabric management and data prep architecture must be able to provide both of these dimensions. You need the ability to quickly spin up your cluster, in any desired cloud environment, load your millions of data rows into your visual data prep interface, publish the results, and then break it all down again when the project is done. This impacts not only your business’ agility, but also the cost of delivering insights to your organization.
  • Enterprise governance and security. While this is obvious, it bears repeating. Siloed, point solutions do not provide end-to-end data governance, data lineage, security, or any of the building blocks for an enterprise data governance strategy.
  • Collaboration and sharing. If the entire organization is going to embrace the data-to-information journey, then it must accept the fact that data sharing is the ultimate team sport. It requires peer-to-peer collaboration on projects and sharing results. It means reusing other users’ datasets to avoid reinventing the wheel. It also calls for those with deep technical skills to educate and collaborate with casual users, so that the entire organization can learn, grow, and become more data-savvy.
Free Trial
DataRobot Data Prep

Interactively explore, combine, and shape diverse datasets into data ready for machine learning and AI applications

Try now for free
About the author

Value-Driven AI

DataRobot is the leader in Value-Driven AI – a unique and collaborative approach to AI that combines our open AI platform, deep AI expertise and broad use-case implementation to improve how customers run, grow and optimize their business. The DataRobot AI Platform is the only complete AI lifecycle platform that interoperates with your existing investments in data, applications and business processes, and can be deployed on-prem or in any cloud environment. DataRobot and our partners have a decade of world-class AI expertise collaborating with AI teams (data scientists, business and IT), removing common blockers and developing best practices to successfully navigate projects that result in faster time to value, increased revenue and reduced costs. DataRobot customers include 40% of the Fortune 50, 8 of top 10 US banks, 7 of the top 10 pharmaceutical companies, 7 of the top 10 telcos, 5 of top 10 global manufacturers.

Meet DataRobot
  • Listen to the blog
  • Share this post
    Subscribe to DataRobot Blog
    Newsletter Subscription
    Subscribe to our Blog