Self Service Data Preparation Are You Seeing The Full Picture Background

Self-Service Data Preparation: Are You Seeing The Full Picture?

June 12, 2019
· 4 min read

80% of time and effort is spent gathering, preparing and integrating data BUT only 20% is spent on insights and visualization.

We’ve all heard this famous (and sometimes infamous) 80/20 principle around analytics and data science. Despite this amazing statistic, Dan Vesset from IDC asserts that annual spending on BI and analytics software is over twice that of spending on data integration and data integration software.

So what’s wrong with this equation?  It implies that companies are underspending on data preparation/integration and trying to offset it with staffing.  Long story short – the wrong people are spending time fixing things using the wrong tools.  Fortunately, a new breed of software aimed at empowering business analysts with self-service data preparation, data quality, and data standardization is available today.


What Is Self-Service Data Preparation?

Simply put, self-service data preparation tools enable non-technical business users or analysts to find, profile (or understand what is in the data), clean, and shape data with a visual and interactive user experience. No coding or programming is required – just working in a familiar Excel-like interface. The software essentially guides the user through a series of steps such as removing duplicates, standardizing names, combining data sets, and publishing a data set ready for use in a report or dashboard. If you want to watch all of this in action, I invite you to watch this short video to see the basic data preparation process.


Are You Watching the Movie Trailer or the Entire Feature Film?

How many times have you watched a movie trailer, then gone to the theater to actually watch the movie only to wonder how the trailer so poorly represented the actual movie that you just saw? Naturally, the film producers try to pack the key selling points of the movie into 90 seconds, but a few highlights can never tell the full story.

What do movie trailers have to do with data preparation? While data preparation is fairly simple to understand conceptually, some data preparation tools in the market are taking a shortcut: rather than working on the full body of data, they will automatically pull out and work on a mere sampling of the data. For example, you might have 20 million rows of data, but the software will try to “sample” and bring back a limited set, perhaps only 100,000 rows of your data. When you inspect your data visually the software driven recommendations will be based on the sample and not the full body of the data. Consequently, if you are seeing the number of unique values in a column, it will not represent the entire data set. Every issue you fix, like the five different spellings of a company name you saw in your sample, will be fixed only for the five variations and not the other two that were not represented in the sample.

What is an even bigger challenge is that most data prep tools offer you no choice in sample size – it is simply a limit enforced by the software. And the limits are often based on a fixed memory size which means the actual number of rows in your sample could vary depending on how many columns your data has. The more columns you have, the fewer number of rows that will fit into the sample.


Three Major Risks of Sample Data Preparation

There are multiple problems with the sampling approach to data preparation:

  • Time wasted iterating and reworking. Working on the sample is just the beginning of your effort. Once you create your data preparation job on the available sample, you will then need to run the job on the full data set. Anything missed in your first attempt must be resampled and redone.
  • Bad insights. There are countless ways you can be impeded by your sample’s shortcomings. For example, if searching for outliers in your data, such as when conducting anomaly detection, how can you possibly know what is or is not in your data if you are only viewing a fraction of it?
  • Limited usefulness. Data preparation is no longer constrained to simply empowering data analysts to finding some data and preparing it for a new report or dashboard. On the contrary, many organizations are running major data initiatives powered by self-service data preparation jobs created by business users. For example, one Paxata customer brings together large volumes of healthcare data from disparate external sources that gets merged with internal sources in an effort to get a complete picture on medical adherence statistics for a patient. Listen to Jake Woods from AdhereHealth discuss how Paxata’s combination of speed and accuracy saved a patient’s life.


Paxata Is The Only Data Preparation Software Built For Any Scale

Paxata was architected and designed to solve the data prep problem at any scale. Built on a native, multi-tenant cloud architecture and powered by Apache Spark, you no longer are forced to iteratively work through multiple sample cycles to get your data in shape. Load the data you need and get the full picture!  In addition, Adaptive Workload Management lets you easily move between realtime interactive or batch processing options to further optimize performance versus cost considerations.


Get The Full Picture

As with all great stories, there is often a surprise in the plot. But you wouldn’t want surprises in your data and you certainly don’t want the trailer.  Instead, you want the full picture your data has to give.

Some may argue that working on a sample is sufficient in some situations, because it is cheaper than processing a full data set. However, with the availability and cost reduction of cloud processing and storage, this should no longer be a consideration. Data prep powered by a cloud native platform can help you democratize access to all your data to deliver powerful business outcomes.  It is now time to sit back and watch your data feature film! 

Free Trial
DataRobot Data Prep

Interactively explore, combine, and shape diverse datasets into data ready for machine learning and AI applications

Try now for free
About the author

Value-Driven AI

DataRobot is the leader in Value-Driven AI – a unique and collaborative approach to AI that combines our open AI platform, deep AI expertise and broad use-case implementation to improve how customers run, grow and optimize their business. The DataRobot AI Platform is the only complete AI lifecycle platform that interoperates with your existing investments in data, applications and business processes, and can be deployed on-prem or in any cloud environment. DataRobot and our partners have a decade of world-class AI expertise collaborating with AI teams (data scientists, business and IT), removing common blockers and developing best practices to successfully navigate projects that result in faster time to value, increased revenue and reduced costs. DataRobot customers include 40% of the Fortune 50, 8 of top 10 US banks, 7 of the top 10 pharmaceutical companies, 7 of the top 10 telcos, 5 of top 10 global manufacturers.

Meet DataRobot
  • Listen to the blog
  • Share this post
    Subscribe to DataRobot Blog
    Newsletter Subscription
    Subscribe to our Blog