Evolution of Data Management: From Presumption-driven to Data-driven
By now, nobody needs to be convinced that data is at the core of driving business value. After all, data is at the center of every business initiative, and therefore, ensuring that you have the most accurate, reliable and complete data possible is a critical first step. Addressing this need is essentially what Information Management tools — including legacy ETL and newer data preparation solutions — were designed to accomplish.
However, there is a vast difference in the approaches that these solutions take and they fall broadly into two categories:
- Presumption-driven approach
- Data-driven approach
Let’s take a closer look.
Traditionally, people who have the business context of the data, don’t have direct access to required data, nor the skills required to merge, profile and transform data sets themselves – so this work is delegated to IT professionals who have the required skills and tools. IT professionals, however, don’t have business context to data and need to rely on their business counterparts to provide guidance on how it needs to be prepared. This partnership between business and IT is typically formalized in a set of requirements provided by the business that get converted to rules by IT, who then enforces them through a variety of ETL like tools
In a Presumption-driven approach, requirements are based on incomplete presumptions of what is needed often without a detailed understanding or study of underlying issues with the data.
For example, when a marketing analyst needs to integrate partner-sourced leads with data in a CRM or marketing automation tool, she will request a data engineer or IT developer to “match the incoming partner-sourced leads with existing CRM data based on email addresses and generate an exception report for any leads that already exist in the system.”
This requirement is created based on a number of presumptions:
- Email is the best indicator for identifying matches,
- The incoming data is complete and does not require any further enrichment, and
- There are no checks for data quality or remediation of data quality issues needed.
The logic formulated and implemented is often independent of any insights into the state of the actual data, but is based on an understanding of a business process.
Once the presumed requirements are handed off, business analysts wait for data engineers or IT developers to deliver the implementation, which often takes weeks, if not months. Business analysts perform tests upon completion of the implementation, at which time they may identify issues and/or gaps. These findings, which may also be uncovered during ongoing usage of the solution, provide additional requirements and the entire cycle begins anew.
With presumed logic at the heart of the process, this approach is often a waterfall, with validations occurring late in the cycle. This costs time and money, and the quality of the results is seldom desirable.
In this approach, data is at the center of identifying and formulating steps to solve a use case. A business analyst, the person who best understands the data context, is empowered to interact with the data directly and envisions the steps needed to transform the data. Machine learning intelligence helps him to pinpoint the necessary transformations and identify issues that need remediation.
Let’s use the same example of a marketing analyst receiving partner-sourced leads and needing to ingest and integrate them with CRM or marketing automation data. In this scenario, the marketing analyst will be able to interact with the data directly, profile and identify key missing fields to generate an exception report, and decide on the right matches based on machine learning suggestions. In this case, the join may be a combination of email and/or social media handles, thereby increasing the validity of the results.
Rather than needing to run data through a slow batch process, the analyst is able to tell the percentage of join matches interactively and instantaneously; if the suggested confidence score is not high enough, she can quickly enrich the data with additional data elements to increase the chances of a match with existing data.
At the heart of this exercise, data guides the user to a desired state, which allows the user to create steps that they otherwise would not have presumed. The result is higher quality and a more agile and instantaneous approach that saves time and money.
Comparing the approaches
Presumption-driven approaches tend to center around logic that is decoupled from the nuances of the data. Data-driven approaches tend to center around the reality of the data at hand. The former is cumbersome and slow, while the latter is agile.
Most people tend to bucket ETL tools into the presumption or logic-driven approach. However, even newer data preparation solutions that expect users to first formulate the end-to-end logic by drawing boxes and lines without being guided by insights into the data follow the identical paradigm of the presumption-based approach, just like ETL tools.
Also, data preparation tools that enforce samples claim to guide users through interactivity with the data, except that the data is just a tiny sliver — often less than 10% — of the overall data set! In this case, the user is given a false sense of confidence, as their impression of the data they are interacting with does not accurately reflect reality. Despite outward appearances, these cases are just as presumption-based as legacy ETL tools. The user must take a very iterative approach to formulating steps based on a sample, then run the inferred logic in batch against the full data set, interpret the results to identify what was missed initially, then start the cycle all over again with yet another sample.
In order to solve data problems, interacting with the entire data set, profiling it, and transforming it is by far the fastest and most desirable approach. Data is your key to unraveling those needs; limiting yourself to anything less than 100% of your data and attempting to prepare it to derive insights is suboptimal and worse, could lead to misleading results. Modern day information management platforms such as Paxata enable users to interact with their entire data set and intuitively guide their data preparation process, leading to much faster and more reliable results.
After all, why wouldn’t your data be at the center of your data management exercise?