Your last big data investment has most probably run into a data quality wall, but you’ve managed to declare the investment a success - all while knowing that it could have been a bigger value add. Don’t worry because you’re not alone in this (mis)adventure.
Most companies only manage to utilize about 3 percent of their data when investing in big data analytics that heavily leverage data integration technologies. Let’s take a moment and discuss the biggest cause for the abysmal utilization - data quality issues. What are the nature of these issues and how many kinds are there? What challenges do these issues cause? And, can one prevent these data quality issues from rising?
We begin by looking at the fundamental nature of big data - the diversity of data sources, especially in the context of enterprises is phenomenal. Data comes in different types, with various levels of complexities and structures that more often than not complicate processes and practices down the line. Data integration is an example of such a process/function that further affects data analytics and therefore the quality of downstream applications.
5V's of Big Data
Ensuring data quality is one of the most powerful ways to get the most out of aforementioned big data investments. And, data quality will have to be consistently maintained throughout all stages of the data journey for the best results.
Evaluating or checking for data accuracy as a beginning point requires an understanding of where data exists and combining that data in a way that’s consistent across different (and silo-ed) data sources.
● Automating data entry
● Selective manual entry options
● Great user interface and design principles
● Instant data validation (for entered data)
● Verification from source to target mapping
● Rule-guided ingestion of only good data
● Monitoring bad or noisy data
● Fix data issues close to the source
For the errors that persist, here are a few suggestions on handling such corrupt data -
● Accept the error if integration falls within an acceptable standard. For example, accept the response if the answer to the question of ‘where do you work?’ is ‘Men’s Salon or Unisex Salon’ instead of ‘Salon.’
● Reject the error particularly during data imports. Especially if the information is severely damaged/incorrect that it makes sense to delete the entry than try to correct it. An example could be transcripts of call centre interactions that are usually very unstructured data.
● Correct the error in cases of misspellings (of names) or a similar. For example, if there are variations of a name, you can set one as the ‘Master’ and retain the consolidated data to correct it across all datasets.
● Create a default value if you don’t know the value. It’s better to have a value such as unknown or N/A rather than nothing at all.
In the case of large organizations, traditional approaches are not suitable when handling massive data volumes and variety. This popular checklist of the six primary dimensions for data quality assessment is commonly used by members of the global data communities when dealing with such errors from such enterprise level data -
Finally, armed with data from various sources, we need to curate the data before we combine it. The ICPSR states that, “Through the curation process, data are organized, described, cleaned, enhanced, and preserved for public use, much like the work done on paintings or rare books to make the works accessible to the public now and in the future.” When at the data curation stage, organizations might want to reduce dependency on human intervention, for which they will utilize ML to better understand consumers, AI and deep learning to recognize engagement and buying patterns and apply their learning to evolve algorithmic behaviour that further strengthens effective learning.
The data curation space sports a few popular tools. One is Tamr that focuses on a bottom-up, ML approach to unify disparate, dirty datasets. The platform’s advanced algorithms automate as much as 90 percent of decisions taken. The other tool, Drunken-data-quality is a small library that checks constraints on Spark data structures. It can assure a certain data quality, especially in cases of continuous data imports.
Alongside these tools, here are a few helpful hints that should be observed at this curation stage -
1. Focus on data usage, i.e. try to make data producers consume their own data (for dashboards or automated KPIs) because then, the producers will be inclined to notice bad data quality. Furthermore, a best practice is to automatically delete data (with a warning) that is not used by anyone for an extended period of time, reducing the amount of data that require quality checks.
2. Appoint clear data owners for each data stream or data set. This person/team will attest to data correctness or not. If a data set has no owner, delete it.
3. Do not curate all data to a central place, instead consider local data lake shores or data marts to provide use-case specific curation.
In conclusion, while the amount of time and energy spent on cleaning up ‘dirty and disconnected’ data is excessive, it’s of paramount importance because it affects analysis that actually gives organizations their actionable insights. For better alignment of data analysts’ time, data quality needs to be on par right at the beginning. This will create an atmosphere of trust in organizational data and its analysis, and the forthcoming insights.
Shweta Jain is Lead Consultant at ThoughtWorks
Disclaimer: This article is published as part of the IDG Contributor Network. The views expressed in this article are solely those of the contributing authors and not of IDG Media and its editor(s).