In recent years, data analytics technology has given businesses a window into valuable streams of information from inventory statuses to customer purchasing habits. These data acquisition successes along with the low cost of disk storage have led some companies to stockpile large amounts of data just for the sake of jumping on the big data bandwagon. Has “big data” become just another buzzword to be applied without discretion?
Don’t be one of those companies that brags that their volumes of data offer a competitive advantage without ensuring that your data is valuable to begin with. According to Experian Data Quality, 75% of businesses are wasting 14% of revenue due to poor data quality. How can you tell if your data is good or bad?
Turning bad into good
Businesses tend to take the “big” part of big data and run blindly with it, collecting anything and everything they can from numerous sources. Instead, they should plan and structure their data collection strategy to weed out the junk so they can focus on the gems.
When data scientists let just any old data into their collection, they end up spending most of their time cleaning, processing and structuring data with laborious manual and semi-automated methods. More time spent cleaning data means less time analyzing it, making big data a big problem for many organizations.
One thing companies can do to ensure data quality is to fix the problem at the source where the data is captured. Here are a few simple techniques that can be applied upstream during data collection:
- Check values for reasonableness. If a user enters a new employee’s birthdate as 07/01/1254, it’s obviously wrong. Give context to data checking and enforce a range of allowed values.
- Constrain values to a pre-defined list. Let the user select a valid value from a list rather than having to type it in, free form.
- Perform checks for duplicates. Ensure that records and transactions can be uniquely identified and alert the user that the information about to be entered may have already been captured.
Some of these techniques are easier said than done because many companies can’t make updates to the data capture programs for a variety of reasons. They may be legacy applications or vendor-hosted and outside the control of the company.
When this occurs there are techniques that can be used downstream to clean the data. Here are a few of those post-data collection techniques:
- Standardization. Create a standard for certain fields and data types. This is essential when merging data together from multiple sources. For instance, a simple field like Gender from one system might contain coded values like ‘M’ for Male, while another system might contain the full text ‘Male’. Either way, identify a common standard and transform the non-conforming data points to meet the standard when the data is merged.
- Backfill missing data. In some cases, alternate data sources containing the missing data may be accessible. Programs have to be written that retrieve the missing data and update the non-conforming records. Where the missing data is simply not available, sometimes records within the same source can be used as a model to determine those missing data points.
- Eliminate duplicate records. Determine the unique identifier and use additional fields such as effective dates or transaction time stamps to determine which record is most current. Sometimes just deleting the duplicate record is sufficient, other times the records must be merged. Having timestamps and unique identifiers on records will simplify the merge or purge process.
Preventing ugly data
The processes for managing and improving data are not just technical in nature – they involve a human element that requires its own procedures for how data is used, stored, accessed and changed. This is where data governance comes into play. Data governance involves the tools, business processes, and people who handle data, and it can help to prevent data from getting ugly.
If your enterprise handles large quantities of data, you should consider forming data governance committees within the organization. Ideally, these committees will be composed of leaders from a large swath of departments (not just IT) because data governance requires leadership and investment from across the organization.
Gaining widespread organizational commitment embeds the importance of data quality into your operations and culture. Committees should be charged with the tasks of implementing business processes to measure and track data entry, setting goals for improvement, and holding employees accountable for keeping data standards high.
Data governance brings proactive action to the equation with a system for routinely checking, correcting, and augmenting your company’s data before it becomes an ugly problem. Combined with the technical solutions described above, data governance can help BI managers make predictions or support critical business decisions with confidence knowing their knowledge is built upon a foundation of clean and accurate data.
When it comes to data, bigger is not necessarily better. Instead of jumping on the big data bandwagon, plan out your overall data strategy so you will end up with clean, usable data with which you can make smart decisions.