Friday 17 June 2016
The outcomes of any analytics project start with the quality of data being pumped into the main system from the various sources the organisation has available. That’s why it is always a surprise when an organisation doesn’t take data quality seriously.
Sure, it’s not going to hinder you functionally. The data sources can still be brought into the analytics system, reports can be built and conclusions will be drawn. But what if these are all wrong? It should matter when the conclusions you are drawing are used to help make important decisions and guide your organisation. If the data is inaccurate, it ultimately effects the decisions you’re making.
Setting up data quality rules within your analytics system is critical to removing data quality problems and leaving you with the best quality data possible to feed into your reports. In this post, we’ll be providing real world, best-practice tips in setting up data quality rules with a view to improving your reporting outcomes.
Common Data Quality Issues
There’s a well-known saying: ‘we’re only human, and we all make mistakes’. This rings very true in a data quality sense. Most of the common >data quality issues that occur are caused by human error. This is because a lot of data across an organisation still rely on human-entered information and even more so with the introduction of big data and social media platforms. Sadly, we all make mistakes every now and then and this can be shown through the common issues: duplicate records, misspelt words and inconsistent data formats.
Duplicate Entries – Over time, organisations are left with a mass of duplicate records. An example of how this develops is when users don’t realise a record is already present within the system and then create an exact replica. Obviously, without any sort of rules governing the system then this can continue to happen and can get out of hand very quickly.
Misspelt Words – We are all guilty of typos and other misspellings. Especially those who aren’t that way inclined to double and triple check the information they’ve entered. I am sure a pre-requisite to becoming a GP or salesperson isn’t their consistent data entry skills. They’re busy doing their job and they will sometimes make mistakes when in a rush entering information – it’s just natural.
Inconsistent Data Formats – With the complex data structures and even more varied data formats thanks to the influx of big data it’s become even more difficult to merge data into a consistent format. For example, the same data can be displayed in different ways: dd/mm/yy, mm/dd/yy, dd/mm/yyyy or as a text string – “Friday 10th June”.
How to Address Data Quality Issues
The fundamental aspect behind addressing data quality issues is to ensure that they are corrected at source. Another saying that comes to mind: ‘Rubbish in, rubbish out’. We have seen data quality issues swept under the carpet for the short term where the changes were made in the analytics system but source databases remained unchanged. This becomes a problem when changing providers or wanting to procure another system using the same data sources. Here are some key points when addressing data quality issues
Data Quality Audit – The objective is to find those data sources plagued by low quality data, and rectify the information at source. Typically these will be those data sources with little rules governing data input and format. Using a rules-based approach to the data-quality audit uncovers the likelihood of inaccurate data values.
Data Matching – Used to build unique identifiers within the data using uncommon data values within the database. The idea being that a “match” can be found by selecting several uncommon values to create a unique identifier, thus reducing duplicates and improving data accuracy. A typical example in healthcare is matching a patient record. This can be done by using the last 3 characters from each data value selected as an identifier to build a unique identifier, e.g. surname, NHS number, postcode – “TON7194ET”.
Master Record – If your analytics system allows and has authorisation to you can define a “master record” – a database which you are confident has accurate values and use that to drive amendments to other data sources where there are duplicates and other invalid data values.
Technologies that remove the manual aspect of these tasks are readily available today, including our own solution CXAIR. And whilst challenges are constantly arising in the form of complex data structures and varying data formats resulting from the explosion of big data, we are in a much better position to meet these challenges. Imperative to the success of any data quality initiative is to implement the right technology, the right people such as data stewards and to make sure data quality is not overlooked.
If you are looking to become a data driven organisation it’s time to get serious about your data!