top of page
Search

SSCPI Approach - The Definitive technique to Systematic Data Cleaning

  • Writer: nyaradzoamuseza
    nyaradzoamuseza
  • Feb 18, 2022
  • 3 min read

Updated: Apr 23, 2022

I have spent on average 80% of my work doing data cleaning. Clean data facilitates quality insights and most importantly machine learning models with high accuracy scores and low losses. Another benefit of data cleaning is error removal from datasets, especially when data originated from multiple data sources. Throughout my experience of working with data, I have developed a method which I named Syntax, Semantics, Plot, Convert, and Impute (SSCPI). SSCPI is a practical methodical guideline for effective data cleaning. I thought to share it as it may help you too! Happy Analyzing!





1. Syntax

One needs to appreciate the syntax of the data they are working on. Data Syntax refers to the structure of individual entries in a feature set. For example, you should always check the date syntax especially when analyzing data from different kinds of sources and or systems. I have recently worked with datasets where the data had been entered as 1/12/2022 instead of 12/1/2022. The (dd/mm/yyyy) or (mm/dd/yyyy) is something to look out for because date-time formats differ per configuration and system. When data cleaning, be actively cautious of the different syntaxes that data in your domain may inherit.


2. Semantics

Semantics refers to the derived meaning and interpretation based on the nature of data. Datasets can attribute inherent and external errors. Inherent errors occur from the data source or at the point of data-capturing. It is therefore imperative to understand the semantics of each entry. Imagine working for a retail outlet and you are asked to produce insights on commonly bought products. In my country, retail shops have airtime-recharge-cards distribution contracts with various internet service providers (ISPs). The dataset you get includes two variables i.e. "Bandwith Usage" and "Salt". Some of the values appear as 500g or 500G. ‘G’ would mean 500Gigabits "Bandwidth Usage" while ‘g’ means 500grams of "Salt". It is vital to ensure that data is recorded with the correct meaning and under the correct variable so that the dataset makes sense. Eventually, defining semantics is also the first step to developing `data constraints.


3. Convert

The next step is the conversion of features to the correct data type and format. It is important to make sure that each feature or variable is in the same state. By the same state, I mean characteristics like an equal number of decimals after the comma, the same sentence case, same date format, same time zone, and or same units for measured data. At one point I worked on financial data where one million dollars was entered differently as either $1,000.00 (000) or $1.00 (000,000). This is a sy issue, that’s why the conversion step is very important before any other transformation takes place.


4. Plot the data

The fourth level of data cleaning is plotting the data on scatter plots and box plots (I recommend these two). This stage is very important before imputation. By plotting one is looking out for outliers and anomalies in the dataset. Anomalies distort the statistical analysis of the data especially the mean and median. If the end goal is to eventually create a machine learning model, outliers will result in low accuracy and poor models. One may however use any other method that suits them for visualisation.


5. Impute

The common mistake that most people make is to impute before outlier detection and removal. Experience has taught me that the dataset is cleaner if imputation is done after outlier detection and removal. Imputation is the replacement of unwanted values with desired ones. A good example is when working with children’s data and one of the ages there is 65, one would want to impute using the mean or mode age because there is no child who is 65 years old. It does not always follow that missing values are imputed with the mean or the mode. If need be ‘0’ zero imputation is also possible or the entire row can be completely removed.


These five practical steps have proven (by me) to increase efficiency when cleaning data. The SSCPI approach is easy to remember, to implement and it can be applied to any dataset in any domain.


Remember to leave feedback because I would like to know if this approach worked for you.

 
 
 

Comments


© 2020 proudly by Nyary: Personal website

bottom of page