
Data scrubbing
Scrubbing data, also called data cleansing, is the process of correcting or removing data in a dataset that is incorrect, inaccurate, incomplete, improperly formatted, or duplicated.
The result of the data analysis process not only depends on the algorithms, it depends on the quality of the data. That's why the next step after obtaining the data, is data scrubbing. In order to avoid dirty data, our dataset should possess the following characteristics:
- Correct
- Completeness
- Accuracy
- Consistency
- Uniformity
Dirty data can be detected by applying some simple statistical data validation and also by parsing the texts or deleting duplicate values. Missing or sparse data can lead you to highly misleading results.
Statistical methods
In this method, we need some context about the problem (knowledge domain) to find values that are unexpected and thus erroneous, even if the data type matches but the values are out of the range. This can be resolved by setting the values to an average or mean value. Statistical validations can be used to handle missing values, which can be replaced by one or more probable values using interpolation or by reducing the dataset using decimation:
- Mean: This is the value calculated by summing up all values and then dividing by the number of values.
- Median: The median is defined as the value where 50% of values in a range will be below 50% of values above the value.
- Range constraints: These are numbers or dates which should fall within a certain range. That is, they have minimum and/or maximum possible values.
- Clustering: Usually, when we obtain data directly from the user some values include ambiguity or refer to the same value with a typo. For example,
"Buchanan Deluxe 750ml 12x01"
and"Buchanan Deluxe 750ml 12x01."
which are different only by a ".
" or in the case ofMicrosoft
orMS
instead ofMicrosoft Corporation
which refers to the same company and all values are valid. In those cases, grouping can help us to get accurate data and eliminate duplicated data, enabling a faster identification of unique values.
Text parsing
We perform parsing to help us to validate whether a string of data is well formatted and avoid syntax errors.
Regular expression patterns, usually text fields, would have to be validated this way. For example, dates, e-mail, phone numbers, and IP address. Regex is a common abbreviation for regular expression.
In Python, we will use the re
module to implement regular expressions. We can perform text search and pattern validations.
First, we need to import the re
module:
import re
In the following examples, we will implement three of the most common validations (e-mail, IP address, and date format).
- E-mail validation:
myString = 'From: readers@packt.com (readers email)' result = re.search('([\w.-]+)@([\w.-]+)', myString) if result: print (result.group(0)) print (result.group(1)) print (result.group(2))
- Output:
>>> readers@packt.com >>> readers >>> packt.com
- The
search()
function scans through a string, searching for any location where the Regex matches. Thegroup()
function helps us to return the string matched by the Regex. The\w
pattern matches any alphanumeric character and is equivalent to the[a-zA-Z0-9_]
class. - IP address validation:
isIP = re.compile('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}') myString = " Your IP is: 192.168.1.254 " result = re.findall(isIP,myString) print(result)
- Output:
>>> 192.168.1.254
- The
findall()
function finds all the substrings where the Regex matches and returns them as a list. The pattern\d
matches any decimal digit, and is equivalent to the class[0-9]
. - Date format:
myString = "01/04/2001" isDate = re.match('[0-1][0-9]\/[0-3][0-9]\/[1-2][0-9]{3}', myString) if isDate: print("valid") else: print("invalid")
- Output:
>>> 'valid'
- The
match()
function finds whether the Regex matches with the string. The pattern implements the class[0-9]
in order to parse the date format.
Tip
For more information about regular expressions:
http://docs.python.org/3.4/howto/regex.html#regex-howto
Data transformation
Data transformation is usually related to databases and data warehouses where values from a source format are extract, transform, and load in a destination format.
Extract, Transform, and Load (ETL) obtains data from various data sources, performs some transformation functions depending on our data model, and loads the resulting data into the destination.
- Data extraction allows us to obtain data from multiple data sources, such as relational databases, data streaming, text files (JSON, CSV, and XML), and NoSQL databases.
- Data transformation allows us to cleanse, convert, aggregate, merge, replace, validate, format, and split data.
- Data loading allows us to load data into a destination format, such as relational databases, text files (JSON, CSV, XML), and NoSQL databases.

Tip
In statistics data, transformation refers to the application of a mathematical function to the dataset or time series points.