Practical Data Analysis(Second Edition)
上QQ阅读APP看书,第一时间看更新

Getting started with OpenRefine

OpenRefine (former Google Refine) is a formatting tool very useful in data cleansing, data exploration, and data transformation. It is an open source web application which runs directly on your computer skipping the problem of uploading delicate information to an external server.

To start working with OpenRefine just run the application and open a browser typing the URL:

http://127.0.0.1:3333/ 

Note

See the section: Installation of OpenRefine in Appendix, Setting Up the Infrastructure.

First, we need to upload our data and click on Create Project. In the following screenshot, we can observe our dataset; in this case, we will use the monthly sales of an alcoholic beverages company. The dataset format is an MS Excel (.xlsx) worksheet with 160 rows.

We can download the original MS Excel file and the OpenRefine project from the author's GitHub repository:

https://github.com/hmcuesta/PDA_Book/tree/master/Chapter2/

Text facet

Text Facet is a very useful tool, similar to the filter in a spreadsheet. Text facet groups unique text values into groups. This can help us to merge information, and we can see values which could be spelled in a lot of different ways.

Now, we will create a text facet on the name column by clicking on that column's drop-down menu and selecting Facet | Text Facet. In the following screenshot, we can see the column name is grouped by its content. This is helpful to see the distribution of elements in the dataset. We will observe the number of choices (43 in this example), and we can sort the information by name or by count:

Clustering

Clicking in our text facet (see the following screenshot), we can cluster all the similar values; in this case, we find Guinness Lata DR 440ml 24x01 and Guinness Lata DR 440ml 24x01. Obviously, the "." in the second value is a typo. The option Cluster allows finding this kind of dirty data easy. Now, we just select the option Merge? and define the new cell value; we click on Merge Selected & Close. Refer to the following screenshot:

We can play with the parameters of the Cluster option like changing the method from Key collision to nearest neighbors, selecting the rows in a cluster or the length variance of choices. Playing with the parameters, we can find duplicate items in a data column and more complex misspells such as in the following screenshot, where the values JW Black Label 750ml 12x01 and JW Bck Label 750ml 12x01 refer to the same product with a typo in the color:

Text filters

We may filter a column using a specific text string or using a regular expression (Java's regular expressions). We will click on the option Find of the column we want to filter and then type our search string in the textbox to the left. For more information about Java's regular expressions see:

http://docs.oracle.com/javase/tutorial/essential/regex/

Numeric facets

Numeric facet groups number into numeric range bins. You can customize numeric facets much like you can customize text facets. For example, if the numeric values in a column are drawn from a power law distribution (see 1 in the following screenshot), then it's better to group them by their logs (see 2 in the same screenshot) using the value.log() expression:

If our values are periodic, we could take the modules by the period to find a pattern, using this expression:

mod(value, 6) 

We can create a numeric facet from a text by taking the length of the string, using this expression:

value.length() 

Transforming data

In our example, the column date uses a special date format 01.04.2013, and we want to replace the "." with "/". Fixing this is pretty easy using Transform. We need to select the column date | Edit Cells | Transform.

We are going to write a replace() expression like this:

replace(value,".","/") 

Now, just click on the OK button to apply the transformation:

Google Refine Expression Language (GREL) allows us to create complex validations. For example, for a simple business logic when the column value reaches 10 units we make a discount of 5%, we do this with an if() statement and some algebra:

if(value>10,value*.95,value) 

Note

See the following link for a complete list of functions supported by GREL:

https://code.google.com/p/google-refine/wiki/GRELFunctions

Exporting data

We can export data from an existing OpenRefine project in several formats like:

  • Tab Separated Values (TSV)
  • CSV
  • Excel
  • HTML table

To export as a JSON, we need to select the option Export and Templating Export, where we can specify a JSON template as is shown in the following screenshot:

Operation history

We can save all the transformations applied to our dataset just by clicking on the tab Undo/Redo and then selecting Extract. This will show all the transformations applied to the current dataset (see the following screenshot). Finally, we will copy the generated JSON and it in a text file.

To apply the transformations to another dataset, we just need to open the dataset in OpenRefine and then go to the Undo/Redo tab. Click on the Apply button and copy the JSON from the first project: