We use cookies to provide you with a better experience. If you continue to use this site, we'll assume you're happy with this. Alternatively, click here to find out how to manage these cookies

hide cookie message
80,259 News Articles

Google Refine offers data cleaning services

Search giant relaunches software from Metaweb

Google has updated and re-released open-source software for cleaning, analysing and transforming data sets, now called Google Refine.

The software, originally called Freebase Gridworks, came with Metaweb, a company Google purchased in July.

Google Refine is a collection of tools that could come in handy when wrangling useful information from a data set, particularly ones that have data inconsistencies.

This desktop application can, for instance, find all the variant spellings of a word in a data set and replace them with the appropriate term. This process, called normalisation, is nothing new. But normalising data usually requires writing code that is specific to one data set, noted Christopher Groskopf, a developer for the Chicago Tribune.

"The genius of Gridworks is that it is generic enough to work for a wide variety of data sets without the need to write any code at all. Even better the resulting operations are portable, so the process used to clean up 2009′s data can be repeated for 2010," Groskopf wrote in a blog post.

The software contains a number of other tools as well. It includes an expression language that can be used to analyse a set of data. Filters can be used to isolate subsets of data, which then can be analysed or changed through a set of transform commands.

The software works with plain text files, the data in which can be split into different columns by the use of commas. Results can exported back out in the JSON (JavaScript Object Notation) format, which can then be easily transformed into HTML tables or other formats.

The software can work with up to a few hundred thousand rows per data set, depending on the user's computer memory. And unlike most spreadsheet software, this software can interactively transform large subsets of data, the company asserted.

Google said this week that it has added several new features to the software, officially called Google Refine 2.0, including the ability to link records to other databases, and a number of new transformation commands and expressions.

The non-profit government watchdog organisation ProPublica has used this software to aggregate data from seven different data sets to show how pharmaceutical companies pay doctors to recommend certain medications.


IDG UK Sites

Best Christmas 2014 UK tech deals, Boxing Day 2014 UK tech deals & January sales 2015 UK tech...

IDG UK Sites

Chromebooks: ready for the prime time (but not for everybody)

IDG UK Sites

Hands-on with Sony's latest smartglasses

IDG UK Sites

The 13 most inspirational Tim Cook quotes