OpenRefine is a browser-based, user-friendly Interactive Data Transformation tool (IDTs) which is used to clean ‘dirty’ data in a (semi) automated manner and perform data profiling operations. OpenRefine extensions (eg: NER and RDF Refine) also allow you to identify concepts in unstructured text through a process called named-entity recognition (NER) and link data with concepts and authorities which have been declared on the web (eg: Library of Congress and OCLC).
We learnt how to clean data using The Programming Historian lesson plan: Cleansing data with OpenRefine. The Programming Historian has a suite of lesson plans that teach a wide range of digital tools, techniques, and workflows to aid humanities research and teaching.
But why do I need to clean data?
It is commonplace for there to be errors, inconsistencies, duplicates, and empty values in a data set. The quality of your data is important and tools such as OpenRefine can make it easier to diagnose and fix data quality issues.
By following the lesson plan, we learnt how to diagnose and perform five common data quality tasks:
- Identify and remove duplicates and blank rows
- Separate multiple values contained in the same field through the process of atomisation
- Analyse the distribution of values throughout a data set using facets
- Group together different representations of the same reality using clustering
- Identify value errors by applying GREL functions
We found OpenRefine a fairly easy tool to use and it would be very useful for humanities researchers to learn how to use via online tutorials. The Programming Historian lesson used a metadata export of the Powerhouse Museum collection and it demonstrated how this tool could be used in the GLAM industry to diagnose and clean collection data.
The Programming Historian also had another lesson plan on Fetching and Parsing Data from the Web using OpenRefine.