Cleaning corpus data: cleaning and preprocessing

STATUS: under construction

In many cases, the first step in corpus building entails cleaning up and normalizing texts which come in unsuitable formats or in formats that do not fit the processing pipeline we have in mind. This may be due to the fact that texts are gathered from diverse sources or in diverse formats, or that texts come in formats such as html, pdf, rtf, docx or any other of the multitude of file types and formats available on modern computer systems. Typically, this is the starting point when planning to build bespoke corpora for a specific research undertaking. In my team, we are calling this process 'textputzing'[1].

The process usually requires the following steps:

  1. data collection
  2. conversion to a suitable, processable digital text format
  3. definition of the specifications for the basic (unannotated) target corpus
  4. cleaning the text of unwanted elements (e.g. mark-up and layout elements etc.) according to the specifications of the target corpus
  5. encoding in the required format with attention to character encodings
  6. decision for and implementation of a storage format, usually a plain text format such as txt, or structured formats such as xml or csv (e.g. verticalized formats such as the vrt (VeRticalized Text) format used by the Open Corpus Workbench (see their encoding tutorial), or the different CoNLL formats such as CoNLL-U etc.)

[1] The term 'textputzing' was coined by my dear colleague and friend Mônica Holtz in the course of a rather extensive corpus cleaning exercise.