UAM Corpus Tool: AutoCode

Tutorial by Sabine Bartsch

UAM Corpus Tool is first and foremost a tool for manual annotation, however, its AutoCode functionality allows annotators to automate annotation steps based on pattern searches. Its documentation described an example of autocoding grammatical patterns such as passive clauses. This tutorial describes the process of identifying string patterns and autocoding them according to a specific coding scheme.

The task

The research project ePoetics: Korpuserschließung und Visualisierung deutschsprachiger Poetiken (1770 - 1960) für den „Algorithmic Criticism“ (Universität Stuttgart, TU Darmstadt) investigates texts from the emerging genre of literary theory. The project aims to combine hermeneutic with algorithmic approaches in researching the development of this text genre. A corpus of historical texts from the emerging genre of literary theory are to be annotated for particular theoretical concepts that are described in the texts. While many of the concepts to be identified simply have to be identified by eye-balling the text, in other cases there are certain lexical manifestations, key-words, expressions etc. that at least give a hint of a concept that is of interest to the analysis. In these cases, it was found to be helpful to be able to search for strings in the text to be annotated to identify relevant concepts. The annotation window itself does not offer a search function. However, the AutoCode tab offers the relevant functionality.

In order to use AutoCode, you first of all populate your corpus and set up your coding scheme as usual. Make sure you segment your text according to your needs, either by manually segmenting the text in the Annotation window of your project or by selecting automatic segmentation when populating your corpus. In this example, automatic segmentation into sentences was selected.

The AutoCode Tab

Once you have set everything up and segmented your data, you switch to the AutoCode tab and add a new autocode rule:

The AutoCode Rule Editor

This opens up the AutoCode Rule Editor which offers three principal approaches:

  1. Search for segments and assign those segments a particular feature.
  2. Create new segments in one layer based on corresponding segments in another layer.
  3. Create a segments in one layer based on string patterns in the text.

For the purposes of this example, we are opting for the first choice:

The white box shows the annotation layer at whose segments the search is directed, in the case of the example used in this tutorial (see .zip file available for download at the end of this page) the layer is called konzept, its segments are sentences:

The search pattern is defined to search for a segment containing the search pattern.

This can either be a complete word such as “Lyrik”:

Or it can be a regular expression, in this case the first three letters of a word plus the wildcard asterisk *: the pattern “lyr* is going to identify all words beginning with these three letters and followed by anything up to the next white space:

This pattern is then saved and the window of the AutoCode Rule Editor closed:

This search for the pattern is carried out by clicking the Show button in the upper right hand part of the AutoCode tab window:

Once you active that search by means of the button Show, you are presented with a concordance style list of results with the search pattern hits in the central column and its sentence context to the left and right of it:

Now you could simple click on the Code selected button to the left above the results concordance. In order to retain more control you can also click on the magnifying glass in front of one sentence after the other and inspect the hits in the projects annotation window. For our purposes, this is what I recommend:

Doing this opens up the annotation window at the position of the identified hit, you can manually make your annotation choices as usual, save the annotation and return to the AutoCode concordance list. You will see that you actually have both windows open at the same time and can easily and quickly switch back and forth between them:

Once you are done annotating for the first search pattern, you can select the edit button to the right of the rule and modify your rule with a new pattern:

Statistics

Once you have completed part or all of the annotation task, you save your annotations and move to the Statistics tab in order to inspect the distribution of your annotation features in the text or the corpus.

In order to do so, move to the Statistics tab and switch to Aspect of interest: Feature Coding Counting: local to inspect the distribution of your features in a table:

You can also switch to a view that attached the raw figures and percentages to your original annotation scheme:

Example project

You can download the example project used in this tutorial: .zip

Hope this helps. Feedback welcome.