corpus analysis basics

tutorial status: under construction (2021-11-20)

Corpus linguistics has two principal technological mainstays:

  • digital corpora, i.e. principled collections of plain text data that form the data basis of the research
  • tools for search over patterns in the data and/or the annotations, for data sorting and display in various aggregations such as e.g. word lists, word frequency lists or KWIC concordances

These two form the backbone of any corpus linguistic research. Digital corpora are available from various sources. In their simplest form, corpora are comprised of sets of plain text (.txt) data, sometimes including just the text, sometimes including text plus annotation such as part of speech tagging. For the time being, we are going to work with unannotated plain text corpora. Tools for searching over patterns in the data include in their basic functionality ways of creating lists of words and their frequencies, so-called word frequency lists. These are often useful in order to get a sense of what kind of linguistic data a corpus is comprised of and how many instances of particular words and constructions are represented and distributed in the corpus. The other main functionality is concordancing, i.e. possibility of searching for words in the corpus and displaying them together with a set amount of context to the left and right of the search word(s). These are commonly called key-word in context (in short: KWIC) concordances. These two principal functions, i.e. word (frequency) listing and concordancing, are often implemented in the same software because they are used hand in hand. We are going to look at AntConc as an example of a commonly used concordancing software, but be aware that there are others out there as well.

Resources used in this tutorial:

  • AntConc
  • Brown corpus sampler

This tutorial offers a first introduction to corpus analysis. It introduces basic techniques of exploring digital corpora by means of computational tools such as AntConc. Please download a version of AntConc that is appropriate for your operating system. There is not much to install here, Antconc ships as a single executable file. All you have to do is copy it to a directory of your choice. Personally, I prefer to have a tools directory where I install all my linguistics tools so that I can use them from a centralized location. I tend to install all my tools to

C:\Users\Public\utility

E.g. install AntConc, i.e. copy the AntconcX.X.X.exe (e.e. AntConc3.5.9.exe) file to a directory such as

C:\Users\Public\utility\AntConc

Next, download our corpus of Brown corpus sampler from moodle. Unpack it to a directory of your choice by means of a tool like 7zip or the like. I tend to hold all of my corpora in a directory with subdirectories for each corpus such as

C:\Users\Public\corpora\american_inaugural_speeches_1961_2017 C:\Users\Public\corpora\Brown-corpus-sampler

After you have done that, open AntConc by double clicking on the .exe file; click File » Open and select a set of files from our corpus on your hard-disk (e.g. all files from the news section of the Brown corpus sampler that are names ca01, ca02 …. Please make sure to avoid white space and special characters in the file or directory names as they tend to cause errors in many processing scenarios. Should you have white space and the like in directory or file names, it mostly helps to put the entire path and filename into double-quotation marks, but more about that later.

The AntConc main window

The AntConc main window is organized into three principal areas plus the menu bar at the top of the window what contains the drop-down menus such as the standard menu File as well as the software specific Global Settings, Tool Preferences and Help.

The area to the left underneath File is the area where the Corpus Files files you load into AntConc for analysis are displayed.

After clicking on File select Open File(s) to select individual files or Open dir … to select an entire directory of files. The default file type opened by AntConc is plain text.

Concordancing function

To the right of this there is a large area that is organized into tabs ranging from the main Concordance to Word List and Key-Word List; we will explore all of these in due course. Below this area you will find the area where you can enter a Search term. There are many options here and a few pre-sets are already made to get you started. Once you have loaded your corpus files into the Files area, you can just type in a search word in the search box and hit Start to be displayed a concordance of occurrence of the search word in the corpus loaded.

Word frequency list

Clusters / N-Grams