Programming for linguistics

There is a vast number of software programs available today supporting many of the tasks researchers in the humanities need to get done. However, a piece of software, no matter how experienced its programmer or how configurable it may be, of necessity is limited to the functions its developer has envisioned. In many cases, this is perfectly sufficient and there is a large number of experienced programmers out there who have a good idea of what the common humanities scholar might want to do. As you advance as a scholar, however, you will come up with new ideas and new data sets that require processing steps nobody had thought of before, or at least not in the configuration you need. It is at this point that you should start thinking about using highly configurable processing frameworks such as GATE, Apache UIMA and its natural processing implementation DKPro. Or, in many cases, you should start thinking about learning to write your own programs to free yourself from the limitations imposed by ready-made software and to implement whole workflows in one environment instead of jumping back and forth between different pieces of software.

A number of programming languages have become popular within the humanities in recent years. A good number of people would probably say that the best and most efficient way to go about this is to learn Java straight away in order to have a full-blown, object-oriented programming language available for software development. And this is actually what some of us end up doing in the long run. However, Java has a bit of a learning curve not least imposed by its development overhead that can make it quite unwieldy for the beginner and initially unrewarding for the linguist looking to quickly and efficiently develop research prototypes for handling linguistic data. Many people prefer to initially or additionally learn one of the so-called script programming languages which have less development overhead and allow to write code that does not have to be compiled.

Perl

… used to be and probably still is many a linguists first choice when looking for a programming language. This may be attributed to its strength in manipulating string data and also its extremely powerful and versatile regular expression handling. If you are interested in learning more about Perl, you may consult the following sources:

Schwartz, Randal L. et al. 2011. Learning Perl. 6th edition. O'Reilly. (In initiate circles, it is called “the Llama book”.)

Python

Python has become the programming language of choice for many linguists in recent years, partly because it is a modern, object-oriented programming language with very readable code style, but - especially for linguists and all kind of folk working with linguistic data - also boosted by the availability of Bird et al.'s Natural Language Toolkit (NLTK) which is a set of Python modules aiding the development of code for linguistic applications.

From a linguist's perspective, NLTK is a great starting point for learning all kinds of linguistically relevant programming tasks from basic corpus searches and corpus pre-processing via annotation to complex tasks such as data visualization and sentiment analysis. For learners at different levels of expertise, NLTK is also accompanied by an excellent O'Reilly book entitled “Natural Language Processing with Python. Analyzing Text with the Natural Language Toolkit” (2009) by Steven Bird, Ewan Klein, and Edward Loper.

Python and NLTK distributions

R

The language R is a bit of a hybrid animal in many ways, because while being a functional scripting language, it is at the same time an extremely powerful statistics package. This hybrid nature makes R ideally suited to the kinds of tasks linguists often are confronted with, i.e. manipulating string data by means of regular expressions and, ever since quantitative approaches have made a good name for themselves in linguistics, extracting, modelling and analysing quantitative linguistic data such as frequencies, frequency distributions, co-occurrence statistics, clustering, classification and the like. R does not replace Python in all contexts, and neither the Python cover all functionalities offered by R, especially not the statistics functionalities, so the two complement rather than replace each other. However, they offer a good combination of functionalities. Python in combination with NLTK and the like is great for building processing and annotation scenarios and extract string and quantitative data. R is likewise great at processing string and quantitative data, but not so good at integrating annotation processes, however, it has as its great asset its statistics capabilities and it also has great graphics capabilities for data visualization.

There are many ways to get you started with R. A good first introduction that quickly and interactively introduces the basics of the R programming language is TryR at C<>de School. It gives a concise introduction to the R syntax, data types and some statistics basics in eight well-organized chapters under the following headings: R Syntax, Vectors, Matrices, Summary Statistics, Factors, Data Frames, Working With Real-World Data.

A great starting point for linguists is Harald Baayen's (2008) “Analyzing Linguistic Data. A Practical Introduction to Statistics Using R” published by Cambridge University Press which builds on linguistic examples and data.