Decision Tree Tagger

1 What is the Tree Tagger

The TreeTagger is a statistical part of speech tagger; it was develped by Helmut Schmid at the Institut für Maschinelle Sprachverarbeitung at the University of Stuttgart. It incorporates three functions:

  • part of speech tagging
  • chunking
  • lemmatization

The TreeTagger is language independent which means that it can be trained to annotated different languages given that suitable training data are available.

URL: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

2 Installation

Operating systems: Linux, Sparc, PC; requires Perl which can be obtained from either Active State (convenient Windows distribution of Perl) or Strawberry Perl (for Windows) or get Perl from perl.org.
Licence model: free, see licence agreement here: http://www.ims.uni-stuttgart.de/~schmid/Tagger-Licence Languages: various, parameter files are provided for many different languages in the website, e.g. English, German, Italian, Russian, Spanish etc.

2.1 Prerequisites

The crucial prerequisite for fully using TreeTagger on your computer is that an interpreter for the programming language Perl is installed on your computer. Perl is freely available as an executable file from the following sources:

  • Active State Perl
  • Strawberry Perl

2.2 Download

To install the TreeTagger, download the tagger package from the TreeTagger homepage. Please make sure you are downloading the package that is appropriate for your operating system.

There is a Windows version that works under all recent versions of Microsoft Windows. It can be downloaded as a zip-file from a link towards the end of the TreeTagger homepage.

The software for Linux, Mac OS etc. comes as a software archive in a so-called tar archive which is compressed by means of Gun-Zip (the Linux equivalent of zip under Windows).

2.3 Extract archive and set up

Unpack the archive to a new directory:

Windows 98 / XP / 2000 users: its easiest to use a directory immediately under c:\, for example c:\TreeTagger, that way all default settings are working and you don't have to change the settings file.

Windows Vista users: you may not have permission to create directories immediately under root (c:\ ) unless you are working with account that has administrator priviledges (NOT ADVISABLE. A good solution is to create a directory directly under your user directory, e.g. C:\Users\Yourname\TreeTagger and unpack the archive there. Make sure the path does not contain any white space.

This creates the following directory structure:

../TreeTagger

../TreeTagger/bin

../TreeTagger/cmd

../TreeTagger/lib

For all distributions, the parameter files have to be downloaded separately (again, make sure you're downloading the parameter files for your operating system). The parameter files are available for the major European languages. They come compressed as zip files and have to be unpacked to the directory ../TreeTagger/lib

2.4 System path

As a last step you have to adjust the Path settings on your computer:

1. Right click on “My Computer” –> Select the “Advanced” tab –> Click on “Environment Variables” –> click on New: enter PATH and C:\TreeTagger\bin\;%PATH% (or wherever your TreeTagger directory is located)

2. If your TreeTagger is located in a directory other than c:\TreeTagger, please enter the correct path to the tagging dircetory (a variable called TAGDIR) in the third line of the following files:

../TreeTagger/bin/tag-english.bat ../TreeTagger/bin/tag-LANGUAGE.bat

The line should look like this with the correct path specified: set TAGDIR=C:\Users\Public\utility\TreeTagger

3 Testing the TreeTagger

You are now ready to test the tagger.

3.1 Command line

Open a DOS-box / command line, change to the /bin directory of your TreeTagger installation and type the following:

c:\Users\Yourname\TreeTagger\tag-english.bat README.txt

This command tags the README.txt file located in the same directory and writes the output to the command window. If you want to capture the output in a file, just type:

c:\Users\Yourname\TreeTagger\tag-english.bat README.txt > README.out

3.2 Graphical user interface

If you prefer a graphical user interface, you might want to check out Ciarán Ó Duibhín's Windows Interface for Stuttgart TreeTagger:

http://www.smo.uhi.ac.uk/~oduibhin/oideasra/interfaces/winttinterface.htm

4 Multilingual tagging

The TreeTagger of itself, like other statistically based pos taggers, is language independent. However, they can be trained to tag different languages. In order to be able to tag a specific language, language specific parameter files are required. Many parameter files have, in the meantime been created by people from the large community of users of the TreeTagger.

5 Training the tagger

IN PROGRESS

6 References

%BIBTEX{topic=“LinguisticsWeb.LinguisticsReferences” select=“keywords : 'Treetagger'” sort=“author” rev=“off” errors=“off”}%