TreeTagger: tagging Middle High German data

Models for many different languages are available for Helmut Schmid's TreeTagger, many of the are made available by members of the linguistics community. Among others there is also a model for tagging Middle High German trained on the Middle High German Conceptual Database and trained by Sarah Schulz. In order to tag your data with the TreeTagger, install the TreeTagger on your system following the instructions in the TreeTagger tutorial on linguisticsweb.org. Then download the model for Middle High German from the TreeTagger website:

Middle High German parameter file trained by Sarah Schulz on the Middle High German Conceptual Database (gzip compressed, UTF-8, paper (in German))

Unpack the gzip compressed file to the \lib subdirectory of your TreeTagger Installation:

..\TreeTagger
\bin
\cmd
\lib

In the \lib directory you should now have the file:

middle-high-german.par which you will have unpacked from gzip by right clicking on the file in your file Explorer and selecting Extract all … from the context menu.

Open the TreeTagger directory in your terminal / shell and navigate to the \bin directory.

Now copy the following code to a batch file in the \bin directory. From the file Explorer, click inside the TreeTagger\bin directory and create a new plain text file and save it as tag-middle-high-german.bat. The code you want to copy to the file is the code below.

Alternatively, you can download and copy my batch file to your Treetagger\bin directory and adjust your own paths: TreeTagger batch file for tagging and lemmatizing Middle High German


@echo off 
set TAGDIR=C:\Users\Public\utility\TreeTagger 

set BIN=%TAGDIR%\bin 
set CMD=%TAGDIR%\cmd 
set LIB=%TAGDIR%\lib 
set TAGOPT=%LIB%\middle-high-german.par -token -lemma -sgml -no-unknown 

if "%2"=="" goto label1 
perl %CMD%\utf8-tokenize.perl -a %LIB%\german-abbreviations "%~1" | %BIN%\tree-tagger %TAGOPT% > "%~2" 
goto end 

:label1 
if "%1"=="" goto label2 
perl %CMD%\utf8-tokenize.perl -a %LIB%\german-abbreviations "%~1" | %BIN%\tree-tagger %TAGOPT% 
goto end 

:label2 
echo. 
echo Usage: tag-middle-high-german file {file} 
echo. 

:end 

Please make sure that the directory path in the line

set TAGDIR=C:\Users\Public\utility\TreeTagger corresponds to the path where your TreeTagger is located. Sace the file to the \bin directory and navigate to that directory in your shell.

You are now ready to test run the TreeTagger. In order to do so, it is easiest to create a text file with a few sentences of Middle High German to the \bin directory, let's assume we are calling the file:

MHG-example.txt

Make sure the text contains some diacritics or characters with Umlauts to ensure that these are processed ok. You might want to use this snippet to test:

ir habet wol vernomen daz,
wi der kunich menelaus besaz
troien die rîchen
vil gewaldechlîchen,
do er sie zefûren wolde
dorch pârîses scholde,
der im sîn wîb hete genomen.

The text should be stored with UTF-8 file character encoding. Now run the TreeTagger from this batch file by issuing the following command:

tag-middle-high-german.bat MHG-example.txt > MHG-example-output.txt and hit ENTER.

If all has gone well, your annotated example text will have been written to a new file named as shown above MHG-example-output.tagged. Please check, whether the Umlauts, diacritics etc. have come out ok. My example text from above should come out like this:

ir PRON sie
habet AUX haben
wol ADV wol
vernomen VERB vernemen
daz PRON daz
, PUNCT ,
wi ADV wie
der DET der
kunich NOUN künic
menelaus NOUN menelaus
besaz VERB besitzen
troien NOUN troien
die DET die
rîchen NOUN rîchen
vil ADV vil
gewaldechlîchen ADJ gewalteclîch
, PUNCT ,
do ADV do
er PRON er
sie PRON sie
zefûren VERB zervüeren
wolde AUX wellen
dorch ADP durch
pârîses NOUN pârîses
scholde VERB soln
, PUNCT ,
der PRON der
im PRON er
sîn DET sîn
wîb NOUN wîp
hete AUX haben
genomen VERB nemen
. SPUNCT .

The output will be tab separated values and contain three columns separated by tabulators as delimiters, hence, this format is called tsv or tab separated values:

TOKEN TABULATOR POS_TAG TABULATOR LEMMA
token \t pos_tag \t lemma

If you can confirm that this has all worked out fine on your example file, you are ready to run this on a file from your corpus. From the TreeTagger\bin directory, you can now run the tagging process on a file from your Middle High German corpus anywhere in your file system.

CAUTION: If you are running this on you corpus data, please make absolutely sure that you are writing to an output file of a DIFFERENT NAME, esp. when the output file is written to your corpus directory. Better make sure you give any output file a name that indicates the file contains annotated output by calling any output file something like FILE-tagged,txt. Better still, write any output also to a different directory called \output to be sure you are not accidentally overwriting your original files. The TreeTagger or, indeed, any scripts run from the terminal, will not warn you before overwriting any files!