TreeTagger: tagging Middle High German data

Models for many different languages are available for Helmut Schmid's TreeTagger, many of the are made available by members of the linguistics community. Among others there is also a model for tagging Middle High German trained on the Middle High German Conceptual Database and trained by Sarah Schulz. In order to tag your data with the TreeTagger, install the TreeTagger on your system following the instructions in the TreeTagger tutorial on linguisticsweb.org. Then download the model for Middle High German from the TreeTagger website:

Middle High German parameter file trained by Sarah Schulz on the Middle High German Conceptual Database (gzip compressed, UTF-8, paper (in German))

Unpack the gzip compressed file to the \lib subdirectory of your TreeTagger Installation:

..\TreeTagger \bin \cmd \lib

In the \lib directory you should now have the file:

middle-high-german.par which you will have unpacked from gzip by right clicking on the file in your file Explorer and selecting Extract all … from the context menu.

Open the TreeTagger directory in your terminal / shell and navigate to the \bin directory.

Now copy the following code to a batch file in the \bin directory. From the file Explorer, click inside the TreeTagger\bin directory and create a new plain text file and save it as tag-middle-high-german.bat. The code you want to copy to the file is the code below.

Alternatively, you can download and copy my batch file to your Treetagger\bin directory and adjust your own paths: TreeTagger batch file for tagging and lemmatizing Middle High German


@echo off 
set TAGDIR=C:\Users\Public\utility\TreeTagger 

set BIN=%TAGDIR%\bin 
set CMD=%TAGDIR%\cmd 
set LIB=%TAGDIR%\lib 
set TAGOPT=%LIB%\middle-high-german.par -token -lemma -sgml -no-unknown 

if "%2"=="" goto label1 
perl %CMD%\utf8-tokenize.perl -a %LIB%\german-abbreviations "%~1" | %BIN%\tree-tagger %TAGOPT% > "%~2" 
goto end 

:label1 
if "%1"=="" goto label2 
perl %CMD%\utf8-tokenize.perl -a %LIB%\german-abbreviations "%~1" | %BIN%\tree-tagger %TAGOPT% 
goto end 

:label2 
echo. 
echo Usage: tag-middle-high-german file {file} 
echo. 

:end

Please make sure that the directory path in the line

set TAGDIR=C:\Users\Public\utility\TreeTagger corresponds to the path where your TreeTagger is located. Sace the file to the \bin directory and navigate to that directory in your shell.

You are now ready to test run the TreeTagger. In order to do so, it is easiest to create a text file with a few sentences of Middle High German to the \bin directory, let's assume we are calling the file:

MHG-example.txt

Make sure the text contains some diacritics or characters with Umlauts to ensure that these are processed ok. You might want to use this snippet to test:

ir habet wol vernomen daz, wi der kunich menelaus besaz troien die rîchen vil gewaldechlîchen, do er sie zefûren wolde dorch pârîses scholde, der im sîn wîb hete genomen.

The text should be stored with UTF-8 file character encoding. Now run the TreeTagger from this batch file by issuing the following command:

tag-middle-high-german.bat MHG-example.txt > MHG-example-output.txt and hit ENTER.

If all has gone well, your annotated example text will have been written to a new file named as shown above MHG-example-output.tagged. Please check, whether the Umlauts, diacritics etc. have come out ok. My example text from above should come out like this:

ir PRON sie habet AUX haben wol ADV wol vernomen VERB vernemen daz PRON daz , PUNCT , wi ADV wie der DET der kunich NOUN künic menelaus NOUN menelaus besaz VERB besitzen troien NOUN troien die DET die rîchen NOUN rîchen vil ADV vil gewaldechlîchen ADJ gewalteclîch , PUNCT , do ADV do er PRON er sie PRON sie zefûren VERB zervüeren wolde AUX wellen dorch ADP durch pârîses NOUN pârîses scholde VERB soln , PUNCT , der PRON der im PRON er sîn DET sîn wîb NOUN wîp hete AUX haben genomen VERB nemen . SPUNCT .

The output will be tab separated values and contain three columns separated by tabulators as delimiters, hence, this format is called tsv or tab separated values:

TOKEN	TABULATOR	POS_TAG	TABULATOR	LEMMA
`token`	\t	`pos_tag`	\t	`lemma`

If you can confirm that this has all worked out fine on your example file, you are ready to run this on a file from your corpus. From the TreeTagger\bin directory, you can now run the tagging process on a file from your Middle High German corpus anywhere in your file system.

CAUTION: If you are running this on you corpus data, please make absolutely sure that you are writing to an output file of a DIFFERENT NAME, esp. when the output file is written to your corpus directory. Better make sure you give any output file a name that indicates the file contains annotated output by calling any output file something like FILE-tagged,txt. Better still, write any output also to a different directory called \output to be sure you are not accidentally overwriting your original files. The TreeTagger or, indeed, any scripts run from the terminal, will not warn you before overwriting any files!

Site Tools

TreeTagger: tagging Middle High German data