TreeTagging multiple files in a directory

Note: this batch file was developed in collaboration with Dieter Keßler.

Much of the time, we will want to tag more than one single file. In a real-life corpus linguistic annotation scenario, we will more like want to tag a directory of files, e.g. a whole directory of .txt files. The TreeTagger batch files are not designed to do achieve that. After some thought, I have opted to run the TreeTagger annotation batch file from another batch file that takes care of processing a directory of data. I will demonstrate this based on the English TreeTagger batch file called tag-english.bat.

We start off by creating a new batch file in our code aware editor such as VSCode. The batch file should look as follows - note that you have to adapt the following paths:

INPUT_DIR –> location of the plain text files to be annotated

OUTPUT_DIR –> location to which your ouput files will be written

as well as

TREETAGGER_BAT which must point to the location of the batch file for a particular language within your TreeTagger\bin directory.

This script assumes that your input files are plain text .txt files. If they have a different file name extension, you have to adapt “*.txt” in line 16 to the file name extension of your files. If you want to see your output files named anything other than .tagged, you will have to change this value in line 19.


@echo off
setlocal

echo starting
rem Set the directory containing the text files
set INPUT_DIR=C:\MY-PROJECT\INPUT-DATA-DIRECTORY
set OUTPUT_DIR=C:\MY-PROJECT\OUPUT-DIRECTORY
set TREETAGGER_BAT=C:\TreeTagger\bin\tag-english.bat

rem Ensuring the output directory exists, create if it does not exist
if not exist "%OUTPUT_DIR%" (
    mkdir "%OUTPUT_DIR%"
    echo created output dir
)

for /f %%G in ('dir /b /s "%INPUT_DIR%\*.txt"') do (
    echo Found %%G
    echo "%OUTPUT_DIR%\%%~nG.tagged"
    call "%TREETAGGER_BAT%" "%%G" "%OUTPUT_DIR%\%%~nG.tagged"
)

endlocal