Stanford Word Segmenter

The Stanford Word Segmenter is a piece of software that can automatically segment text into words. It is is designed to handle non-Indo-European languages such as Arabic and Chinese which pose special segmentation / tokenization challenges.

Download and extraction

Download the Stanford Word Segmenter from the Stanford NLP software website and unpack the zip file to a location of your choice.

It does not require any installation proper, but needs Java version 1.6 or upwards to be installed on your machine as a Java JDK (Java Development Kit) or Open JDK.

Running the software

The directory your have just unpacked contains a file segment.bat which is the batch file for running the software under Windows and a file segment.sh which is a shell script to be run in UNIX-like environments. The following tutorial assumes you are working under a Windows OS.

In order to run the software, open a terminal (preferably the Powershell) and enter the following command:

.\segment.bat ctb .\test.simple.utf8 0

Hit ENTER and see what appears on your screen. It is very likely to be gobbledygook.

Please write the output to a file next in order to be able to inspect it in a text editor and check a few things:

.\segment.bat ctb .\test.simple.utf8 0 > test-out.txt

Now open this file in a good UTF-8 aware text editor (Notepad++ or Sublime will do fine).

Shell configuration issues

Chances are the output will still not look right. There are a few potential reasons for this and thus a few things worth checking. First of all, check the encoding of the input file test.simple.utf8; it will be UTF-8. Next, check the encoding of your output file; it will be UTF-16.

So why does it still not look right. Well, it looks like (I have to verify this) your shell is interfering with the output. It does not look right in the shell window itself which is probably due to the fact that you have no Chinese fonts installed and that your fonts are unable to display Chinese characters. Next, it might simple not 'know' that you are processing UTF-8, so we need to make sure it does that.

In the Powershell, enter the following:

[Console]::OutputEncoding = [System.Text.Encoding]::UTF8

Now try step X above again.