Topic Modelling with Mallet

Mallet

Mallet is a Java-based machine learning toolkit that enables a number of typical tasks for the exploration of large sets of unlabeled data. It is widely used in exploratory approaches in linguistics such as for example topic exploration in text corpora.

Installation

Mallet does not require much of an installation in the sense of running an executable installer. Rather, mallet is set-up by extracting the mallet zip file.

Download the file from the mallet website: http://mallet.cs.umass.edu/download.php

Two versions of the mallet download are available:

Download the file that is appropriate for your operating system and extract it. Windows users should extract mallet to the root of the main drive C:\ like this:

c:\mallet-2.0.8\

Also set the environment variable MALLET_HOME so your operating system can find the path to mallet in your control panel under System –> Additional System Settings –> Environment Variables … under System variables by creating a New … variable MALLET_HOME with the path to the mallet directory as its value: C:\mallet-2.0.8\

Running mallet

Mallet has no graphical user interface, so will be run from the command-line in a terminal. From anywhere on your system, run the first command (line 3) from the command prompt / terminal to test that mallet has been properly set up on your machine:

# testing the installation by calling mallet help
# from the terminal in the directory C:\mallet-2.0.8
C:\bin\mallet import-dir --help

If all goes well, you should have a long list of help options printed to your screen and no error messages.

Then it's time to look into the first step of the topic modelling process itself which entails importing and initializing the data, in our case set from the example data (UTF-8 encoded plain text files in English en and German de) that ship with the mallet installation and are located in the directory sample-data\web\en. Let us go through the command in line 2 step by step. Please look at the command and read the text beneath before running it:

# importing and initializing the sample data
C:\mallet-2.0.8\bin\mallet import-dir –input “sample-data\web\en” –stoplist-file “C:\mallet\stoplists\en.txt” –output “C:\Users\Public\tm_output\tutorial.mallet” –keep-sequence –remove-stopwords

The mallet command is followed by the command import-dir followed by the parameter switch –input which requires information on the directory where the input data is located; this is followed by the optional parameter –stoplist-file followed by the directory where a list of stopwords to be excluded from the topic modelling is declared (note that example stopword lists also ships with mallet); after that you declare an existing directory as output directory with the parameter –output (this directory has to be created by you in a location of your choice; make that an empty directory so you can easily keep track of the files being written by mallet); the next parameter keeps the sequence of the data in tact; the last parameter –remove-stopwords tells the process to remove the stopwords declared in the stopword list. Now the data is ready for the actual training of the topic model.

# training a model on the sample data 
# write this command to a batch file for easier parameter tweaking
C:\mallet-2.0.8\bin\mallet train-topics –input C:\Users\Public\tm_output\tutorial.mallet –num-topics 20 –optimize-interval 20 –output-state C:\Users\Public\tm_output\topic-state.gz –output-topic-keys C:\Users\Public\tm_output\tutorial_keys.txt –output-doc-topics C:\Users\Public\tm_output\tutorial_composition.txt

Parameters

The parameters driving the example above are the following

–helpshows help options
–inputspecifies the directory where your input files are located
–stoplist-filespecifies the stopword list for the respective language to be processed
–num-topics parameters for the number of topics you want to generate,
in the example above, 20 topics are generated
–optimize-interval
–output-state directory to which mallet writes output
–output-topic-keys directory to which mallet writes output topic keys
–output-doc-topicsdirectory to which mallet writes output document topics