Stanford Named Entity Recognizer

author: Sabine Bartsch, Technische Universität Darmstadt

tutorial status: under revision

builds on an earlier version by Michael Hanl

1 What is the Stanford NER and what are named entities?

This tutorial builds on the basic Stanford NER tutorial. If you would like to learn more about the basic running and options of the Stanford NER tagger, please refer to that tutorial first. In this tutorial, we are going to explore ways of displaying the NER tagged output in a visually pleasing and informative fashion by means of a tool called WordFreak as well as by means of XML / XHTML technologies.

2 Working with NER tagged data

2.1 Basic output format

Basic output from the NER tagging process looks like this:

The/O fate/O of/O Lehman/ORGANIZATION Brothers/ORGANIZATION ,/O the/O beleaguered/O investment/O bank/O ,/O hung/O in/O the/O balance/O on/O Sunday/O as/O Federal/ORGANIZATION Reserve/ORGANIZATION officials/O and/O the/O leaders/O of/O major/O financial/O institutions/O continued/O to/O gather/O in/O emergency/O meetings/O trying/O to/O complete/O a/O plan/O to/O rescue/O the/O stricken/O bank/O ./O Several/O possible/O plans/O emerged/O from/O the/O talks/O ,/O held/O at/O the/O Federal/ORGANIZATION Reserve/ORGANIZATION Bank/ORGANIZATION of/ORGANIZATION New/ORGANIZATION York/ORGANIZATION and/O led/O by/O Timothy/PERSON R./PERSON Geithner/PERSON ,/O the/O president/O of/O the/O New/ORGANIZATION York/ORGANIZATION Fed/ORGANIZATION ,/O and/O Treasury/ORGANIZATION Secretary/O Henry/PERSON M./PERSON Paulson/PERSON Jr./PERSON ./O

You can also extract the data in CoNNL format which is a tab separated (tsv) table-like format according to the guidelines of a computational shared task:

JohnPERSON
KerryPERSON
willO
flyO
toO
ParisLOCATION
thisO
weekendO
.O

2.2 Using WordFreak to display and manipulate NER tagged data

The entities detected by the classifier are highlighted in different colors shown in the following figure:

Screenshot 1: Stanford NER output example

Unfortunately, the results cannot be saved in this colored format. After saving the tagged file (as *.txt file or *.gz file) the results are presented in the following manner similar to xml when opening the file again: “Then a man from <LOCATION>Belfast<LOCATION> bought the field […]”.

The application offers detection and visualization of named entities but no statistics. A possibility is a deeper analysis of the data by a concordance program such as WConcord. Thereby you can search, for instance, for the different entities. This can be useful especially when using larger amount of texts to preserve clarity. The following figure shows the results searching for the entity location in the NER output by applying WConcord.

3 Summary

Besides its compatibility with other tools provided by Stanford University an advantage of the Stanford NER is the fast generation of results. Furthermore, it is language independent which means numerous possibilities for application. The performance ranges between 60-70% applying the tool on German texts. Using the models with the additional distributional similarity features the performance can be up to 77%. Concerning English texts, the value ranges between 90-93% with or without distributional similarity features (Spanish texts 80%, Dutch 70-80%). Because of the fact that the application is trained on British news texts (2002-2003) the results for other text types such as literary texts or older texts may differ.

A feature providing a possibility for manual correction of results is missing which can be seen as a disadvantage. An alternative is the usage of WordFreak-OpenNLP][Word Freak because the implementation of the Open NLP Tools in Word Freak allows named entity recognition and manual annotations.

Furthermore, you have the possibility to train the Stanford NER. This complex task requires a lot of memory and further knowledge. Useful pieces of information can be found in the FAQ.

For a deeper analysis combined with other tools for text analysis, there is the Stanford NER UIMA annotator which uses a modified version of Stanford NER. Developed by Florian Law you can download it here.

All in all, the Stanford NER offers a fast solution for automatic detection and visualization of named entities, which you can achieve easily.

References


1 Compatible means that the output of the Stanford NER can be used as input for further analysis as tagging and parsing which is useful for building a processing pipeline.