Stanford Core NLP

author: Sabine Bartsch, Technische Universität Darmstadt

1 What are the Stanford Core NLP Tools?

The Stanford Core NLP Tools subsume a set of the principal Stanford NLP Tools such as the Stanford POS Tagger, the Stanford Named Entity Recognizer, the Stanford Parser etc. in one integrated package together with models for English and a growing number of other languages. The Stanford Core NLP tools automatically generate annotations that are the foundation of many types of linguistic analysis such as part of speech tagging or dependency parsing as shown in the example below. These annotations are applied in so-called pipelines that integrate a number of different annotators into one workflow that generates integrated output in specific formats such as plain text – human-readable data, xml, json or the conll data format. More about this below.

The Core NLP tools are the right choice in scenarios in which a number of different types of annotations are to be applied to a set of files with the aim of generating an integrated output. The Core NLP tools should be used in scenarios in which sufficient computing power, esp. memory (RAM), is available. They are most efficiently employed to larger numbers of files due to their resource requirements, but more about this later. As the annotators integrated into the Core NLP tools mostly also exist as stand-alone tools, users may be wondering when to choose one over the other. For example, the Stanford PoS Tagger exists as a stand-alone tool in which case it comes integrated with the necessary pre-processing such as tokenization and sentence splitting, so users might be wondering whether to use the stand-alone version or the Core NLP version of the software.

Stanford Core NLP 4.4.0

This tutorial is for the most current version of the Stanford Core NLP 4.4.0 (2021-05). Tutorials for older versions - which may or may not be different - are kept here for reference purposes (see below).

Tutorials for Stanford Core NLP – older versions

2 Requirements and set-up

Requirements

The Stanford Core NLP Tools require a running Java installation. As many software programs in corpus and computational linguistics require Java and Java is widely used for development in the field anyway, it is advisable to install a full Java JDK (Java Development Kit) which can be downloaded as the Open JDK from multiple sources. This also includes the widely used JRE (Java Runtime Environment) which is a prerequisite for the execution of many different software programs.

Note that the Stanford CoreNLP Tools standardly require Java 8. Later versions should work fine as well. However, if you run into any unforeseen errors, you might want to try and add the following to your command:

–add-modules java.se.ee [this might be optional, check without it first]

It will be shown in an example further down the page where to insert this call for the Java JAXB module.

Please be aware that the Core NLP tools require more memory than the stand-alone tools, so be sure your machine has sufficient RAM available. It would be good to be able to allocate at leasz 3 - 4 GB of RAM for the Core NLP tools, more if you are planning to use all of the annotators simultaneously; note that depending on the size of the models and your data set even more RAM may be required (6 - 8 GB of available RAM can come in handy).

Download

3 Quick start guide

Java-based software such as the Stanford Core NLP Tools can be called directly from the command line by typing in the relevant command, however, it is usually best to write the command to a batch file and execute that instead of typing directly to the command-prompt. This is going to save you time and spare you frustration, because you can more easily edit and reuse commands and also document within the file what a specific command is supposed to do.

Assuming there is a file called input.txt in the directory of the Core NLP tools that you want to process, a simple command that would call three annotators might look like this:

java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -file input.txt -outputFormat xml

Note: If you have upgraded to Java 9 or upwards, you may need to add an extra parameter –add-modules java.se.ee to the above command in order to run the CoreNLP pipeline:

java --add-modules java.se.ee -cp "*" -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,parse -file input.txt -outputFormat xml

The switch -annotators has three parameters “tokenize,ssplit,pos” that evoke the pre-processing steps tokenization and sentence splitting followed by the part of speech tagger, the switch -file takes the name of the input file and the switch -outputFormat takes as a parameter the format of the output to be created, in this case an xml file. You can copy this line into a text file, save it as core-nlp-pos.bat into the Core NLP directory and evoke it from the command line. It will produce an output file called input.txt.xml in the same directory.

Annotators included in the CoreNLP tools

The Stanford Core NLP Tools include the following annotators and more that are invoked by the switch -annotators or in the .properties file. Note that availability of annotators varies between languages depending on whether models for a language and an annotation process are made available either by the Stanford NLP team or members of the community.

annotatorfunction
tokenizeThe tokenizer subdivides a text into individual tokens, i.e. words, punctuation marks etc.
ssplitThe sentence splitter segments a text into sentences
posThe Stanford Part of Speech Tagger, assigns word class labels to each token according to a model and annotation scheme
lemmaThe lemmatizer provides the lemma or base form for each token.
nerThe Stanford Named Entity Recognizer identifies tokens that are proper nouns as members of specific classes such as Person(al) name, Organization name etc.
parseThe Stanford Parser analyses and annotates the syntactic structure of each sentence in the text. The Stanford Parer is actually not just one parser, but offers phrase structure parses as well as dependency parses.
dcorefThe Stanford CorefAnnotator implements pronominal and nominal coreference resolution.

Note that there are dependencies between the annotators, i.e. certain annotators require pre-processing by other annotators, e.g. tokenization and sentence splitting are obligatory pre-processing steps for part of speech tagging and parsing. You can look up the dependencies on the Standard CoreNLP Annotator dependencies page.

Some advice on directory paths

  • avoid white space in path and file names; if you do have white space in path and file names, be sure to enclose them in “double quotes”
  • have a special directory for you tools in an easily accessible location, ideally one that will not require you to have very long paths; on my machine that location is: C:\Users\Public\utility
  • please do not locate your corpora under the directory of the Stanford CoreNLP Tools, this will NOT work!

4 Running the Stanford CoreNLP tools with multiple annotators

:: calls the Core NLP Tools with files from the Stanford Core NLP folder and outputs xml

java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt -outputFormat xml

5 Processing pipelines for different languages

The example pipelines illustrated so far all annotate English data. However, there are models and configurations available for languages other than English [en], i.e. Arabic [ar], Chinese [zh], Frecnh [fr], German [de] and Spanish [es]. In this section, some examples are shown for processing different languages.

Annotators for different languages

Annotator ar zh en fr de es annotator
Tokenize / Segment tokenize
Sentence Split ssplit
Part of Speech pos
Lemma lemma
Named Entities ner
Mention Detection entitymentions
Constituency Parsing parse
Dependency Parsing depparse
Sentiment Analysis sentiment
Coreference coref
Open IE openie
Quote Extraction quote

Models for different languages

Annotator ar zh en fr de es annotator
Tokenize / Segment tokenize
Sentence Split ssplit
Part of Speech pos
Lemma lemma
Named Entities ner
Mention Detection entitymentions
Constituency Parsing parse
Dependency Parsing depparse
Sentiment Analysis sentiment
Coreference coref
Open IE openie

Example: Processing pipeline for Chinese text

In order to process text written in the Chinese language, you can use the following command (also from a batch file):

java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file chinese.txt -outputFormat xml

You will need to have the properties file StanfordCoreNLP-chinese.properties in the same directory as the CoreNLP tools.