This is an old revision of the document!


Stanford Core NLP

author: Sabine Bartsch, Technische Universität Darmstadt

Tutorial status: tested with Core NLP 3.9.2 and below

1 What are the Stanford Core NLP Tools?

The Stanford Core NLP Tools subsume the set of the principal Stanford NLP Tools such as the Stanford POS Tagger, the Stanford Named Entity Recognizer, the Stanford Parser etc. in one integrated package together with models for English and a number of other languages. It is thus a viable choice if you know from the start that you are going to be processing English texts or texts in any of the languages for which models exist. The Stanford Core NLP tools can automatically generate annotations that are the foundation of many types of linguistic analysis such as part of speech tagging or dependency parsing as shown in the example shown here.

2 Requirements and set-up

Requirements

The Stanford Core NLP Tools require a running Java installation. As many software programs in corpus and computational linguistics require Java and Java is widely used for development in the field anyway, it is advisable to install a full Java JDK (Java Development Kit) which can be downloaded from the Oracle Java JDK site. This also includes the widely used JRE (Java Runtime Environment) which is a prerequisite for the execution of many different software programs.

Note that the Stanford CoreNLP Tools standardly require Java 8. However, later versions of Java do work fine if you add the following to your command:

–add-modules java.se.ee

It will be shown in an example further down the page where to insert this call for the Java JAXB module.

Please be aware that the Core NLP tools require more memory than the stand-alone tools, so be sure your machine has sufficient RAM available. It would be good to be able to allocate at leasz 3 - 4 GB of RAM for the Core NLP tools, more if you are planning to use all of the annotators simultaneously; note that depending on the size of the models and your data set even more RAM may be required (6 - 8 GB free RAM can come in handy).

Download

3 Quick start guide

Java-based software such as the Stanford Core NLP Tools can be called directly from the command line by typing in the relevant command, however, it is usually best to write the command to a file and execute that file. This is going to save you time and spare you frustration, because you can more easily edit and reuse commands and also document within the file what a specific command is supposed to do.

Assuming there is a file called input.txt in the directory of the Core NLP tools that you want to process, a simple command that would call three annotators might look like this:

java -cp “*” -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -file input.txt -outputFormat xml

Note: If you have upgraded to Java 9 or upwards, you need to add an extra parameter –add-modules java.se.ee to the above command in order to run the CoreNLP pipeline:

java –add-modules java.se.ee -cp “*” -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,parse -file input.txt -outputFormat xml

The switch -annotators has three parameters “tokenize,ssplit,pos” that evoke the pre-processing steps tokenization and sentence splitting followed by the part of speech tagger, the switch -file takes the name of the input file and the switch -outputFormat takes as a parameter the format of the output to be created, in this case an xml file. You can copy this line into a text file, save it as core-nlp-pos.bat into the Core NLP directory and evoke it from the command line. It will produce an output file called input.txt.xml in the same directory.

Annotators included in the CoreNLP tools

The Stanford Core NLP Tools offer the following annotators that are invoked by the switch -annotators:

annotatorfunction
tokenizeThe tokenizer subdivides a text into individual tokens, i.e. words, punctuation marks etc.
ssplitThe sentence splitter segments a text into sentences
posThe Stanford Part of Speech Tagger, assigns word class labels to each token according to a model and annotation scheme
lemmaThe lemmatizer provides the lemma or base form for each token.
nerThe Stanford Named Entity Recognizer identifies tokens that are proper nouns as members of specific classes such as Person(al) name, Organization name etc.
parseThe Stanford Parser analyses and annotates the syntactic structure of each sentence in the text. The Stanford Parer is actually not just one parser, but offers phrase structure parses as well as dependency parses.
dcorefThe Stanford CorefAnnotator implements pronominal and nominal coreference resolution.

Note that there are dependencies between the annotators, i.e. certain annotators require pre-processing by other annotators, e.g. tokenization and sentence splitting are obligatory pre-processing steps for part of speech tagging and parsing. You can look up the dependencies on the Standard CoreNLP Annotator dependencies page.

Running the Stanford CoreNLP tools with multiple annotators

:: calls the Core NLP Tools with files from the Stanford Core NLP folder and outputs xml

java -cp “*” -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt -outputFormat xml

Processing pipelines for different languages

The example pipelines illustrated so far all annotate English data. However, there are models and configurations available for languages other than English, e.g. Arabic, Chinese, German and Spanish. In this section, some examples are shown for processing different languages.

Processing pipeline for Chinese text

In order to process text written in the Chinese language, you can use the following command (also from a batch file):

java -mx3g -cp “*” edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file chinese.txt -outputFormat xml

You will need to have the properties file StanfordCoreNLP-chinese.properties in the same directory as the CoreNLP tools. You can download the properties file from here: Chinese properties file