Tutorial status: tested with Core NLP 3.9.2 and below
The Stanford Core NLP Tools subsume the set of the principal Stanford NLP Tools such as the Stanford POS Tagger, the Stanford Named Entity Recognizer, the Stanford Parser etc. in one integrated package together with models for English and a number of other languages. It is thus a viable choice if you know from the start that you are going to be processing English texts or texts in any of the languages for which models exist. The Stanford Core NLP tools can automatically generate annotations that are the foundation of many types of linguistic analysis such as part of speech tagging or dependency parsing as shown in the example shown here.
The Stanford Core NLP Tools require a running Java installation. As many software programs in corpus and computational linguistics require Java and Java is widely used for development in the field anyway, it is advisable to install a full Java JDK (Java Development Kit) which can be downloaded as the Open JDK from multiple sources. This also includes the widely used JRE (Java Runtime Environment) which is a prerequisite for the execution of many different software programs.
Note that the Stanford CoreNLP Tools standardly require Java 8. However, later versions of Java (9/10/11 etc.) work fine if you add the following to your java command:
–add-modules java.se.ee
It will be shown in an example further down the page where to insert this call for the Java JAXB module.
Please be aware that the Core NLP tools require more memory than the stand-alone Stanford tools such as the Stanford PoS Tagger, so be sure your machine has sufficient RAM available. It would be good to be able to allocate at least 3 - 4 GB of RAM for the Core NLP tools, more if you are planning to use all of the annotators simultaneously; note that depending on the size of the models and your data set even more RAM may be required (6 - 8 GB of available RAM can come in handy).
Java-based software such as the Stanford Core NLP Tools can be called directly from the command line by typing in the relevant command, however, it is usually best to write the command to a file and execute that file. This is going to save you time and spare you frustration, because you can more easily edit and reuse commands and also document within the file what a specific command is supposed to do.
Assuming there is a file called input.txt in the directory of the Core NLP tools that you want to process, a simple command that would call three annotators might look like this:
java -cp “*” -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -file input.txt -outputFormat xml
Note: If you have upgraded to Java 9 or upwards, you may need to add an extra parameter –add-modules java.se.ee
to the above command in order to run the CoreNLP pipeline:
java –add-modules java.se.ee -cp “*” -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,parse -file input.txt -outputFormat xml
The switch -annotators
has three parameters “tokenize,ssplit,pos” that evoke the pre-processing steps tokenization and sentence splitting followed by the part of speech tagger, the switch -file
takes the name of the input file and the switch -outputFormat
takes as a parameter the format of the output to be created, in this case an xml file. You can copy this line into a text file, save it as core-nlp-pos.bat
into the Core NLP directory and evoke it from the command line. It will produce an output file called input.txt.xml
in the same directory.
The Stanford Core NLP Tools offer the following annotators that are invoked by the switch -annotators
:
annotator | function |
tokenize | The tokenizer subdivides a text into individual tokens, i.e. words, punctuation marks etc. |
ssplit | The sentence splitter segments a text into sentences |
pos | The Stanford Part of Speech Tagger, assigns word class labels to each token according to a model and annotation scheme |
lemma | The lemmatizer provides the lemma or base form for each token. |
ner | The Stanford Named Entity Recognizer identifies tokens that are proper nouns as members of specific classes such as Person(al) name, Organization name etc. |
parse | The Stanford Parser analyses and annotates the syntactic structure of each sentence in the text. The Stanford Parer is actually not just one parser, but offers phrase structure parses as well as dependency parses. |
dcoref | The Stanford CorefAnnotator implements pronominal and nominal coreference resolution. |
Note that there are dependencies between the annotators, i.e. certain annotators require pre-processing by other annotators, e.g. tokenization and sentence splitting are obligatory pre-processing steps for part of speech tagging and parsing. You can look up the dependencies on the Standard CoreNLP Annotator dependencies page.
:: calls the Core NLP Tools with files from the Stanford Core NLP folder and outputs xml
java -cp “*” -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt -outputFormat xml
The example pipelines illustrated so far all annotate English data. However, there are models and configurations available for languages other than English, e.g. Arabic, Chinese, German and Spanish. In this section, some examples are shown for processing different languages.
In order to process text written in the Chinese language, you can use the following command (also from a batch file):
java -mx3g -cp “*” edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file chinese.txt -outputFormat xml
You will need to have the properties file StanfordCoreNLP-chinese.properties
in the same directory as the CoreNLP tools. You can download the properties file from here: Chinese properties file