Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_ner [2019/05/15 11:21]
sabinebartsch [2 Installation and Requirements]
linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_ner [2020/07/07 22:15] (current)
sabinebartsch [author: Sabine Bartsch, Technische Universität Darmstadt]
Line 3: Line 3:
 ==== author: Sabine Bartsch, Technische Universität Darmstadt ==== ==== author: Sabine Bartsch, Technische Universität Darmstadt ====
  
-''​tutorial status: under revision''​ +''​tutorial status: under revision'' ​ \\ 
- +builds on an earlier version by Michael Hanl
-''​builds on an earlier version by Michael Hanl''​ +
  
 ===== Related tutorials ===== ===== Related tutorials =====
  
 +  * [[linguisticsweb:​tutorials:​linguistics_tutorials:​automaticannotation:​stanford_ner_visualisation|Stanford NER visualisation]]
   * [[linguisticsweb:​tutorials:​linguistics_tutorials:​automaticannotation:​stanford_ner_tagger_python|Stanford NER from Python]]   * [[linguisticsweb:​tutorials:​linguistics_tutorials:​automaticannotation:​stanford_ner_tagger_python|Stanford NER from Python]]
  
 ===== 1 What is the Stanford NER and what are named entities? ===== ===== 1 What is the Stanford NER and what are named entities? =====
  
-The Stanford Named Entity Recognizer (NER) developed by Stanford University (2002-2006) and licensed under the GNU GPL is an application which locates and classifies named entities in a text into predefined categories. Entities can be understood as elements having a distinct, separate existence. These phrases contain, for instance, the names of persons, organizations,​ locations, times and quantities. The detection of named entities can be useful for computational analysis of literary texts. ​Questions ​concerning the narrative structure can be answered: Which characters are introduced in which part of the text, which places are mentioned? Moreover, named entity recognition is a subtask in the field of information extraction. It is an important component if you are dealing with question answering or co-reference resolution.+The Stanford Named Entity Recognizer (NER) developed by Stanford University (2002-2015) and licensed under the GNU GPL is an application which locates and classifies named entities in a text into predefined categories. Entities can be understood as elements having a distinct, separate existence ​and name. These phrases contain, for instance, the names of persons, organizations,​ locations, times and quantities. The detection of named entities can be useful for computational analysis of literary texts, new texts and research scenarios requiring information on named entitiesIn literary studies, questions ​concerning the narrative structure can be answered: Which characters are introduced in which part of the text? Which places are mentioned? Moreover, named entity recognition is a subtask in the field of information extraction. It is an important component if you are dealing with question answering or co-reference resolution.
  
 This manual offers a brief introduction to the Stanford NER explaining its installation and usage. This manual offers a brief introduction to the Stanford NER explaining its installation and usage.
Line 21: Line 20:
 ===== 2 Installation and Requirements ===== ===== 2 Installation and Requirements =====
  
-Running the application requires a [[linguisticsweb:​tutorials:​linguistics_tutorials:​basics:​environment:​java|Java Development Kit (JDK) version 8 or later]]. A free download of the Stanford NER tool is available [[http://​nlp.stanford.edu/​software/​CRF-NER.shtml|here]].+Running the application requires a [[linguisticsweb:​tutorials:​linguistics_tutorials:​basics:​environment:​java|Java Development Kit (JDK) version 8 or later]]. A free download of the Stanford NER tool is available [[https://​nlp.stanford.edu/​software/​CRF-NER.html|here]].
  
 Stanford NER can be run under Windows as well as under Unix/Linux and MacOSX. Stanford NER can be run under Windows as well as under Unix/Linux and MacOSX.
  
-After downloading the zip archive you unzipp ​the file to a directory of your choice. The package contains a simple GUI (graphical user interface) providing a simple visualisation of the named entities. You can start the tool from the command line or just by double clicking the file //​ner-gui.bat//​ (Windows) or //​ner-gui.sh//​ (Linux/​Unix/​MacOSX) for using the GUI. The GUI is suitable for testing and demo purposes, but not for classifying larger files or several files.+After downloading the zip archive you unzip the file to a directory of your choice. The package contains a simple GUI (graphical user interface) providing a simple visualisation of the named entities. You can start the tool from the command line or just by double clicking the file //​ner-gui.bat//​ (Windows) or //​ner-gui.sh//​ (Linux/​Unix/​MacOSX) for using the GUI. The GUI is suitable for testing and demo purposes, but not for classifying larger files or several files.
  
-The tool is well documented. Besides the readme file downloaded with the application ​and offering ​a brief introduction you can use the [[http://​nlp.stanford.edu/​software/​crf-faq.shtml|list of FAQ'​s]] and a [[http://​nlp.stanford.edu/​software/​jenny-ner-2007.pdf|presentation]] about named entity recognition and the Stanford NER as resources.+The tool is well documented. Besides the readme file downloaded with the application ​which offers ​a brief introduction you can refer to the [[http://​nlp.stanford.edu/​software/​crf-faq.shtml|list of FAQ'​s]] and a [[http://​nlp.stanford.edu/​software/​jenny-ner-2007.pdf|presentation]] about named entity recognition and the Stanford NER as resources.
  
  
Line 33: Line 32:
  
  
-==== 3.1 Add text file ====+==== 3.1 Running the Stanford NER tagger form batch or shell file ====
  
-Concerning ​the choice of a text the application offers numerous opportunitiesAfter starting ​the tool and clicking ​on the button ​''​file'' ​in the menu you can:+Much like the Stanford PoS Tagger, ​the NER tagger is likewise best run on a command lineIn the simplest case, you run the named entity recognition process with this command from the terminal which you can likewise save this to a batch (.bat) file on Windows systems or use the ''​ner.bat'' ​batch file that ships with the Stanford NER zip file.
  
-a) load a text file from your computer+=== Running the Stanford NER tagger ​from the ''​ner.bat''​ batch file that ships with the distribution:​ ===
  
-b) choose ​website URL or+Open terminal / shell / command prompt, navigate to the directory of the Stanford NER instance and enter the following command consisting of a call for the default ''​ner.bat''​ followed by the input file to be NER tagged:
  
-c) type in or copy text material directly+''​ner.bat sample.txt''​
  
-To receive ​first impression of the application you can use the given text sample by just clicking the button ''​RUN ​NER''​ at the bottom of the interface.+If you complete this command further with output file name, the Stanford ​NER tagger will write its output to a file which is to be found in its directory instead of writing ​the output to the screen:
  
 +''​ner.bat sample.txt'​ > sample-ner-output.txt''​
 +
 +=== Running the full command from the command line: ===
 +
 +Open the terminal / shell  / command prompt and navigate to the Stanford NER directory and enter the following command:
 +
 +''​java -mx1000m -cp stanford-ner.jar;​lib/​* edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers\english.all.3class.distsim.crf.ser.gz -textFile sample.txt''​
 +
 +This command runs an English 3-class classifier that identifies three classes of named entities: ''​PERSON,​ LOCATION, ORGANIZATION''​.
  
 ==== 3.2 Choose a classifier ==== ==== 3.2 Choose a classifier ====
  
-For detection of the named entities ​four different classifiers are offered. Based on statistical and lexical ​pieces of information they can be selected by the button ''​classifiers''​ according to your research question. The following ​bar shows the underlying entities:+For detection of the named entities different classifiers are offered. Based on statistical and lexical information they can be selected ​in the GUI by the button ''​classifiers''​ according to your research question ​or entered as a parameter in your command or batch file. The following ​table shows the underlying entities:
  
 | **File and underlying Entities** | **Person** | **Location** | **Organization** | **MISC** | **DSF** | | **File and underlying Entities** | **Person** | **Location** | **Organization** | **MISC** | **DSF** |
-ner-eng-ie.crf-3-all2008.ser.gz | + | + | + | - | - | +english.all.3class.distsim.crf.ser.gz | + | + | + | - | - | 
-ner-eng-ie.crf-4-conll.ser.gz | + | + | + | + | - |+english.conll.4class.distsim.crf.ser.gz | + | + | + | + | - |
 | ner-eng-ie.crf-3-all2008-distsim.ser.gz | + | + | + | - | + | | ner-eng-ie.crf-3-all2008-distsim.ser.gz | + | + | + | - | + |
 | ner-eng-ie.crf-4-conll-distsim.ser.gz | + | + | + | + | + | | ner-eng-ie.crf-4-conll-distsim.ser.gz | + | + | + | + | + |
Line 58: Line 66:
 Table 1: Features of the different classifiers Table 1: Features of the different classifiers
  
-Using model 3 (file //​ner-eng-ie.crf-3-all2008.ser.gz//​) means searching for the entities: ​personlocation ​and organization ​whereas model 4 (file //​ner-eng-ie.crf-4-conll.ser.gz//​),​ in addition to the three entities mentioned before, contains the entity MISC. This term - an abbreviation for miscellaneous ​refers to entities that don'​t ​belong to the previous three groups, for instance, nationalities. Furthermore,​ these two models are offered containing additional statistical features, the distributional similarity features (DSF) (file //​ner-eng-ie.crf-3-all2008-distsim.ser.gz//​ and //​ner-eng-ie.crf-4-conll-distsim.ser.gz//​). Although their selection means having results with a better performance their usage also requires more memory. +Using model 3 (file //​ner-eng-ie.crf-3-all2008.ser.gz//​) means searching for the entities: ​''​PERSON''​''​LOCATION'' ​and ''​ORGANIZATION'' ​whereas model 4 (file //​ner-eng-ie.crf-4-conll.ser.gz//​),​ in addition to the three entities mentioned before, contains the entity ​''​MISC''​. This abbreviation ​(for miscellaneous)  ​refers to entities that do not belong to the previous three groups, for instance, nationalities. Furthermore,​ these two models are offered containing additional statistical features, the distributional similarity features (DSF) (file //​ner-eng-ie.crf-3-all2008-distsim.ser.gz//​ and //​ner-eng-ie.crf-4-conll-distsim.ser.gz//​). Although their selection means having results with a better performance their usage also requires more memory.
- +
-For a first attempt and a quick usage you can use the option ''​Load Default-CRF'',​ which means using model 3 with the entities person, location and organization. +
- +
- +
-==== 3.3 Generating Output ==== +
- +
-The entities detected by the classifier are highlighted in different colors shown in the following figure: +
- +
-{{ :​linguisticsweb:​tutorials:​linguistics_tutorials:​automaticannotation:​stanford-ner-output-example.png |}} +
- +
-**Screenshot 1: Stanford NER output example** +
- +
-Unfortunately,​ the results cannot be saved in this colored format. After saving the tagged file (as *.txt file or *.gz file) the results are presented in the following manner similar to xml when opening the file again: "Then a man from <​LOCATION>​Belfast<​LOCATION>​ bought the field [...]"​. +
- +
-The application offers detection and visualization of named entities but no statistics. A possibility is a deeper analysis of the data by a concordance program such as WConcord. Thereby you can search, for instance, for the different entities. This can be useful especially when using larger amount of texts to preserve clarity. The following figure shows the results searching for the entity ''​location''​ in the NER output by applying WConcord. +
- +
-<p align="​center"><​img width="​679"​ alt="​StanfordNER"​ src="​http://​linglit194.linglit.tu-darmstadt.de/​linguisticsweb/​pub/​LinguisticsWeb/​StanfordNER/​Results_of_a_search_for_entity_location_applying_WConcord.jpg"​ height="​199"​ border="​0"​ /></​p>​ +
- +
-<p align="​center">​ __Screenshot 1: Results of a search for entity location applying WConcord__ </p>+
  
 +For a first attempt and a quick usage you can use the option ''​Load Default-CRF'',​ which means using model 3 with the entities person, location and organization as shown in the example under 3.1 above.
  
 ===== 4 Summary ===== ===== 4 Summary =====
  
-Besides its compatibility with other tools provided by Stanford ​University ​an advantage of the Stanford NER is the fast generation of results. Furthermore, ​it is language independent ​which means numerous possibilities for application. The performance ranges between 60-70% applying the tool on German texts. Using the models with the additional distributional similarity features the performance can be up to 77%. Concerning English texts, the value ranges between 90-93% with or without distributional similarity features (Spanish texts 80%, Dutch 70-80%). Because of the fact that the application is trained on British news texts (2002-2003) the results for other text types such as literary texts or older texts may differ.+Besides its compatibility with other tools provided by the Stanford ​NLP group, ​an advantage of the Stanford NER tagger ​is the fast generation of results. Furthermore, ​models exist for different languages and sets of named entities ​which means that it comes with numerous possibilities for application. The performance ranges between 60-70% applying the tool on German texts. Using the models with the additional distributional similarity features the performance can be up to 77%. Concerning English texts, the value ranges between 90-93% with or without distributional similarity features (Spanish texts 80%, Dutch 70-80%). Because of the fact that the application is trained on British news texts (2002-2003) the results for other text types such as literary texts or older texts may differ.
  
-A feature providing a possibility for manual correction of results is missing which can be seen as a disadvantage. An alternative is the usage of [[WordFreak-OpenNLP][Word Freak]] because the implementation of the Open NLP Tools in Word Freak allows named entity recognition and manual annotations.+A feature providing a possibility for manual correction of results is missing which can be seen as a disadvantage. An alternative is the usage of [[WordFreak-OpenNLP][Word Freak]] because the implementation of the Open NLP Tools in Word Freak allows named entity recognition and manual annotations. In another tutorial on [[linguisticsweb:​tutorials:​linguistics_tutorials:​automaticannotation:​stanford_ner_visualisation|Stanford NER visualisation]] you can learn more about generating a visually pleasing and informative display of NER tagged data.
  
-Furthermore, ​you have the possibility to train the Stanford NER. This complex task requires a lot of memory and further knowledge. Useful pieces of information can be found in the [[http://​nlp.stanford.edu/​software/​crf-faq.shtml|FAQ]].+Furthermore,​ the Stanford NER can be trained on new data. This complex task requires a lot of memory and further knowledge. Useful pieces of information can be found in the [[http://​nlp.stanford.edu/​software/​crf-faq.shtml|FAQ]].
  
 For a deeper analysis combined with other tools for text analysis, there is the Stanford NER UIMA annotator which uses a modified version of Stanford NER. Developed by Florian Law you can download it [[http://​www.florianlaws.de/​software/​|here]]. For a deeper analysis combined with other tools for text analysis, there is the Stanford NER UIMA annotator which uses a modified version of Stanford NER. Developed by Florian Law you can download it [[http://​www.florianlaws.de/​software/​|here]].
  
-All in all, the Stanford NER offers a fast solution for automatic detection and visualization of named entities, which you can achieve easily. +All in all, the Stanford NER offers a fast solution for automatic detection and visualization of named entities. ​Note that the Stanford NER tagging process is also implemented as part of the Stanford CoreNLP Tools.
  
 ===== References ===== ===== References =====
  
-%BIBTEX{topic="​LinguisticsReferences"​ select="​keywords : '​StanfordNER'"​ sort="​author"​ rev="​off"​ errors="​off"​}% 
  
 ----- -----