Annotation formats

status: work in progress (2018-07-09)

Plain text, white space delimited tokens

The simplest format is plain text in which tokens are delimited by white space:

We're not laughing at you - we're laughing near you.

Plain text, white space delimited tokens including tokenized punctuation marks

We ' re not laughing at you - we ' re laughing near you .

Plain text, one token per line, single column format

These formats are accepted as basic input by many annotation tools. Check the documentation.

This this kind of formats, annotations or various types are added, either pertaining to individual tokens or groups of tokens. The following formats are typical output of part of speech taggers:

Plain text, part of speech tag attached to token with delimiter

We_PRN '_' re_VB not_RB laughing_VBG at_IN you_PRP ._.

We_PRN '_' re_VB  laughing_VBG near_IN you_PRP ._. 

Plain text, part of speech tag attached to token delimiter 'slash', the so-called slashTags

We/PRN '/' re/VB not/RB laughing/VBG at/IN you/PRP ./.

We/PRN '/' re/VB  laughing/VBG near/IN you/PRP ./. 

Stanford CoreNLP Tools 'text' format

Document: ID=example.txt (2 sentences, 13 tokens)

Sentence #1 (7 tokens):

We're not laughing at you.


Tokens:

[Text=We CharacterOffsetBegin=0 CharacterOffsetEnd=2 PartOfSpeech=PRP]

[Text='re CharacterOffsetBegin=2 CharacterOffsetEnd=5 PartOfSpeech=VBP]

[Text=not CharacterOffsetBegin=6 CharacterOffsetEnd=9 PartOfSpeech=RB]

[Text=laughing CharacterOffsetBegin=10 CharacterOffsetEnd=18 PartOfSpeech=VBG]

[Text=at CharacterOffsetBegin=19 CharacterOffsetEnd=21 PartOfSpeech=IN]

[Text=you CharacterOffsetBegin=22 CharacterOffsetEnd=25 PartOfSpeech=PRP]

[Text=. CharacterOffsetBegin=25 CharacterOffsetEnd=26 PartOfSpeech=.]


Sentence #2 (6 tokens):

We're laughing near you.


Tokens:

[Text=We CharacterOffsetBegin=27 CharacterOffsetEnd=29 PartOfSpeech=PRP]

[Text='re CharacterOffsetBegin=29 CharacterOffsetEnd=32 PartOfSpeech=VBP]

[Text=laughing CharacterOffsetBegin=33 CharacterOffsetEnd=41 PartOfSpeech=VBG]

[Text=near CharacterOffsetBegin=42 CharacterOffsetEnd=46 PartOfSpeech=IN]

[Text=you CharacterOffsetBegin=47 CharacterOffsetEnd=50 PartOfSpeech=PRP]

[Text=. CharacterOffsetBegin=50 CharacterOffsetEnd=51 PartOfSpeech=.]

Stanford CoreNLP Tools conll format

read more about conll here: https://universaldependencies.org/format.html

annotators: tokenize,ssplit,pos

idx word lemma pos ner headidx deprel
1We_PRP___
2're_VBP___
3not_RB___
4laughing_VBG___
5at_IN___
6you_PRP___
7._.___
1We_PRP___
2're_VBP___
3laughing_VBG___
4near_IN___
5you_PRP___
6._.___