Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Previous revision
Next revision Both sides next revision
linguisticsweb:resources:corpora [2019/07/13 09:34]
linguisticsweb:resources:corpora [2023/04/06 12:36]
sabinebartsch [Tag sets]
Line 1: Line 1:
 +====== Corpora and other language resources ======
 +
 +===== Tag sets =====
 +
 +==== Penn TreeBank tag set ====
 +
 +Reference:
 +Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: the penn treebank. Comput. Linguist. 19, 2 (June 1993), 313–330. [[https://dl.acm.org/doi/10.5555/972470.972475|Ref]]
 +
 +^pos tag^description^example^
 +|CC|coordinating conjunction|and, or|
 +|CD|cardinal number|3, third|
 +|DT|determiner|the, this|
 +|EX|existential there|there is|
 +|FW|foreign word|tabula|
 +|IN|preposition, subordinating conjunction|in, of, like|
 +|IN/that|that as subordinator|that|
 +|JJ|adjective|blue, happy|
 +|JJR|adjective, comparative|bluer, happier|
 +|JJS|adjective, superlative|bluest, happiest|
 +|LS|list marker|1)|
 +|MD|modal|could, will|
 +|NN|noun, singular or mass|house|
 +|NNS|noun plural|houses|
 +|NP|proper noun, singular|Carrie|
 +|NPS|proper noun, plural|Americans|
 +|PDT|predeterminer|both as in "both the girls"|
 +|POS|possessive ending|person’s|
 +|PP|personal pronoun|I, she, it|
 +|PPZ|possessive pronoun|my, his, your|
 +|RB|adverb|however, usually, naturally, here, good|
 +|RBR|adverb, comparative|better|
 +|RBS|adverb, superlative|best|
 +|RP|particle|up as in "give up"|
 +|SENT|Sentence-break punctuation|. ! ?|
 +|SYM|Symbol|/ [ = *|
 +|TO|infinitive ‘to’|to play|
 +|UH|interjection|aha|
 +|VB|verb be, base form|be|
 +|VBD|verb be, past tense|was, were|
 +|VBG|verb be, gerund/present participle|being|
 +|VBN|verb be, past participle|been|
 +|VBP|verb be, sing. present, non-3d|am, are|
 +|VBZ|verb be, 3rd person sing. present|is|
 +|VH|verb have, base form|have|
 +|VHD|verb have, past tense|had|
 +|VHG|verb have, gerund/present participle|having|
 +|VHN|verb have, past participle|had|
 +|VHP|verb have, sing. present, non-3d|have|
 +|VHZ|verb have, 3rd person sing. present|has|
 +|VV|verb, base form|take|
 +|VVD|verb, past tense|took|
 +|VVG|verb, gerund/present participle|taking|
 +|VVN|verb, past participle|taken|
 +|VVP|verb, sing. present, non-3d|take|
 +|VVZ|verb, 3rd person sing. present|takes|
 +|WDT|wh-determiner|which, who|
 +|WP|wh-pronoun|who, what|
 +|WP$|possessive wh-pronoun|whose|
 +|WRB|wh-abverb|where, when|
 +|#|#|#|
 +|$|$|$|
 +|“|Quotation marks|‘ “|
 +|``|Opening quotation marks|‘ “|
 +|(|Opening brackets|( {|
 +|)|Closing brackets|) }|
 +|,|Comma|,|
 +|:|Punctuation|– ; : — …|
 +
 +===== Corpora =====
 +
 +^corpus title^size^time^source^language^
 +|British National Corpus (BNC)|100 million tokens|mid 1970s - early 1990s|Oxford|British English|
 +|The Brown Corpus|1 mio tokens|1961|ICAME|British English|
 +|The Lancaster/Oslo-Bergen Corpus (LOB)|1 mio. tokens|1961|ICAME|British English|
 +|[[http://ice-corpora.net/ice/index.html|International Corpus of English (ICE)]]|xxxxxx|varieties of world Englishes|[[https://www.ice-corpora.uzh.ch/en.html|International Corpus of English (ICE) at Zuerich, CH]]|world English|
 +|Mark Davies' English Corpora|xxxxxx|diverse set of corpora|Mark Davies|American English, British English, international English|
 +|Textcorpora in the DWDS|  div. |div.| https://www.dwds.de/ |German|
 +|DWDS Kernkorpus|  |1900-1999  |Berlin-Brandenburgische Akademie der Wissenschaften: https://www.dwds.de/d/korpora/kern|German|
 +|DWDS Kernkorpus 21|  |2000-2010  |Berlin-Brandenburgische Akademie der Wissenschaften: https://www.dwds.de/d/korpora/korpus21|German|
 +|Hamburg Dependency Treebank|  |German news site heise.de, articles published between 1996 and 2001|http://hdl.handle.net/11022/0000-0000-7FC7-2|German|
 +|IDS-Corpora|  |  |http://www.ids-mannheim.de/kt/corpora.html|German|
 +|LIMAS-Korpus|1 mio words, 500 texts / fragments|1970s|http://www.korpora.org/Limas/|German|
 +|Arabic News Texts Corpus (AntCorpus)| | | https://antcorpus.github.io/|Arabic|
 +|Wortschatz Leipzig|various sample sizes|Arabic, English, French, German, Russian misc. |https://wortschatz.uni-leipzig.de/de/download|various|
 +|SpråkbankenText| | |https://spraakbanken.gu.se/en/resources|Swedish|
 +