AntConc: PoS tagged corpora

The concordancing software AntConc can be used on any plain text corpora. This also entails corpora with annotations in standard structured formats such as:

  • in-line PoS tagged data with delimiters:

The_DT old_JJ house_NN ._.

  • XML annotated data:

<div level=“1”><head> <s n=“2”><mw c5=“AV0”><w c5=“PRP” hw=“at” pos=“PREP”>At </w><w c5=“ORD” hw=“last” pos=“ADJ”>last </w></mw><w c5=“PNP” hw=“it” pos=“PRON”>it</w><w c5=“VBZ” hw=“be” pos=“VERB”>'s </w><w c5=“AV0” hw=“here” pos=“ADV”>here</w> (source: <bncDoc xml:id=“C8K”>

in-line PoS tags with delimiters

The_DT old_JJ house_NN ._.

In order to display and query this kind of data, you need to be familiar with its structure. So let's take a look at some data pos tagged with the Stanford PoS Tagger. The Stanford PoS Tagger can automatically tag any plain text corpus in a variety of languages such as English, German, French, Spanish and others with part of speech tags in a format that looks like this:

tokendelimiterpos tag
Linguistics_NN
is_VBZ
the_DT
scientific_JJ
study_NN
of_IN
natural_JJ
language_NN
._.

Represented in the output file as a so-called 'in-line' annotation, it looks like this:

Linguistics_NNP is_VBZ the_DT scientific_JJ study_NN of_IN natural_JJ language_NN ._.

This format can be opened and queried in AntConc (3.5.9), but some settings have to be adjusted in AntConc so that the software is set to detect that internal structure of the data.

XML annotated data

[IN PROGRESS!]

XML annotated data: <wtext type=“OTHERPUB”><p type=“caption”> <s n=“1”><w c5=“NN1” hw=“number” pos=“SUBST”>Number </w><w c5=“CRD” hw=“133” pos=“ADJ”>133</w></s></p><div level=“1”><head> <s n=“2”><mw c5=“AV0”><w c5=“PRP” hw=“at” pos=“PREP”>At </w><w c5=“ORD” hw=“last” pos=“ADJ”>last </w></mw><w c5=“PNP” hw=“it” pos=“PRON”>it</w><w c5=“VBZ” hw=“be” pos=“VERB”>'s </w><w c5=“AV0” hw=“here” pos=“ADV”>here</w><c c5=“PUN”>!</c></s></head><p> (source: <bncDoc xml:id=“C8K”>