Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
linguisticsweb:tutorials:linguistics_tutorials:corpus_analysis_basics [2019/06/29 11:36]
sabinebartsch [The AntConc main window]
linguisticsweb:tutorials:linguistics_tutorials:corpus_analysis_basics [2019/07/21 21:47]
sabinebartsch [Corpus analysis basics]
Line 1: Line 1:
 ====== Corpus analysis basics ====== ====== Corpus analysis basics ======
  
-''​tutorial status: under construction (2019-06-24)''​+''​tutorial status: under construction (2019-07-22)''​
  
-Corpus linguistics ​basically ​has two principal technological mainstays:+Corpus linguistics has two principal technological mainstays:
  
   * the digital corpus, i.e. a principled collection of plain text data that forms the data basis of the research   * the digital corpus, i.e. a principled collection of plain text data that forms the data basis of the research
   * tools for search over patterns in the data, for data sorting and display   * tools for search over patterns in the data, for data sorting and display
  
-These two form the backbone of any corpus linguistic research. Digital corpora are available from various sources. ​They are comprised of sets of plain text (.txt) data, sometimes including just the text, sometimes including text plus annotation such as part of speech tagging. For the time being, we are going to work with unannotated plain text corpora. Tools for searching over patterns in the data include in their basic functionality ways of creating lists of words and their frequencies,​ so-called word-frequency lists. These are often useful in order to get a sense of what a corpus is comprised of and how many instances of particular words and constructions are represented in the corpus. The second ​many functionality is concordancing,​ i.e. search ​for words the corpus and displaying them with a set context to the left and right of the search word(s). ​We call these key-word in context (KWIC) concordances. These functions are often implemented in the same software because they work hand in hand. We are going to look at AntConc as an example of a modern ​concordancing software, but be aware that there are others out there as well.+These two form the backbone of any corpus linguistic research. Digital corpora are available from various sources. ​In their simplest form, corpora ​are comprised of sets of plain text (.txt) data, sometimes including just the text, sometimes including text plus annotation such as part of speech tagging. For the time being, we are going to work with unannotated plain text corpora. Tools for searching over patterns in the data include in their basic functionality ways of creating lists of words and their frequencies,​ so-called word-frequency lists. These are often useful in order to get a sense of what kind of linguistic data a corpus is comprised of and how many instances of particular words and constructions are represented in the corpus. The other many functionality is concordancing,​ i.e. possibility of searching ​for words in the corpus and displaying them together ​with a set context to the left and right of the search word(s). ​These are commonly called ​key-word in context (KWIC) concordances. 
 +These two principal ​functions, i.e. word (frequency) listing and concordancing, ​are often implemented in the same software because they are used hand in hand. We are going to look at AntConc as an example of a commonly used concordancing software, but be aware that there are others out there as well.
  
 This tutorial offers a first introduction to corpus analysis. It introduces basic techniques of exploring digital corpora by means of computational tools such as AntConc. Please download a version of AntConc that is appropriate for your operating system. There is not much to install here, Antconc ships as a single executable file. All you have to do is copy it to a directory of your choice. Personally, I prefer to have a tools directory where I install all my linguistics tools so that I can use them independently of any particular class. I tend to install all my tools to This tutorial offers a first introduction to corpus analysis. It introduces basic techniques of exploring digital corpora by means of computational tools such as AntConc. Please download a version of AntConc that is appropriate for your operating system. There is not much to install here, Antconc ships as a single executable file. All you have to do is copy it to a directory of your choice. Personally, I prefer to have a tools directory where I install all my linguistics tools so that I can use them independently of any particular class. I tend to install all my tools to
Line 39: Line 40:
  
 {{:​linguisticsweb:​tutorials:​linguistics_tutorials:​antconc03-default-window-corpus-files.png?​nolink&​600|}} {{:​linguisticsweb:​tutorials:​linguistics_tutorials:​antconc03-default-window-corpus-files.png?​nolink&​600|}}
 +
 +
 +===== Concordancing function =====
  
 To the right of this there is a large area that is organized into tabs ranging from the main ''​Concordance''​ to ''​Word List''​ and ''​Key-Word List'';​ we will explore all of these in due course. ​ To the right of this there is a large area that is organized into tabs ranging from the main ''​Concordance''​ to ''​Word List''​ and ''​Key-Word List'';​ we will explore all of these in due course. ​
Line 44: Line 48:
  
 {{:​linguisticsweb:​tutorials:​linguistics_tutorials:​antconc04-default-window-search-word-right-sorted.png?​nolink&​600|}} {{:​linguisticsweb:​tutorials:​linguistics_tutorials:​antconc04-default-window-search-word-right-sorted.png?​nolink&​600|}}
 +
 +
 +===== Word frequency list =====
 +
 +{{:​linguisticsweb:​tutorials:​linguistics_tutorials:​antconc05-wordfrqlist.png?​nolink&​600|}}
 +
 +
 +===== Clusters / N-Grams =====
 +
 +{{:​linguisticsweb:​tutorials:​linguistics_tutorials:​antconc06-bigrams-2-2.png?​nolink&​600|}}