[PLUG-TALK] [PLUG] Python and Natural Language Toolkit

John Jason Jordan johnxj at comcast.net
Thu Nov 19 11:25:07 PST 2015


On Thu, 19 Nov 2015 09:31:44 -0800
John Jason Jordan <johnxj at comcast.net> dijo:

>Thanks, that seems to have worked! I'll know for sure in a bit when I
>try to do the homework exercises.

Yes, it worked, homework all done. :)

For the ling-curious here I should add a bit regarding what this is all
about (while moving to PLUG-talk).

A sub-field of linguistics is corpus linguistics, aka computational
linguistics, a field that was impossible before computers. A corpus is a
body of text in a language compiled from written documents and sometimes
transcriptions of spoken language as well, stored electronically. For
an example of a publicly available web-based corpus, try COCA:

	http://corpus.byu.edu/coca/

Each word in a corpus is tagged with one or more features of the word
so that other linguists can use the corpus to search for how words are
used by real speakers and writers. 

What is the benefit of this? Think for a moment of what you learned
about grammar when you were studying a foreign language in school.
While a large portion of that grammar is still held to be accurate, by
using corpora we have discovered that such traditional grammar models
are full of inaccuracies and misleading notions, not to mention the
fact that they are hopelessly incomplete. We can also discover how
words are used by searching for collocations, leading to a favorite
statement by one of our PSU professors "there is no such thing as a
synonym." Think for a moment about 'big' and 'large.' They mean the
same thing, right? Well, by searching in a corpus we discover that they
collocate differently. There are places where English speakers never use
one of them and other places where they never use the other. 

For an example of a grammar of English that was created based only on
corpus data, look at the Longman Grammar of Spoken and Written English:

	http://tinyurl.com/p82n9e8

From the Amazon description: "The Longman Grammar of Spoken and Written
English is a revolutionary, corpus-based reference grammar of English,
based on a groundbreaking research project to analyze the ways in which
English grammar is really used. The book looks at four text types -
conversation, fiction, news reportage, and academic prose - and reports
statistical findings as well as examining the reasons that condition a
particular grammatical choice." In my opinion this is the first real
English grammar book, a grammar that renders all previous English
grammars obsolete.



More information about the PLUG-talk mailing list