Foundations of Statistical Natural Language Processing (Ch. 1) - Manning/Schutze
One of the big questions that Manning and Schutze pose in their first chapter is: ‘how do human’s use language?’ The answer to this question is not simple, and requires a deep, theoretical exploration of linguistics. Linguists lay claim that all languages have an underlying structure about them. This is to say that languages have rules. However, it is not possible to provide an exact characterization of rules, nor determine whether utterances are well-formed or ill-formed, usual or unusual, grammatical or ungrammatical.
Two major approaches that Manning and Schutze put forth are the rationalist approach and the empiricist approach. The rationalist approach says that “a significant part of the knowledge in the human mind is not derived by the senses but is fixed in advance, presumably by genetic inheritance” (M&S, 5). Basically, as humans, we are born with some innate quality that allows us to acquire language. Evidence that supports this points to children and how they learn the complexities of natural language regardless of how much input they hear during their childhood. An empiricist approach differs only by a degree; it assumes that a baby’s brain has general operations for association, pattern recognition, and generalization, rather than a set of principles and procedures specific to the various components of language. As an aspiring linguist, the empiricist approach seems a fundamentally stronger approach because the organization of one’s input being an inherent feature falls in line nicely with the Sapir-Whorf hypothesis that our perceptions of the world or thoughts are influenced by language, and the acquisition of said language.
Both approaches are utilized in Statistical NLP and provide a means for determining how language is used. Difficulty arises with the concept of grammaticality, which presupposes a dichotomy of grammatical or ungrammatical–an utterance is either structurally well-formed or it is not. However, this causes problems as there are many examples of sentences that could be formed by language-users that are indeed grammatical but semantically strange. Chomsky uses the example of: “Colorless green ideas sleep furiously.” This is not something one would expect a native speaker to say as most natives speakers normally produce meaningful sentences. Furthermore, a dichotomy between grammatical and ungrammatical does not give information regarding conventions. Thus, it is important to know the frequency of different sentences and sentence types that are being used, and also determine conventionality in a language–the way in which people commonly express or do something regardless of the fact that other ways are structurally possible.
So, where does probability fit into all of this? Manning and Schutze argue that if human cognition is probabilistic, then language must be as well since it is an integral part of cognition. The argument that cognition is probabilistic stands true if we look at how one makes decisions in this crazy, uncertain world of ours. How do you reason that the river before you is safe to cross? First, you look to see how fast the current is flowing–the slower the better. Next, you check to see how deep it is, and maybe you already know that there are no dangerous animals. All of these things you are taking in and reasoning about are events. To determine whether or not you think you will be able to cross the river, you have to figure out the probability that you will be able to make it across, hence, our cognition is probabilistic. Now, if someone behind you were to tell you, ‘hey, that part of the river has a clay bottom; it’s easier to walk across there,’ then, you are “processing the words, forming an idea of the overall meaning of the sentence, and weighing it in making your decision…” (M&S, 15). Chomsky argues against the probabilistic nature of words. He says that computing the probability of sentences from a corpus of utterances would assign the same low probability to all unattested sentences, grammatical or ungrammatical.
The difficulty in using probability in NLP stems from the ambiguity of language in general. Ambiguity is a feature of every language. Parsing is a way that computer science tries to solve this problem. A parse is defined as a syntactic analysis of a sentence. Ambiguity arises when there are several different parses for the same sentence, which happens the longer a sentence goes. A well-developed NLP system must be good at making disambiguation decisions. Perhaps if we can teach a machine language, we can better understand how humans learn language to begin with.
Manning and Schutze mention a few tools of the trade that they find to be extremely helpful for the journey that I am about to embark on. The first is the Brown Corpus. It is a corpus or about a million or so words put together by Brown University in the 60s and 70s. Other corpora include the British English version of the Brown Corpus, the Lancaster-Oslo-Bergen (LOB) Corpus, and a subset of the Brown Corpus, the Suzanne Corpus. Once I have obtained a significant amount of text to be analyzed, there are a few basic level questions that I should answer: What are the most common words in the text? How many words are there in the text as a whole? How many different words or word types appear in the text? Looking at examples given by Manning and Schutze, it is clear to see that it will be hard to predict the behavior of words that are not in the corpus, or barely in the corpus at all. Using a larger corpus does not help this situation, however, it does not make the words very much less rare than they already are. Zipf’s law explains this phenomena–the Principle of Least Effort, which says that people act in such a way to minimize their probable average rate of work.