Tuesday, June 13, 2017

Natural Language Processing

I have been studying natural language processing lately (otherwise known as computational linguistics). I began with NLTK (Natural Language Toolkit), an open source natural language processing tool kit. This is a superb guide to practical computational linguistics featuring a free comprehensive textbook (which is frequently used for a single semester course in natural language processing at advanced undergraduate or postgraduate level) and software package running in a Python environment on Windows or Linux.The field covers a wide range, but an example readily available to many people these days is the process by which your smart phone accepts vocal commands from you. This involves segmenting the phonemes (the individual pieces of spoken words, nominally involving a consonant and a vowel), putting breaks in the incoming stream of language sound you make and then attempting to match those with words from a lexicon (large list of possible words).
This is no easy task, but it is followed by the even more challenging pursuit of meaning, attempting to map what you have spoken to actions the phone can take, including the object of such an action. For example, if you commanded your phone "search for Mexican restaurants in Las Cruces" the phone would look for a command in that string of sounds, a command it recognizes. If it successfully recognized "search for" then it would branch in its processing logic to objects of such a command, i.e., what you want to search for.

This would require tagging each word in the utterance (what you just said to the phone) to identify the command and its object(s). The phone would have to recognize that "(Mexican) restaurants" is the search object.

Here is a look at the result of a natural language processor tagging the text string of the utterance we are discussing (we will ignore the details of how the sounds you made became this text stream):

>>> grammar = nltk.data.load('grammars/large_grammars/atis.cfg')
>>> parser = nltk.parse.EarleyChartParser(grammar)
>>> text = nltk.word_tokenize("Look for Mexican restaurants in Las Cruces.")
>>> nltk.pos_tag(text)
[('Look', 'NN'), ('for', 'IN'), ('Mexican', 'JJ'), ('restaurants', 'NNS'), ('in', 'IN'), ('Las', 'NNP'), ('Cruces', 'NNP'), ('.', '.')]

Notice how each word in the utterance (what you said to the phone) has now been tagged with a part of speech label (which we refer to as simply "tag").  'IN' means "preposition."  'NN' means "singular noun," 'NNS' means 'plural noun,' and 'NNP' means 'proper noun' (typically the name of a person or place). These grammatical tags are taken primarily from the Brown Corpus, a landmark publication by Brown University in 1964, featuring over a million words of running text or edited prose published in the United States in 1961.

Before I ran the parser on our target utterance I had to give it a grammar (you can see that I loaded the atis.cfg grammer in the Python IDLE session above (Python is a computer programming language frequently used in science; IDLE is an integrated development environment, i.e., a windowed application that makes it easier to write and test code). The ATIS grammar, developed by Microsoft Research, was extracted from a treebank of the DARPA ATIS3 training sentences. Sentences are typically parsed into treelike structures. Well, I will see if a picture is worth a thousand words here and show you a tree parse diagram of the sentence we are working with (from a parse done later to correct mislabelling of the verb):

It does appear somewhat like an upside down tree, where the tree's root is at the top and its branches become developed as it proceeds down the page, the inverse of an oak tree rooted at the ground and branching above. A treebank (in the context of computational linguistics) is a database of such syntactic or parse trees. Such a treebank can be analyzed to discover typical patterns of syntax, i.e., the way the different parts of speech are normally organized in sentences of a particular language. For example, English sentences typically have the subject first (going left to right) and it usually is a noun or noun phrase. The predicate, the part to the right of the subject typically, is formed around a verb. We form sentences without having to think much about it, having brains that are evolved to learn and process language (I will agree with Noam Chomsky on this and may say more about it later), but it is difficult to program a machine to do this. One of the ways to construct a computer program that will parse sentences is to analyze a treebank and produce rules of grammar that describe the frequent patterns in the treebank (like subject/NN-->predicate/VP).

A context free grammar (CFG) is often used to formally present the rules of grammar in a form a computer can process. For example, S --> NP VP, which tells us that the symbol 'S' on the left, symbolizing 'sentence' can be produced by a noun phrase (NP) followed by a verb phrase (VP). There are many different ways to form sentences (understatement). In fact, that is one of the things that distinguishes human speech from animal communications (well, it used to), i.e., that each utterance is potentially unique, never previously said, created by putting together the blocks of language by the rules of grammar to accomplish the communication of a potentially novel thought. So you really want a computer to help grind through huge treebanks of sentences labelled with their parts of speech (POS) tags and generate grammar rules as much as possible (since the rules will be lengthy, i.e., many lines of the kind of production rule I just showed you above).

DARPA, the Defense Advanced Research Projects Agency (of the United States government), has done linguistics research among other things.  For example, they created the TCP/IP protocol that we use to communicate over the Internet---that protocol was designed to assure an email got to its destination even if a city or two was destroyed by a nuclear attack, TCP/IP being able to try different routes if a particular city disappears). They have also been working on true artificial intelligence (not the chicanery promoted as AI by many software folks and companies, which I won't name since I am blogging on their platform), but they abruptly went "black" on the subject after 2012, now only presenting this effort as using mammalian brain structural hints to create advanced computer chips. Their actual intent is to create true mammalian brain intelligence, which I will prove by reproducing one of their press images from 2012 (which seems to have been removed) describing the SyNAPSE project (to alarm those of you who have watched the Terminator movie series):


DARPA was interested in machine reading and other computational linguistics subjects and produced the ATIS3 training sentences which Microsoft used to produce the ATIS grammar that I gave to the Earley parser I used to analyze the "look for Mexican restaurants in Las Cruces" sentence above. The Earley parser is a chart parser that uses a set of grammar rules (as just discussed) in a dynamic programming environment, trying to predict which of the grammar rules to use next as it moves from left to right across a sentence trying to match up rules with the words and POS tags it encounters. It is important to predict which rule to use next because to simply scan through the entire CFG grammar file for each word of each sentence might take a prohibitively long time. The ATIS grammar I used above has about 5,235 lines (rules).

Well, some of you who have persisted in the grueling task of reading my entire post may be wondering if I noticed that the Earley parser mislabelled the verb 'look' in the sentence. Yes, I did. So I had to obtain a more robust computational package (I am sure I could have gotten better results with NLTK had I spent more time teaching classifiers, but I was in a hurry), my simple sentence being a somewhat unfair target, being a command to a machine and missing a subject (that being understood by most humans to be 'you,' i.e., the person or thing being commanded).

I got hold of a language processing pipeline from the Natural Language Processing (NLP) group at Stanford University and ran their coreNLP pipeline as a local server, using http post to request parsing of the target sentence from my IDLE Python console (assisted in that http post process by use of a nice Python package called Requests, advertised as 'the only non-GMO HTTP library for Python, safe for human consumption' by its ingenious author, Kenneth Reitz, and some interface Python code written by a postgrad AI student at Berkeley, Smitha Milli). The Stanford NLP software is industrial strength but did churn for a minute or two to produce a correct parse:

c:\stanfordcnlp>java -cp "c:\stanfordcnlp\*" -Xmx1500m edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -annotators tokenize, ssplit,pos,depparse,parse
...
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000

Then I started a Python IDLE session and requested service from the Stanford server locally:
 
ActivePython 2.7.10.12 (ActiveState Software Inc.) based on
Python 2.7.10 (default, Aug 21 2015, 12:07:58) [MSC v.1500 64 bit (AMD64)] on win32
...
>>> from pycorenlp import StanfordCoreNLP
>>> nlp = StanfordCoreNLP('http://localhost:9000')
>>> text = ('Look for Mexican restaurants in Las Cruces.')
>>> properties = {'annotators': 'tokenize,ssplit,pos,depparse,parse', 'outputFormat': 'json'}
>>> output = nlp.annotate(text, properties)

...and the server response was:

>>> print(output['sentences'][0]['parse'])
(ROOT
  (S
    (VP (VB Look)
      (PP (IN for)
        (NP
          (NP (JJ Mexican) (NNS restaurants))
          (PP (IN in)
            (NP (NNP Las) (NNP Cruces))))))
    (. .)))
>>>


So, we got the proper tagging and parse of our sentence. I wanted to see a tree visual of this so I laboriously manually entered into NLTK the CoNLL2000 IOB tag lines corresponding to the parse from the Stanford NLP parse:

>>> chunkTest = """
look VP B-VP
for IN B-PP
restaurants NNS B-NP
Mexican JJ I-NP
in IN B-PP
Las NNP B-NP
Cruces NNP I-NP
. . O
"""
>>> nltk.chunk.conllstr2tree(chunkTest, chunk_types=['VP', 'NP', 'PP']).draw()


...and obtained the following visual presentation of the Stanford parse:




CoNLL2000 was the 2000 Conference on Computational Natural Language Learning (CoNLL-2000). Chunking captures larger pieces of a sentence, grouping POS tagged words into chunks like VP (verb phrases), NP (noun phrases) and PP (prepositional phrases). The data file format they used was IOB, which you can see above were each line of a sentence with a word, a POS tag and an IOB chunk tag specifying 'B' for beginning of a chunk, 'I' for 'in a chunk', and 'O' for 'out of a chunk, i.e., not a recognized chunk type.' 

I had better close this post, since I am getting some strange edit behavior and may be exceeding the size limits here. Stay tuned though---I intended to talk more about what is is machines are doing when they process language.