Hello, NLP

This post was originally part of a Tohoku-journal post but grew into its own thing.  I'm setting it up to be auto-posted a few days after it's written so it doesn't block the journal post, but in case you missed it, here's The Latest.  

This post takes heavily from the introductory text on NLP I've been reading by Jurafsky and Martin, Speech and Language Processing.

Research

My weekend thus far has been studying.  I have an appointment with Inui sensei on Monday to start narrowing down topic by what seemed interesting to me in the book I've been reading. Looks like the meeting will just be me saying "this seems neat" and he'll point me on to some more specific papers to check out in the field.  Things that have caught my eye so far:  (Warning: it gets nerdy from here)

Grammars for Natural Languages

Grammars are a mathematical constructs made up of rules for generating a language.  I've worked with them in the past with programming languages and other basic, limited languages. With natural language (human language), we can use grammars to check if a sentence is grammatically correct; we can also work in the opposite direction and use a grammar to generate sentences.

Parsing

Parsing is something that goes hand in hand with grammar.  Given that English text, how do we break it up from a string of text into a tree object that corresponds with the rules of the grammar?  This has some really interesting problems.  Look at the sentence

Time flies like an arrow.

We see this and automatically know that "Time", here, is a noun, and flies is a verb.  We see instantly the intended meaning of "temporality moves in the way of a projectile: fast".  But we can break this sentence up other ways.  What if I do this "(time flies) (like) (an arrow)"?  Suddenly a new meaning pops into view that you (probably) hadn't seen before.  Now we're talking about something called "time flies", probably out of Doctor Who, and how they're affectionate for an arrow somewhere out in the world.

Another interpretation involves seeing the sentence as an imperative. Now "time" is a verb (like with a stopwatch), "flies" are the thing we're suppose to time, and we're suppose to do it the way an arrow would do it.  Or are we only suppose to time flies that are like arrows?  In this interpretation, another ambiguity shows up.  Does "like an arrow" attach to the verb, "time", or to the noun, "flies"?

It's also neat to ask the question of "how do human's parse language?".  NLP often models their methods off of psycho-linguistic research results.  Here's a neat sentence from Jurafsky that made absolutely no sense to me for quite some time after reading it - it's seemingly difficult for human's to parse.  Why? It's  a fairly simple sentence without any sort of convoluted grammar.

The complex houses married and single students and their families

Yeah, I'm into this stuff.

Feature Structure and Unification

It's always neat to see an algorithm pop its head up in other fields.  I've worked with unification quite a few times in functional programming contexts, particularly when working on lambda calculus systems.  Unification does what it sounds like: it answers the question, given these two objects, can we make one unified object?

In the context of Feature Structure, the objects unification gets to work with are an added feature to our earlier grammars.  These feature structures encode "features" of words into the grammar, so that when our grammar finds a verb and a noun it can be sure that they share agreement in plurality or person.  That is, make sure we say "he is" not *"he are" and that our "flights serve" and not *"flights serves" .

Essentially, feature structures go on top of our grammar and work into our parser, using the unification algorithm to just make everything better.  Woohoo!

Semantics

Semantics is a fancy word for meanings.  When we have the sentence "The door opened", what does that actually mean?  Semantics asks the computer to take the arbitrary symbols the words have been so far (that's right, all our parsing was just math - there was no meaning to the words) and create some representation of a world from them.

My previous work with semantics was designing a programming language, where we asked "Now, what does the execution of this command mean?  What does it do?".  In that context, the meaning of a command was often something like "allocate room for a variable" or "add the value in this location to the value in this location".  Real language gets a bit fancier, if you hadn't guessed.

The thing that really grabbed me with semantics was its use of First Order Logic and lambda calculus (I'm really keen on lambda calc) , and of course the fact that we were asking formulas to represent reality.  Unification also comes back in some semantics work.

Word Sense

Interesting from a linguistics-nerd viewpoint was the book's discussion of word sense: the relation words have to each other.  This includes things like, how many different things does the lemma "bank" represent, or what words are related to "right" and in what ways?

I enjoyed reading the chapter, but I'm not sure I'd be do research directly on the subject. It seems more like a tool for other drives, such as...

Word Sense Disambiguation

When the word "bass" appears in a sentence, is that something to do with low-range sounds, or something to do with fishing?  Word sense disambiguation is the field of NLP that works on figuring this out.  Solutions come from things like looking at the words nearby the word in question.  For example, if the word band appears in the sentence, then we're probably talking about the music-sense of the word, and likewise, if someone says fishing net, we're talking the fish sense.

Relations established in the word sense section are also useful for finding bridges to what meaning of the word we're talking about.  One paper notes that if an ambiguous word is used more than once in a document, it is typically used in only one sense.  That is, if a paper repeatedly mentions "bass", it's probably either a paper on fishing or a paper on music - it's very unlikely it should be both.  The limitation to this, though, is that it seems to mostly only hold for very distinct meanings like with bass.

Named Entity Recognition

Usually a sentence doesn't stand on its own.  It refers to things throughout the discourse, and finding out which phrases refer to what entity is the job of named entity recognition.  This starts with obvious things like pronouns (what does my "its" refer to in the first sentence?) but also goes on to more complicated relationships, such as this example stolen from the text.

{Victoria Chen, Chief Financial Officer of Megabucks Banking Corp since 1994, her, the 37-year-old, the Denver-based financial-services company's president, she}

How's a computer going to recognize all those are in fact the same thing?  Named entity recognition was one of the first problems I ran into when I looked into what exactly it was that NLP did.

The Holy Grail: Machine Translation

It shouldn't be surprising that I'm interested in this.  My interest in foreign languages, and really, just with language in general, makes machine translation shine.  It's also the easiest example to give someone who asks what a researcher in natural language processing actually does.

The main thing I gathered from Jurafsky and Martin is that machine translation is very, very, hard.