Next: Returning the Parse Up: Syntax Previous: Grammars in Prolog

Parsers

Having a grammar isn't enough to parse natural language - you need a parser. The parser should search for possible ways the rules of the grammar can be used to parse the sentence - so parsing can be viewed as a kind of search problem. In general there may be many different rules that can be used to ``expand'' or rewrite a given syntactic category, and the parser must check through them all, to see if the sentence can be parsed using them. For example, in our mini-grammar above there were two rules for noun_phrases: a parse of the sentence may use either one or the other. In fact we can view the grammar as defining an AND-OR tree to search: alternative ways of expanding a node give OR branches, while rules with more than one syntactic category on the right hand side give AND branches.

So, to parse a sentence we need to search through all these possibilities, effectively going through all possible syntactic structures to find one that fits the sentence. There are good ways and bad ways of doing this, just as there are good and bad ways of parsing programming languages. One way is basically to do a depth first search through the parse tree. When you reach the first terminal node in the grammar (ie, a primitive syntactic category, such as noun) you check whether the first word of the sentence belongs to this category (e.g., is a noun). If it is, then you continue the parse with the rest of the sentence. If it isn't you backtrack and try alternative grammar rules.

As an example, suppose you were trying to parse ``John loves Mary'' given the following grammar:


sentence -->  noun_phrase, verb_pharse.
verb_phrase --> verb, noun_phrase.
noun_pharse --> det, noun.
noun_phrase --> p_name.

verb --> [loves].
p_name --> [john].
p_name --> [mary].

You might start off expanding ``sentence'' to a verb phrase and a noun phrase. Then the noun phrase would be expanded to give a determiner and a noun, using the third rule. A determiner is a primitive syntactic category (a terminal node in the grammar) so we check whether the first word (John) belongs to that category. It doesn't - John is a proper noun - so we backtrack and find another way of expanding ``noun_phrase'' and try the fourth rule. Now, as John is a proper name this will work OK, so we continue the parse with the rest of the sentence (``loves Mary''). We haven't yet expanded verb phrase, so we try to parse ``loves Mary'' as a verb phrase. This will eventually succeed, so the whole thing succeeds.

It may be clear by now that in Prolog the parsing mechanism is really just Prolog's built in search procedure. Prolog will just backtrack to explore the different possible syntactic structures. The extra arguments which Prolog (internally) adds to DCG rules allows it to parse a bit of the sentence using one rule, then the rest of the sentence using another.

Note that this kind of simple top down parser can be implemented reasonably easily in other languages, which don't support backtracking, by using an agenda-based search mechanism. An item in the agenda may be the remaining syntactic categories and bits of sentences left to parse, given this branch of the search tree (e.g., [[verb, noun_phrase], [loves, mary]]).

Simple parsers are often inefficient, because they don't keep a record of all the bits of the sentence that have been parsed. Using simple depth first search with backtracking can result in useful bits of parsing being undone on backtracking: if you have a rule ``a -> b, c, d'' and a rule ``a -> b, c, e'' then you may find the first part of the sentence consists of ``b'' and ``c'' constituents, go on to check if the rest is a ``d'' constituent, but when that fails a Prolog-like system will through away its conclusions about the first half when backtracking to try the second rule. It will then duplicate the parsing of the ``b'' and ``c'' bits. Anyway, better parsers try to avoid this. Two important kinds of parsers used in natural language are transition network parsers and chart parsers. A transition network parser makes use of a more flexible network representation for grammar rules to partially avoid this problem: the above rules would be represented something like:

The transition network parser would basically traverse this network, checking that words in the sentence match the syntactic categories on each arc.

A chart parser goes further in avoiding duplicating work by explicitly recording in a chart all the possible parses of each bit of the sentence.



Next: Returning the Parse Up: Syntax Previous: Grammars in Prolog


alison@
Fri Aug 19 10:42:17 BST 1994