A natural language grammar specifies allowable sentence structures in terms of basic syntactic categories such as nouns and verbs, and allows us to determine the structure of the sentence. It is defined in a similar way to a grammar for a programming language, though tends to be more complex, and the notations used are somewhat different. Because of the complexity of natural language a given grammar is unlikely to cover all possible syntactically acceptable sentences.
[Note: In Natural language we don't usually parse language in order to check that it is correct. We parse it in order to determine the structure and help work out the meaning. But most grammars are just concerned with the structure of ``correct'' English, as it gets much more complex to parse if you allow bad English.]
A starting point for describing the structure of a natural language is to use a context free grammar (as often used to describe the syntax of programming languages). Suppose we want a grammar that will parse sentences like:
but we want to exclude incorrect sentences like:
A simple grammar that deals with this is the following:
1 sentence --> noun_phrase, verb_phrase. 2 noun_phrase --> proper_name. 3 noun_phrase --> determiner, noun. 4 verb_phrase --> verb, noun_phrase. proper_name --> [Mary]. proper_name --> [John]. noun --> [schizophrenic]. noun --> [biscuit]. verb --> [ate]. verb --> [kissed]. determiner --> [the].
The notation is similar to that sometimes used for grammars of programming languages. A sentence consists of a noun phrase and a verb phrase. A noun phrase consists of either a proper noun (using rule 2) or a determiner (e.g., the, a) and a noun. A verb phrase consists of a verb (e.g., ate) and a noun phrase. The rules at the end are really like dictionary entries, which state the syntactic category of different words. Basic syntactic categories such as ``noun'' and ``verb'' are terminal symbols in the grammar, as they can not be expanded into lower level categories. (We put words in square brackets because thats the way it is generally done in Prolog's built in grammar notation).
If we consider the example sentences above, the sentence ``John ate the biscuit'' consists of a noun phrase ``John'' and a verb phrase ``ate the biscuit''. The noun phrase is just a proper noun, while the verb phrase consists of a verb ``ate'' and another noun phrase (``the biscuit''). This noun phrase consists of a determiner ``the'' and a noun ``biscuit''. The incorrect sentences will be excluded by the grammar. For example, ``biscuit lion kissed'' starts with two nouns, which is not allowed in the grammar. However, some odd sentences will be allowed, such as ``The biscuit kissed John''. This sentence is syntactically acceptable, just semantically odd, so should still be parsed.
For a given grammar we can illustrate the syntactic structure of the sentence by giving the parse tree, which shows how the sentence is broken down into different syntactic constituents. This kind of information may be useful for later semantic processing. Anyway, given the above grammar, the parse tree for ``John ate the lion'' would be:
Of course, the grammar given above is not really adequate to parse natural language properly. Consider the following two sentences:
If we have ``eat'' and ``eats'' categorised as verbs then, given the simple grammar above, the first sentence will be acceptable according to the grammar, while the second won't - we don't have any mention of adjectives in our grammar. To deal with the first problem we need to have some method of enforcing number/person agreement between subjects and verbs, so that things like ``I am..'' and ``We are ..'' are accepted, but ``I are ..'' and ``We am ..'' are not. To deal with the second problem we need to add further rules to our grammar.
To enforce subject-verb agreement the simplest method is to add arguments to our grammar rules. If we're only concerned about singular vs plural nouns (and assume that we don't have any first or second person pronouns), we might get the rules and dictionary entries which include the following:
sentence --> noun_phrase(Num), verb_phrase(Num). noun_phrase(Num) --> proper_name(Num). noun_phrase(Num) --> determiner(Num), noun(Num). verb_phrase(Num) --> verb(Num), noun_phrase(_). proper_name(sing) --> [mary]. noun(sing) --> [lion]. noun(plur) --> [lions]. det(sing) --> [the]. det(plur) --> [the]. verb(sing) --> [eats]. verb(plur) --> [eat].
Note that strictly, we no longer have a context free grammar having
added these extra arguments to our rules.
In general, getting agreement right in a grammar is much more
complex that this. We need both fairly complex rules, and also
to put more information in dictionary entries. A good dictionary
will not state everything explicitly, but will exploit
general information about word structure, such as the fact that,
given a verb such as ``eat'' the 3rd person singular form
generally involves adding an ``s'': hit/hits, eat/eats, like/likes etc.
Morphology is the area of natural language processing
concerned with such things.
To extend the grammar to allow adjectives we need to add an extra rule or two, e.g.,
noun_phrase(Num) --> determiner(Num), adjectives, noun(Num). adjectives --> adjective, adjectives. adjectives --> adjective. adjective --> [ferocious]. adjective --> [ugly]. etc.
That is, noun phrases can consist of a determiner, some adjectives and a noun. Adjectives consist of an adjective and some more adjectives, OR just of an adjective. We can now parse the sentence ``the ferocious ugly lion eats Mary''.
Another thing we may need to do to our grammar is extend it so we can distinguish between transitive verbs that take an object (e.g., likes) and intransitive verbs that don't (e.g., talks). (``Mary likes the lion'' is OK while ``Mary likes'' is not. ``Mary talks'' is OK while ``Mary talks the lion'' is not). This is left as an exercise for the reader.
Our grammar so far (if we put all the bits together) still only parses sentences of a very simple form. It certainly wouldn't parse the sentences I'm currently writing! We can try adding more and more rules to account for more and more of English - for example, we need rules that deal with prepositional phrases (e.g., ``Mary likes the lion with the long mane''), and relative clauses (e.g., ``The lion that ate Mary kissed John''). These are left as another exercise for the reader.
As we add more and more rules to allow more bits of English to be parsed then we may find that our basic grammar formalism becomes inadequate, and we need a more powerful one to allow us to concisely capture the rules of syntax. There are lots of different grammar formalisms that have been developed (e.g., unification grammar, categorial grammar), but we won't go into them.