In this paper a preliminary investigation of this problem
is done, showing some different approaches, and a brief
example of recognition of simple geometrical shape is
given.
2. The Theory of Formal Languages
In the theory of formal languages, a language is defined
as a set of strings: a string is a finite sequence of symbols
chosen from some finite vocabulary. In natural languages,
a string is a sentence, and the sentences are sequences of
words chosen from some vocabulary of possible words.
A grammar is defined as a four-tuple:
G-(N,T,P,S)
where N and T are the non terminal and terminal
vocabularies of G, P is the set of production rules, and S
is the start symbol.
A formal language is indeed defined by:
= A terminal vocabulary of symbols (the words of
the natural language)
= A non terminal vocabulary of symbols (the
syntactic categories, e.g. noun, verb)
= À set of productions (the phrase structure rules
of the language)
= The so called start symbol
We start from a special non terminal S, and S is replaced
by the string on the right side of a chosen production rule.
The process of rewriting a substring according to one of
the rewriting production rules continues until the string
consists only of terminal symbols:
S — aS| bS|e
where the symbol | indicates “or” and is the null string.
The succession of strings that result from the process is a
derivation from the grammar: to find a derivation (a
parse) for a given sentence (sequence) is called parsing.
As to the latter approach, let's remind that in the Theory
of Formal languages Chomsky divides languages in
classes, thus forming a hierarchy of languages, based
upon different grammars:
= Unrestricted grammars (0-type grammars)
= Context-sensitive grammars (1-type grammars)
= Context-free grammars (2-type grammars)
" Regular grammars or finite state grammars
(3-type grammars)
The most general grammar is obviously the 0-type, which
bears no limits for rewriting rules: for the other types,
such restrictions are regularly increasing. Types 0 and 1
are able to describe natural languages as the other two,
much simpler to manage from a computational viewpoint,
are more suited into limited backgrounds and have been
hitherto used for artificial languages.
In the 1-type grammar, rewriting rules restraints bear that
the right side of a rule should have at least as many
symbols as the left side;
For the 2-type grammar, all rules should have just one
non-terminal symbol on the left side
44
For 3-type grammar, the right side has only one terminal
symbol, or one terminal symbol and a non-terminal one
for every production rule.
The language classes as arranged by Chomsky need a
number of reconnaissance devices (automata):
0-type languages: Turing machines
1-type languages: bounded automata
2-type languages: pushdown automata
3-type languages: finite state automata.
Although many classes of patterns appear context
sensitive, context sensitive grammars have rarely been
used, because of their complexity. Context free
programmed grammars have been used , probably due to
their effectiveness in describing natural languages: they
are able to capture much of natural and artificial
languages, but many problems required extensions.
Context free grammars cannot model all the
characteristics of natural languages. One example is the
conversion of sentences from active to passive voice,
Chomsky (1957) developed the theory of
transformational grammar, in which a sentence is derived
as a deep structure, then modified by transformational
rules and finally converted in surface form by
phonological rules. The deep structure is derived by a
context free grammar, which generates “proto” phrases:
strings like “the bridge crosses the river” and “the river is
crossed by the bridge” have the same deep structure, but a
different surficial structure.
In syntactic pattern recognition problems, it is often
important to represent the two or three dimensional
structure of sentences in the languages. Traditional
context free grammars generate only one dimensional
string. Context free graph grammars have been developed
in order to construct a graph of terminal nodes instead of
a string of terminal symbols.
A statistic approach to language started as the scholars
realized the value of statistic methods in the recognition
of speech. Hidden Markov’s model was first used by the
same for modelling the language of “Evgenij Onegin”
(Markov, 1913). A grammar normally divides strings in
just two classes: the grammatically correct ones, and the
others. In any case, ambiguity is very frequent, and it is
even deliberately pursued as sometimes happens in
poetry. The simplest way of accounting for ambiguity is
the usage of a stochastic grammar.
3. Parsing Algorithms
In accordance with the needs of different grammars, a
certain number of Parsing algorithms are in use.
Parsing procedures may be top-down or bottom-up type,
as one starts from initial symbol S, operating
substitutions until only terminal symbols are present,
such as to fit the clause, or as one starts from the string
backward till the start symbol S.
Tabular procedures are time saving, however they are not
suitable to all grammar types.
As an example, in the following the tabular Kasami
method is used, which is remarkable for its semplicity: it
works with context-free grammars, previously reduced to
the Normal Chomsky Form.
Actual
gramm
being
(CNF)
genera
shall h
A—B(
A a.
For in
contex
S
B
A —
Some
genere
accord
Rules
canoni
(non 1
termin
substit
So
There!
A—
S
A—
B.
Bo.
At last
A —>
B o
Once
its Cl
assem
belong
Be w:
grama
reduce
The a
whose
Gi sho
termin
A 3
langu:
The ta
| «
pre
Also,