ISPRS, Vol.34, Part 2W2, “Dynamic and Multi-Dimensional GIS”, Bangkok, May 23-25, 2001
201
great deal of efforts have been spent in defining IR suited
semantic networks based on manually constructed thesauri
[[20], [21], [22], [23]]. Metric over semantic networks is
determined by measuring the similarity between two terms
mainly according to the topography of the thesaurus (the
number and length of links). Two problems may occur in these
systems. First, the estimation of the strength of term
connections which is based heavily (if not only) on the use of
thesaurus topography may fail to reflect the real strength of the
connections. This strength also depends on the nature of the
relations between them which affects their relevance to some
application area. Second, the metrics used to measure term
connection are often symmetric: for a metric m, we have m
(a,b) = m (b,a) for any pair of terms a and b. This property is
obviously counterintuitive. For example, a document about
object-oriented languages should be more relevant to a query
on programming languages than in the reverse situation.
A new method was recently proposed to revise the strength of
term connections according to users' feedback using fuzzy
logic [[2],[24]]. The core idea can be represented as follows.
Assuming one term is relevant to another if a document
represented by the first term is relevant to the query
represented by the second term alone. The term relevance
can be represented with a fuzzy implication relation such as
aro b where P G [0,1] . In this way, the entire
P
thesaurus may be represented as a set of fuzzy term
relevance relations:
{ {cr=>p b},
The key problem lies in the estimation of term relevance
strength (3 given a thesaurus relation between two terms
according to the users' judge on if the documents represented
by a is related to query represented by b only.
The principle goes as follows. The system gives a tentative
query evaluation and provides an answer (a set of ordered
documents). Then the user is required to give his or her own
relevance evaluation of the retrieved documents. The user’s
evaluation is used by the system to revise the strength of term
relevance relation in order to better fit the user's evaluation.
This new method is right in concept but is hard to be realized
on a large size thesauri like WordNet, which contains about
words and phrases. To determine the strength of each term
connection, in other word, a node of the semantic networks
which contains enumerable nodes, needs infinite users'
feedback, which will rapidly increase the cost and time.
Moreover, biased users' feedback, for example, users'
feedback focuses on some nodes heavily but scares on other
nodes, will deteriorate the final results. In practice, the full
thesauri is divided into a few groups, a group of thesaurus
relations are adopted instead of individual relations among
terms [2]. However, the results will be definitely deteriorated
since the group classification is very coarse and can not
represent the true relations among terms.
Although this new method, to revise the strength of term
connections according to users' feedback using fuzzy logic, is
hard or impossible to be realized in a large size thesauri, it is
very possible to be realized in a small size thesauri. The
following section describes how this method is improved and
realized on a GIS thesauri, which contains about 2000 terms.
INFORMATION RETRIEVAL BASED ON GIS THESAURUS
thesaurus have to be updated frequently, even the structure of
the GIS thesaurus might need to be renewed in a few years.
To enhance the reliability and provide an adequate service,
any update on the GIS thesaurus should be approved by the
GIS expert board.
The GIS thesaurus prototype adopted in No-Name, the GIS
intelligent search engine, are composed of about 2000 GIS
terms and further divided into 16 categories. Each category
contains about a few tens to a few hundreds terms. Figure 1
shows the structure of the GIS thesaurus prototype.
Figure 1: GIS thesaurus structure
The possible combination for 2000 terms can be as much as
2000 * 2000 = 4,000,000. Notice (a,b) is different from
P (b,a) where a, b represent any two keywords. /? (a,b)
stands for term connection valuefrom a to b. Two assumptions
are made to simplify the computation for term connection
strength.
Assumptioni: ¡3 (A0,Ai) = 0; /? (Ai, AiJ) = 0
where A0 stands for “GIS technology”, Ai (i=1.... 16) stands for
the categories like “Data Collection” and “Data organization”
(notice the categories are also terms), and Ai,j (j =1, ...) stands
for the terms under category i.
Assumption2: ft (Ai,Aj) = 0 (i <> j); p (Ai.Aj) = 1 (i = j);
(3 (Aij, Ap,q) = 0(i <>p orjoq); f3 (AiJ, Ap,q) = 1(i =p or
j=Q)-
The first assumption shows that a document represented only
by term “Data collecting" has no relation with a query
represented only by “GPS”, but a document represented only
by term “GPS” relates to a query represented only by “Data
collecting” to some degree. The second assumption shows
that a documen represented only by term “GPS” has no
relation to a query represented only by "Digitizing” or “Vector”
and vice versa.
The number of term connection with unknown values is
compressed from 4,000,000 to around 2,000 under these two
assumptions. It is now a reasonable work for the expert board
to set a prior value for each term connection and it can be
realized to modify the prior values for term connection by
users’ feedback.
The tree-like GIS thesaurus is constructed by adding a few
hundred GIS terms into an online GIS dictionary [25]. With the
fast advancement of GIS technology, the GIS terms of the GIS