ISPRS, Vol.34, Part 2W2, “Dynamic and Multi-Dimensional GIS", Bangkok, May 23-25, 2001
200
more sophistically and reliably. By integrating text analysis
technology and fuzzy logic, the accuracy and integrity of
search engine are enhanced and can supply high-quality
information retrieval.
This paper is organized as follows: First the information
collection methods of No-Name are introduced. Second the
basic of information retrieval, both traditional and recent
approaches, are reviewed briefly. Then previous work on
thesaurus-based information retrieval, a new method using
human-defined knowledge, is described. An implementation of
this new method on GIS thesauri is described to show the
information retrieval strategies of No-Name, GIS intelligent
search engine. Then the unique structure of No-Name to
support multiple independent layers is shown and analyzed.
Finally, concluding remarks are given.
SPIDER AND ROBOT: COLLECTING INFORMATION ON
INTERNET
The Internet is an extremely vast source of information and its
contents are growing explosively. Today, it has become
extremely difficult for professionals to select qualitative data of
the required relevance in a reasonable amount of time. To
save the GIS users from the mess information on Internet, No-
Name employs spider technology and robot technology to
search, analyze and collect GIS related information on
Internet.
A spider is a program operated by a search engine that surfs
the web automatically. As it visits each web site, it records all
the words on each site and notes each link to other sites. It
then "clicks" on each link and off it goes to read and record
another web site. No-Name employs multiple threads to
create spiders, so different spiders can search the web parallel
at the same time. The web sites of famous enterprises ,
academics, organizations and technology forums with activities
in GIS field are chosen as the start-off points for spiders since
there are plenty of links at their web sites. When a spider finds
more than one link at a web site, more spiders will be
generated to follow each link. If a spider finds that a website
contains no GIS information, the spider will kill itself and the
search process will stop there. Since the spiders will consume
the computer resources, the search engine has to control the
number of spiders, or threads, to avoid resource exhaustion.
A robot is a program operated by a search engine that can
address other search engines as a user. The current search
engines available, both general search engine and GIS search
engine, can response to robot's query represented by GIS
terms. However, the results from current search engines are
usually considered rough and can not be used directly by No-
Nmae. The rough results will be purified by means of text
analysis technology.
Both spiders and robots adopt text analysis technology to
analyze and extract the useful keywords from the texts and
documents at web sites. The concept of text analysis
technology was first introduced by IBM in 1999 and was
applied in IBM Text Search Engine. Text analysis technology
can be used to analyze the whole text of a web page. By
recording the frequency of keywords occurring in the text and
rendering weight when the keywords occurring at the heads
and titles, search engine applying text analysis technology can
extract the feature from a web site and save it to database.
The feature information will be used to rank the relevance level
of web sites to the keywords by integrating term connection
value information. Two terms are considered to have relation
with each other if the documents represented by one term are
usually considered to have relation with a query represented
by the other term. Term connection value is a fuzzy logic
approach to determine the connection strength between terms.
The concepts and methods to determine term connection
value are described in the following sections.
BASIC OF INFORMATION RETRIEVAL
The goal of an Information Retrieval (IR) system is to select
the documents relevant to a given information need out of a
document database. The present information explosion
increases the importance of this area. It is difficult to find out
relevant information from a huge information mass like Internet
[2].
Traditional approaches to IR, which are popular in current
search engines, use direct keyword matching between
documents and query representations in order to select
relevant documents. The most critical point goes as follows: if
a document is described by a keyword different from those
given in a query, then the document cannot be selected
although it may be highly related. This situation often occurs in
real cases as documents are written and sought by different
persons [2].
In recent work, there is common agreement that more
adequate relevance estimation should be based on inference
rather than direct keyword matching [[4], [5], [6], [7]]. That is,
the relevance relationship between a document and a query
should be inferred using available knowledge. This inference,
however, cannot be performed with complete certainty as in
classical logic due to the uncertainty inherent in the concept of
relevance: one often cannot determine with complete certainty
if a document is relevant or not. In IR, uncertainty is always
associated to the inference process [2].
In order to deal with this uncertainty, probability theory has
been a commonly used tool in IR [[8], [9], [10], [11], [12]].
Probabilistic models usually attempt to determine the
relationship between a document and a query through a set of
terms that are considered as features. Within the strict
probabilistic framework, inferential approaches are often
confined to using only statistical relations among terms. The
main method adopted by probability theory to determine the
relevance degree among terms is by considering term co
occurrences in the document collection [[13]]. In this case, two
terms which often co-occur are considered strongly related.
The problem stands out in this method because relations
obtained from statistics may be very different from the genuine
relations: truly connected terms may be overlooked [[14]]
whereas truly independent terms may be put in relation [[15]].
A new method using human-defined knowledge like a thesauri
to establish the relationship among terms is now getting
popular in IR. With the recent development of large thesauri
(for example, Wordnet [[3]]), these relations have quite a good
coverage of application areas. A manual thesaurus is then a
valuable source of knowledge for IR. However, due to the lack
of strict quantitative values of such relations in thesauri, the
quantitative values have to be determined by user relevance
feedback or expert training.
PREVIOUS WORK ON THESAURUS-BASED
INFORMATION RETRIEVAL
Thesauri used in IR may be divided into two categories
according to their construction: automatically or manually
constructed. The former are usually based on statistics on
word (co-)occurrences. While this kind of thesaurus may help
users to some extent, their utilization in early systems shows
that their impact on the global effectiveness is limited [[16]].
The main reason is that real relations (e.g. synonymy) can
hardly be identified statistically. In fact, words very similar in
meaning tend to repulse from each other in continuous
portions of text [[14]]. For example, "document retrieval", "text
retrieval" and "information retrieval" are rarely used
simultaneously.
Recent work pays more and more attention to manually
constructed thesauri [[17], [18]]. Initiated by Rada [[19]], a