The 3rd ISPRS Workshop on Dynamic and Multi-Dimensional GIS & the 10th Annual Conference of CPGIS on Geoinformatics

chen, jun
ISPRS, Vol.34, Part 2W2, “Dynamic and Multi-Dimensional GIS", Bangkok, May 23-25, 2001 
200 
more sophistically and reliably. By integrating text analysis 
technology and fuzzy logic, the accuracy and integrity of 
search engine are enhanced and can supply high-quality 
information retrieval. 
This paper is organized as follows: First the information 
collection methods of No-Name are introduced. Second the 
basic of information retrieval, both traditional and recent 
approaches, are reviewed briefly. Then previous work on 
thesaurus-based information retrieval, a new method using 
human-defined knowledge, is described. An implementation of 
this new method on GIS thesauri is described to show the 
information retrieval strategies of No-Name, GIS intelligent 
search engine. Then the unique structure of No-Name to 
support multiple independent layers is shown and analyzed. 
Finally, concluding remarks are given. 
SPIDER AND ROBOT: COLLECTING INFORMATION ON 
INTERNET 
The Internet is an extremely vast source of information and its 
contents are growing explosively. Today, it has become 
extremely difficult for professionals to select qualitative data of 
the required relevance in a reasonable amount of time. To 
save the GIS users from the mess information on Internet, No- 
Name employs spider technology and robot technology to 
search, analyze and collect GIS related information on 
Internet. 
A spider is a program operated by a search engine that surfs 
the web automatically. As it visits each web site, it records all 
the words on each site and notes each link to other sites. It 
then "clicks" on each link and off it goes to read and record 
another web site. No-Name employs multiple threads to 
create spiders, so different spiders can search the web parallel 
at the same time. The web sites of famous enterprises , 
academics, organizations and technology forums with activities 
in GIS field are chosen as the start-off points for spiders since 
there are plenty of links at their web sites. When a spider finds 
more than one link at a web site, more spiders will be 
generated to follow each link. If a spider finds that a website 
contains no GIS information, the spider will kill itself and the 
search process will stop there. Since the spiders will consume 
the computer resources, the search engine has to control the 
number of spiders, or threads, to avoid resource exhaustion. 
A robot is a program operated by a search engine that can 
address other search engines as a user. The current search 
engines available, both general search engine and GIS search 
engine, can response to robot's query represented by GIS 
terms. However, the results from current search engines are 
usually considered rough and can not be used directly by No- 
Nmae. The rough results will be purified by means of text 
analysis technology. 
Both spiders and robots adopt text analysis technology to 
analyze and extract the useful keywords from the texts and 
documents at web sites. The concept of text analysis 
technology was first introduced by IBM in 1999 and was 
applied in IBM Text Search Engine. Text analysis technology 
can be used to analyze the whole text of a web page. By 
recording the frequency of keywords occurring in the text and 
rendering weight when the keywords occurring at the heads 
and titles, search engine applying text analysis technology can 
extract the feature from a web site and save it to database. 
The feature information will be used to rank the relevance level 
of web sites to the keywords by integrating term connection 
value information. Two terms are considered to have relation 
with each other if the documents represented by one term are 
usually considered to have relation with a query represented 
by the other term. Term connection value is a fuzzy logic 
approach to determine the connection strength between terms. 
The concepts and methods to determine term connection 
value are described in the following sections. 
BASIC OF INFORMATION RETRIEVAL 
The goal of an Information Retrieval (IR) system is to select 
the documents relevant to a given information need out of a 
document database. The present information explosion 
increases the importance of this area. It is difficult to find out 
relevant information from a huge information mass like Internet 
[2]. 
Traditional approaches to IR, which are popular in current 
search engines, use direct keyword matching between 
documents and query representations in order to select 
relevant documents. The most critical point goes as follows: if 
a document is described by a keyword different from those 
given in a query, then the document cannot be selected 
although it may be highly related. This situation often occurs in 
real cases as documents are written and sought by different 
persons [2]. 
In recent work, there is common agreement that more 
adequate relevance estimation should be based on inference 
rather than direct keyword matching [[4], [5], [6], [7]]. That is, 
the relevance relationship between a document and a query 
should be inferred using available knowledge. This inference, 
however, cannot be performed with complete certainty as in 
classical logic due to the uncertainty inherent in the concept of 
relevance: one often cannot determine with complete certainty 
if a document is relevant or not. In IR, uncertainty is always 
associated to the inference process [2]. 
In order to deal with this uncertainty, probability theory has 
been a commonly used tool in IR [[8], [9], [10], [11], [12]]. 
Probabilistic models usually attempt to determine the 
relationship between a document and a query through a set of 
terms that are considered as features. Within the strict 
probabilistic framework, inferential approaches are often 
confined to using only statistical relations among terms. The 
main method adopted by probability theory to determine the 
relevance degree among terms is by considering term co 
occurrences in the document collection [[13]]. In this case, two 
terms which often co-occur are considered strongly related. 
The problem stands out in this method because relations 
obtained from statistics may be very different from the genuine 
relations: truly connected terms may be overlooked [[14]] 
whereas truly independent terms may be put in relation [[15]]. 
A new method using human-defined knowledge like a thesauri 
to establish the relationship among terms is now getting 
popular in IR. With the recent development of large thesauri 
(for example, Wordnet [[3]]), these relations have quite a good 
coverage of application areas. A manual thesaurus is then a 
valuable source of knowledge for IR. However, due to the lack 
of strict quantitative values of such relations in thesauri, the 
quantitative values have to be determined by user relevance 
feedback or expert training. 
PREVIOUS WORK ON THESAURUS-BASED 
INFORMATION RETRIEVAL 
Thesauri used in IR may be divided into two categories 
according to their construction: automatically or manually 
constructed. The former are usually based on statistics on 
word (co-)occurrences. While this kind of thesaurus may help 
users to some extent, their utilization in early systems shows 
that their impact on the global effectiveness is limited [[16]]. 
The main reason is that real relations (e.g. synonymy) can 
hardly be identified statistically. In fact, words very similar in 
meaning tend to repulse from each other in continuous 
portions of text [[14]]. For example, "document retrieval", "text 
retrieval" and "information retrieval" are rarely used 
simultaneously. 
Recent work pays more and more attention to manually 
constructed thesauri [[17], [18]]. Initiated by Rada [[19]], a
1
2
...
211
212
213
214
215
...
448
449
Full text: The 3rd ISPRS Workshop on Dynamic and Multi-Dimensional GIS & the 10th Annual Conference of CPGIS on Geoinformatics

Access restriction

Copyright

Note to user