Full text: Proceedings, XXth congress (Part 6)

  
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol XXXV, Part B6. Istanbul 2004 
  
Net. Programs with names like "gopher" and "Archie" kept 
indexes of files stored on servers connected to the Internet, and 
dramatically reduced the amount of time required to find 
programs and documents. Early search engines held an index of 
a few hundred thousand pages and documents and received 
maybe one or two thousand inquiries each day. Today, a top 
search engine will index hundreds of millions of pages and 
responds to tens of millions of queries per day. There are 
differences in the ways various search engines. work, but they 
all perform three basic tasks (Figure 9): 
L.they search the Internet or select pieces of the Internet, based - 
on keyword; 
2.they keep an index of the words they find and where they are; 
3.they allow users to look for words or combinations of words 
found in that index. : 
To find information on the hundreds of millions of Web pages 
that exist, a search engine employs special software robots, 
called spiders or crawl, to build lists of the words found on 
Web sites. When a spider is building its lists, the process is 
called Web crawling. In order to build and maintain a useful list 
of words, a search engine's spiders have to look at a lot of pages. 
When a typical search engine spider looks at an HTML page, it 
took note of the words within the page and also where the 
words were found. Words occurring in the title, subtitles and 
other positions of relative importance were noted for special 
consideration during a subsequent user search. Each spider 
takes different approaches but they are always crawling, 
because of the constantly changing nature of the Web. 
  
   
ie 
kg | 
Create a list of all 
the words with 
sn + : their location 
  
  
  
  
| Store (and encode) | | Create index of | 
the data for the <+—— all the words 
users inquires | 
  
  
Figure 9: The tasks performed by a good search engine. 
These different approaches usually attempt to make the spider 
operate faster, allow users to search more efficiently. Once the 
spiders have completed the task of finding information on Web 
pages, the search engine must store the information in a way 
that makes it useful. Most search engines store more than just 
the word and URL. An engine might store the number of times 
that the word appears on a page and might assign a "weight" to 
each entry. Each commercial search engine has a different 
formula for assigning weight to the words in its index. This is 
one of the reasons that a search for the same word on different 
search engines will produce different lists, with the pages 
presented in different orders. 
Searching through an index involves a user building a query 
and submitting it through the search engine. The query can be 
quite simple, a single word at minimum. Building a more 
complex query requires the use of Boolean operators that allow 
you to refine and extend the terms of the search. Each query is 
analysed and searched in the database of the system. The result 
of the research is a collection of URL with an associate score 
(determined by the number of times the search criteria is found 
in each page), and it is displayed in order from the highest score 
to the lowest. Some of the most popular search engines are 
Google (http://Www.google.com), Altavista 
(http://www.altavista.com), Yahoo  (http//Awww.yahoo.com), 
HotBot (http://www.hotbot.com), Lycos 
116 
(http://www.lycos.com), Excite (http://www.excite.com), MSM 
(http://search.msn.com/). Some of these search engine entries 
present also a main menu organised with directories that can 
help a user in his research. Elsevier Science has developed a 
powerful Internet search tool called Scirus 
(http://www.scirus.com). Scirus distinguishes itself from 
existing search engines by concentrating on scientific content 
only and by searching both Web and membership sources 
(articles, presentations, reports). It enables scientists, students 
and anyone searching for scientific information, locate 
university sites, and find reports and articles in a clutter-free, 
user-friendly and efficient manner. Furthermore there are sites, 
called meta crawler, that use at the same time more search 
engines to search for a query, | as- Mamma 
(http:/www.mamma.com), Metacrawler 
(http://www.metacrawler.com), Search Engine = Guide 
(http://www.searchengineguide.com/). 
The qualities of a good search engine web site should be fast 
answer to the queries, user-friendly, often updated, it should 
have an advanced research options and a nice display of the 
results. Few years ago, the search engines were not able to 
spider all the net and a part of it were hidden to the Internet 
users. Today this "Invisible Net" is very small as the search 
engines are more powerful, their databases are often updated 
and they can recognise also non-text files such as pdf, 
postscript and other formats. 
2.9.2 Online Internet Directories 
They are webpages where the information are stored and 
displayed to the users in thematic channels or categories. Each 
link is listed with a short description and its URL and these lists 
can be updated automatically or manually. It is also possible to 
search inside the web site as a normal search engine. Useful 
URL are Galaxy (htp://www.galaxy.com/), Yahoo 
(http://www.yahoo.com), the WWW Virtual Library 
(http://www.vlib.org/), the Educational Virtual Library 
(http://www.csu.edu.au/education/library.html), the Earth 
Science Portal (http://webserv.gsfc.nasa.gov/ESD/). The Earth 
Science Portal is a searchable links directory together with a 
web crawler search engine that spans all the web-based 
information of the NASA Earth Sciences Directorate. 
AllConferencesNet  (http://www.allconferences.net/) instead 
provides interesting links for all kind of events. It is a directory 
focusing on conferences, conventions, trade shows, exhibits, 
workshops, events and business meetings. This is a unique 
search directory that serves users looking for specific 
information on conference or event information. 
2.10 Educational resources on the Web 
The possibility to find scientific articles, reports, journals or 
entire books on the Web is very high. These electronic 
documents contain nothing different in comparison with the 
same text and picture of the paper version, except some 
occasional hyperlink. They are quickly disseminated on the net 
and everybody can read them. Real e-zines or e-journal have no 
paper equivalent and are not always free. A big problem of 
electronic documents is they are not permanent and they can be 
lost from the permanent record, as subject to changes in 
positions and unpredictable removal. Instead document on 
paper or electronic format (CD-ROM) are not transient and can 
be available and legible for many years, in particular the paper 
one, which do not require special equipment or knowledge to 
be read. Therefore to preserve for a longer period also the 
Internet publication, a timely and active intervention at all 
stages is required. Educational resources on the web are without 
Int 
lin 
by 
(ht 
pr 
col 
sta 
An 
fac 
Int 
inc 
lisi 
of 
spt 
lin 
(ht 
(ht 
Co 
the 
21 
E-| 
col 
ski 
an 
lea 
tap 
An 
of
	        
Waiting...

Note to user

Dear user,

In response to current developments in the web technology used by the Goobi viewer, the software no longer supports your browser.

Please use one of the following browsers to display this page correctly.

Thank you.