International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol XXXV, Part B6. Istanbul 2004
Net. Programs with names like "gopher" and "Archie" kept
indexes of files stored on servers connected to the Internet, and
dramatically reduced the amount of time required to find
programs and documents. Early search engines held an index of
a few hundred thousand pages and documents and received
maybe one or two thousand inquiries each day. Today, a top
search engine will index hundreds of millions of pages and
responds to tens of millions of queries per day. There are
differences in the ways various search engines. work, but they
all perform three basic tasks (Figure 9):
L.they search the Internet or select pieces of the Internet, based -
on keyword;
2.they keep an index of the words they find and where they are;
3.they allow users to look for words or combinations of words
found in that index. :
To find information on the hundreds of millions of Web pages
that exist, a search engine employs special software robots,
called spiders or crawl, to build lists of the words found on
Web sites. When a spider is building its lists, the process is
called Web crawling. In order to build and maintain a useful list
of words, a search engine's spiders have to look at a lot of pages.
When a typical search engine spider looks at an HTML page, it
took note of the words within the page and also where the
words were found. Words occurring in the title, subtitles and
other positions of relative importance were noted for special
consideration during a subsequent user search. Each spider
takes different approaches but they are always crawling,
because of the constantly changing nature of the Web.
ie
kg |
Create a list of all
the words with
sn + : their location
| Store (and encode) | | Create index of |
the data for the <+—— all the words
users inquires |
Figure 9: The tasks performed by a good search engine.
These different approaches usually attempt to make the spider
operate faster, allow users to search more efficiently. Once the
spiders have completed the task of finding information on Web
pages, the search engine must store the information in a way
that makes it useful. Most search engines store more than just
the word and URL. An engine might store the number of times
that the word appears on a page and might assign a "weight" to
each entry. Each commercial search engine has a different
formula for assigning weight to the words in its index. This is
one of the reasons that a search for the same word on different
search engines will produce different lists, with the pages
presented in different orders.
Searching through an index involves a user building a query
and submitting it through the search engine. The query can be
quite simple, a single word at minimum. Building a more
complex query requires the use of Boolean operators that allow
you to refine and extend the terms of the search. Each query is
analysed and searched in the database of the system. The result
of the research is a collection of URL with an associate score
(determined by the number of times the search criteria is found
in each page), and it is displayed in order from the highest score
to the lowest. Some of the most popular search engines are
Google (http://Www.google.com), Altavista
(http://www.altavista.com), Yahoo (http//Awww.yahoo.com),
HotBot (http://www.hotbot.com), Lycos
116
(http://www.lycos.com), Excite (http://www.excite.com), MSM
(http://search.msn.com/). Some of these search engine entries
present also a main menu organised with directories that can
help a user in his research. Elsevier Science has developed a
powerful Internet search tool called Scirus
(http://www.scirus.com). Scirus distinguishes itself from
existing search engines by concentrating on scientific content
only and by searching both Web and membership sources
(articles, presentations, reports). It enables scientists, students
and anyone searching for scientific information, locate
university sites, and find reports and articles in a clutter-free,
user-friendly and efficient manner. Furthermore there are sites,
called meta crawler, that use at the same time more search
engines to search for a query, | as- Mamma
(http:/www.mamma.com), Metacrawler
(http://www.metacrawler.com), Search Engine = Guide
(http://www.searchengineguide.com/).
The qualities of a good search engine web site should be fast
answer to the queries, user-friendly, often updated, it should
have an advanced research options and a nice display of the
results. Few years ago, the search engines were not able to
spider all the net and a part of it were hidden to the Internet
users. Today this "Invisible Net" is very small as the search
engines are more powerful, their databases are often updated
and they can recognise also non-text files such as pdf,
postscript and other formats.
2.9.2 Online Internet Directories
They are webpages where the information are stored and
displayed to the users in thematic channels or categories. Each
link is listed with a short description and its URL and these lists
can be updated automatically or manually. It is also possible to
search inside the web site as a normal search engine. Useful
URL are Galaxy (htp://www.galaxy.com/), Yahoo
(http://www.yahoo.com), the WWW Virtual Library
(http://www.vlib.org/), the Educational Virtual Library
(http://www.csu.edu.au/education/library.html), the Earth
Science Portal (http://webserv.gsfc.nasa.gov/ESD/). The Earth
Science Portal is a searchable links directory together with a
web crawler search engine that spans all the web-based
information of the NASA Earth Sciences Directorate.
AllConferencesNet (http://www.allconferences.net/) instead
provides interesting links for all kind of events. It is a directory
focusing on conferences, conventions, trade shows, exhibits,
workshops, events and business meetings. This is a unique
search directory that serves users looking for specific
information on conference or event information.
2.10 Educational resources on the Web
The possibility to find scientific articles, reports, journals or
entire books on the Web is very high. These electronic
documents contain nothing different in comparison with the
same text and picture of the paper version, except some
occasional hyperlink. They are quickly disseminated on the net
and everybody can read them. Real e-zines or e-journal have no
paper equivalent and are not always free. A big problem of
electronic documents is they are not permanent and they can be
lost from the permanent record, as subject to changes in
positions and unpredictable removal. Instead document on
paper or electronic format (CD-ROM) are not transient and can
be available and legible for many years, in particular the paper
one, which do not require special equipment or knowledge to
be read. Therefore to preserve for a longer period also the
Internet publication, a timely and active intervention at all
stages is required. Educational resources on the web are without
Int
lin
by
(ht
pr
col
sta
An
fac
Int
inc
lisi
of
spt
lin
(ht
(ht
Co
the
21
E-|
col
ski
an
lea
tap
An
of