Full text: Proceedings of the Symposium on Global and Environmental Monitoring (Pt. 1)

was the hardware to be used. Much of the 
work on the EPA GCRP will occur at the EPA 
research laboratories each having a Digital 
Equipment Corporation (DEC) VAX computer 
connected in a private EPA network. Since this 
hardware and network are already in place and 
operational it was decided that the DSI would 
have to operate in this environment. 
A summary of the design objectives were as 
follows: 
1. The DSI for the GCRP would consist of 
primarily textual information that would be 
broadly categorized to assist in both the input 
and retrieval of information. 
2. The DSI would require only minimal 
information for data entry while allowing the 
entry of extensive amounts of information if 
necessary. 
3. The person entering or retrieving information 
would not be required to look through lists of 
keywords to "fit" information into the system or to 
retrieve information from the system. 
4. The hardware platform for the application 
would be a DEC VAX model computer running 
VMS as the operating system. 
DEVELOPMENT 
Development started by trying to categorize all 
the information into a fixed field system with 
many fields coded for quick and consistent 
retrieval. This worked fine for items such as 
data set name, location, contact, and phone 
numbers, but didn’t work at all for text 
information such as data set uses, quality, spatial 
extent, and descriptions of data set and its 
parameters. Fixed fields for the textual 
information didn’t work well because of the 
variable length of these fields and slow or 
missing ability to search the text within these 
field. With the exception of the temporal data, 
all the DSI information we wished to store was 
textual. This indicated the need for a text data 
base management system (DBMS). 
By definition a text DBMS is one with the ability 
to store and retrieve unstructured textual data. 
In a text DBMS data structure the software 
depends on the user’s knowledge of the English 
language and the structure of the retrieval 
request to formulate a useful query of the 
database. In practice this often results in 
retrieving more information than was required or 
expected requiring the user to rephrase the 
query or sift through data. The alternative is to 
construct some ’keyword’ vocabulary that would 
have to be learned by user’s for both data entry 
and data retrieval. Although knowledge of a 
"keyword" vocabulary produces more accurate 
and complete retrievals from a textual database 
it puts limitations on both the data entry and 
data retrieval interfaces to the database. Since 
the majority of the DSI data was textual and to 
maximize user flexibility for data retrieval and 
data storage we decided to use a text DBMS. 
The selection of the text DBMS depended on 
both our hardware environment, the features of 
the text DBMS’s on the market (we had already 
ruled out developing our own), and the speed of 
data retrieval. Since the hardware to be used 
for the DSI would be a VAX network this limited 
our search of text DBMS’s to those that run on 
VAX. Features such as proximity searches, 
single word and phrase searches, boolean 
searches, and the ability to identify structured 
fields are present in most text DBMSs. 
Therefore, the speed and method of searching 
the text was our primary selection factor. 
The ability to perform "string searches” of textual 
information has existed for a long time but the 
speed of such searches has prevented its use 
interactively for all but the smallest applications. 
Within the last five years several approaches 
have been used to deal with this problem. The 
most popular approach to searching textual data 
has been to make an index of the important 
words in the data base. The index is then 
searched and the textual data is retrieved via 
links between the index and the data. This 
method increases the retrieval speeds because 
the index is smaller than the actual text files and 
is organized to allow the use of rapid searching 
techniques. There are several disadvantages to 
this indexed approach: 
1. The indexes require significant amounts of 
disk space that can be as much as 250% of that 
required by the actual text data (Hamilton, 1989). 
2. The organization of the index must be 
maintained. This requires the system to be 
unavailable while updates to the index are being 
run.
	        
Waiting...

Note to user

Dear user,

In response to current developments in the web technology used by the Goobi viewer, the software no longer supports your browser.

Please use one of the following browsers to display this page correctly.

Thank you.