was the hardware to be used. Much of the
work on the EPA GCRP will occur at the EPA
research laboratories each having a Digital
Equipment Corporation (DEC) VAX computer
connected in a private EPA network. Since this
hardware and network are already in place and
operational it was decided that the DSI would
have to operate in this environment.
A summary of the design objectives were as
follows:
1. The DSI for the GCRP would consist of
primarily textual information that would be
broadly categorized to assist in both the input
and retrieval of information.
2. The DSI would require only minimal
information for data entry while allowing the
entry of extensive amounts of information if
necessary.
3. The person entering or retrieving information
would not be required to look through lists of
keywords to "fit" information into the system or to
retrieve information from the system.
4. The hardware platform for the application
would be a DEC VAX model computer running
VMS as the operating system.
DEVELOPMENT
Development started by trying to categorize all
the information into a fixed field system with
many fields coded for quick and consistent
retrieval. This worked fine for items such as
data set name, location, contact, and phone
numbers, but didn’t work at all for text
information such as data set uses, quality, spatial
extent, and descriptions of data set and its
parameters. Fixed fields for the textual
information didn’t work well because of the
variable length of these fields and slow or
missing ability to search the text within these
field. With the exception of the temporal data,
all the DSI information we wished to store was
textual. This indicated the need for a text data
base management system (DBMS).
By definition a text DBMS is one with the ability
to store and retrieve unstructured textual data.
In a text DBMS data structure the software
depends on the user’s knowledge of the English
language and the structure of the retrieval
request to formulate a useful query of the
database. In practice this often results in
retrieving more information than was required or
expected requiring the user to rephrase the
query or sift through data. The alternative is to
construct some ’keyword’ vocabulary that would
have to be learned by user’s for both data entry
and data retrieval. Although knowledge of a
"keyword" vocabulary produces more accurate
and complete retrievals from a textual database
it puts limitations on both the data entry and
data retrieval interfaces to the database. Since
the majority of the DSI data was textual and to
maximize user flexibility for data retrieval and
data storage we decided to use a text DBMS.
The selection of the text DBMS depended on
both our hardware environment, the features of
the text DBMS’s on the market (we had already
ruled out developing our own), and the speed of
data retrieval. Since the hardware to be used
for the DSI would be a VAX network this limited
our search of text DBMS’s to those that run on
VAX. Features such as proximity searches,
single word and phrase searches, boolean
searches, and the ability to identify structured
fields are present in most text DBMSs.
Therefore, the speed and method of searching
the text was our primary selection factor.
The ability to perform "string searches” of textual
information has existed for a long time but the
speed of such searches has prevented its use
interactively for all but the smallest applications.
Within the last five years several approaches
have been used to deal with this problem. The
most popular approach to searching textual data
has been to make an index of the important
words in the data base. The index is then
searched and the textual data is retrieved via
links between the index and the data. This
method increases the retrieval speeds because
the index is smaller than the actual text files and
is organized to allow the use of rapid searching
techniques. There are several disadvantages to
this indexed approach:
1. The indexes require significant amounts of
disk space that can be as much as 250% of that
required by the actual text data (Hamilton, 1989).
2. The organization of the index must be
maintained. This requires the system to be
unavailable while updates to the index are being
run.