306
facts about a data set. They required the
storage of detailed information that would
provide insight into the usefulness of a data set
relative to their projects. This kind of
information was only available from other
scientists who had knowledge of the data sets.
Realizing that getting a scientist to enter detailed
information about a data set into the DSI would
be difficult we identified ease of data entry as a
top priority in the system design.
To make data entry convenient we decided to
require only a minimum amount of information
based on a data sets usefulness to the GCRP.
Primarily we wanted to give the scientists the
freedom to add information and categories of
information whenever they determined it would
be useful in describing a data set. When
attempting to identify the minimum amount of
required information we recognized that in many
cases a scientist’s knowledge of a data set
would be directly related to the data set’s
usefulness in the GCRP. With this in mind we
developed a classification scheme for data sets
based on their level of use within the GCRP
resulting in four levels of usage: investigated,
accessed, acquired, and generated (Figure 1).
The scientist’s knowledge of a data set would
most likely grow as the data set proceeded
through these levels of usage. Therefore the
DSI was designed to require more information as
the usage of the data set increased.
In addition to identifying the quantity of
information required in the DSI based on data
set usage we identified the categories of
information that would be required to describe
a data set. To do this we identified the users
of the DSI from a data retrieval perspective.
Primarily we wanted the DSI to serve scientists
as a type of library system for data sets. If
scientists had a data need we wanted them to
first search the DSI to see if the data they
required had already been located by other
scientists in the GCRP. This would help
facilitate the acquisition of the data for the
scientist while also reducing the cost of data
duplication for the program. Secondarily we
wanted the DSI to serve as a data management
tool for the GCRP data manager. Using
information from the DSI the data manager
would be able to report more holistically on the
status of GCRP data to program managers while
also keeping track of data redundancy, data
quality, data quantity, data usage and data
distribution.
To satisfy these user needs we identified 5
categories of information to store in the DSI.
1. Data Set Information
Information about the data set such as the data
set name, a description of the data set, the
source of the data set, the primary scientific
purpose for the data set’s existence, whether a
data dictionary for the data set exists, the quality
of the data dictionary, the major parameters
stored in the data set.
2. Data Set Retrieval Information
Information to assist in retrieving this metadata
such as acronyms for the data set name and
parameters, keywords used to reference the data
set, spatial features recorded in the data set for
locating sampling locations, and temporal
descriptors used to define the frequency and
duration of sampling events.
3. Data Set Recorder Information
Information about the person or group making
the DSI entry such as their name, level of
understanding of the data set, location, phone
number, and their affiliation within the GCRP.
4. Data Set Usage Information
Information about the uses and/or usefulness of
the data to the GCRP such as the potential
usefulness of investigated data, uses of acquired
data, and the name of the person or research
group using the data set.
5. Data Set Acquisition Information
Information about the acquisition and storage of
the data such as the location of the data, name
of the data analyst in charge of the data, and
the formats in which the data is stored.
The metadata within the DSI will be stored in
these categories and the data retrieval software
will use the categories to provide more options
for selective data retrieval. The relationship
between the DSI required information and the
usage level of data sets as shown in Figure 2.
For investigated data sets only minimal
information for categories 1, 2, and 3 will be
required. As the level of usage increases,
information from all 5 categories will be required
and the amount of detail will increase.
Another consideration concerning the DSI design