DEVELOPMENT OF A DATA SET INDEX FOR THE GLOBAL CLIMATE RESEARCH PROGRAM
Donald R. Block and Edward H. Barrows
NSI Technology Services Corporation, Research Triangle Park, N.C. 27709
ABSTRACT
The U.S. Environmental Protection Agency Global Climate Research Program (GCRP) is being set up
to conduct multiple research efforts at many different laboratories distributed across the United States.
Each of these research efforts will require very large quantities of data for analysis. The distributed
nature of the research effort and the dynamic nature of the data itself argue against a centralized
depository for all data. Distributed data systems, however, will cause redundancies when the same data
set must be used at more than a single location. The establishment of independent data centers at
each research location can be prohibitively expensive. At the same time, program management needs
to be able to identify and act on a variety of data problems.
A system design combining the elements of a centralized data set index and distributed processing
proves to be both cost-effective and responsive to the research requirements. Based on a text-retrieval
software platform, a centralized index provides information not only on the data sets being utilized in the
research, but also facilitates the movement of data sets to other locations, and tracks the uses and users.
The evaluation process and products that were examined in the analysis of alternatives indicate a clear
choice in an innovative combined hardware/software platform.
KEY WORDS: Data Set Index, Indexing, Data Management, Text Retrieval, Text BDMS
INTRODUCTION
The data set index (DSI) for the GCRP is a data
base containing information about all the GCRP
data sets. It is the mechanism that ties the
multiple, geographically distributed physical data
bases together into a single logical data system.
The DSI does not function as a master data
dictionary like IBM’s Repository or DEC’S
Common Data Dictionary. The DSI stores
general descriptive metadata about GCRP data
sets that can be used by scientists for reviewing
and locating these data sets. The DSI for the
GCRP contains descriptive information about a
data sets’ spatial and temporal density and
extent, parameters, usefulness, usage, and
quality. It identifies the sources of data sets and
monitors their movement throughout the GCRP
information system. This paper outlines the
design and development approach used to
identify and implement a practical and effective
method of storing and retrieving metadata about
data sets for the United States Environmental
Protection Agency (EPA) GCRP.
BACKGROUND
research effort often requires information from
many separate sources such as; universities,
foreign governments, local governments, other
federal agencies and from within EPA. Early in
the development of the data management plan
for the GCRP it was determined that, since the
research effort would be widely distributed, the
pertinent data should be located and managed
at locations in close proximity to the research.
However, it was evident that the management of
this distributed data would require some degree
of centralization. To facilitate project research,
reduce redundancy, and promote data sharing,
a centralized method of monitoring and reporting
the available GCRP data sets led to the concept
of a data set index (DSI).
DESIGN
The initial question that arose during the design
phase was "Who would use the DSI and for what
purposes?". Answering this question would help
identify the kind of data to be stored in the DSI.
Since the concept for the DSI arose to promote
data sharing and to facilitate cross project
research it was evident that the primary users of
the DSI would be scientists looking for data sets
that supported their research objectives. To
make this kind of determination scientists
required that the DSI store more than just simple
The organization of the EPA GCRP consists of
multiple research efforts located at EPA
laboratories throughout the United States. Each
305