ul 2004
A
ent DERIVATION OF IMPLICIT INFORMATION FROM SPATIAL DATA SETS WITH
ird DATA MINING
VY, F. Heinzle', M. Sester
)0,
d Institute of Cartography and Geoinformatics, University of Hannover, Appelstr. 9a, 30167 Hannover, Germany -
f (frauke.heinzle, monika.sester)@ikg.uni-hannover.de
Working Group: TS WG IV/5
of n. s : ov
ce KEY WORDS: Spatial, Information, Retrieval, Metadata, Data Mining, GIS, Databases, Internet/Web
on,
ral
ABSTRACT:
Geographical data sets contain a huge amount of information about spatial phenomena. The exploitation of this knowledge with the
aim to make it usable in an internet search engine is one of the goals of the EU-funded project SPIRIT. This project deals with
spatially related information retrieval in the internet and the development of a search engine, which includes the spatial aspect of
queries.
Existing metadata as provided by the standard ISO/DIS 19115 only give [ractional information about the substantial content of a
data set. Most of the time, the enrichment with metadata has to be done manually, which results in this information being present
rarely. Further, the given metadata does not contain implicit information. This implicit information does not exist on the level of
pure geographical features, but on the level of the relationships between the features, their extent, density, frequency,
neighbourhood, uniqueness and more. This knowledge often is well known by humans with their background information, however
it has to be made explicit for the computer.
The first part of the paper describes the automatic extraction of classical metadata from data sets. The second part describes concepts
of information retrieval from geographical data sets. This part deals with the setup of rules to derive useful implicit information. We
describe possible implementations of data mining algorithms.
1. INTRODUCTION The aim is to make spatial data sets visible in the Internet,
especially to enable search engines to get knowledge about the
There is an imagination, a dream, that some day our computer data and publish it or use it in search queries. This requires the
would communicate with us in a meaningful way. Tim Berners definition of metadata that are sufficient enough to describe the
Lee (2001) concretised this dream in the range of Internet with significant aspects of the data, but moreover it requires the
the formulation of the Semantic Web. The idea is to let the development of algorithms, which will determine these
computer understand not only the words used by humans, but metadata automatically. The second and more ambitious aim is
also the context of the expressions and their use in different to even make the contents usable for a search engine. This
situations. means to identify spatial phenomena in the data sets and to
Especially when using an Internet search engine, we are often build a semantic network from implicit information in the data.
confronted with the stupidity of the computer. Today most of Both attempts are described in the following chapters. In
the search engines conduct a query by looking up keywords and section 3 we discuss the first issue, namely the automatic
comparing them to a precompiled catalogue of all existing web annotation of spatial data sets with a set of important metadata
sites. However, there is no analysis of the sense of a query or an tags. Subsequently we present ideas for the extraction of
interpretation of the combination of used words in web sites. implicit information to use it for spatial concepts and
The aim of building a Semantic Web deals with those questions. concentrate on data mining algorithms to derive this
Linked to the idea of the Semantic Web is the EU-funded information.
project SPIRIT (Jones et al, 2002). SPIRIT (Spatially-aware
Information Retrieval on the Internet) is engaged in improving
the concept of search engines by evaluating the spatial context 2. RELATED WORK
of queries and web sites. The inclusion of the context and
consideration of the semantic background improves the quality The extraction of information from spatial data sets has been
of the results. Often we use spatial concepts to describe investigated in the domain of interpreting digital images. There,
something or we keep a spatial situation in mind, when we the need for interpretation is obvious, as the task is to
search for something. In SPIRIT we want to include those automatically determine individual pixels or collections of
structures to define a spatial ontology. pixels representing an object. Basic techniques for image
A huge amount of information is stored in spatial data sets. interpretation are either pixel based classification methods (e.g.
However, usually these data sets are not accessible in the Lillesand and Kiefer, 1994) or structure based matching
Internet. Most of the time there are neither metadata describing techniques (e.g. Schenk, 1999). The major applications in
the datasets nor specifications of the intrinsic geometries and photogrammetry lie in the automatic extraction of topographic
attributes. Furthermore these data sets contain a lot of implicit features like roads (Gerke et al., 2003), buildings (Brenner,
information. 2000) or trees (Straub, 2003). The main challenge is to provide
appropriate models for the objects to be found in the images.
o3
O3
UA