109
Beijing 2008
THE CLASSIFICATION OF HIGH DIMENSIONAL INDICES FOR SPATIAL DATA
SIMILARITY SEARCH
Yu XIA a ’ *, Xinyan ZHU b , Chang LI a
a School of Remote Sensing and Information Engineering, Wuhan University - geoxy@126.com
b State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University
WG IV/1
KEY WORDS: Spatial Database; Data Management; Digital Photogrammetry; Query Processing; GIS; Similarity Search; Metric
Space
ABSTRACT:
The applications of spatial data similarity search are increasingly needed nowadays, and accordingly high dimensional index
becomes one key technology to solve the problem of spatial data similarity search. Firstly, the distribution of high dimensional data
is in-depth analyzed, and then high dimensional data retrieval for spatial data similarity search is also discussed. Secondly, based on
the research, the classification of high dimensional indices for spatial data similarity search is presented, which initially makes a
clear distinction of the relationship between the high dimensional index and the application of spatial data similarity search. Finally,
the principle of high dimensional indices and the state of the applications in spatial data similarity search are analyzed with an
example of typical index structure respectively, which lays a foundation for the research on index technology in spatial data
similarity search.
1. INTRODUCTION
How to search similar objects of a given object from spatial
database efficiently, called spatial data similarity search, has a
wide application needs nowadays, and becomes an increasingly
important problem. One of the key technologies to solve the
similarity search problem is the high dimensional index.
Accordingly, high dimensional index technology has been a
very active research area over the past decades, and attracts
many scholars’ attentions. For example, (Antomn Guttman,
1984) introduced a dynamic index structure called R-tree,
which was well suited to data objects of non-zero size located
in multidimensional spaces; In the later years, the classic
improved index structure R+-tree (Timos Sellis, 1987) and R* -
tree (Norbert Beckmann, 1990) were presented respectively;
The TV-tree (King-lp Lin, 1994) was designed for indexing
high dimensional objects which used a variable number of
dimensions to get a larger fanouts and fewer disk accesses; In
(Stefan Berchtold, 1996), the X-tree was proposed which used a
split algorithm minimizing overlap and additionally utilized the
concept of supemodes to keep the directory as hierarchy as
possible and avoid splits in the directory; (David A. White,
1996) proposed the similarity search tree, called SS-tree, which
used spheres as page regions; The SR-tree (Norio Katayama,
1997) integrated bounding spheres and bounding rectangles,
and used the intersection as the region, and enhanced the
performance especially for high dimensional and non-uniform
data; In (Roger Weber, 1998), a vector approximation schema,
called VA-file, was presented to give the approximation of
vectors and make the unavoidable sequence scan as fast as
possible; (Christians Bohm, 2001) gave a general overview of
index structures and algorithms in high dimensional space;
(Christians M. Garcia-Arellano, 2002) compared the index
structures among VA-File, IQ and A-tree, and then proposed a
new static similarity search strategy called Quantized
Clustering tree or QC-tree, which integrated the best
characteristics observed in the IQ-tree and A-tree; (Ye Hangjun,
2003) presented the vector quantization (VQ) index schema,
which assumed a Gaussian mixture distribution of real-world
data, and trained optimized vector quantizer to partition data
space; (Cui Jiangtao, 2005) proposed PCR-tree index based on
R-tree and VA-File, which employed R-tree to manage the
approximate vectors at principal components; In addition, many
scholars paid much attention to high dimensional index in
metric space, and proposed many distance based index
structures where object proximity was only defined by a
distance function satisfying the positivity, symmetry and
triangle inequality; (Paolo Ciaccia,1997) introduced the M-tree
index to organize and search large data sets, which was a
dynamic balanced distance based index structure; (Tolga
Bozkaya, 1999) proposed a distance based index structure
called multi-vantage point (mvp) tree for similarity search,
which used more than one vantage point and utilized pre
computed distances between the data points and the vantage
points; (Gonzalo Navarro, 2002) presented SA-tree index which
was based on approaching the searched objects spatially.
(Benjamin Bustos, 2006) presented the Multi-Metric M-tree
(M3-tree), which was the extension of the M-tree, stored partial
distances to better estimate the weighted distance between
routing entries and each search, where a single distance
function was used to build the whole index. Though the
researches have attributes to the high dimensional index
technology, there isn’t systematic research on high dimensional
index classification. Furthermore, the relationship between high
dimensional index structure and the application of spatial data
similarity search is still not clear. Note that the deficiency limits
further research on high dimensional index technology in
spatial data similarity search. So the research on the
classification of high dimensional index technology for feature-
geoxy@126.com