The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. Voi. XXXVII. Part B4. Beijing 2008
113
From different points of view, distance based indices can be
divided into different types (Dong Daoguo, 2005). (1) From the
perspective of data organization, distance based indices can be
divided into static indices and dynamic indices, where the static
indices including VP-tree (Gisli R Hjaltason, 2003) and SA-tree
(Gonzalo Navarro, 2002), etc., don’t support the dynamic
operation of insert, delete and so on, and the dynamic indices
are the index structures, whose creation is the process of
inserting every data point orderly, and usually from bottom to
top, including M-tree (Paolo Ciaccia, 1997), Slim-tree (C. Train
Jr., 2000) and so on; (2) From the angle of distance measure,
distance based indices can be divided into discrete distance
indices and continuous distance indices. The former means the
distance function is discrete, including BK-tree (Edgar Chavez,
2001) and FQ-tree (Edgar Chavez, 2001). The latter refers to
the indices whose distance function is continuous, including
VP-tree (Gisli R Hjaltason, 2003) and M-tree (Paolo Ciaccia,
1997); (3) From the point of view of index structure, distance
based indices can be divided into tree type indices and non-tree
type indices. Just as its names implies, tree type indices are
organized as tree structures including BK-tree, VP-tree, BS-tree
(Edgar Chavez, 2001). Non-tree type indices include FQA,
AES A (Edgar Chavez, 2001 ) and so on.
With an example of classic index structure VP-tree (Gisli R
Hjaltason, 2003), the principal of distance based index
structures is analyzed. The main idea of VP-tree is that data
space is partitioned into two subspaces in term of the distance
from vantage point, as shown in figure 5(a). The algorithm
recursively repeats the partitioning process and forms a
hierarchical binary tree. As figure 5(e) shows, thè tree is
generated from data space SS where left branch is built from
SSI and right branch is generated from SS2. Each internal node
of the binary vp-tree is of the form (S v , M, R ptr ,L ptr ), where Sv is
the vantage point, M is the median distance among the distances
of all the points from S v indexed below that node, and R ptr and
L ptr are points to the left and right branches ( Tolga Bozkaya,
2000).
For a given search point q, if d(q,Svì) - Y > R , as figure
5 (b) shows, where d(q,Sv 1) can be calculated in real time,
and r is from the search condition, and R is the median distance
M of the vantage point Svi, then left branch can be pruned
which are marked with dashed in figure 5(e); if
d(q, Svi) + r < R , as figure 5 (c) shows, the right branch
can be pruned, and so the distance from the search point to the
vantage point is exempted from calculation; if the two
conditions of d(q,Sv\) - y < R and d(q,SvY) + Y
> R are satisfied at the same time, as figure 5 (d) shows, then
both branches cannot be pruned and further travel in both
branches is needed.
Figure 5. The idea of distance-based indices
From the analysis, we can see that data filtering in VP-tree is
completed by triangular properties in metric space. In fact, the
triangle inequality property is applied for pruning the search
space in all distance based indices. Note that the cost of
distance calculation is related with not only the definition of
similarity measure, but also the number of dimension. Usually
the distance calculation cost grows with the number of
dimensions. For example, it is needed to calculate the distance
of two feature vectors with n dimensions. When we take Euclid
distance as the similarity measure method without consideration
of the evolution operation, n-1 times addition and n times
multiplication are needed. If the number of feature vector pairs
is m, then n m times multiplication operation is needed. For high
dimension data retrieval, n reaches hundreds, even thousands,
and m depends on the dataset and the specific index structure.
Therefore, with the increase of the dimension, the distance
calculation becomes significantly more expensive. Distance
based indices are suitable for the application of the spatial data
similarity search with expensive distance calculation, and
usually including the case of high dimension. It is true that they
have been used in many applications of similarity search, and
are a prosperous research area.
6. CONCLUSION
High dimensional data index is one key technology to solve the
problem of spatial data similarity search. In this paper, the
quantified analysis of the distribution of high dimensional data
and the curve reflecting the relationship among the dimension,
the radius and the probability are given, which show the
sparsity and distribution tendency of high dimensional data.
Because of the properties in high dimensional space, traditional
partition based index can not reflect data distribution well and
can also not avoid “dimension crisis”, which leads to poor
performance. For the fact that the case of high dimension does
often appear in spatial data similarity, aiming to the application
of spatial data similarity search, we present the classification of
high dimensional indices for spatial data similarity search:
partition based indices, approximation based indices and
distance based indices. Partition based indices locate the
specific data according to the coordinates, which are suitable
for the application of spatial data similarity search in the case of
low dimension, such as less than 10 dimension; Approximation
based indices implement the retrieval by reducing the cost of
sequence scan, which are suitable for the high dimension
similarity search such as over ten dimension to more than 100
dimension; Distance based indices employ triangle inequity to
complete the retrieval which are suitable for the similarity
search of expensive distance calculation, usually including the
case of high dimension, such as dozens to hundreds of
dimension.