The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. Vol. XXXVII. Part B4. Beijing 2008
curve reflecting the relationship among the dimension, the
radius and the probability.
2.2 High dimension data retrieval in similarity search
Spatial data similarity search is to search the data points that are
within the specific range of the given search point, or the K
points that are nearest neighbours to the given search point. The
two kinds of search refer to range search and K nearest
neighbour search respectively. For the fact that many users
don't concern about the distance threshold, similarity search
often means K nearest neighbour search. In fact, K nearest
neighbour search can be implemented by adjusting the
threshold of range search.
From the analysis above, we can see that with the increase of
dimension, the distribution of high dimension spatial data
becomes very sparse, and inclines to locate in the interior
surface of the hypersphere. The characteristics have a sound
impact on the retrieval of high dimensional data in spatial data
similarity search. (1) The sparse distribution means the poor
property of clustering, which leads to the fact that traditional
partition-based indices cannot reflect the data distribution in
high dimension space well, so that the performance of partition-
based indices decreases rapidly as the dimension grows. (2) The
data distribution in high dimension space inclines to locate the
interior surface of the hypersphere, ant that the ratio of the
maximum and the minimum pairwise distances of a set of
random points decreases and becomes close to 1 as the
dimension increases ( Jinhua Li, 2001), so similarity search or
nearest neighbour search becomes more difficult in the case of
high dimension.
In view of these, high dimensional data retrieval in spatial data
similarity search is more difficult than the retrieval of
traditional data. Consequently, there isn’t high dimension index
structure suitable for all applications of spatial data similarity
search. So aiming to different applications, this paper has
classified the high dimensional indices into three categories:
partition-based indices, approximation based indices and
distance-based indices. 3
3. PARTITION BASED INDICES
Partition based indices are usually tree-structure indices formed
by partitioning vector space recursively, and data nodes are
organized in a hierarchically structured directory. In general,
the index structures can be classified into two categories: data-
partitioning based indices and space-partitioning based indices
(Beomseok, Nam, 2004; Gang Qian, 2006), which are also
called other names, such as data dependent indices and data in
dependent indices (Xiaodong Fu, 1997), data organizing indices
and space organizing indices (Christian Bohm, 2001) and so on.
3.1 Data-partitioning based indices
Data-partitioning based indices partition data space according
to the distribution of the data, including R-tree, R+-tree, R*-tree,
SS-tree and X-tree, etc.. An R-tree is a height-balanced tree
similar to a B-tree, and is built by Minimum Boundary
Rectangle (MBR). In that the MBRs can be overlapped, the
number of paths to be travelled is more than one. So the
efficiency decreases when serious overlap appears, and efficient
R-tree searching demands that both overlap and coverage
should be minimized; R+-tree restricts the overlap of the MBRs
to improve search performance, but the cost of building the tree
unavoidably increases; R*-tree uses the forced reinsert
technology to improve the performance, however the cost of
building the tree is.more expensive compared with R-tree; SS-
tree divides the space by enclosing hypersphere, and improves
the spanouts of the nodes to enhance the performance. The
weak point is that the volume of the hypersphere is usually
larger than the hypercube, so that the overlap is more serious;
X-tree introduces the hyper-node, and decreases the number of
node splits to improve the performance.
3.2 Space-partitioning based indices
Space-partitioning based indices partition the space in a specific
algorithm, including kdb-tree(Bemhard Seeger,1990), hB-tree
and so on. The kdb-tree is a dynamic index structure combining
the kd-tree and the b-tree, and has a good performance for
indexing the high dimensional point data; The hB-tree is an
index structure based on the kdb-tree, and the node uses the k-
dimension tree. The common characters of partition based index
structures are as follows: (1) it is necessary to partition the data
space whether the data distribution is depended on or not; (2)
usually the index is a tree structure because of the recursive
partition to the data space. The difference between the two
kinds of partition based index structures is that (1) data partition
based indices are usually formed from bottom to top; (2) space
partition based indices structures are usually formed from top to
bottom.
Partition based index structures locate quickly the specific data
according to the coordinates, and they are efficient when the
dimension is not high. However, when the dimension becomes
high, the performance decreases sharply. That can be shown
with following example of R-tree. As we know, R-tree performs
well when indexing two dimension data. But as the dimension
increases, on the one hand, the overlap of the MBRs of the
directory becomes more and more serious; on the other hand,
the radius of the nearest neighbour search becomes larger and
larger. From the characteristics of the high dimension data, we
can see the ratio of the maximum distance and the minimum
distance between the any points pare decreases as the
dimension increases, and becomes close to 1 when the
dimension is large enough. As a result, when the dimension
becomes enough large, the overlap between the MBRs becomes
sharply serious, and so the radius of hyper-sphere with the
search point as the center almost intersects with all the MBRs.
All nodes have to be searched and the performance deteriorated
rapidly, and is worse than the sequence scan. The phenomenon
is called ‘'dimension crisis”. The main causation is that the
partition to data space can not reflect the rule of the data
distribution in space when the dimension becomes very large.
However when the dimension is not very large, the kind of
index structures performs well, such as R-tree performs well
when the dimension is 3, and the X-tree performs well even
when the dimension reaches 8. So, the kind of indices is
suitable for the application of similarity search when the
dimension is not high, usually from several to about 10. In fact,
the kind of indices plays a very import role on spatial data
similarity search.
4. APPROXIMATION BASED INDICES
From above analysis, we can see that when the dimension is
very high, all the nodes in the kind of index structures will be
travelled, and the performance is unavoidably poorer than the
111