The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. Vol. XXXVII. Part B4. Beijing 2008
110
based spatial data similarity search has the theoretical value and
practical significance.
2. HIGH DIMENSIONAL DATA RETRIEVAL
Spatial data similarity search is generally based on spatial data
features, such as image features, including color features,
texture features, shape features, etc. , and vector features,
including topological features, direction features, measure
features, etc., and so on. The kind of similarity search are called
feature-based spatial data similarity search. These features are
usually represented as high dimensional vectors. For example,
the homogenous texture descriptors recommended by MPEG-7
are represented as a vector of 62 dimensions, and the edge
histogram descriptors are denoted as a vector of 150 dimensions.
Therefore, one key problem of feature-based spatial data
similarity search is high dimensional data retrieval efficiently.
2.1 High dimensional data distribution characteristics
To explore the nature of high dimensional data retrieval, high
dimensional data distribution is firstly discussed. In 2
dimension space, the difference between the area of the
minimum boundary rectangle and that of the minimum
boundary circle is relatively small, and the ratio is
(2 X y[2 /2 X r) 2 /( 71 r 2 ) = 0.64; However, in 64 dimension
space, the ratio becomes (2X >/64 /64Xr) 64 / {71 32 r 64 /32!) =
9.54 X 10 20 ; In 256 dimension space, the ratio evenly decrease
to (2 X >/256 /256 X r) 256 / ( 71 128 r 256 /128!) = 5.76 X 10' 80 ,
where r denotes the radius of the hypersphere.
From the two examples, we can see that high dimension data
are very sparse in high dimensional space, and the minimum
boundary hypercube and minimum boundary hypersphere don’t
reflect the distribution of high dimensional data well. Provided
that the volume of over 75 percent of radius of the hypersphere
is defined interior surface, the probability of a data point
located in the interior surface in 2 dimension space is {71 r 2 -
71 (3r/4) 2 )/( 7t r 2 ) = 0.438, and in 6 dimension space the
probability is ( Jt 3 r 6 /6!- 71 3 (3r/4) 6 /6!)/( 7t 3 r 6 /6!) = 0.822.
The rest may be deduced by analogy, and the change of the
probability with the dimension is shown as figures 2.
Figure 2. Probability change of data points located in interior
surface with dimension in the case of the radius threshold is
75% of that of the hypersphere.
It is shown that from the figure 2, as the dimension grows, the
probability of distribution in interior surface increases, and
when the dimension is 8, the probability reaches 90 percent.
As figure 1 shows, with the increase of the dimension, the ratio Note that if interior surface is defined as the volume of over
decreases sharply, and the difference between the volume of the different radius threshold, the probability of data distribution in
minimum boundary hypercube and that of the minimum interior surface is different in the case of the same dimension,
boundary hypersphere becomes significant. Therefore, in the In other words, the dimension is different in the case of same
case of high dimension, the minimum boundary hypersphere probability of data distribution in interior surface when the
doesn’t reflect fittingly the distribution of high dimensional data. interior surface involves different radius.
Figure 1. The change of the ratio of the volume of the
hypercube and that of the hypersphere with dimension
Further, supposing 1,000,000 points distributes in a 256-
dimension hypercube and each dimension is divided into 2 units,
2 256 = 1.16 X 10 77 units cubes are generated which exceed
extremely the number of data points. For uniform datasets, there
aren’t points in the most of the units, and the minimum
boundary hyperscube doesn’t reflect the distribution of high
dimensional data as well as the minimum boundary hypersphere.
Figure 3. The dimension changes with the ratio of the radius
threshold and the radius of the hypersphere when the
probability of data distribution in interior surface reaches 80%
From the analysis, we conclude (1) high dimensional data point
is very sparse in high dimension space, and traditional partition
based indices don’t reflect the distribution of high dimensional
data well; (2) high dimensional data points incline to locate the
interior surface, and with the increase of the dimension, the
tendency is more notable. The related work includes (Yue Li,
2004), (Jinhua Li, 2001), (Tolga Bozkaya, 1997) and so on, but
the researches don’t involve the quantified analysis and give the