112
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. Vol. XXXVII. Part B4. Beijing 2008
primary sequence scan, so the crisis of dimension is not avoided
in the kind of index structures. Then, when the dimension is
very high, can we give up the idea of partition based index, and
implement the quick retrieval by reducing the cost of sequence
scan?
4.1 VA-File index
Approximation based indices are implemented through this idea,
and can receive good performance in the case of high
dimension. The representative is the VA-File (Roger Weber,
1998). The main idea is to divide every dimension of data space
into some segments, form some subspaces or cells, and code
these cells through bit strings. These cells are represented
approximately by corresponding bit strings or vector
approximation.
vector data
11
10
01
00
Figure 4. VA-File ( Roger Weber, 1998)
From the point of view of approximation, the other category try
to achieve a more accurate approximate vector and to improve
the secondary retrieval performance. The kind of indices
include VA+-File which uses the KL transformation to
eliminate the correlation of every components, and allocate
different number of bits to every dimension according to the
different energy of every dimension after KL transformation
(Hakan Ferhatosmanoglu, 2000); and IQ index which
introduces minimum bounding rectangles to VA-File, and
obtains the data point density in MBRs, and then allocates
different number of bits to the dimensions of different point
density (Stefan Berchtold , 2000); and the local polar
coordinate file (LPC-File) which approximates vector by polar
coordinates on the partitioned local cells and enhances
significantly discriminatory power of the approximation
(Guang-Ho Cha, 2000). From the point of view of the strategy
of bits distribution and organization of approximate vector
respectively, these methods improve the primary filtering and
secondary retrieval efficiency and can achieve better
performance that VA-File.
From the analysis, we can find the kind of indices complete the
quick retrieval by making the cost of sequence scan decrease.
For the storage space of approximation vector is smaller than
the primary vector, and then the I/O times needed to access
decreases sharply. The kind of indices avoid successfully the
“dimension crisis” phenomenon, and is suitable for the
application of similarity search where the dimension is very
large, usually from over dozens dimension to more than one
hundred dimension.
The creation of the VA-File is considered as the process of
building the representation of the approximations. If the number
of bits to allocate to each dimension is bj, and the total number
of bits. in vector approximation is
bj = b , the per-
dimension partition position can be denoted as ( Pj[0],
Pj[ 1 ], , Pj[2 bj ]), and the data space is divided into 2 b hyper-
rectangular cells, each of which can be represented as a unique
bit-string of length b. Each data point is approximated by the
bit-string of the cell into which it falls. Figure 4 illustrates this
for five sample points.
Similarity search can be performed by scanning the entire
approximation file, and by excluding the vast majority of
vectors from the search based only on these approximations
(Roger Weber, 1998).
4.2 Improved VA-File index
In addition, the kind of indices includes still the improved VA-
File indices. Note that VA-File index all the same uses the
strategy of two levels “filtering-refine” to complete the high
dimensional retrieval. Consequently, the improvement of VA-
File can be done from two aspects.
In view of this, we classify the improved VA-File into two
categories according to the strategy. (1) From the perspective of
the organization method of approximate vector, the first
category reorganizes approximate vectors according to some
specific rule and makes the cost to scan the approximate vector
decrease. The primary filtering performance of the index
improves for it is easier to find the approximate vector. The
kind of indices involves sorting approximate vectors based on
principle component (Cui Jiangtao, 2005) and using R-tree to
index approximate vector (Cui Jiangtao, 2005) and so on; (2)
5. DISTANCE BASED INDICES
Spatial object in spatial data similarity search is usually
represented as high dimension feature vector, and one of
important trait is that distance calculation cost or CPU cost is
expensive. Partition based indices divide the data space
according to coordinate information without consideration of
distance cost, and the performance cannot be guaranteed in the
case of high dimension. And then, is it available to decrease the
cost of distance calculation by reducing the number of feature
vectors which are necessary to similarity match calculation, and
accordingly complete the efficient retrieval of high dimension
data?
Distance-based indices utilize the distance properties or
measure properties in metric space to index high dimension data.
The formal description of metric space is can be given as
follows: The set X denotes the universe of valid objects, and U
is a subset of it, a given set of data objects where we search.
There is a function d : X X X —> R, which satisfies following
three axioms.
(1) Positiveness,
Vx, y G X , d(x, y) > 0 , when and only when
X = y , d(x,y) = 0 ; when X * y , d(x,y) > 0
(2) Symmetry,
Vx,j> e X ,d(x,y) = d(y,x),
(3) Triangular inequality,
\/x,y,z e X, d(x,y) < d(x,z) + d{y,z).
Then the pair (X, d) is called metric space.