anbul 2004
e common
ain dataset
s has been
so that the
’s internal
r, seldom
ion on the
' proposed
ualization-
| analysis.
s-changing
ualization-
ptable to
te relative
O current
.NTED
resolution
ectangular
"ular cells
hese cells
1 level is
wer level.
s of each
ibution) is
nswer the
lustering.
easily be
. For each
lependent
2 {ree
> distance
this cell
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol XXXV, Part B2. Istanbul 2004
distribution—1he type of distribution that the attribute value
in this cell follows(such as normal, uniform, exponential, etc.
NONE is assigned if the distribution type is unknown)
cellLoc—the location of this cell, it record the coordinate
information associated to the objects in data set.
relevant—coetficient of this cell relevant to given query
[n our algorithm, parameters of higher-level cells can be easily
calculated from parameters of lower level cell. Let nj m; s;, d;,
min, max; cellLoc, layerNo; and dist; be parameters of
corresponding lower level cells, respectively. The parameters in
current cell n; ,, mj.,, 5;.,, dj, min; ;, Max; 1, cellLoc;.,, layer No;
and dist; ; can be calculated as follows.
N; | = S 1 (1)
d a ED (2)
n
> mn
m RIO UT (3)
n
2
Ss +m;)xn
1 2
SVT A; (4)
Hg
min, , = min(min,) (5)
!
max, ,- max(max,) (6)
i
layerNo, ,- layerNo, —1 (7)
cellLoc, ,- min (cellLoc,) (8)
cellNo, ,
The calculation of dist; is much more complicated, here we
adopt the calculating method designed by Wang et al.(1997) in
their algorithm of STING.
i | 2 3 4
N; 80 10 90 60
m, 18.3 173 18.1 18.5
di 0.8 0.5 0.7 0.2
s; 1.8 1.6 2.1 ].9
min; 2/3 5.8 12 4.2
max; 27.4 54.3 62.8 48.5
layerNo; 5 S 5 5
cellNo; 85 86 87 88
cellLoc; | (80,60) (90,60) (90,60) (90,60)
dist; INORMAL NONE NORMAL NORAML
Table 1: Parameters of lower cells
According to the formula present above, we can easily to
calculate the parameters in current cell:
Ai, = 240
m;, — 18.254
Si; 1.943
337
d; = 0.6
min,, = 1.2
max; = 62.8
cellLoc,; = (80,60)
layerNo, ; = 4
dist; ; = NORMAL
The advantages of this approach are:
l. It is a query-independent approach since the statistical
information stored in each cell represents the summary
information of the data in the cell, which is independent
of the query.
2. The computational complexity is O(K), where K is the
number of cells at the lowest level. Usually, &««N,
where N is the number of objects.
3. The cell structure facilitates parallel processing and
incremental updating.
3.2 Focus-changing Clustering
Focus-changing clustering means that CAESA can provide user
the corresponding information when his interested dataset is
changed. Take exploratory data analysis for example, the user
may first want to see the global view of the processed dataset,
then her/his interest may turn to a smaller part of the dataset to
see some details, and so on. To fulfil focus-changing clustering,
a simple method is to cluster focused data each time from
scratch in the fashion of current clustering algorithm. Obviously,
such approach is time-consuming and of low efficient. The
better solution is to design a clustering algorithm that carries
out focus-changing clustering in an integrated framework. Thus
clustering time and I/O cost is reduced, and the clustering
flexibility is enhanced.
For example, the user wants to select the maximal regions that
have at least 100 houses per unit arca and at least 70% of the
house prices are above $200,000 and with total area at least 100
units with 90% confidence. We can describe this query using
SQL like this:
SELECT REGION
FROM house-map
WHERE . DENSITY IN (100, oc)
AND price RANGE (200000, oo)
WITH PERCENT (0.7, 1)
AND AREA (100, x)
AND WITH CONFIDENCE 0.9;
After getting this information, perhaps she/he wants to see more
detailed information, e.g., the maximal sub-regions that have at
least 50 houses per unit area and at least 85% of the house
prices are above $350,000 and with total area at least 80 units
with 8096 confidence. By the following SQL, we can get
information we need from the original dataset. Here, we needn't
scan all the original dataset again.
SELECT SUB-REGION
FROM house-map
WHERE DENSITY IN (50, œ)
AND price RANGE (350000, 2)
WITH PERCENT (0.85, 1)
i
vm phun
A