Proceedings, XXth congress: Proceedings, XXth congress

altan, m. orhan
anbul 2004 
e common 
ain dataset 
s has been 
so that the 
’s internal 
r, seldom 
ion on the 
' proposed 
ualization- 
| analysis. 
s-changing 
ualization- 
ptable to 
te relative 
O current 
.NTED 
resolution 
ectangular 
"ular cells 
hese cells 
1 level is 
wer level. 
s of each 
ibution) is 
nswer the 
  
lustering. 
easily be 
. For each 
lependent 
2 {ree 
> distance 
this cell 
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol XXXV, Part B2. Istanbul 2004 
distribution—1he type of distribution that the attribute value 
in this cell follows(such as normal, uniform, exponential, etc. 
NONE is assigned if the distribution type is unknown) 
cellLoc—the location of this cell, it record the coordinate 
information associated to the objects in data set. 
relevant—coetficient of this cell relevant to given query 
[n our algorithm, parameters of higher-level cells can be easily 
calculated from parameters of lower level cell. Let nj m; s;, d;, 
min, max; cellLoc, layerNo; and dist; be parameters of 
corresponding lower level cells, respectively. The parameters in 
current cell n; ,, mj.,, 5;.,, dj, min; ;, Max; 1, cellLoc;.,, layer No; 
and dist; ; can be calculated as follows. 
N; | = S 1 (1) 
  
d a ED (2) 
n 
> mn 
m RIO UT (3) 
n 
2 
Ss +m;)xn 
1 2 
SVT A; (4) 
Hg 
min, , = min(min,) (5) 
! 
max, ,- max(max,) (6) 
i 
layerNo, ,- layerNo, —1 (7) 
cellLoc, ,- min (cellLoc,) (8) 
cellNo, , 
The calculation of dist; is much more complicated, here we 
adopt the calculating method designed by Wang et al.(1997) in 
their algorithm of STING. 
  
  
  
  
  
  
  
  
  
  
  
  
i | 2 3 4 
N; 80 10 90 60 
m, 18.3 173 18.1 18.5 
di 0.8 0.5 0.7 0.2 
s; 1.8 1.6 2.1 ].9 
min; 2/3 5.8 12 4.2 
max; 27.4 54.3 62.8 48.5 
layerNo; 5 S 5 5 
cellNo; 85 86 87 88 
cellLoc; | (80,60) (90,60) (90,60) (90,60) 
dist; INORMAL NONE NORMAL NORAML 
  
Table 1: Parameters of lower cells 
According to the formula present above, we can easily to 
calculate the parameters in current cell: 
Ai, = 240 
m;, — 18.254 
Si; 1.943 
337 
d; = 0.6 
min,, = 1.2 
max; = 62.8 
cellLoc,; = (80,60) 
layerNo, ; = 4 
dist; ; = NORMAL 
The advantages of this approach are: 
l. It is a query-independent approach since the statistical 
information stored in each cell represents the summary 
information of the data in the cell, which is independent 
of the query. 
2. The computational complexity is O(K), where K is the 
number of cells at the lowest level. Usually, &««N, 
where N is the number of objects. 
3. The cell structure facilitates parallel processing and 
incremental updating. 
3.2 Focus-changing Clustering 
Focus-changing clustering means that CAESA can provide user 
the corresponding information when his interested dataset is 
changed. Take exploratory data analysis for example, the user 
may first want to see the global view of the processed dataset, 
then her/his interest may turn to a smaller part of the dataset to 
see some details, and so on. To fulfil focus-changing clustering, 
a simple method is to cluster focused data each time from 
scratch in the fashion of current clustering algorithm. Obviously, 
such approach is time-consuming and of low efficient. The 
better solution is to design a clustering algorithm that carries 
out focus-changing clustering in an integrated framework. Thus 
clustering time and I/O cost is reduced, and the clustering 
flexibility is enhanced. 
For example, the user wants to select the maximal regions that 
have at least 100 houses per unit arca and at least 70% of the 
house prices are above $200,000 and with total area at least 100 
units with 90% confidence. We can describe this query using 
SQL like this: 
SELECT REGION 
FROM house-map 
WHERE . DENSITY IN (100, oc) 
AND price RANGE (200000, oo) 
WITH PERCENT (0.7, 1) 
AND AREA (100, x) 
AND WITH CONFIDENCE 0.9; 
After getting this information, perhaps she/he wants to see more 
detailed information, e.g., the maximal sub-regions that have at 
least 50 houses per unit area and at least 85% of the house 
prices are above $350,000 and with total area at least 80 units 
with 8096 confidence. By the following SQL, we can get 
information we need from the original dataset. Here, we needn't 
scan all the original dataset again. 
SELECT SUB-REGION 
FROM house-map 
WHERE DENSITY IN (50, œ) 
AND price RANGE (350000, 2) 
WITH PERCENT (0.85, 1) 
i 
vm phun 
A
1
2
...
346
347
348
349
350
...
878
879
Full text: Proceedings, XXth congress (Part 2)

Access restriction

Copyright

Note to user