inbul 2004
zii xl
clarity of
and wipe
nsity and
ethod can
y and on
portioned.
; and have
ill not be
, the sub-
inl x
First, we
| to data
en divide
al to the
yunt data
with each
rid-based
omplete a
ation and
utput the
mation of
setup the
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol XXXV, Part B2. Istanbul 2004
tree. One parameter is dataset to be placed in the hierarchical
structure; the other parameter is the number of desired cells at
the lowest level. It needs two step to build the quad-tree:
calculate the bottom layer cell statistical information directly by
the given dataset, then calculate the higher layer cells statistical
information according to the lower layer cells. We only need to
go through the dataset once in order to calculate the parameters
associated with the grid cells at the bottom level, the overall
computation time is nearly proportional to the number of cells.
That is, query processing complexity is O(k) rather than O(n),
here « means the number of cells. We will analyze performance
in more detail in later sections.
The process of creating the tree structure goes like this:
Input:
D //Data to be placed in the hierarchical structure
k //Number of desired cells at the lowest level
Output:
D Tree
CAESA BUILD Algorithm
// Create the empty tree from top to down
T = root node with data values initialized; //Initially
only root node
1= 1;
repeat
for cach node in level i do
create 4 children nodes with initial values;
end for
i=i+l;
until 4' = k:
// Populate tree from bottom up
for each item in D do
determine leaf node j associated with the position of
D;
update values of j based on attribute values in item;
end for
i = log4(k);
repeat
i=i-1;
for each node j in level i do
update values of j based on attribute values in its 4
children;
until i = 1;
End of CAESA BUILD Algorithm
Figure 5: CAESA BUILD algorithm
Once the user changes her/his focus scope, adjust clustering
granularity based on the criterion that relative clustering
granularity is close to, but not lower than the relative
visualization resolution of visualization window. With the new
clustering granularity, re-divide the focused data space; then
resort to grid-based clustering algorithm to cluster the focused
data. Note that new clustering granularity must be n times of the
minimum clustering granularity where n is a positive nature
number. Due to the sufficient information (such as location of
coordinate, relevant coefficient etc.) , we can easily use the
formal result to answer the new query.
The process of query answering is as follows:
Input:
T Tree
Q //Query
Output:
R //Regions of relevant cells
CAESA QUERY Algorithm
QUEUE q;
1=1;
repeat
For each node in level i do
if cell j is relevant to Q
mark this cell as such;
q.add( j );//cell j add to queue so as to calculate its
neighbor
end if
i=1+ 1; //go on with the next level
until all layers in the tree have been visited;
for each neighbor cell i of cells in q AND i doesn't inq
do
if (density( 1) » M DENSITY)
Identify neighboring cells of relevant cells to create
regions of cells;
end if
end for
D1=R.data();
calculate the new clustering granularity kl
call CAESA BUILD Algorithm with parameter D1, k1
End of CAESA QUERY Algorithm
Figure 6: CAESA QUERY algorithm
To handle very large databases, the algorithm constructs units
instead of original data objects. To obtain these units, first,
partitioning is done to allocate data points into cells. Statistics
information is also obtained for each cell. Then we test to see if
a cell is qualified as a unit. The definition of unit is as follows:
The density of a unit is greater than a certain predefined
threshold. Therefore, we determine if a cell is a unit by its
density. If the density of a cell is M (M21) times of the average
density of all data points, we regard such a cell as a unit. There
may be some cells of low density but are in fact part of a final
cluster, they may not be included into units. We treat such data
points as separated sub-clusters. This method will greatly
reduce the time complexity, as will be shown in the
experiments.
The clusters identified in this algorithm are denoted by the
representative points. Usually, user may want to know the
detailed information about the clusters and the data points they
include. For each data point, we need not test which cluster it
belongs to. Instead, we need only find the cluster a cell
belonging to, and all data points in the cell belong to such
cluster. This process can also improve the speed of the whole
clustering process.
5. PERFORMANCE EVALUATION
We conduct several experiments to evaluate the performance of
the proposed algorithm. The following experiments are run on a
PC machine with Win2000 operating system (256M memory).