CMRT09: Object Extraction for 3D City Models, Road Databases and Traffic Monitoring - Concepts, Algorithms, and Evaluation
Data clustering is a difficult problem in unsupervised pattern
recognition as the clusters in data may have different shapes and
sizes. In the background of clustering techniques, the following
terms are used in this paper (Jain et al., 1999):
• A pattern (or feature vector), z, is a single object or data
point used by the clustering algorithm.
• A feature (attribute) is an individual component of a
pattern.
• A cluster is a set of similar patterns, and patterns from
different clusters are not similar.
• A distance measure is a metric used to evaluate the
similarity of patterns.
The clustering problem can be formally defined as follows (Jain
et ah, 1999): Given a data set Z={zi,z 2 ,... ,z p ,... ,z Np } where z p
is a pattern in the A^dimensional feature space, and N p is the
number of patterns in Z, then the clustering of Z is the
partitioning of Z into K clusters {C/,C 2 , . . . ,Ck} satisfying the
following conditions:
• Each pattern should be assigned to a cluster, i.e.
U;.i C, = Z
• Each cluster has at least one pattern assigned to it, i.e.
C k * 0, k = l,...,K
• Each pattern is assigned to one and only one cluster
C k n Cj = 0, where k =£ j
As previously mentioned, clustering is the process of
identifying natural groupings or clusters within
multidimensional data based on feature space through similarity
measure. Hence, similarity measures are fundamental
components in most clustering algorithms (Jain et ah, 1999).
The most popular way to evaluate a similarity measure is the
use of distance measures. The most widely used distance
measure is the Euclidean distance, defined as:
d{z lt Zj) = JZSifofc - z Jik f = \\zi.ZjW (1)
N 2
Generally, clustering algorithms can be categorized into
partitioning methods, hierarchical methods, density-based
methods, grid-based methods, and model-based methods. An
excellent survey of clustering techniques can be found in
(Kotsiantis and Pintelas, 2004). In this paper, the focus will be
on the partitional clustering algorithms. Partitional clustering
algorithms divide the data set into a specified number of
clusters and then evaluate them by some criteria. These
algorithms try to minimize certain criteria (e.g. a square error
function) and can therefore be treated as optimization problems
(Harvey et ah, 2002; Omran et ah, 2005; Wilson et ah, 2002).
The most widely used partitional algorithm in clustering
techniques is the iterative A-means approach (Kotsiantis and
Pintelas, 2004). The objective function J that the A-means
optimizes is:
JK-means ~ Xj'=i £vz p 6Cfc ^ (2)
Where m k is the centroid of the A-th cluster. The membership
and weight functions u for A-means are defined as:
u(m k \z v ) = \ 1 if d \ z v m k) = argmin k {d 2 (z p ,m k )} (3)
f0 otherwise
Consequently, the k-means method minimizes the intra-cluster
distance. The A-means algorithm starts with A centroids (initial
values are randomly selected or derived from a priori
information). Then, each pattern z p in the data set is assigned to
the closest cluster (i.e. closest centroid). Finally, the centroids
are recalculated according to the associated patterns. This
procedure is repeated until convergence is achieved.
It is known that the A-means algorithm may reach local
optimal solutions, depending on the choice of the initial
cluster centres. Genetic algorithms have a potentially greater
ability to avoid local optima through the localised search
employed by most clustering techniques. Maulik and
Bandyopadhyay (2004) proposed a genetic algorithm-based
clustering technique, called GA-clustering, that proven to be
effective in optimal clusters. With this algorithm, solutions
(typically, cluster centroids) are represented by bit strings. The
search for an appropriate solution begins with a population, or
collection, of initial solutions. Members of the current
population are used to create the next generation population
by applying operations such as random mutation and
crossover. At each step, the solutions in the current population
are evaluated relative to some measures of fitness (which,
typically, is inversely proportional to d), with the fittest
solutions selected probabilistically as seeds for producing the
next generation. The process performs a generate-and-test
beam search of the solution space, in which variants of the
best current solutions are most likely to be considered next. In
the next section, an alternative clustering method to solve the
local optimum problem of the A-means algorithm is described.
The applied method adopts the artificial swarm bees algorithm
as it has proved to give a more robust performance than other
intelligent optimisation methods for a range of complex
problems (Pham, 2006).
3. CLUSTERING OF LIDAR DATA USING SWARM
ARTIFICIAL BEE COLONY ALGORITHM
Swarm Intelligence (SI) is an innovative distributed intelligent
paradigm for solving optimization problems that originally
took its inspiration from the biological examples by swarming,
flocking and herding phenomena. These techniques
incorporate swarming behaviours observed in flocks of birds,
schools of fish, or swarms of bees, and even human social
behaviour, from which the idea is emerged (Omran et al.,
2002, 2005; Paterlini and Krink, 2005; Pham et al., 2006; Wu
and Shi, 2001). Data clustering and swarm intelligence may
seem that they do not have many properties in common.
However, recent studies suggest that they can be used together
for several real world data clustering and mining problems
especially when other methods would be too expensive or
difficult to implement.
Clustering approaches inspired by the collective behaviours of
ants have been proposed by Wu and Shi (2001), Labroche et
al. (2001). The main idea of these approaches is that artificial
ants are used to pick up items and drop them near similar
items resulting in the formation of clusters. Omran et al.
(2002) proposed particle swarm optimization (PSO) clustering
algorithm. The results of Omran et al. (2002, 2005) show that
PSO outperformed A-means, fuzzy c-means (FCM) and other
state-of-the-art clustering algorithms. More recently, Paterlini
and Krink (2005) compared the performance of A-means,
genetic algorithm (GA), PSO and Differential Evolution (DE)
for a representative point evaluation approach to partitional
clustering. The results show that GAs, PSO and DE
outperformed the A-means algorithm. Pham et al. (2006) used
the artificial bee colony algorithm for clustering of different
data sets. The obtained results of their work show that their
proposed artificial bee colony algorithm has better
performance than both of standard A-means as well as GA-
based method. In general, the literature review of recent