xtual Classification
onally it is used the
now that using non-
ication significatively
erformed by the ICM
tion as starting point
contextual classifiers
1. Our proposal con-
sifications as starting
.uracy of the conven-
tually correcting the
al classifiers improve
he ML classifier and
| good candidates to
al classifiers. Finally,
ational effort to per-
sification followed by
cular contextual clas-
classification effort is
thus the global com-
spectral classification
section 2 we describe
is work together with
ave used. In section 3
aper and in section 4
the main conclusions
ction 5.
3Y
ome combinations of
assifications. In order
tions of classifiers for
e have tested a wide
extual classifiers.
two main categories:
e the existence of an
he data and b) non-
ssume anything about
determined, basically,
odf's) p(.X|wi). The
vised parametric clas-
ttern of each class in
a 1996
terms of its pdf. It is assumed that it is known the form of that
function, that is, it is only needed to know some parameters
(estimated from the training set) in order to characterize each
class. The pdf's are usually multivariate Gaussian functions
so it is only needed to estimate two sets of parameters for
each class: the mean vector, pu; and the covariance matrix,
Xi
The maximum likelihood classifier (ML classifier) imposes
quadratic decision boundaries between the clusters of samples
in the representation space. lf it is assumed a common co-
variance matrix, D; = D for à = 1,2,...,J, then the quad-
ratic classifier yields to a linear classifier in which the decision
boundaries have a linear form. The quadratic classifier is more
sensitive to the violation of the Gaussian assumption of the
pdf's than the linear classifier [Lachenbruch, 1979] and the
training set size required by a quadratic classifier is higher
than the training set size required by a linear classifier. lt
is well-know that the Hughes effect [Hughes, 1968] arises in
high dimensionality data when the training set size is not large
enough to estimate properly the covariance matrices.
When adopting a quadratic or a linear classifier we are impos-
ing an extreme degree of adjustment of the decision bound-
aries to the training samples. When the training samples are
highly overlapped in the representation space they are not
good choices and it is plausible to allow a wider degree of
adjustment, that is, a wider set of possible decision boun-
daries. This can be achieved using the regularized discrim-
inant analysis classifier (RDA classifier) proposed by Fried-
man [Friedman, 1989].
RDA allows a wide family of parametric classifiers includ-
ing the quadratic ML classifier and the linear classifiers as
particular cases. The estimation of the covariance matrices
is performed by a regularization process determined by two
parameters, the values of which are customized to individual
situations by jointly minimizing a sample-based estimate of
future misclassification risk. The joint optimization of these
parameters minimize the cross-validation error so this tech-
nique is more convenient for problems in which the training
set size is small.
In most classification applications the assumption that the
forms of the underlying density functions are known is unreal-
istic. The common parametric forms rarely fit the densities
actually encountered in practice. For instance, the paramet-
ric models manage unimodal densities whereas many practical
problems present multimodal densities [Duda & Hart, 1973].
The only available information is the training set and the clas-
sification rules must be built just from it with no additional
assumptions.
We can find a wide variety of spectral non-parametric
classifiers. They can be summarized in three main
categories. The first approximation consists in non-
parametric statistical techniques to estimate p(X |wi)
(nearest neighbor estimation techniques and kernel es-
timation techniques [Duda & Hart, 1973], [Parzen, 1962],
[Devijver & Kittler, 1982]). The second consists in estimate
directly the a posterior probability p(w;|.X) (nearest neighbor
classification rules [Devijver & Kittler, 1982]) and the third
consists in splitting recursively the representation space by
means of binaries questions related to the values of the vari-
ables involved (classification trees). In this work we have ad-
opted two different approaches due to their wide use, accuracy
and knowledge: we have used CART [Breiman et al., 1984]
121
as classification tree based technique and the nearest neigh-
bor classification rules [Devijver & Kittler, 1982] applied to
many different reference sets which have been built by many
different learning algorithms, as we will see below.
One of the most popular and widely used non-parametric clas-
sification rule is the k nearest neighbor rule (k-NN). In the
simplest formulation, given a training set 7 , a metric ó on the
representation space and a sample to classify X, the k-NN
searches the k nearest neighbors to X in 7 and assigns to X
the most populated class in the selected neighbors. If k — 1
the k-NN rule is known as the nearest-neighbor-rule or 1-NN
rule. In this work we will note by 1-NN to the 1-NN classifier
that uses the complete (as given by experts) training set to
search the nearest neighbor.
The requirement of a large training set to assure the con-
vergence of the k-NN rule [Devijver & Kittler, 1982] is the
main drawback of the nearest neighbor rules in practical
problems. Moreover there are two additional drawbacks
in the application of these rules: firstly, they are very in-
fluenced by incorrectly labeled training samples ("noisy"
samples or outliers) and secondly, the computational com-
plexity associated to the search of the nearest neighbor(s)
in 7 can be O(n?) or higher. To circumvent these prob-
lems it is possible to obtain a reduced and representative
reference set, R, from T with the objective of searching the
nearest neighbor(s) in R with an acceptable trade-off between
the accuracy of the classification and the required com-
putational effort ([Devijver & Kittler, 1982], [Kohonen, 1990],
[Geva & Sitte, 1991] among others). This can be done in two
ways: a) by editing-condensing techniques or b) by adaptat-
ive learning techniques. In the first case R C 7 whereas in
the second case there is not an explicit relation between both
sets.
The aim of editing-condensing techniques is two-fold: improv-
ing the accuracy of the classification by removing samples loc-
ated in overlapping acceptation surfaces (editing techniques)
and decreasing the computational effort required to find the
nearest neighbor(s) (condensing techniques). Editing tech-
niques take as input the original training set and give as out-
put a subset of the original training set. Condensing tech-
niques take usually as input the edited training set and give
as output a subset of the previously edited set. R is a reduced
(sometimes drastically) and representative version of 7. The
joint application of these techniques improve the trade-off
between the accuracy of the 1-NN classification and the com-
putational effort required for that classification. A different
approach consists in adaptative learning techniques. Adapt-
ative learning is a powerful alternative to classical editing-
condensing techniques as it allows to fix the reference set
size. Now the reference set is not usually a subset of the
training set. Adaptative learning algorithms can be tuned by
means of a set of parameters in such a way that it is pos-
sible to directly supervise the learning process. The training
samples are used to tune a fixed number of codebooks or pro-
totypes and the reference set is called the codebooks set or
the prototypes set. Adaptative learning is performed in two
sequential phases: initialization and learning. The prototypes
set is initially a subset of the training set and the values of
the prototypes are updated in a iterative learning process.
As editing algorithm we have chosen the multiedit al-
gorithm [Devijver & Kittler, 1982] and as condensing al-
gorithm we have adopted the Hart's condensing al-
International Archives of Photogrammetry and Remote Sensing. Vol. XXXI, Part B3. Vienna 1996