Full text: XVIIIth Congress (Part B3)

  
   
  
   
   
   
  
  
  
  
   
  
  
   
    
       
   
   
   
   
   
   
  
   
    
    
    
   
  
  
     
   
  
   
     
   
    
   
  
  
   
   
  
  
   
  
  
  
   
    
   
     
    
   
   
     
   
    
    
    
   
   
  
   
    
xtual Classification 
onally it is used the 
now that using non- 
ication significatively 
erformed by the ICM 
tion as starting point 
contextual classifiers 
1. Our proposal con- 
sifications as starting 
.uracy of the conven- 
tually correcting the 
al classifiers improve 
he ML classifier and 
| good candidates to 
al classifiers. Finally, 
ational effort to per- 
sification followed by 
cular contextual clas- 
classification effort is 
thus the global com- 
spectral classification 
section 2 we describe 
is work together with 
ave used. In section 3 
aper and in section 4 
the main conclusions 
ction 5. 
3Y 
ome combinations of 
assifications. In order 
tions of classifiers for 
e have tested a wide 
extual classifiers. 
two main categories: 
e the existence of an 
he data and b) non- 
ssume anything about 
determined, basically, 
odf's) p(.X|wi). The 
vised parametric clas- 
ttern of each class in 
a 1996 
terms of its pdf. It is assumed that it is known the form of that 
function, that is, it is only needed to know some parameters 
(estimated from the training set) in order to characterize each 
class. The pdf's are usually multivariate Gaussian functions 
so it is only needed to estimate two sets of parameters for 
each class: the mean vector, pu; and the covariance matrix, 
Xi 
The maximum likelihood classifier (ML classifier) imposes 
quadratic decision boundaries between the clusters of samples 
in the representation space. lf it is assumed a common co- 
variance matrix, D; = D for à = 1,2,...,J, then the quad- 
ratic classifier yields to a linear classifier in which the decision 
boundaries have a linear form. The quadratic classifier is more 
sensitive to the violation of the Gaussian assumption of the 
pdf's than the linear classifier [Lachenbruch, 1979] and the 
training set size required by a quadratic classifier is higher 
than the training set size required by a linear classifier. lt 
is well-know that the Hughes effect [Hughes, 1968] arises in 
high dimensionality data when the training set size is not large 
enough to estimate properly the covariance matrices. 
When adopting a quadratic or a linear classifier we are impos- 
ing an extreme degree of adjustment of the decision bound- 
aries to the training samples. When the training samples are 
highly overlapped in the representation space they are not 
good choices and it is plausible to allow a wider degree of 
adjustment, that is, a wider set of possible decision boun- 
daries. This can be achieved using the regularized discrim- 
inant analysis classifier (RDA classifier) proposed by Fried- 
man [Friedman, 1989]. 
RDA allows a wide family of parametric classifiers includ- 
ing the quadratic ML classifier and the linear classifiers as 
particular cases. The estimation of the covariance matrices 
is performed by a regularization process determined by two 
parameters, the values of which are customized to individual 
situations by jointly minimizing a sample-based estimate of 
future misclassification risk. The joint optimization of these 
parameters minimize the cross-validation error so this tech- 
nique is more convenient for problems in which the training 
set size is small. 
In most classification applications the assumption that the 
forms of the underlying density functions are known is unreal- 
istic. The common parametric forms rarely fit the densities 
actually encountered in practice. For instance, the paramet- 
ric models manage unimodal densities whereas many practical 
problems present multimodal densities [Duda & Hart, 1973]. 
The only available information is the training set and the clas- 
sification rules must be built just from it with no additional 
assumptions. 
We can find a wide variety of spectral non-parametric 
classifiers. They can be summarized in three main 
categories. The first approximation consists in non- 
parametric statistical techniques to estimate p(X |wi) 
(nearest neighbor estimation techniques and kernel es- 
timation techniques [Duda & Hart, 1973], [Parzen, 1962], 
[Devijver & Kittler, 1982]). The second consists in estimate 
directly the a posterior probability p(w;|.X) (nearest neighbor 
classification rules [Devijver & Kittler, 1982]) and the third 
consists in splitting recursively the representation space by 
means of binaries questions related to the values of the vari- 
ables involved (classification trees). In this work we have ad- 
opted two different approaches due to their wide use, accuracy 
and knowledge: we have used CART [Breiman et al., 1984] 
121 
as classification tree based technique and the nearest neigh- 
bor classification rules [Devijver & Kittler, 1982] applied to 
many different reference sets which have been built by many 
different learning algorithms, as we will see below. 
One of the most popular and widely used non-parametric clas- 
sification rule is the k nearest neighbor rule (k-NN). In the 
simplest formulation, given a training set 7 , a metric ó on the 
representation space and a sample to classify X, the k-NN 
searches the k nearest neighbors to X in 7 and assigns to X 
the most populated class in the selected neighbors. If k — 1 
the k-NN rule is known as the nearest-neighbor-rule or 1-NN 
rule. In this work we will note by 1-NN to the 1-NN classifier 
that uses the complete (as given by experts) training set to 
search the nearest neighbor. 
The requirement of a large training set to assure the con- 
vergence of the k-NN rule [Devijver & Kittler, 1982] is the 
main drawback of the nearest neighbor rules in practical 
problems. Moreover there are two additional drawbacks 
in the application of these rules: firstly, they are very in- 
fluenced by incorrectly labeled training samples ("noisy" 
samples or outliers) and secondly, the computational com- 
plexity associated to the search of the nearest neighbor(s) 
in 7 can be O(n?) or higher. To circumvent these prob- 
lems it is possible to obtain a reduced and representative 
reference set, R, from T with the objective of searching the 
nearest neighbor(s) in R with an acceptable trade-off between 
the accuracy of the classification and the required com- 
putational effort ([Devijver & Kittler, 1982], [Kohonen, 1990], 
[Geva & Sitte, 1991] among others). This can be done in two 
ways: a) by editing-condensing techniques or b) by adaptat- 
ive learning techniques. In the first case R C 7 whereas in 
the second case there is not an explicit relation between both 
sets. 
The aim of editing-condensing techniques is two-fold: improv- 
ing the accuracy of the classification by removing samples loc- 
ated in overlapping acceptation surfaces (editing techniques) 
and decreasing the computational effort required to find the 
nearest neighbor(s) (condensing techniques). Editing tech- 
niques take as input the original training set and give as out- 
put a subset of the original training set. Condensing tech- 
niques take usually as input the edited training set and give 
as output a subset of the previously edited set. R is a reduced 
(sometimes drastically) and representative version of 7. The 
joint application of these techniques improve the trade-off 
between the accuracy of the 1-NN classification and the com- 
putational effort required for that classification. A different 
approach consists in adaptative learning techniques. Adapt- 
ative learning is a powerful alternative to classical editing- 
condensing techniques as it allows to fix the reference set 
size. Now the reference set is not usually a subset of the 
training set. Adaptative learning algorithms can be tuned by 
means of a set of parameters in such a way that it is pos- 
sible to directly supervise the learning process. The training 
samples are used to tune a fixed number of codebooks or pro- 
totypes and the reference set is called the codebooks set or 
the prototypes set. Adaptative learning is performed in two 
sequential phases: initialization and learning. The prototypes 
set is initially a subset of the training set and the values of 
the prototypes are updated in a iterative learning process. 
As editing algorithm we have chosen the multiedit al- 
gorithm [Devijver & Kittler, 1982] and as condensing al- 
gorithm we have adopted the Hart's condensing al- 
International Archives of Photogrammetry and Remote Sensing. Vol. XXXI, Part B3. Vienna 1996
	        
Waiting...

Note to user

Dear user,

In response to current developments in the web technology used by the Goobi viewer, the software no longer supports your browser.

Please use one of the following browsers to display this page correctly.

Thank you.