FEATURE SELECTION BY USING .
CLASSIFICATION AND REGRESSION TREES (CART)
H. R. Bittencourt * *, R. T. Clarke"
* Faculty of Mathematics, Pontifícia Universidade Católica do RS, Porto Alegre, Brazil - heliorb@pucrs.br
® CEPSRM, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil, clarke@iph.ufrgs.br
KEY WORDS: Hyper-spectral, Classification, Feature Extraction, Remote Sensing, Pattern Recognition, Agriculture
ABSTRACT:
Hyper-spectral remote sensing increases the volume of information available for research and practice, but brings with it the need for
efficient statistical methods in sample spaces of many dimensions. Due to the complexity of problems in high dimensionality,
several methods for dimension reduction are suggested in the literature, such as Principal Components Analysis (PCA). Although
PCA can be applied to data reduction, its use for classifying images has not produced good results. In the present study, the
Classification and Regression Trees technique, more widely known by the acronym CART, is used for feature selection. CART
involves the identification and construction of a binary decision tree using a sample of training data for which the correct
classification is known. Binary decision trees consist of repeated divisions of a feature space into two sub-spaces, with the terminal
nodes associated with the classes. A desirable decision tree is one having a relatively small number of branches, a relatively small
number of intermediate nodes from which these branches diverge, and high predictive power, in which entities are correctly
classified at the terminal nodes. In the present study, AVIRIS digital images from agricultural fields in the USA are used. The
images were automatically classified by a binary decision tree. Based on the results from the digital classification, a table showing
highly discriminatory spectral bands for each kind of agricultural field was generated. Moreover, the spectral signatures of the
cultures are discussed. The results show that the decision trees employ a strategy in which a complex problem is divided into
simpler sub-problems, with the advantage that it becomes possible to follow the classification process through each node of the
decision tree. It is emphasized that it is the computer algorithm itself which selects the bands with maximum discriminatory power,
thus providing useful information to the researcher.
1. INTRODUCTION recommended when the number of training samples is limited,
but, on the other hand, a reduction may lead to loss in the
1.1 Hyper-spectral Sensors and Dimensionality Reduction discrimination power between the classes.
In the last years, advances in sensor technology have made
possible the acquisition of images on several hundred spectral
bands. AVIRIS and HYDICE sensors are well known examples In the statistical approach to pattern recognition, each pattern is
1.2 Statistical Pattern Recognition and Feature Extraction
of this technology, having 224 and 210 bands, respectively. regarded as a p-dimensional random vector, where p is the
Since a great volume of data information is made available to number of characteristics used in classification that compose
researchers by means of hyper-spectral remote sensing, some the feature space. Normally, when spectral attributes are used
problems can occur during the image classification process. only, the pixels are the patterns and the p spectral bands to
When a parametric classifier is used, the parameters estimation match up feature space. Several selection methods to determine
becomes problematic in high dimensionality. In the traditional an appropriate subspace of dimensionality m (m < p) in the
Gaussian Maximum Likelihood classifier, for example, the original feature space are found in the literature. Probably the
underlying probability distributions are assumed to be most widely-known technique is Principal Component Analysis
multivariate Normal and the number of parameters to be (PCA) or Karhunen-Loéve expansion, although the literature
estimated can be very large, since with k classes, k mean alerts to problems when PCA is used (Cheriyadat and Bruce,
vectors (of dimension pxl) and k covariance matrices — 2003). Although PCA is an excellent tool for data reduction, it
(dimension pxp, symmetric) are estimated (Bittencourt and js not necessarily an appropriate method for feature extraction
Clarke, 2003a). As stated by Haertel and Landgrebe (1999), one when the main goal is classification, because PCA analyses a
of the most difficult problems in dealing with high dimensional ^ covariance matrix constructed from the entire data distribution,
data resides in the estimation of the classes’ covariance which does not represent the underlying class information
matrices. Methods to solve this problem have received present in the data.
considerable attention from the scientific community and one
way to solve this problem is the reduction of dimensionality. This paper shows an alternative to data reduction demonstrating
the ability of Classification and Regression Trees (CART) to
There are two main reasons for keeping the dimensionality as determinate bands with highly discriminatory power between
small as possible: measurement cost and classification accuracy classes. This procedure is also known as feature selection.
(Jain et al., 2000). The reduction of dimensionality is highly
* Corresponding author.
66
Internati
As discu
trees foi
approacl
hierarchi
patterns
the resul
of node:
decision
Breiman
summari
Regressi
not only
but also
which tl
estimate
explanat
Binary €
space in
with the
relativel
of interr
high pre
at the tei
2.1 Ho
CART 1
decision
correct
two sub
the two
become
sample
(McLac
The de«
whichev
the imp
impurity
followir
it) =
where p
w; at nc
further |
entities
division
Ai(s,i
The de
division
decreas