International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol XXXV, Part B7. Istanbul 2004
2. CLASSIFICATION AND REGRESSION TREES
(CART)
As discussed in Bittencourt and Clarke (2003b), binary decision
trees for classification can be viewed as a non-parametric
approach to pattern recognition. A decision tree provides a
hierarchical representation of the feature space in which
patterns x; are allocated to classes w; (j=1,2,...,k) according to
the result obtained by following decisions made at a sequence
of nodes at which branches of the tree diverge. The type of
decision tree used in this paper is discussed in detail by
Breiman et al. (1984), whose contributions have been
summarized by the letters CART (Classification And
Regression Trees). These letters indicate that trees may be used
not only to classify entities into a discrete number of groups,
but also as an alternative approach to regression analysis in
which the value of a response (dependent) variable is to be
estimated, given the value of each variable in a set of
explanatory (independent) variables.
Binary decision trees consist of repeated divisions of a feature
space into two sub-spaces, with the terminal nodes associated
with the classes w;. A desirable decision tree is one having a
relatively small number of branches, a relatively small number
of intermediate nodes from which these branches diverge, and
high predictive power, in which entities are correctly classified
at the terminal nodes.
2.1 How Does CART Operate?
CART involves the identification and construction of a binary
decision tree using a sample of training data for which the
correct classification is known. The numbers of entities in the
two sub-groups defined at each binary split, corresponding to
the two branches emerging from each intermediate node,
become successively smaller, so that a reasonably large training
sample is required if good results are to be obtained
(McLachlan , 1992).
The decision tree begins with a root node ; derived from
whichever variable in the feature space minimizes a measure of
the impurity of the two sibling nodes. The measure of the
impurity or entropy at node /, denoted by i(/), is as shown in the
following equation :
Á
ity 22 pw; | t)log pOw; |f) (1)
j=l
where p(w;| t ) is the proportion of patterns x; allocated to class
w; at node t. Each non-terminal node is then divided into two
further nodes, /, and /;, such that p; , pa are the proportions of
entities passed to the new nodes /;, f; respectively. The best
division is that which maximizes the difference given in:
Ai(s,t)=i(t)- p,i(t,)- Pity)
The decision tree grows by means of the successive sub-
divisions until a stage is reached in which there is no significant
decrease in the measure of impurity when a further additional
67
division s is implemented. When this stage is reached, the node
t is not sub-divided further, and automatically becomes a
terminal node. The class w; associated with the terminal node /
is that which maximizes the conditional probability p(w;| f ).
2.2 CART applied to Feature Selection
The decision tree generated by CART uses only the bands that
help to separate the classes, while the others are not considered.
In this paper we use the tree as a feature selection method to
reduce dimensionality. As an illustration, an image with only
six spectral bands was classified using a decision tree; only two
of the six points were identified by the CART procedure as
necessary to separate three classes, as shown in Figure 1.
Band 1 < 50 ; AZ x
VH Tom ^
/ A
p X
s
RE.
Band5«80 — A72]
= ^
A \
I ^
b
Wa Wa
Figure 1. Example of decision tree using synthetic data
Figure 2 shows the subdivision of the feature space determined
by the decision tree, where the point representing the patterns.
As the tree used only two spectral bands, is possible to present
the subdivision of feature space in the plane.
tn
m
1401 Hs n ns
Jn
1204 o n of on
nu
: Ba oo”
n
100 4 en u un 2 © H
g^ gp n fF
ca ou
ae
Wj P
Clazs
40 4
9 Class w3
) W3
e 20% * Class wi
=
à u U Claes ed
n 0 za mn ua 5
Band 1
Figure 2. Subdivisions of the feature space determined by the
decision tree (synthetic data)
This example is over-simplified because the solution can be
presented as a two-dimensional scatter-plot. However
hyperspectral images classifications generally require solutions
in space of many dimensions. Another way to show the decision
trees' results is the following: