The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. Vol. XXXVII. Part B3b. Beijing 2008
448
tween random or greedily deterministic approaches to find
an appropriate subset of features to perform satisfying clas
sifications.
Alternatively, Guyon and Elisseeff (2003) propose a feature
selection framework which is based on the ranking of fea
tures. Then, the classification step itself is done by only one
classifier. In our experiments, the features have low corre
lation coefficients with the class target and they are highly
correlated with each other. Furthermore, the training sam
ples do not form compact clusters in feature space. Then, it
is a hard task to find a single classifier that is able to sepa
rate the classes. Thus, we prefer a feature selection scheme
which combines several classifiers.
One of such schemes is the framework of adaptive boosting
(Adaboost), cf. (Schapire, 1990). The first weak classifier
(or best hypothesis) is learnt on equally weighted training
samples (x n ,y n ). Then, the influence of all misclassified
samples is increased by adjusting the weights of the fea
ture vectors before training the second weak classifier. So,
the second weak classifier will focus especially on the pre
viously misclassified samples. Again, the weights are ad
justed once more depending on the classification result of
the second weak classifier before training the third weak
classifier, and so on. Finally, we obtain the resulting clas
sifier by a majority vote of all weak classifiers. The dis
criminative power of this resulting classifier is much higher
than the discriminative power of each weak classifier, cf.
(Ratsch et al., 2001).
This majority voting is also done when using random forests,
which are based on random decisions when constructing
decision trees. In (Ho, 1998), a random selection of a pre
defined small number of features is used for constructing
decision trees. This procedure is equivalent to projecting
the feature vectors into a space with much lower dimen
sionality, but the projections differ from one decision tree
to the others. The decision trees in (Breiman, 2001) ran
domly choose a feature from the whole feature set and takes
it for determining the best domain split. Both methods only
work well, if the number of random decision trees is large,
especially if there are only features which are nearly un
correlated with the classes. In the experiments by Breiman
(2001), decision trees were used five times more than weak
learners in Adaboost. Therefore, we focus our work on Ad
aboost and its variants.
3 FEATURE SELECTION WITH ADABOOST
The basic algorithm of Adaboost as taken from (Schapire
and Singer, 1999) is shown in alg. 1. Input are the number
T of iterations and N samples (x n ,y n ) with binary tar
gets, i. e. y n £ {+1,-1}. In each iteration, the best weak
classifier is determined with respect to the samples weights.
Each weak classifier ht is a function h t : x n >—► {+1,-1}.
After Adaboost has terminated after T iterations, the result
ing strong classifier H can be depicted as weighted major
ity voting over the responses of all weak classifiers:
H(x n ) = sign a t h t (x n ) \ . (1)
The a t are predictive values and depend on the weak clas
sifiers success rates.
Algorithm 1 Adaboost algorithm
1: function Adaboost(T, (xi, y\),..., (xat, j/tv))
2: W? = i
3: for t = 1,..., T do
4: Determine best weak hypothesis h t using W t
5: Determine a t
6: Determine distribution Wt+i
7: end for
8: return H with H(x) = sign a t h t {x)).
If the weak classifiers are designed in a way that they use
only a limited set of features, e. g. 1 feature only, then we
are able to derive the relevance of features from the rele
vance of the weak classifiers. In the case, where only clas
sifiers on single features are used, the strong classifier H
consists of T weak classifiers hi,..., hr, and we obtain a
list of maximally T features that are involved in classifica
tion. If T < D, then these maximally T features are the
most appropriate features for classification.
Another strategy for obtaining the best feature subset is pre
sented in (Drauschke and Forstner, 2008). In case T « D
or T > D, we cannot assume to find the best subset by
only taking those features that have been used by the first
weak classifiers. The influence of a feature depends on the
absolute value of the predictive values at- Since these at
do not monotonously decrease with increasing t, we have
to evaluate the features after the iterative process has termi
nated.
4 DATA
We use terrestrial images of the eTRIMS data base for test
ing our feature selection and classification. This data base
contains several hundred buildings from several major Eu
ropean cities or their suburbs. We selected 82 images from
Berlin, Bonn and Munich, Germany, and from Prague, Czech
Republic.
Typically, buildings have dominant vanishing lines in hori
zontal and vertical directions, respectively. If a pair of lines
in the image is selected for each of both directions, we are
able to determine the vanishing points and the homography
for rectifying the image, cf. (McGlone, 2004), p. 775. So
far, we work on rectified images which where the image
rectification has been calculated after manual selection of
these two pairs of lines. We show such a rectified image in
fig. 1.
Our class ontology contains several object classes, and our
goal is to find instances of these classes in images. At this
stage, we limit our experiments to a subset of classes. On
the one hand, we are interested in building detection, and
therefore, we want to classify image regions as facade, roof
sky or vegetation. On the other hand, we are interested in
detection of building parts, and therefore, we want to clas
sify image regions as windows or window panes. Objects of
both classes appear in all images, and regarding their cardi
nality, these are the most dominant structures in all images
of the eTRIMS data base.