Q.(X,V) c log p(f,(y) | x; ^c). Thus, qi; yj is the log likelihood
of the generative model with a smoothness prior on the labels
expressed in Eq. 1. In this context, the image data are
represented by site-wise feature vectors fi(y) that may depend
on the data observed at site i and its local neighbourhood. Both
the definition of the features and the dimension of the feature
vectors f(y) may vary from dataset to dataset, because the
definition of appropriate and expressive features depends on the
image resolution and also on the spectral information contained
in the images.
We use a simple model for the likelihood p(f(y) | x) and,
consequently, for the association potential @(x,y;). In the
training phase, for each class we generate histograms of all
features. These histograms are smoothed and normalised, and
the smoothed and normalised histograms are used as probability
density functions (pdf) p(f; | x;=c) = Pelf | X), where f; is the 5
component of f, for the class c. Neglecting the statistical
dependencies between the individual features fi, the likelihood
p(y) | x;/=c) becomes the product of the probability density
functions of the individual features, so that the association
potential becomes the sum of the logarithms of these functions:
M
e(x 7 e») 7 2tog| p (/5]x;)] Q)
Jsi
In Eq. 2, M is the dimension of the site-wise feature vectors
f(y). This is a very simplistic model, which is to be replaced by
more appropriate ones in the future. Its advantage is that it is
very fast to determine in training.
3.3 Interaction Potential
The pairwise interaction potential Vi X. x) in Eg. 1 is a
measure for the influence of neighbouring labels x; on the class
x; of image site i. The function wj(x;xj characterizes how
likely the variable x; will take the value c, given that the variable
x; from the neighbouring data site j € N; takes the value c', that
is, Vj(x,x) x log p(x; ^c | x; 7c )).
In order to determine this probability, a 2D histogram of the co-
occurrence of labels at neighbouring interaction features is
generated from the training data. The histogram corresponds to
a (symmetric) matrix of dimension n, x n, where n, is the
number of classes to be discerned, and the matrix entry at (c, c")
is the number of occurrences of the classes (cc) at
neighbouring pixels i and j. After generating the histogram
matrix, its rows are scaled so that the largest value in a row
(usually the diagonal element) will be one. This is done in order
to compensate for different numbers of pixels per class in the
training data, i.e. to guarantee, that p(x; =c | x; =c) will be the
same for all classes ¢ € C. The interaction potentials Wi(XiX)
are then defined as the logarithm of the scaled histogram matrix
entries. It is a drawback of this type of scaling that Yi(Xix;) will
no longer be symmetric.
3.4 Definition of the Features
As stated in Section 3.1, we derive a feature vector f;(y) for each
image site i that consists of Ming features derived from the
orthophoto (image features) collected in a vector fing, a feature
derived from the DSM (fps) and, optionally, a car confidence
feature (fear). Consequently, the total number M of features is
either M=M,,+1 or M= Ming 2, corresponding to
fy)" = {i T nr fnsm) or fi y =f rn TDSM Fear), depending on
whether the car confidence feature is used is or not. In any case,
for numerical reasons all features are scaled linearly into the
range between 0 and 255 and then quantized by 8 bit.
3.4.1 Image features: We do not use the colour vectors of the
images directly to define the site-wise image feature vectors f
In total, we determine 7 image features. The first three features
are the normalized difference vegetation index (ND VI), derived
from the near infrared and the red band of the CIR orthophoto,
the saturation (sat) component after transforming the image to
the LHS colour space, and image intensity (int), calculated as
the average of the two non-infrared channels. We also make use
of the variance of intensity (varım) and the variance of
saturation (vary,), determined from a local neighbourhood of
each pixel (7 x 7 pixels for var, 13 x 13 pixels for vary). The
sixth image feature (disf) represents the relation between an
image site and its nearest edge pixel; this feature should model
the fact that road pixels are usually found in a certain distance
either from road edges or road markings. We generate an edge
image by thresholding the intensity gradient of the input image.
Then, we determine a distance map from this edge image. The
feature used in classification is the distance of an image site to
its nearest edge pixel, taken from the distance map. The last
image feature is the local gradient orientation, calculated in
respect to the main gradient orientation (Origrad). In order to
compute the gradient orientation, we calculate two histograms
of oriented gradients (HOG) (Dalal & Triggs, 2005), one
considering a local neighbourhood (13 x 13 pixels in our
experiments), and one from a larger neighbourhood (101 x 101
pixels). Each histogram consists of 9 orientation bins. The
feature is the difference between the angles corresponding to the
histogram bins having the maximum entries in the two
histograms. Thus, the image feature vector for each pixel is
fing = (NDVI, sat, int, vary, vari, dist, OFigrad)
3.5.2 DSM feature: A coarse Digital Terrain Model (DTM) is
generated from the DSM by applying a morphological opening
filter with a structural element whose size corresponds to the
size of the largest off-terrain structure in the scene, followed by
a median filter with the same kernel size. The DSM feature is
the difference between the DSM and the DTM, ie.
fosm = DSM-DTM. This feature describes the relative elevation
of objects above ground such as buildings, trees, or bridges.
3.5.3 Car confidence feature: This is a feature that is supposed
to be particularly useful for classifying cars. We use the output
of the car detection algorithm described in (Leitloff et. al.
2010). However, we do not use the binary image of detected
cars, but the confidence image derived by that method. The
calculation of these confidence values uses an extended set of
Haar-like features (Lienhart et. al., 2003) as input for pixel-wise
classification. The number of possible features depends on the
size of image samples used for training the classifier. Even for a
reduced GSD of 20 cm and the resulting image patch size of 30
by 30 pixels more than 800,000 features exist. It is not possible
to calculate all those features during classification. Thus, the
number of features has to be reduced significantly during
training. The idea of using Adaptive Boosting (Friedmann et. al.
2000) for feature reduction has been introduced by Tieu &
Viola (2004). Boosting is an ensemble learning method, which
combines a set of simple classifiers to generate a strong
classifier. The output of each base (weak) learner is a
confidence value. The final classification is obtained from the
sum of
or clas
regress
thresh
becom
found
selecte
dataset
be fou
feature
classif
3.5
Trainit
probat
an est
compu
to be
param:
separa;
images
histogi
that pu
in Sec
scaled
classes
Sectio!
for MI
namel:
for pro
to gi
(Vishw
4.1
Under
(Germ
camer:
vertica
system
test ou
infrare
(flying
experi
conver
The nc
60%, 1
block i
For ou
their :
examp
rural a
orthop
Section
1000 >
trainin
pixels)
of the
200 x
classif
2011).