XXII ISPRS Congress 2012: Technical Commission III

  
    
     
    
    
   
   
   
    
   
   
   
   
   
     
  
    
    
  
   
    
  
    
   
   
    
   
    
  
   
   
    
    
  
  
   
  
     
  
  
   
     
    
     
    
   
   
   
   
Q.(X,V) c log p(f,(y) | x; ^c). Thus, qi; yj is the log likelihood 
of the generative model with a smoothness prior on the labels 
expressed in Eq. 1. In this context, the image data are 
represented by site-wise feature vectors fi(y) that may depend 
on the data observed at site i and its local neighbourhood. Both 
the definition of the features and the dimension of the feature 
vectors f(y) may vary from dataset to dataset, because the 
definition of appropriate and expressive features depends on the 
image resolution and also on the spectral information contained 
in the images. 
We use a simple model for the likelihood p(f(y) | x) and, 
consequently, for the association potential @(x,y;). In the 
training phase, for each class we generate histograms of all 
features. These histograms are smoothed and normalised, and 
the smoothed and normalised histograms are used as probability 
density functions (pdf) p(f; | x;=c) = Pelf | X), where f; is the 5 
component of f, for the class c. Neglecting the statistical 
dependencies between the individual features fi, the likelihood 
p(y) | x;/=c) becomes the product of the probability density 
functions of the individual features, so that the association 
potential becomes the sum of the logarithms of these functions: 
M 
e(x 7 e») 7 2tog| p (/5]x;)] Q) 
Jsi 
In Eq. 2, M is the dimension of the site-wise feature vectors 
f(y). This is a very simplistic model, which is to be replaced by 
more appropriate ones in the future. Its advantage is that it is 
very fast to determine in training. 
3.3 Interaction Potential 
The pairwise interaction potential Vi X. x) in Eg. 1 is a 
measure for the influence of neighbouring labels x; on the class 
x; of image site i. The function wj(x;xj characterizes how 
likely the variable x; will take the value c, given that the variable 
x; from the neighbouring data site j € N; takes the value c', that 
is, Vj(x,x) x log p(x; ^c | x; 7c )). 
In order to determine this probability, a 2D histogram of the co- 
occurrence of labels at neighbouring interaction features is 
generated from the training data. The histogram corresponds to 
a (symmetric) matrix of dimension n, x n, where n, is the 
number of classes to be discerned, and the matrix entry at (c, c") 
is the number of occurrences of the classes (cc) at 
neighbouring pixels i and j. After generating the histogram 
matrix, its rows are scaled so that the largest value in a row 
(usually the diagonal element) will be one. This is done in order 
to compensate for different numbers of pixels per class in the 
training data, i.e. to guarantee, that p(x; =c | x; =c) will be the 
same for all classes ¢ € C. The interaction potentials Wi(XiX) 
are then defined as the logarithm of the scaled histogram matrix 
entries. It is a drawback of this type of scaling that Yi(Xix;) will 
no longer be symmetric. 
3.4 Definition of the Features 
As stated in Section 3.1, we derive a feature vector f;(y) for each 
image site i that consists of Ming features derived from the 
orthophoto (image features) collected in a vector fing, a feature 
derived from the DSM (fps) and, optionally, a car confidence 
feature (fear). Consequently, the total number M of features is 
  
either M=M,,+1 or M= Ming 2, corresponding to 
fy)" = {i T nr fnsm) or fi y =f rn TDSM Fear), depending on 
whether the car confidence feature is used is or not. In any case, 
for numerical reasons all features are scaled linearly into the 
range between 0 and 255 and then quantized by 8 bit. 
3.4.1 Image features: We do not use the colour vectors of the 
images directly to define the site-wise image feature vectors f 
In total, we determine 7 image features. The first three features 
are the normalized difference vegetation index (ND VI), derived 
from the near infrared and the red band of the CIR orthophoto, 
the saturation (sat) component after transforming the image to 
the LHS colour space, and image intensity (int), calculated as 
the average of the two non-infrared channels. We also make use 
of the variance of intensity (varım) and the variance of 
saturation (vary,), determined from a local neighbourhood of 
each pixel (7 x 7 pixels for var, 13 x 13 pixels for vary). The 
sixth image feature (disf) represents the relation between an 
image site and its nearest edge pixel; this feature should model 
the fact that road pixels are usually found in a certain distance 
either from road edges or road markings. We generate an edge 
image by thresholding the intensity gradient of the input image. 
Then, we determine a distance map from this edge image. The 
feature used in classification is the distance of an image site to 
its nearest edge pixel, taken from the distance map. The last 
image feature is the local gradient orientation, calculated in 
respect to the main gradient orientation (Origrad). In order to 
compute the gradient orientation, we calculate two histograms 
of oriented gradients (HOG) (Dalal & Triggs, 2005), one 
considering a local neighbourhood (13 x 13 pixels in our 
experiments), and one from a larger neighbourhood (101 x 101 
pixels). Each histogram consists of 9 orientation bins. The 
feature is the difference between the angles corresponding to the 
histogram bins having the maximum entries in the two 
histograms. Thus, the image feature vector for each pixel is 
fing = (NDVI, sat, int, vary, vari, dist, OFigrad) 
3.5.2 DSM feature: A coarse Digital Terrain Model (DTM) is 
generated from the DSM by applying a morphological opening 
filter with a structural element whose size corresponds to the 
size of the largest off-terrain structure in the scene, followed by 
a median filter with the same kernel size. The DSM feature is 
the difference between the DSM and the DTM, ie. 
fosm = DSM-DTM. This feature describes the relative elevation 
of objects above ground such as buildings, trees, or bridges. 
3.5.3 Car confidence feature: This is a feature that is supposed 
to be particularly useful for classifying cars. We use the output 
of the car detection algorithm described in (Leitloff et. al. 
2010). However, we do not use the binary image of detected 
cars, but the confidence image derived by that method. The 
calculation of these confidence values uses an extended set of 
Haar-like features (Lienhart et. al., 2003) as input for pixel-wise 
classification. The number of possible features depends on the 
size of image samples used for training the classifier. Even for a 
reduced GSD of 20 cm and the resulting image patch size of 30 
by 30 pixels more than 800,000 features exist. It is not possible 
to calculate all those features during classification. Thus, the 
number of features has to be reduced significantly during 
training. The idea of using Adaptive Boosting (Friedmann et. al. 
2000) for feature reduction has been introduced by Tieu & 
Viola (2004). Boosting is an ensemble learning method, which 
combines a set of simple classifiers to generate a strong 
classifier. The output of each base (weak) learner is a 
confidence value. The final classification is obtained from the 
  
   
sum of 
or clas 
regress 
thresh 
becom 
found 
selecte 
dataset 
be fou 
feature 
classif 
3.5 
Trainit 
probat 
an est 
compu 
to be 
param: 
separa; 
images 
histogi 
that pu 
in Sec 
scaled 
classes 
Sectio! 
for MI 
namel: 
for pro 
to gi 
(Vishw 
4.1 
Under 
(Germ 
camer: 
vertica 
system 
test ou 
infrare 
(flying 
experi 
conver 
The nc 
60%, 1 
block i 
For ou 
their : 
examp 
rural a 
orthop 
Section 
1000 > 
trainin 
pixels) 
of the 
200 x 
classif 
2011).
1
2
...
477
478
479
480
481
...
586
587
Full text: Technical Commission III (B3)

Access restriction

Copyright

Note to user