orresponding to
; depending on
r not. In any case,
linearly into the
y 8 bit.
our vectors of the
ature vectors fr
irst three features
c (NDVI), derived
> CIR orthophoto,
ning the image to
nt), calculated as
We also make use
the variance of
ieighbourhood of
1s for var). The
tion between an
ure should model
à certain distance
generate an edge
“the input image.
edge image. The
‘an image site to
ce map. The last
on, calculated in
grad). In order to
> two histograms
ggs, 2005), one
3 pixels in our
hood (101 x 101
tation bins. The
responding to the
ies in the two
or each pixel is
Model (DTM) is
ological opening
rresponds to the
ene, followed by
DSM feature is
he DTM, ie,
'elative elevation
, or bridges.
that is supposed
'e use the output
(Leitloff et. al.
rage of detected
iat method. The
extended set of
ut for pixel-wise
depends on the
sifier. Even for a
patch size of 30
t is not possible
ation. Thus, the
ificantly during
"riedmann et. al.
ced by Tieu &
; method, which
nerate a strong
) leamer is a
tained from the
sum of confidence values of all weak learners. Generally stumps
or classification trees are used as base classifier. Each node of a
regression tree applies a threshold to only one feature. The
thresholds and features are chosen so that the training error
becomes minimal. Thus, the most distinguishing features are
found during training. In our case only 350 features have been
selected, which makes the final classification suitable for large
datasets. More details about training the Boosting classifier can
be found in our previous work (Leitloff et. al. 2010). The
feature fear is defined as the combined confidence value of the
classifier.
35 Training and Inference
Training of MRF is complex if it is to be carried out in a
probabilistic framework, mainly due to the fact that it requires
an estimate for the partition function Z in Eq. 1, which is
computationally intractable. Thus, approximate solutions have
to be used for training. In our application, we determine the
parameters of the association and interaction potentials
separately. That is, given the training data (fully labelled
images), the probabilities p.(f;| x) are determined from
histograms of the features /; (which are quantized by 8 bit for
that purpose) of each class and smoothing, in the way described
in Section 3.3. In a similar way, the interaction potentials are
scaled versions of the 2D histograms of the co-occurrence of
classes at neighbouring image sites in the way described in
Section 3.4. Exact inference is also computationally intractable
for MRF's. For inference, we use a message passing algorithm,
namely Loopy Belief Propagation (LBP), a standard technique
for probability propagation in graphs with cycles that has shown
to give good results in the comparison reported in
(Vishwanathan et al., 2006).
4. EXPERIMENTS
41 Test Data and Test Setup
Under the auspices of the DGPF a test data set over Vaihingen
(Germany) was acquired in order to evaluate digital aerial
camera systems (Cramer, 2010). It consists of several blocks of
vertical images captured by various digital aerial camera
systems at two resolutions. We used one of the DMC blocks to
test our approach. The images are 16 bit pan-sharpened colour
infrared images with a ground sampling distance (GSD) of 8 cm
(flying height: 800 m, focal length: 120 mm). For our
experiments, the radiometric resolution of the images had to be
converted to 8 bit. The georeferencing accuracy is about 1 pixel.
The nominal forward and side laps of the images are 65% and
60%, respectively. As a consequence, each crossroads in the
block is visible in at least four images.
For our experiments, we selected 55 crossroads by digitizing
their approximate centres. The set of crossroads contained
examples from densely built-up urban and suburban as well as
rural areas. For each crossroads, we generated a DSM and a true
orthophoto, both with a GSD of 8 cm in the way described in
Section 3.2; the size of the orthophotos used in our process was
1000 x 1000 pixels, thus corresponding to 80 x 80 m°. In the
training phase we use the original orthophotos (1000 x 1000
pixels); for inference, squares of 5 x 5 pixels were used as nodes
of the graphical model; thus, each graphical model consisted of
200 x 200 nodes. For the car confidence feature we used a
classifier trained on data of DLR's 3K-system (Kurz et. al.
2011). The sample images have a resolution of 20 cm. Thus, the
Vaihingen dataset is resampled to this resolution for
classification. Due to different radiometric properties, the Haar-
like features are only calculated from intensity values. Both,
resampling and exclusive use of intensity values limit the
classification performance in this context.
The ground truth was generated by manually labelling the image
areas using altogether 14 classes (cf. Figure 2). We use the
ground truth for the algorithm's training phase and for the
evaluating the classification accuracy. In order to have a
sufficient amount of training data, we had to use cross
validation in our evaluation procedure: in each experiment, all
images except one were used for training, and the remaining
image served as a test image; this procedure was repeated 55
times, each time using a different test image, so that in the end
each image was used as a test image once. In all experiments,
confusion matrices were determined from a comparison of the
test images with the ground truth, as well as the completeness
and the correctness of the results for each class and the overall
classification accuracy (Rutzinger et al., 2009).
We carried out four different experiments. In the first two
experiments, we tried to separate all 14 classes; the only
difference is the number of features we used. In the first
experiment we used all features described in Section 3.5,
including the car feature, whereas the second experiment was
carried out without the car feature. In the third and the fourth
experiments we reduced the set of classes to eight by merging
classes having a similar appearance in the data. Again, the two
experiments differ by the use of the car feature.
4.2 Evaluation
The confusion matrices as well as the completeness and the
correctness of the results achieved in the first two experiments
(the ones using 14 classes) are shown in Tables 1 and 2; an
example for the classification result is shown in Fig. 2. The
overall accuracy of the classification was 63.5% if the car
feature was used and 63.3% if it was not used. Thus, the overall
accuracy, while being relatively poor in both cases, was hardly
improved by that feature. The relatively poor overall accuracy is
caused by the fact that some of the classes have a very similar
appearance in the data, e.g. sealed, road, sidewalk, and traffic
islands. Reasonable values of completeness and correctness
could be achieved for buildings (> 80%). For trees, the
completeness is also larger than 80%, but the correctness is
much lower (62%). Both, for buildings and trees the main error
source was errors in the DSM caused by areas with hardly any
texture (buildings) or abrupt height changes (trees). One of the
problems was the information reduction caused by the
conversion of the images to 8 bit, but apparently the openCV
matcher also had problems with non-fronto-parallel surfaces
and with different illumination. The main impact of the car
confidence feature was a considerable reduction of the false
positive car detections, though the correctness of 28% achieved
with this feature is still not satisfactory.
The evaluation of the experiments carried out with the reduced
set of classes is presented in Tables 3 and 4. The overall
accuracy increased to about 75%, which indicates that our
classification scheme is reasonable, though there is room for
improvement. The main error source is the confusion between
trees, grass, and agriculture, again partly caused by DSM
errors. In this setting, the impact of the car confidence feature is
similar to its impact in the first group of experiments.