In: Paparoditis N., PieiTot-Deseilligny M.. Mallet C.. Tournaire О. (Eds). lAPRS. Vol. XXXVIII. Part ЗА - Saint-Mandé, France. September 1-3. 2010
Ш.
overall
building
non-buil.
Graz, pixel level
88.5
90.5
87.3
Graz, with super-pixel
90.6
92.1
90.0
Graz, with CRF
93.7
92.1
93.4
San Fran., pixel level
85.7
86.3
85.3
San Fran., with super-pixel
89.2
89.0
91.8
San Fran., with CRF
92.1
91.8
93.4
Table 1: Building classification accuracy in terms of correctly
classified pixels on hand-labeled orthographic test data. It can be
clearly seen that use of super-pixels as spatial support improves
accuracy. The CRF stage further improves the classification rates
using a consistent final labeling of the super-pixels.
et al.. 2006) and the derived DTM (Champion and Boldo. 2006).
A combination of DTM and DSM yields absolute elevation mea
surements per pixel from ground which are applied for the build
ing classification and modeling.
Building Classification. For all datasets, we train individual
RF classifiers with 8 trees and a maximum depth of 14. The
Sigma Points feature vectors are collected within small image
patches (11 x 11 pixels). In this work the Sigma Points describe
the statistics of feature cues like color, texture and elevation mea
surements within small image patches. Texture information is
directly obtained by computing first order derivatives on the L
channel of CIELab color images. A combination of the color
channels, two gradients and the elevation measurements yields a
feature vector with 78 attributes, which can be directly trained
and evaluated using the RF classifiers.
In our approach we exploit hand-labeled ground truth maps for
training of the classifiers. Please note that the labeling of train
ing data involves some human interaction, but since our approach
works at the pixel level there is no need to accurately label com
plete building areas. Hence the labeling of the training data is
straightforward and can be efficiently done by applying brush
strokes representing either building or non-building class. For
evaluation we additionally label randomly selected orthographic
images (we use 9 tiles per dataset). Obtained classification rates
are summarized in Table 1. We report both the overall per-pixel
classification rate (i.e. the accuracy of all pixels correctly clas
sified) and the average of class specific per-pixel percentages,
which gives a more significant measurement due to varying quan
tity of labeled pixels for each class. On both datasets we obtain
overall classification rates of more than 90%. A classification
of a single aerial image at full resolution takes approximately 3
minutes on a dual core machine.
The fusion step for color, height and classification, also including
the super-pixel generation, of 6 different viewpoints covering an
area of 150 x 150 meters lasts less than 5 minutes. Quickshift is
applied to a vector consisting of pixel location and CIELab color.
The parameters for Quickshift are set to a = 2 and t — 8. It
turned out that these parameters capture nearly all object bound
aries in some observed test images. In addition, the parameters
generates sufficiently small regions in order to preserve curved
boundary shapes. The overall results, adding a CRF stage for
classification refinement are given for ui = 3.0.
Figure 3 shows a result for a fused image tile of Graz. While
the raw pixel-wise fusion of the class probabilities shows higher
granularity and blurred object boundaries due to inaccurate 3D
information (compare to Figure 2), an integration of super-pixels
and CRF improves the final building classification significantly.
Building Modeling. We use the proposed method to model com
plex rooftops of buildings in 3D. Figure 3 shows a modeling re
sult for a part of Graz. In order to obtain a quantitative evaluation
Figure 3: Results for a small part of Graz. The first row depicts
the input sources like color, elevation measurements and refined
classification, aggregated within super-pixels. The second row
shows computed super-pixels overlaid with the building mask and
the result of the refinement step which groups super-pixels by
taking into account the geometric primitives. In the bottom the
corresponding constructed 3D building model is given.
the root mean squared error (RMSE) over all building pixel is
computed between fused DSM values and the heights obtained
by 3D modeling. For Graz we obtain an RMSE of 1.9 meters
taking into account all 170.0e6 building pixels. For San Fran
cisco the RMSE is 1.7 meters evaluated on 210.0e6 pixels. In
case of prototype refinement the parameter uz controls the fidelity
between the degree of details and geometric simplification. For
both datasets the smoothing factor with u) = 5.0 has given reli
able results.
In Figure 4 computed 3D models are shown for San Francisco
and Graz. For efficiency and large-scale capability we compute
such models in tiles of 1600 x 1600 pixels. Given the fused color
including super-pixel segmentation, height and classification im
ages, the 3D model of Graz can be computed within an hour using
a subsequent processing.
7 CONCLUSION
We have proposed an efficient, purely image-driven approach for
constructing synthetic 3D models of buildings by exploiting re
dundant color and height information. First, an efficient classifi
cation at the pixel level has been introduced to separate buildings
from the background. A pixel-wise fusion step integrates differ
ent modalities from multiple viewpoints into a common ortho
graphic view. In particular, involving a super-pixel segmentation
enables a generic modeling of any building rooftop shape and re
duces the problem of outliers and computational complexity. We
237