In: Paparoditis N., Pierrot-Deseilligny M., Mallet C.. Tournaire O. (Eds). IAPRS. Vol. XXXVIII. Part ЗА - Saint-Mandé, France. Septentber 1-3. 2010
235
Figure 2: Fusion result: The first row shows six redundant ortho
graphic views of a scene taken from Graz. Fused image results
are given in the second row for color, height and building classi
fication. Undefined areas are considerably compensated by using
the high redundancy.
the set of Sigma Points 1 as follows:
Po
= fj> Pi = д + a(vE)i pi +d =//- a(v/E)i, (1)
where i — 1... d and (x/E), defines the z-th column of the re
quired matrix square root. Due to symmetry of the covariance
matrix, we apply the Cholesky factorization to efficiently com
pute the matrix square root of E. The term a defines a weighting
for the elements in the covariance matrix and is set to ct = \/2d
as suggested in (Kluckner et al.. 2009). Then, a resulting region
descriptor P = {po pad} consists of 2d + 1 concatenated
Sigma Points p, £ R' 1 and has a dimension of P £
For details we refer to (Kluckner et al.. 2009). The next section
describes the fusion of redundant information into a common 3D
coordinate system.
4 FUSION OF MULTIPLE IMAGES
Because of the high overlap in the aerial imagery, each point
on ground is mapped multiple times from different viewpoints.
Since we are interested in large-scale modeling, we generate an
orthographic image from many overlapping perspective images
by a pixel-wise transformation into a common 3D coordinate sys
tem. Taking into account camera data and depth information, pro
vided by a dense matching procedure, corresponding pixels in the
perspective images yield multiple observations for color, height
and building classification in the orthographic view. Several rec
tified observations of a scene taken from the imagery Graz are
shown in Figure 2.
The fusion of redundant information into a common view has the
benefit that e.g. reconstruction errors caused by non-stationary
objects like moving cars can be compensated. In addition, a pro
jection of many different views produces an orthographic image
without undefined image regions caused by perspective occlu
sions. First, color and height information are fused by computing
median values for each pixel from multiple observations. In case
of robustly fusing color information per pixel, we use random
projections of the color vector onto ID lines to detect the me
dian of vector-valued data (Tukey, 1974). Though simple mean
computation has lower computational complexity, a median will
not introduce new colors values as possibly introduced by aver
aging. In addition, an accurate fused color image is essential for
super-pixel segmentation performed at the next step. In order to
estimate a final building likelihood for each pixel in the ortho
graphic view, confidences from different views are accumulated
1 Code available at http://www.icg.tugraz.at/Members/kluckner
and normalized. Figure 2 depicts the final pixel-wise fusion result
for color, height and building classification. In the next step we
briefly discuss super-pixels and introduce an optimization stage
to refine the classification and the prototype labeling on a super
pixel neighborhood.
4.1 Super-Pixel Segmentation
A variety of recently proposed methods obtaining state-of-the-art
performance on benchmark datasets integrate unsupervised im
age segmentation methods into classification or object detection.
Several approaches utilize multiple segmentations (Malisiewicz
and Efros. 2007, Pantofaru et al., 2008) however the generation
of many partitions induces enormous computational complexity
and is impractical for aerial image segmentation. Recently. Fulk
erson et al. (Fulkerson et al., 2009) proposed to use super-pixels,
rapidly generated by Quiekshift (Vedaldi and Soatto, 2008). These
super-pixels accurately preserve object boundaries of natural and
man-made objects. Applying Quiekshift super-pixel segmenta
tion to our approach offers several benefits: First, computed super
pixels can be seen as the smallest units in the image space. All
subsequent processing steps can be performed on a reduced ad
jacency graph instead of incorporating the full pixel image grid.
Furthermore, we consider super-pixels like homogeneous regions
providing important spatial support: Due to edge preserving ca
pability, each super-pixel describes a part of only one class, namely
building or non-building. Aggregating data, such as classifica
tion and height information, over the pixels defining a super-pixel
compensates for outliers and erroneous pixels. For instance, an
accumulation of building likelihoods results an improved build
ing classification for each segment. A color averaging within
small regions synthesizes the final modeling results and signif
icantly reduces the amount of data. More importantly, we ex
ploit super-pixels, which define parts of the building footprints,
for the 3D modeling procedure. Taking into account a derived
polygon approximating of the boundary pixels and corresponding
height information, classified building footprints can be extruded
to form any type of geometric 3D primitives. Therefore, intro
ducing super-pixels for footprint description allows to model any
kind of ground plan and in the following the rooftop.
4.2 Refined Labeling using Super-Pixels
Although aggregating the fused building classification or extract
ing geometric prototypes using super-pixels capture some local
information, the regions in the image space are handled inde
pendently. In order to incorporate spatial dependencies between
nodes defined on the image grid, e.g. Markov random field for
mulations (Boykov et al., 2001) are widely used to enforce an
evident final class labeling. In contrast to minimizing the energy
on a full image grid (Pantofaru et al., 2008, Kluckner et al., 2009)
we apply a conditional Markov random field (CRF) stage defined
on the super-pixel neighborhoods similar as proposed in (Fulk
erson et al.. 2009). In our approach we apply the refinement on
super-pixels twice: First, we apply the CRF to provide a smooth
labeling of the building class taking into account the spatial de
pendency on an adjacency graph. Second, in a separate process
ing step the CRF is used for consistent labeling of the geometric
prototypes to enforce a piecewise planar rooftop.
Let G(S, E) be an adjacency graph with a super-pixel node s, £
S and a pair (Si,Sj) £ E be an edge between the segments s l and
Sj. then an energy can be defined with respect to the class labels
c. In this work a label can be a building/non-building class or a
possible assignment to a specific geometric primitive. Generally,