problems is the reliability of feature extraction, the other the
exponential complexity of the matching task. For the later
approach in order to have a two-dimensional representation of a
three-dimensional shape, one has to decompose the shape into
several views and store a two-dimensional representation for
each view. This is referred to as an aspect-graph. For our
system, since we have an approximated exterior orientation of
the imaging device, we do not have to build the whole aspect
graph, rather we create on-the-fly a single view of the shape
corresponding to the orientation data, as we receive the image.
2.1 View Class Representation
With the exception of a few simple shapes (a sphere for
example) every three dimensional object has a different
appearance when seen from a different viewpoint. The more
complex the shape is the stronger are the differences. This is the
case for buildings. In the view class representation scheme
(Koenderink & van Doorn 1979) the space of possible
viewpoints is partitioned into view classes. The view classes are
arranged in a graph known as the aspect graph (see Figure 1).
Each node represents a single view class, each arc represents
the transition from one viewpoint to another. The methods used
to compute the view classes of an object can be very complex
and the aspect graph of a non-trivial shape is quite large. In the
subsequent matching process the input data has to be compared
to a multitude of nodes.
In our framework we make use of the knowledge of an
approximated exterior orientation. This translates to a single
point in the space of possible viewpoints. Therefore we are not
required to compute the full aspect graph. If the viewpoint is
known only to a very low accuracy it is sufficient to compute a
small part of the aspect graph in the neighborhood of the
approximated viewpoint. Our experience shows that for our
application it is usually sufficient to reduce the graph to a single
node. This single node does not need to be stored beforehand
but can be computed on-the-fly. The matching process is
restricted to only one instance further reducing computational
costs.
Figure 1: Part of the aspect graph of a simple building.
2.2 Feature Extraction
When designing an object recognition system one has to choose
the type of features used for recognition and the algorithm used
to perform the matching. The decision on the feature type is
often guided by the available model data. In our case the
buildings are modeled as polyhedrons, no in-plane facade detail
or texture information is available. The strong discrepancy in
feature detail in-between model and sensor data prevented us
from using edge or corner detection. Since there is no texture
information available, image correlation was also not an option.
To achieve a robust detection we chose to detect the overall
shape of the ‘building in the image rather than extracting single
features. The intent was, that the overall shape is more robust
against clutter of the scene, partial occlusion by trees, cars,
pedestrians and other negative influences onto the scene. The
silhouette of a building is a good representation for its overall
shape. From an existing CAD database the CAD model of the
building is rendered for a given view according to the
calibration data of the camera. The details on how to select the
specific building are given in (Klinec & Fritsch 2001). The
‘virtual view’ of the building is used to extract the silhouette of
the building (see Figure 2). This representation is then detected
in the scene.
Figure 2: CAD model of a building rendered for a given exte-
rior orientation and the extracted silhouette below.
2.3 Generalized Hough Transform
We decided to use Generalized Hough Transform (GHT) to
implement the detection. The GHT is a framework for both the
representation and detection of two-dimensional shapes in
images. Using the GHT we are able to detect the shape no
matter whether it is shifted, rotated or optionally even scaled in
the image. We need these degrees of freedom, since the
orientation is only known approximately. Additionally the GHT
allows for a certain tolerance in shape deviation, which is
necessary, since the CAD model of the building is only a coarse
generalization of its actual shape as it appears in the image.
The well known conventional Hough Transform (Hough, 1962)
is used to detect analytical curves such as lines or circles in an
image. The requirement for the application of the Hough
Transform is that the model can be formulated as a function of a
set of parameters. The GHT is the generalization of that concept
for the detection of arbitrary shapes for which a simple analytic
description is not possible (Ballard, 1981).
The GHT uses a gradient operator to compute edge magnitude
and gradient direction. During the offline phase, which has to
be computed once for a certain view class, the so-called R-table
is built. First a reference point of the shape is selected usually
the centroid of the shape is used. Then the distance vector r of
every edge pixel P to the reference point O with respect to its
gradient direction ® (see Figure 3) is stored in the R-table. Of
course when iterating over all edge pixels, there can be several
r vectors for an entry of ®.
In the online phase, when the actual detection is performed, all
gradients of the search images are computed. For each edge
pixel the gradient direction points to an entry in the R-table,
402 —
Fi
fre
Fo
pr
co
op
Fo
py’
api
1S (
Th
trai
trai
act
din
We
its
3.1
Thi
obs
cali
Ori
froi
Fro
an
pho
Sev
pro!
sug
4 P
COO
fou