CMRT09: Object Extraction for 3D City Models, Road Databases and Traffic Monitoring - Concepts, Algorithms, and Evaluation
Within the last years, the matching of multiple views of
an object enabled the reconstruction of 3D object points
with high accuracy and high density. Previous approaches
as (Kanade and Okutomi, 1994) are based on a low-level
preprocessing of the image to extract points of interest.
Then, the correspondences of such points are used to es
timate the 3D position of the object points. In many ap
plications, Forstner-features (Forstner and Gulch, 1987)
or SIFT-features (Lowe, 2004) are used, but the derived
point clouds are either sparse or have been extracted from
many images or video, e. g. (Mayer and Reznik, 2005) and
(Gallup et al., 2007). In (Tuytelaars and Van Gool, 2000),
the correspondences are determined over local affinely in
variant regions, which were extracted from local extrema
in intensity images. This procedure is liable to make match
ing mistakes if the image noise is relatively high.
Dense point clouds from only a few images are obtained
by adjusting the correspondence between pixels by correla
tion based on (semi-) global methods, e. g. (Hirschmuller,
2005). Assuming the observed objects have a smooth sur
face, the accuracy of the obtained point clouds gets in
creased by including information on the relations between
the pixels by a Markov random field, e. g. (Yang et al.,
2009), or from image segmentation, e. g. (Tao and Sawh-
ney, 2000).
In our approach, we take up the idea of (Khoshelham, 2005)
to improve an initial image segmentation using additional
3D information. From multi-view analysis, we derive a
point cloud, which is used for deriving additional features
for the segmented image regions. We focus on building
scenes, whose objects mostly consist of planar surfaces.
So, it is reasonable to look for dominant planes in the point
cloud, where the search is guided by the image segmenta
tion.
For us, it is important to realize an approach, which has
the potential to get automatized since there are many ap
plications with thousands of images. There is a need for a
completely automatic procedure if additional features are
derived from a reconstructed point cloud to improve the
segmentation or interpretation of the images. Our input are
only two or more images from the object, which were taken
by a calibrated camera. An example is shown in fig. 1.
3 RECONSTRUCTION OF THE 3D SCENE
In this section, we describe the generation of the point
cloud C from the given images. For this generation, there
are two conditions, which should be fulfilled: (a) the ob
served objects should be textured sufficiently and (b) the
views must overlap, otherwise we have problems to deter
mine the relative orientation between the images. So far,
the implemented algorithms need some human interaction
for setting the point cloud scale and the disparity range pa
rameters, but under certain conditions, the whole approach
could get designed to perform completely automatically.
We describe the procedure with two or three given images
Ji, 12 and /3. Two views are necessary to reconstruct the
Figure 1: Three aerial views of a building scene consisting
of a flat roofed part and a gable roofed part. The initial
segmentation of the upper view is shown on its right side.
The ground consists of several weirdly shaped regions, and
the flat roof is also not well segmented.
Figure 2: Reconstructed 3D-points are projected back into
2D-image (white). Left: all pairs of matches are shown.
The point cloud is very dense with approximately 75% of
pixels having a 3D point, but these points are very impre
cise. Right: only matches in all three images are shown.
The point cloud is still dense with approximately 30% of
pixels having a 3D point with higher precision.
observed 3D data, but if the matching is performed over
three images, the point cloud is still dense, see fig. 2, and it
contains more reliable points, thus less outliers. The recon
struction process can get improved if even more images are
considered. If all used images were taken by a calibrated
camera, we are able to reconstruct the 3D scene by per
forming the following steps.
In the first step we determine the relative orientations be
tween the given images. Of course, it can be skipped if the
projection matrices have been estimated during image ac
quisition. Otherwise, due to the calibration of the camera
we eliminate automatically the non-linear distortions using
the approach of (Abraham and Hau, 1997). The matching
of extracted key-points using the approach of (Lowe, 2004)
leads to the determination of the relative orientations of all
images, i. e. their projection matrices P n , cf. (Labe and
Forstner, 2006). The success of the relative orientation can
212