ISPRS Commission III, Vol.34, Part 3A „Photogrammetric Computer Vision“, Graz, 2002
VIDEO-TO-3D
Marc Pollefeys*^^* Luc Van Gool^, Maarten Vergauwen^, Kurt Cornelis?, Frank Verbiest^, Jan Tops?
% Center for Processing of Speech and Images, K.U.Leuven
? Dept. of Computer Science, University of North Carolina — Chapel Hill,
Marc.Pollefeys(a)cs.unc.edu
Working Group III/V
KEY WORDS: 3D modeling, video sequences, structure from motion, self-calibration, stereo matching, image-based
rendering.
ABSTRACT:
In this contribution we intend to present a complete system that takes a video sequence of a static scene as input and
generates visual 3D model. The system can deal with images acquired by an uncalibrated hand-held camera, with intrinsic
camera parameters possibly varying during the acquisition. In a first stage features are extracted and tracked throughout
the sequence. Using robust statistics and multiple view relations the 3D structure of the observed features and the camera
motion and calibration are computed. In a second stage stereo matching is used to obtain a detailed estimate of the
geometry of the observed scene. The presented approach integrates state-of-the-art algorithms developed in computer
vision, computer graphics and photogrammetry. The resulting models are suited for both measurement and visualization
purposes.
1 INTRODUCTION
In recent years the emphasis for applications of 3D model-
ing has shifted from measurements to visualization. New
communication and visualization technology have created
an important demand for photo-realistic 3D content. In
most cases virtual models of existing scenes are desired.
This has created a lot of interest for image-based approaches.
Applications can be found in e-commerce, real estate, games,
post-production and special effects, simulation, etc. For
most of these applications there is a need for simple and
flexible acquisition procedures. Therefore calibration should
be absent or restricted to a minimum. Many new applica-
tions also require robust low cost acquisition systems. This
stimulates the use of consumer photo- or video cameras.
The approach presented in this paper allows to captures
photo-realistic virtual models from images. The user ac-
quires the images by freely moving a camera around an
object or scene. Neither the camera motion nor the cam-
era settings have to be known a priori. There is also no
need for preliminary models. The approach can also be
used to combine virtual objects with real video, yielding
augmented video sequences.
The approach proposed in this papers builds further on ear-
lier work, e.g. (Pollefeys et al., 2000). Several important
improvements were made to the system. To deal more effi-
ciently with video, we have developed an approach that can
automatically select key-frames suited for structure and mo-
tion recovery. The projective structure and motion recov-
ery stage has been made completely independent of the ini-
tialization which avoids some instability problems that oc-
curred with the quasi-Euclidean initialization proposed in
(Beardsley et al., 1997). Several optimizations have been
implemented to obtain more efficient robust algorithms (Matas
* corresponding author
A- 252
and Chum, 2001). To guarantee a maximum likelihood re-
construction at the different levels a state-of-the-art bundle
adjustment algorithm was implemented that can be used
both at the projective and the Euclidean level. A much
more robust linear self-calibration algorithm was obtained
by incorporating general a priori knowledge on meaning-
ful values for the camera intrinsics. This allows to avoid
most problems related to critical motion sequences (Sturm,
1997) (i.e. some motions do not yield a unique solution
for the calibration of the intrinsics) that caused the ini-
tial linear algorithm proposed in (Pollefeys et al., 1998)
to yield poor results under some circumstances. A solu-
tion was also developed for another problem. Previously
observing a purely planar scene at some point during the
acquisition would have caused uncalibrated approaches to
fail. A solution that detects this case and deals with it ac-
cordingly has been proposed (Pollefeys et al., 2002). Both
correction for radial distortion and stereo rectification have
been integrated in a single image resampling pass. This al-
lows to minimize the image degradation. Our processing
pipeline uses a non-linear rectification scheme (Pollefeys
et al., 1999b) that can deal with all types of camera motion
(including forward motion). For the integration of multiple
depth maps into a single surface representation a volumet-
ric approach has been implemented (Curless and Levoy,
1996). The texture is obtained by blending the original
images based on the surface geometry so that the texture
quality is optimized. The resulting system is much more
robust and accurate. This makes it possible to efficiently
use it for many different applications.
2 FROM VIDEO TO 3D MODELS
Starting from a sequence of images the first step consists
of recovering the relative motion between consecutive im-
ages. This process goes hand in hand with finding corre-