ISPRS Commission III, Vol.34, Part 3A ,,Photogrammetric Computer Vision“, Graz, 2002
sponding image features between these images (i.e. im-
age points that originate from the same 3D feature). In the
case of video data features are tracked until disparities be-
come sufficiently large so that an accurate estimation of the
epipolar geometry becomes possible.
The next step consists of recovering the motion and cali-
bration of the camera and the 3D structure of the tracked
or matched features. This process is done in two phases. At
first the reconstruction contains a projective skew (i.e. par-
allel lines are not parallel, angles are not correct, distances
are too long or too short, etc.). This is due to the absence
of a priori calibration. Using a self-calibration algorithm
(Pollefeys et al., 1999a) this distortion can be removed,
yielding a reconstruction equivalent to the original up to
a global scale factor. This uncalibrated approach to 3D
reconstruction allows much more flexibility in the acquisi-
tion process since the focal length and other intrinsic cam-
era parameters do not have to be measured —calibrated—
beforehand and are allowed to change during the acquisi-
tion.
The reconstruction obtained as described in the previous
paragraph only contains a sparse set of 3D points. Al-
though interpolation might be a solution, this yields mod-
els with poor visual quality. Therefore, the next step con-
sists in an attempt to match all image pixels of an image
with pixels in neighboring images, so that these points too
can be reconstructed. This task is greatly facilitated by the
knowledge of all the camera parameters which we have
obtained in the previous stage. Since a pixel in the im-
age corresponds to a ray in space and the projection of this
ray in other images can be predicted from the recovered
pose and calibration, the search of a corresponding pixel in
other images can be restricted to a single line. Additional
constraints such as the assumption of a piecewise continu-
ous 3D surface are also employed to further constrain the
search. It is possible to warp the images so that the search
range coincides with the horizontal scan-lines. An algo-
rithm that can achieve this for arbitrary camera motion is
described in (Pollefeys et al., 1999b). This allows us to
use an efficient stereo algorithm that computes an optimal
match for the whole scan-line at once (Van Meerbergen et
al., 2002). Thus, we can obtain a depth estimate (i.e. the
distance from the camera to the object surface) for almost
every pixel of an image. By fusing the results of all the
images together a complete dense 3D surface model is ob-
tained. The images used for the reconstruction can also
be used for texture mapping so that a final photo-realistic
result is achieved. The different steps of the process are
illustrated in Figure 1. In the following paragraphs the dif-
ferent steps are described in some more detail.
2.1 Relating images
Starting from a collection of images or a video sequence
the first step consists of relating the different images to
each other. This is not an easy problem. A restricted num-
ber of corresponding points is sufficient to determine the
geometric relationship or multi-view constraints between
the images. Since not all points are equally suited for
Relating images
Y
Structure & Motion
recovery
Dense Matching
3D Model Building
Figure 1: Overview of our image-based 3D recording ap-
proach.
3D surface model
matching or tracking (e.g. a pixel in a homogeneous re-
gion), feature points need to be selected (Harris and Stephens,
1988, Shi and Tomasi, 1994). Depending on the type of
image data (i.e. video or still pictures) the feature points
are tracked or matched and a number of potential corre-
spondences are obtained. From these the multi-view con-
straints can be computed. However, since the correspon-
dence problem is an ill-posed problem, the set of corre-
sponding points can (and almost certainly will) be con-
taminated with an important number of wrong matches or
outliers. A traditional least-squares approach will fail and
therefore a robust method is used (Torr, 1995, Fischler and
Bolles, 1981). Once the multi-view constraints have been
obtained they can be used to guide the search for additional
correspondences. These can then be employed to further
refine the results for the multi-view constraints.
In case of video computing the epipolar geometry between
two consecutive views is not well determined. In fact as
long as the camera has not sufficiently moved, the mo-
tion of the features can just as well be explained by a ho-
mography. The Geometric Robust Information Criterion
(GRIC) proposed by Torr (Torr et al., 1999) allows to eval-
uate which of the two models —epipolar geometry (F) or
homography (H)- is best suited to explain the data. Typ-
ically, for very small baselines the homography model is
always selected, as the baseline gets larger both models
become equivalent and eventually the epipolar geometry
model outperforms the homography based one. One can
reliably compute the epipolar geometry from the moment
A - 253