first the reconstruction contains a projective skew (i.e. par-
allel lines are not parallel, angles are not correct, distances
are too long or too short, etc.). This is due to the absence
of a priori calibration. Using a self-calibration algorithm
(Pollefeys et al., 1999a) this distortion can be removed,
yielding a reconstruction equivalent to the original up to
a global scale factor. This uncalibrated approach to 3D
reconstruction allows much more flexibility in the acquisi-
tion process since the focal length and other intrinsic cam-
era parameters do not have to be measured —calibrated—
beforehand and are allowed to change during the acquisi-
üon.
The reconstruction obtained as described in the previous
paragraph only contains a sparse set of 3D points. Al-
though interpolation might be a solution, this yields mod-
els with poor visual quality. Therefore, the next step con-
sists in an attempt to match all image pixels of an image
with pixels in neighboring images, so that these points too
can be reconstructed. This task is greatly facilitated by the
knowledge of all the camera parameters which we have
obtained in the previous stage. Since a pixel in the im-
age corresponds to a ray in space and the projection of this
ray in other images can be predicted from the recovered
pose and calibration, the search of a corresponding pixel in
other images can be restricted to a single line. Additional
constraints such as the assumption of a piecewise continu-
ous 3D surface are also employed to further constrain the
search. It is possible to warp the images so that the search
range coincides with the horizontal scan-lines. An algo-
rithm that can achieve this for arbitrary camera motion is
described in (Pollefeys et al., 1999b). This allows us to
use an efficient stereo algorithm that computes an optimal
match for the whole scan-line at once (Van Meerbergen et
al., 2002). Thus, we can obtain a depth estimate (i.e. the
distance from the camera to the object surface) for almost
every pixel of an image. By fusing the results of all the
images together a complete dense 3D surface model is ob-
tained. The images used for the reconstruction can also
be used for texture mapping so that a final photo-realistic
result is achieved. The different steps of the process are
illustrated in Figure 1. In the following paragraphs the dif-
ferent steps are described in some more detail.
2.1 Relating images
Starting from a collection of images or a video sequence
the first step consists of relating the different images to
each other. This is not an easy problem. A restricted num-
ber of corresponding points is sufficient to determine the
geometric relationship or multi-view constraints between
the images. Since not all points are equally suited for
matching or tracking (e.g. a pixel in a homogeneous re-
gion), feature points need to be selected (Harris and Stephens,
1988, Shi and Tomasi, 1994). Depending on the type of
image data (i.e. video or still pictures) the feature points
are tracked or matched and a number of potential corre-
spondences are obtained. From these the multi-view con-
straints can be computed. However, since the correspon-
dence problem is an ill-posed problem, the set of corre-
sponding points can (and almost certainly will) be con-
taminated with an important number of wrong matches or
6000 [rer -— —— 1 1 1 uq
5500 crs +
x ER
5000 |- ee eee -
ut. TT 89 987
~~
=
4500 - +. 4
4
/
;
4000 |- / 1
/
/
3500 -
/
|
|
|
|
3000 —— 1 l 1 D 1 rn
2 4 6 8 10 12
Figure 2: Comparison of epipolar geometry (F-GRIC/solid
line) and homography model (H-GRIC/dashed line) for
tracked features. The lowest GRIC value yields the best
model.
outliers. A traditional least-squares approach will fail and
therefore a robust method is used (Torr, 1995, Fischler and
Bolles, 1981). Once the multi-view constraints have been
obtained they can be used to guide the search for additional
correspondences. These can then be employed to further
refine the results for the multi-view constraints.
In case of video computing the epipolar geometry between
two consecutive views is not well determined. In fact as
long as the camera has not sufficiently moved, the mo-
tion of the features can just as well be explained by a ho-
mography. The Geometric Robust Information Criterion
(GRIC) proposed by Torr (Torr et al., 1999) allows to eval-
uate which of the two models —epipolar geometry (F) or
homography (H)- is best suited to explain the data. Typ-
ically, for very small baselines the homography model is
always selected, as the baseline gets larger both models
become equivalent and eventually the epipolar geometry
model outperforms the homography based one. This can
be seen in Figure 2. One can reliably compute the epipolar
geometry from the moment that the F-GRIC value drops
below the H-GRIC value. We then select as a new key-
frame the last frame for which the number of tracked fea-
tures is above 90% of the number of features tracked at the
F-GRIC/H-GRIC intersection.
2.2 Structure and motion recovery
The relation between the views and the correspondences
between the features, retrieved as explained in the previ-
ous section, will be used to retrieve the structure of the
scene and the motion of the camera. Our approach is fully
projective so that it does not depend on the initialization.
This is achieved by strictly carrying out all measurements
in the images, i.e. using reprojection errors instead of 3D
errors.
At first two images are selected and an initial projective re-
construction frame is set-up (Faugeras, 1992, Hartley et al.,
1992). Matching feature points are reconstructed through
triangulation. Features points that are also observed in a
—580—