The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. Vol. XXXVII. Part B5. Beijing 2008
The scene is then reconstructed with the technique proposed in
Section 3. In this section, an approach for reconstructing wide
area-scenes from high-resolution images with the associated
computational issues is proposed. In our technique, the
conventional space-sweeping approach (e.g. Zabulis et al. 2003)
is slightly modified to employ a sweeping spherical, instead of
planar, back-projection surface. Result is a more accurate and
memory-conserving technique. Moreover, this extension
facilitates the acceleration of the methods, based on a coarse-to-
fine depth map computation.
The proposed approach offers to the user the ability to
reconstruct a scene from a few snapshots acquired with an off-
the-shelf camera, preferably of high resolution. This way, a few
snapshots suffice for the reconstruction and the image
acquisition process becomes much simpler than capturing the
scene with a video camera or with a multicamera apparatus
(Mordohai et al., 2007).
The final result is a textured mesh in either the Keyhole Markup
Language (KML) or Virtual Reality Modeling Language
(VRML) formats. The KML output allows integration to the
Google Earth™ platform, thus the reconstructed 3D models and
their virtual walkthrough applications can easily become a part
of a large geographical information system (GIS) in the near
future. Section 4 explains the Web-based virtual tour
application developed.
2. ROBUST CAMERA MOTION ESTIMATION
BASED ON SIFT DETECTION AND MATCHING
Robust estimation of the camera motion is essential, since the
accuracy of the produced 3D reconstruction is based on this
information. Our work is based on the approach proposed
initially by Beardsley et al. (1997), and, subsequently, extended
by Pollefeys et al. (1999, 2004) and Tola (2005). The approach
establishes correspondences across consecutive images of a
sequence to estimate camera motion.
Previous approaches used the Harris comer detector (Harris and
Stephens, 1988) to extract point features in images. The
matching procedure utilized similarity as well as proximity
criteria (Tola, 2005) to avoid spurious matches. In this paper, an
alternative procedure was tested, utilizing SIFT feature
detection and matching (Lowe, 2004). In both cases
(Harris/SIFT), a RANSAC framework is then utilized to
remove spurious correspondences, followed by a Levenberg-
Marquardt post-processing step to further improve the
estimation. Intrinsic camera parameters are estimated a priori
through a simple calibration procedure (Bouguet, 2007).
Besides reducing the unknowns in the following external
calibration and bundle adjustment procedures, intrinsic
calibration is used to compensate for radial distortion. As a
result, the perspective camera model is better approximated and
the system produces more accurate results. The output is an
estimation of the essential matrix E, which is thereafter
decomposed into rotation matrix (R) and translation vector (t)
of the new view. Finally, triangulation is used to estimate the
3D coordinates of the corresponding features.
When a sequence of views is available, the above technique is
applied for the first two views and for each new view i, the
feature detection and matching approaches are applied to
establish 2-D correspondences with the previous view i-1,
which are then matched with the already established 3-D points,
using a RANSAC-based technique that yields a robust estimate
of the projection matrix Pj of the new view. We have used an
efficient Bundle Adjustment procedure (Lourakis and Argyros,
2004) as a final step at each addition of a new view. The
procedure is illustrated in Figure 1.
Although several error suppression and outlier removal steps
are included, results show that the accuracy of the whole chain
greatly relies on the success of the feature detection and
matching. Despite the efficiency of the Harris comer detector
and the neighborhood-based constraints utilized in
correspondence establishment, we observed that SIFT yields
better correspondences in terms of number and accuracy. This
is especially important for camera positions with wider
baselines. For our problem, robustness to large disparities or
severe view angle changes is important because the scene is to
be reconstructed from a few snapshots instead of a high-frame
rate video.
A technical issue encountered when high resolution images are
utilized is that the computation of the SIFT features may require
more memory than available. The proposed treatment is to
tessellate the image into blocks, compute the features
independently in each, and merge the results. To avoid blocking
artifacts, the blocks in the above tessellation are adequately
overlapping. Duplicate features are often encountered, either
due to block overlap or due to collocation of different SIFT that
occur at different scales; they are all removed at the merging
stage.
Figure 1. Illustration of the camera motion estimation procedure.
3. 3D RECONSTRUCTION
In this section, an approach for 3D scene reconstruction from
high-resolution images is proposed and the associated
computational issues are discussed. In the proposed method, the
space-sweeping approach is slightly modified to employ a
sweeping spherical, instead of planar, backprojection surface
(see Zabulis, Kordelas et al. (2006) for an analytical
formulation).
The conventional space-sweeping approach is frequently used
for multiview stereo reconstruction, due to its computational
efficiency and its straightforward acceleration by graphics
hardware (Yang et al., 2002; Li et al., 2004). However, it is less