Virtual navigation in remote environments can be achieved by building an image-based model made of multiple panoramas gathered
from cameras moving around the scene. In such models, it could be useful to acquire knowledge of the 3D structure of the scene. In
this paper, we propose a method that constructs a sparse but rich 3D representation of a scene, given a set of calibrated panoramic
images. The proposed method is a heuristic search algorithm that, given calibrated panoramic images, finds 3D points that correspond
to the surfaces of objects in the scene. The algorithm constructs a set of 3D points by searching for matching edge pixels in pairs of
images using the epipolar constraint. Empirical results show that the proposed method performs well at locating 3D points of interest
in different scenes.
A goal of tele-presence applications is to allow someone to vi
sually experience a remote environment such that they can freely
navigate through the environment with the impression of “being
there”. One way to reach this goal is to create an image-based
models of the scene composed of a multitude of panoramas cap
tured in the scene of interest. Starting from a user-selected geo-
referenced panorama, virtual navigation is then achieved by al
lowing the user to move from one panorama to a neighboring
one, thus simulating motion along some path in the scene. Under
such a framework, knowledge of the 3D structure of the scene is
not a necessary requirement; however extracting 3D information
from the scene can be beneficial in many ways: i) it allows to
more accurately register the panoramic images one with respect
to the other and with respect to maps or other representations of
the scene; ii) the image-model can then be augmented with vir
tual objects or virtual annotation that can be coherently displayed
on the different panoramic images; iii) 3D measurement in the
scene can be made and non feasible motions can be invalidated
(e.g. going through an obstacle); iv) it facilitates the generation of
photo-realistic virtual views in order to simulate smooth motion
while navigating through the scene, from a finite set of images.
The purpose of this work is, given a sparse set of calibrated panoramic
images, to obtain a rich set of 3D points that correspond to the
surfaces of the objects in a scene. Towards this goal we have de
veloped a search method that searches for matches using features
that appear more frequently in each image than the features used
during the calibration procedure. Our method uses a multi-start
search methodology which is a variation of the method proposed
in (Louchet, 1999).
The rest of this paper is organized as follows: Section 2 gives
a brief description of methods that have been developed to esti
mate the 3D structure of a scene; our proposed heuristic search
algorithm is presented in Section 3; the results of testing our pro
posed algorithm on sets of real calibrated images can be found in
Section 4; and finally, our conclusions are given in Section 5.
STRUCTURE FROM MOTION
The purpose of structure from motion algorithms is to estimate
the position and orientation of each image in a set of images, and
to estimate the 3D structure of the scene.
Recent work has been done by Snavely et al. (Snavely et al.,
2008) on calibrating images of a scene taken from different view
points, and in turn estimating the 3D structure of the scene. In
both cases, camera calibration is carried out by (1) finding cor
respondences between pixels among subsets of the images using
the scale invariant feature transform (SIFT) (Lowe, 2004), (2) es
timating the camera parameters (internal and external) using the
epipolar constraint and the RANSAC algorithm, and then (3) us
ing bundle adjustment to optimize these parameters, minimizing
the reprojection error over all correspondences. The correspon
dences constitute a sparse description of the 3D structure of the
scene. Goesele et al. (Goesele et al., 2007) then proceed to es
timate the complete 3D structure of the scene from these sparse
3D points. Both of these methods were tested using large, densely
located sets of non-panoramic images.
An alternative to SIFT, called speeded up robust features (SURF),
is proposed by Bay et al. (Bay et al., 2008). SURF claims to be
faster to compute and more accurate than SIFT.
Although this calibration method is very effective at estimating
the camera parameters, it may result in a set of 3D points that are
too sparse to adequately describe the 3D structure of the scene.
Figure 1 shows an example of how SURF detected correspon
dences may not adequately cover the scene.
Pollefeys et al. (Pollefeys et al., 2008) and Comelis et al. (Cor
nells et al., 2008) designed systems that perform 3D reconstruc
tion of urban environments from video sequences. Camera pose
estimation is carried out using camera calibration techniques that
are similar to the technique summarized above. In order to per
form faster and more accurately in urban environments, both sys
tems use simplifying geometric assumptions of the scene to model
the objects, such as roads and buildings. The system designed by