Fusion of visual data through dynamic stereo-motion cooperation
Nassir Navab, and Zhengyou Zhang
INRIA Sophia Antipolis
2004 Route des Lucioles
06565 Valbonne Cedex
FRANCE
Abstract
Integrating information from sequences of stereo im-
ages can lead us to a robust visual data fusion. Instead of
considering stereo and temporal matchings as two indepen-
dent processes, we propose a unified scheme in which each
borrows dynamically some information from the other. Us-
ing an iterative approach and statistical error analysis, dif-
ferent observations are appropriately combined to estimate
the motion of the stereo rig and build a dynamic 3D model
of the environment. We also show how motion estima-
tion and temporal matching can be used to add new stereo
matches. The algorithm is demonstrated with real images.
Implemented on a mobile robot, it shows how fusion of vi-
sual data can be useful for an autonomous vehicle working
in an unknown environment.
1 Introduction
In stereo and motion analysis, most of previous work has
been conducted using either two or three static cameras [27]
or a sequence of monocular images obtained by a moving
camera [4]. Several researchers tried to combine these two
process to find faster and more robust algorithms [7, 23,
18,16, 21,17, 22, 19, 2, 111.
We believe in the efficiency of stereo-motion coopera-
tion. This paper is another attempt to improve this idea.
To extract 3D information from real images, “mean-
ingful” extracted features, such as corner points, edges,
regions, etc., are often used to reduce the computational
cast and matching ambiguities. In this paper, we use the
line segments obtained by an edge detector. Line segments
are present in most of the real-world scenes such as : high-
ways,car traffic tunnels, long indoor hallways or industrial
assembly.
In [7], [19] and [9] , we tried to make cooperate two
existing algorithms-a hypothesis-verification based stereo
matching algorithm [3] and a monocular line tracking al-
gorithm [5]. Very soon we realized that each of these pro-
cesses may work faster and better (in terms of robustness)
if they could borrow dynamically some information from
each other. And the motion estimation could play an im-
portant role of intermediary between these two processes.
If we want a tighter cooperation between stereo and mo-
tion, we must not consider them as two different processes
with some interactions from time to time.
We present a unified iterative algorithm for both tem-
poral tracking of stereo pairs of segments and camera sys-
tem ego-motion estimation; which consequently allows us
to keep track of our 3D reconstructions. The algorithm is
based on a dynamic interaction between different sources
of information.
Figure3 shows the general scheme of the algorithm.
This scheme is adapted from that of Droid [14, 10]. The
basic difference is that we use straight lines features as to-
ken, where Droid make use of point feature, and once the
cameras system ego-motion is estimated, we use that in-
formation for tacking 2D lines on each camera.
In section 4, we describe how to use straight line tokens
to estimate the cameras system ego-motion and its associ-
ated covariance matrix. The algorithm is decomposed in
three different steps, as shown in figure3. In section 6, we
describe these three different steps. Finally, section 7 shows
briefly the results of the different steps of our algorithm on
real images obtained by INRIA mobile robot.
2 Preliminaries
Vectors are represented in bold face, i.e x. Transposition of
vectors and matrices is indicated by 7, i.e xT. x denotes
the time derivative of x, ie X — dx 3D points P are
represented by vectors P — (X, Y, Z)T. For a given three-
dimensional vector x we also use X to represent the 3 x 3
antisymmetric matrix such that Xy — x ^ y for all vectors
y. L, represents the n x n identity matrix.
y
Fig.1. Pinhole model of a camera
We model our camera with the standard pinhole model
of figure-1 and assume that everything is referred to the
camera standard coordinate frame (ozyz). We know from
work on calibration [24, 6] that it is always possible, up to
a very good approximation, to go from the real pixel values
to the standardized values z and y. When using a pair of
calibrated stereo cameras, everything is written in one of
the cameras coordinate systems.
3 The Pluckerian line representa-
tion
Different line representations in $? and R°, have been used
-1z in computer vision works. Though the theoretical re-
sults may be equivalent, one is more or less suitable for
a possible implementation. Here, we use the Pluckerian