2. Computer Vision background
Our recognition and tracking scheme of mobile 3D scenes from
a sequence of views presents several levels of analysis:
e Signal treatment: filter design and their computer
implementation to detect features and basic primitives
into binarized images associated to sampled images
extracted from a video sequence.
e Extraction of basic primitives: grouping of
minisegments located at boundaries, linearization of
auxiliary straight lines and auxiliary vanishing points
e Generation of local structural elements: regions
identification and grouping
e Global matching of objects and scenes: identification
of adjacency relations, generation and tracking of
apparent
This scheme follows an increasing information complexity from
pixel or "infinitesimal" local, next local, and finally the global
level. We have focused towards some relations between local
and global analysis levels for each view. There, we have
implemented algorithms for compatibility criteria relative to the
local analysis, and algorithms for coherence evaluation criteria
for the global analysis. Next we store and compare local and
global information for triplets of images. Compatibility criteria
follow an ascendant sense, inversely to the descendant
coherence criteria, which evaluate and validate the acquired
information to integrate it or not in a knowledge scheme. The
robustness is reinforced by modifying the weight of data which
are already labelled as locally compatible and globally coherent
in precedent evaluations.
At each level, we have local and global aspects. The feedback
between both of them is carried out by verifying /ocal and
global constraints associated to spatio-temporal propagation
models, again. Spatial propagation models are used to analyze
the local compatibility between data. They can be represented as
coordinate changes induced by the projection of rigid
transformations. Temporal propagation models involve to
locally symmetric dynamical systems defined onto such
support. If we select piecewise linear models, then the geometry
of lines configurations with the corresponding induced action
will provide a good candidate for temporal propagation models.
As we have a robust geometric support for the scene given by a
perspective model, we have choosen a kinematic propagation
model. So, in our work, propagation is performed along
comparable elements contained in the same image for the static
case, and along putative homologue elements for triplets of
images of a video sequence for the mobile case. The validation
of propagation models includes also a classical model of error
propagation and unbiased estimators for a grouping based on
trapezoids, extending precedent constructions ([Coe92]). The
search of correspondences is performed between trapezoidal
regions, instead of common features, which are more difficult
for identification and tracking problems. From a robust pair of
corresponding trapezoids in successive images of a video
sequence, we apply a propagation algorithm for prediction and
tracking of egomotion.
Extraction and grouping of basic primitives is performed in a
standard way. We have applied the Canny's filter ([Can86]) to
obtain minisegments for the frames extracted from a digitalized
video sequence (two each second). Canny's filter is easily
implementable, allow us to discriminate curved contours, and it
gives a unique response by edge. Minisegments are grouped
along lines by applying the Shin's algorithm ([SGB98]}). In this
way attenuate the striped effect which appears from boundaries
and minisegments obtained from the application of Canny's
detector. Furthermore, we eliminate short segments and apply
an active discrimination for vertical lines, due to the
characteristics of scene and bad illumination (reflectances,
irradiance of the ground and floor, etc).
Perspective models are robust, provide an initialization for the
system, and allow an image-to-image matching from easily
identifiable perspective elements (vanishing points, projection
lines, etc). We adopt a paraperspective model for a coarse
planar representation of the scene, because it is easier to
maintain and update due to a larger tolerance with small errors
arising from noisy data. In indoor scenes, due to illumination
problems (irradiance, shining, etc), the true corners obtained
from Canny's detector are not well-defined. Thus, we generate
perspective elements (projection lines and vanishing points) by
using regression methods around apparent extremes of central
vertical large segments. A weighted average of intersections
corresponding to the pairs of projection lines determine a
vanishing point. Next, the projection lines are retraced to
recover a simplified version of the scene which is easier to work
out. In this way, we simplify the matching procedures, by
avoiding a profusion of partial matches which could give us a
wrong representation of the paraperspective model for the 3D
scene.
The general framework to perform the mise in correspondence
for homologue points along a monocular vide sequence is based
on the epipolar geometry.([Fau93]) Reconstruction based onto
3D points is feasible after identifying homologue points in pairs
of images in terms of lifted triangulations linked to the images.
Large segments are reconstructed by using collinearity
constraints and the Hough transform for partially occluded
segments. The epipolar constraints allow us to decouple the
estimation of 3D motion from estimation of the structure of 3D
scene. The global coherence of motion is based on the
identification and patching of homologue triangles along the
sequence of views. Next, homologue triangles are lifted out to
prisms, and their intersection determines 3D triangles which
provide the basic pieces for the 3D reconstruction.
Triangulations appear as the standard tool in computer
packages, with their applications to grouping ([Ber97]).
Triangulations are useful for 3D reconstruction because
homologue triangles determine the local homographies
associated to a projective transformation linking two views. The
update of triangulations is easy in terms of insertion/deletion
algorithms of points ([Fau93], [Har00] ). Hence, elementary
events in triangular data structures are linked to the
(dis)apparition of points. Any triangulation 7 displays a simple
combinatorial structure which can be translated to a graph Gr
Nodes of Gr represent simple triangular regions, and their
edges represent adjacency relations between triangles.
Furthermore, each inserted point inside a triangle unfolds the
original triangle in another three triangles with the
corresponding adjacency relations. Often, the localization of
such points is corrupted by the noise; another said, they are no
easy to identify or they are even eliminated along the image
preprocessing.
In real indoor scenes, due to illumination and partial occlusion
problems, large 2D segments are better determined than
extremal real points at views. Instead of using points as in above
described approaches, we identify and track segments along
some vertical directions and projection lines. The segments are
—150—