MARKERLESS FULL BODY SHAPE AND MOTION CAPTURE FROM VIDEO SEQUENCES
P. Fua?, A. Gruen^, N. D'Apuzzo^, R. Plánkers?*
? VrLab, EPFL, 1015 Lausanne, ^ IGP, ETH-Hónggerberg, 8093 Zürich
KEY WORDS: Body Modeling, Motion Capture, Stereo, Silhouettes, Least-Squares Matching.
ABSTRACT
We develop a framework for 3-D shape and motion recovery of articulated deformable objects. We propose a formalism
that incorporates the use of implicit surfaces into earlier robotics approaches that were designed to handle articulated
structures. We demonstrate its effectiveness for human body modeling from video sequences. Our method is both robust
and generic. It could easily be applied to other shape and motion recovery problems.
1 INTRODUCTION
Recently, many approaches to tracking and modeling ar-
ticulated 3-D objects have been proposed. They have been
used to capture people's motion in video sequences with
potential applications to animation, surveillance, medicine,
and man-machine interaction. See [Aggarwal and Cai, 1999,
Gavrila, 1999, Moeslund and Granum, 2001] for recent re-
views.
Such systems are promising. However, they typically use
oversimplified models, such as cylinders or ellipsoids at-
tached to articulated skeletons. Such models are too crude
for precise recovery of both shape and motion. In our
work, we have proposed a framework that retains the ar-
ticulated skeleton but replaces the simple geometric primi-
tives by soft objects. Each primitive defines a field function
and the skin is taken to be a level set of the sum of these
fields. This implicit surface formulation has the following
advantages:
e Effective use of stereo and silhouette data: Defin-
ing surfaces implicitly allows us to define a distance
function of data points to models that is both differ-
entiable and computable without search.
e Accurate shape description by a small number of
parameters: Varying a few dimensions yields mod-
els that can match different body shapes and allow
both shape and motion recovery.
e Explicit modeling of 3-D geometry: Geometry can
be taken into account to predict the expected location
of image features and occluded areas, thereby making
the extraction algorithm more robust.
Our approach, like many others, relies on optimization to
deform the generic model so that it conforms to the data.
This involves computing first and second order derivatives
of the distance function of the model to the data points.
This turns out to be prohibitively complex and slow if done
in a brute-force fashion. The main contribution of this ap-
proach is a mathematical formalism that greatly simplifies
*This work was supported in part by the Swiss National Science Foun-
dation.
these computations and allows a fast and robust imple-
mentation of articulated soft objects. It extends the tradi-
tional robotics approach that was designed to handle artic-
ulated bodies [Craig, 1989] and allows the use of implicit
surfaces. For additional details, we refer the interested
reader to our earlier publications [Plánkers and Fua, 2001,
Plänkers and Fua, 2002].
We have integrated our formalism into a complete frame-
work for tracking and modeling and demonstrate its robust-
ness using video sequences of complex 3-D motions. We
have set up a comprehensive concept to fit animation mod-
els to a variety of different data [Fua et al., 1998]. This in-
cludes image silhouettes, key body points and surface data
generated by stereo or multi-image matching. All these ob-
servations are brought together under a joint least squares
estimation system, from which the body model parameters
are derived.
To validate it, we focus on using stereo and silhouette data
because they are complementary sources of information,
as illustrated by Figure 1. Stereo works well on both tex-
tured clothes and bare skin for surfaces facing the camera
but fails where the view direction and the surface normal is
close to being orthogonal, which is exactly where silhou-
ettes provide information. To increase the performance of
our system, we have also developed an improved approach
to extracting stereo-data using least-squares matching and
tracking methods.
In the remainder of this paper we first introduce our mod-
els. We then discuss our approach to extracting 3-D infor-
mation from the video sequences and, finally, to fitting the
3-D body models to it.
2 ARTICULATED MODEL AND SURFACES
The human body model we use in this work [Thalmann
et al., 1996] is depicted by Figures 1(a,b). It incorporates
a highly effective multi-layered approach for constructing
and animating realistic human bodies. The first layer is a
skeleton that is a connected set of segments, correspond-
ing to limbs and joints. A joint is the intersection of two
segments, which means it is a skeleton point around which
the limb linked to that point may move.
-256-