Fua, Pascal
since the two reconstruction methods are independent, places of agreement are very likely to be correct for both. As
discussed in Section 4.1.1, we model the deformation induced by our arbitrary choice of internal camera parameters as an
affine transform: We have therefore computed the affine transform that brings the bundle-adjustment triangulation closest
to the laser-scanner model. In both cases, the median distance between the affine-transformed face models and the laser
output is approximately 1 millimeter which, given the camera geometry, corresponds to a shift in disparity of less than
1/5 a pixel. The precision of the correlation based algorithm we use is in the order of half a pixel, outliers excluded (Fua,
1993). We therefore conclude that our motion recovery algorithm performs an effective and robust averaging of the input
data.
4.2 Body Modeling
If the face can be assumed to be relatively rigid as a long as the subject does not change his expression, when dealing
with the body, one must take into account its articulated nature. Here, we use two or three video sequences acquired using
synchronized cameras. The body model and the image data are used throughout the fitting process.
Recently, a number of techniques have been proposed (Kakadiaris and Metaxas, 1996, Gavrila and Davis, 1996, Lerasle
et al, 1996, Bregler and Malik, 1998) to track human motions from video sequences. They are fairly effective but use
very simplified models of the human body, such as ellipsoids or cylinders, that do not precisely model the human shape
and would not be sufficient for a truly realistic simulation. By contrast, we use the full body model of Figure 1.
The algorithm goes through four steps that we summarize below. For a more complete description, we refer the interested
reader to our earlier publications (D' Apuzzo et al., 1999, Plánkers et al., 1999).
Data Acquisition Clouds of 3-D points are derived from the input images using correlation-based stereo (Fua, 1993).
Alternatively, we can use least-squares matching to derive these clouds (D' Apuzzo et al., 2000). Silhouette edges may be
delineated in several key-frames or automatically generated for the whole sequence.
Initialization: We first initialize the model interactively in one frame of the sequence. The user has to enter the approx-
imate position of some key joints, like shoulders, elbows, hands, hips, knees and feet. Here, it was done by clicking on
these features in two images and triangulating the corresponding points. This initialization gives us a rough shape, i.e. a
scaling of the skeleton, and an approximate posture of the model.
Tracking: Ata given time step the tracking process adjusts the model’s joint angles by minimizing an objective function.
This modified posture is saved for the current frame and serves as initialization for the next one. The computing power
of today’s PCs allows for interactivity. If, for some reason, the algorithm loses track the user simply pauses the program,
adjusts the posture interactively and hands the control back to the algorithm for further processing.
Fitting: The results from the tracking step serve as initialization for a fitting step. Its goal is to refine the postures in all
frames and to adjust the skeleton and/or metaball parameters to make the model correspond more closely to the person.
The fitting optimizes over all frames simultaneously, by minimizing the same objective function as before. This allows
us to find a single set of parameters that describe a model that is consistent with the images of the whole sequence. The
results are further improved by introducing inter-frame constraints such as smoothness or limits on velocity/acceleration.
In practice, the model and the constraints it imposes are used to overcome the inherent noisiness of the data. We recover
both motion and body shape from stereo video sequences. The corresponding parameters can be used to recreate realistic
3-D animations.
4.2.1 Least Squares Framework Our system must deal with heterogeneous sources of information—3-D data and
2-D outlines—whose contributions may not be commensurate. To this end, we have developed the following framework.
In standard least-squares fashion, we use the image data to write nobs observation equations of the form
Fi(S) = obs; — ei ‚1<4< nobs‘, (3)
where S is the state vector that defines the shape and position of the body model and e; is the deviation from the model.
We will then minimize
v7 Pv 2 Min , (4)
where v is the vector of residuals and P is a weight matrix associated with the observations. P is usually introduced as
diagonal.
Our system must be able to deal with observations coming from different sources that may not be commensurate with
each other. Formally, we can rewrite the observation equations of Equation 3 as
fH?*(S) — obs?" —e, 1 « i € nobs , 6)
264 International Archives of Photogrammetry and Remote Sensing. Vol. XXXIII, Part B5. Amsterdam 2000.
ce
di
es
en et 78 9 03$ ON ON ZI