5)
Fua, Pascal
: : type : : : : :
with weight s. where type is one of the possible types of observations we use. In this paper, type can be object space
coordinates or silhouette rays. However, other information cues can easily be integrated. The individual weights of the
different types of observations have to be homogenized before estimation according to:
pF (et) ©
where 05, 0; are the a priori standard deviations of the observations obs;, obs; of type k,l. Applying least-squares
estimation implies the joint minimum
nt
t t
) ye p, poU = Min, (7)
type=1
with nt the number of observation types, which then leads to the well-known normal equations which need to be solved
using standard techniques.
In practice, however, it is very difficult to estimate the standard deviations of Eq. 6. We therefore use the following
heuristics, which has proved to be very effective. To ensure that the minimization proceeds smoothly we multiply the
. ype A in . . .
weight p;/?° of the nzype individual observations of a given type by a global coefficient wyype computed as follows:
VE creat SHON
Mtype
Atype
RT 8
Giype S
G'type m
type
where Azype is à User supplied coefficient between 0 and 1 that dictates the relative importance of the various kinds of
observations. This guarantees that, initially at least, the magnitudes of the gradient terms for the various types have the
appropriate relative values.
Since our overall problem is non-linear, the results are obtained through an iteration process. We use a sparse-matrix
implementation of the Levenberg-Marquardt algorithm (Press et al., 1986) that can handle the large number of parameters
and observations we must deal with.
4.2.2 Skeleton Fitting and Motion Modeling The images of Figure 11 show somebody walking in front of a hori-
zontally aligned stereo camera pair. The background and lighting were uncontrolled (standard office head lights) and the
camera pair was about 5m from the person. The distance between the two cameras was 75cm. The images are interlaced
and the processed half-frame has an effective resolution of 768 x 288. The disparities result in about 2000 3-D points,
including reconstructed parts of the background. The top row of Figure 11 shows three frames out of 50 from this se-
quence. The result from the initial tracking process is depicted by the middle row of the Figure. The bottom row shows
the output of the subsequent fitting step. Here, the dimensions of the skeleton and the size of the metaballs have been
adjusted, resulting in slightly more realistic postures.
The sequence of Figure 12 exhibits complex motions of a naked upper body, taken with a camera set up in front of
the subject. Three cameras in an L configuration took interlaced images at 20 frames/sec with an effective resolution of
432 x 288 per half-frame. Our stereo algorithm (Fua, 1993) produced very dense point clouds with about 4000 3-D points
on the surface of the subject, even without textured clothes. To increase the frame rate and, thus, reduce the difference in
posture between frames we used both halves of the interlaced images and adjusted the camera calibration accordingly.
Although this motion involves severe occlusions, the system faithfully tracks the arms and yields both body postures and
an adapted skeleton. The technique of Section 4.1 was used to derive the model's head from one video-sequence.
5 CONCLUSION
We have presented a set of technique that allow us to fit complex facial and body animation models to potentially noisy
data with minimal manual intervention. Consequently, using either optical motion capture data or video-sequences, these
models can be instantiated robustly and quickly. Although the models were primarily designed for animation rather than
fitting purposes, we have designed a framework that allows us to exploit them to resolve ambiguities in the image data.
International Archives of Photogrammetry and Remote Sensing. Vol. XXXIII, Part B5. Amsterdam 2000. 265