933
gesture
Ca ®era__ a
1 bam
PP.
Ia yake, ^
ai >d rescue.
№. m
52 №. pp.
oving face
of IEEE
tem, Iasi,
V invariant
-2 cantera.
07 Range
ations for
ieometrics
vol. 6491,
57 (3), pp.
r Science,
ment: an
ermany.
Kamera
athematic
RANGE IMAGE SEQUENCE ANALYSIS BY 2.5-D LEAST SQUARES TRACKING WITH
VARIANCE COMPONENT ESTIMATION AND ROBUST VARIANCE COVARIANCE
MATRIX ESTIMATION
Patrick Westfeld 3 ’* and René Hempel
Technische Universität Dresden, D-01062 Dresden, Germany
institute of Photogrammetry and Remote Sensing (IPF), patrick.westfeld@tu-dresden.de, http://www.tu-dresden.de/ipf/photo
b Faculty of Education, rene.hempel@tu-dresden.de, http://tu-dresden.de/die_tu_dresden/fakultaeten/erzw/erzwiae/ewwm
KEY WORDS: Range Imaging, Least Squares Tracking, Variance Component Estimation, Robust Variance Covariance Matrix
ABSTRACT:
In this article, a range image sequence tracking approach is proposed, which combines 3-D camera intensity and range observations
in an integrated geometric transformation model. Based on 2-D least squares matching, a closed solution for intensity and range
observations has been developed. By combining complementary information, an increase in accuracy and reliability can be achieved.
The weighting of the two different types of observations with a-priori unknown quality is performed by variance component estimation.
To fulfill the requirements of robust variance covariance matrix estimation in statistical context, alternative approaches for variance
covariance matrix calculation are proposed and evaluated. To verify its applicability, reliability and accuracy potential, the introduced
2.5-D least squares tracking technique has been evaluated by several series of experiments in the field of human motion and interaction
measurement.
1 INTRODUCTION AND MOTIVATION
Conventional stereo-photogrammetric procedures generate, de
pending on the sensors used, object space maps with high spatio-
temporal resolution. The main drawbacks are the recording con
figuration of at least two cameras, synchronized and oriented to
each other, and the data processing, which is highly complex due
to spatial and temporal feature matching.
duced by Isard and Blake (1998) and extended for tracking mul
tiple objects in RIM sequences by Koller-Meier (2000) - into a
RIM tracking process is described in Kahlmann et al. (2007).
Range imaging (RIM) cameras (3-D cameras) based on photonic
mixer devices (PMD; Schwarte, 1997) or comparable principles
offer an interesting monocular alternative for photogrammetric
3-D data acquisition. The use of modulation techniques and com
bined CCD/CMOS technology provides simultaneous gray value
and distance measurements of the scene in each pixel of the sen
sor. With frame rates up to 50 Hz, 3-D cameras are well suited
for motion capture in fields such as human or robot (inter-)action
analysis.
The above reviewed RIM tracking approaches are based on ba
sic image analysis functions (e.g. thresholding, segmentation,
computation of point cloud centroid) or extended matching pro
cedures using motion and measurement models (e.g. CONDEN
SATION algorithm, Kalman filtering) applied to the range data.
In this article, a RIM sequence tracking approach (2.5-D least
squares tracking; LST) is proposed, which combines RIM inten
sity and range observations in an integrated geometric transfor
mation model. Based on 2-D least squares matching (LSM), a
closed solution for intensity and range observations has been de
veloped. In contrast to motion model techniques, intensity obser
vations are also included into the least squares (LS) adjustment.
By adding complementary information, an increase in accuracy
and reliability can be expected.
Several approaches to (semi-)automatic RIM sequence analysis
have been shown: Goktiirk and Tomasi (2004) introduced a RIM
head-tracking algorithm. In a training stage, a depth signature
(representative signature for head location) is calculated by iden
tifying the probands’ heads on each frame interactively. In a
tracking stage, the depth-signature of each frame is compared
against the training signatures. The best match can be identified
by a correlation metric and represents the location of the object of
interest. Kahlmann and Ingensand (2006) described the usability
of the RIM camera SwissRanger SR-3000 for surveillance sys
tems. Moving persons within an indoor scene could be detected
by RIM thresholding and pixel clustering. Gesture recognition
based on motion detection by double difference range images and
3-D shape matching with 3-D shape contexts has been presented
by Holte and Moeslund (2007). Breuer et al. (2007) recognized
hand movements (location and orientation) by principle compo
nent analysis (PCA) applied on RIM data. In the further course of
analysis, they fitted an articulated model to reconstruct the hands.
The centroid of a cluster represents the persons position for the
corresponding frame. The implementation of the CONDENSA
TION algorithm (conditional density propagation) - first intro-
2 SENSOR AND DATA
RIM sensors (Figure 1) allow the simultaneous acquisition of in
tensity and range images of - in principal - any scene (Figure
2). In the field of RIM sensor technology, 3-D cameras are cur
rently available with a sensor size of up to 25,000 pixels and a
frame rate of up to 50 Hz. Based on a phase-measuring time-of-
flight (TOF) principle, the camera is able to measure distances
for each pixel in addition to the gray value information (Oggier
et al., 2004). As a result, a spatiotemporal resolved represen
tation of the object space is given in the form of intensity im
ages and range maps. The calculation of 3-D coordinates is per
formed on-chip. Image coordinates as well as range information
are transformed into Cartesian coordinates using the relationship
between image and object space as described in Kahlmann and
Ingensand (2006). Several assumptions are implied, which have
to be proven by suitable photogrammetric calibration techniques
(Kahlmann et al., 2006; Westfeld, 2007a).
Corresponding author.
Advantages of this new 3-D mapping technology are the genera
tion of 3-D data on a discreet raster without stereo compilation,
the recording of motion sequences and the marginal dimension.