In: Paparoditis N., Pierrot-Deseilligny M„ Mallet C. Toumaire O. (Eds). 1APRS, Vol. XXXVIII. Part 3A - Saint-Mandé, France. September 1-3, 2010
(a) Input video (b) Depth image (c) GMM on color (d) GMM on color and depth
Figure 1: Results for (a) a challenging input video , (b) a depth image where invalid measurements are black, (c) the foreground mask
using GMM only for color and (d) GMM based on color and depth
ble refinements of the depth for a color pixel. Again a bilateral
filter is applied to this volume and after sub-pixel refinement a
proposed depth is gained. The optimization is performed iter
atively to achieve the final depth map. The incorporation of a
second view is also discussed. In (Bartczak and Koch, 2009) a
similar method using multiply views was presented.
An approach working with one color image and multiple depth
images is described in (Rajagopalan et al., 2008). Here the data
fusion is formulated in a statistical manner and modeled using
Markov Random Fields on which an energy minimization method
is applied.
Another advanced method to combine depth and color informa
tion was introduced in (Lindner et al., 2008). It is based on edge
preserving biquadratic upscaling and performs a special treat
ment of invalid depth measurements.
3 GAUSSIAN MIXTURE MODELS
All observations at each pixel position x = [x,y) T are modeled
with a mixture of Gaussians to estimate the background. The
assumption is that an object in the line of sight associated with
a certain pixel produces a Gaussian formed observation or sev
eral in the case of a periodically changing appearance (e.g., mov
ing leaves, monitor flickering). Each observation is then mod
eled with one Gaussian whose mean and variance is adapted over
time. An observation at time t for a pixel x is given by s l (x) =
[si (x), s^x),..., s^(x)] T . The probability distribution density
of s t (x) can now be described by
/«*(«)(D = • Nz (1)
determine the parameters of the mixture, the usual online cluster
ing approach is used in this work: When a new observation s(x)
arrives it is checked if it is similar to already modeled observa
tions or if it is originating from a new object. It may also just be
noise. This is done by evaluating the Mahalanobis distance •)
towards the associated Gaussian Nx{y., E*)
<5 (x,y.) = yj(/z. -x) T £. 1 (f±.-x) < T ne
(3)
with Tnear being a given constant. If similar observations have
been recorded, their Gaussian is adapted using the observed data.
Otherwise, a new Gaussian is created and added to the mixture.
An exact description of a possible implementation can be found
in (Stauffer and Grimson, 1999) for normal videos and in (Harville
et al., 2001) with additional depth values.
An observation for a pixel is given by s(x) = (y, Cb, c r , z, a) T
in this work and contains the color value in YCbCr format, a
depth value z and an amplitude modulation value a. The 2D/3D
camera produces a full size color image and low resolution depth
and amplitude modulation images which are resized to match to
color images by the nearest neighbor method. The variances of
all Gaussians are limited to be diagonal to simplify computations.
When working with ToF data, invalid depth measurements due to
low reflectance, have to be handled cautiously. A depth measure
ment is considered invalid if the corresponding amplitude is lower
that a given threshold. In (Harville et al., 2001) an elaborate log
ical condition is used to classify a new observation. Experiments
show that this can be simplified by using the measure
£(^/e) 2 = ~x) T Ei 1
(Hi ~ *) ( 4 )
where
and checking the condition
JVs(fc.S.) =
- e xp{-| • S“ 1 - [i-Mj} (2)
is the multivariate Gaussian with mean y and covariance ma
trix E¿. Clearly for the mixing coefficients u>i we must have:
= 1. How many Gaussians should be used to model the
i
observations, how to adapt the Gaussian efficiently over time and
which Gaussians should be considered background, are questions
that arrive immediately. Most GMM based methods are based on
very simple assumptions for efficiency. They used a fixed num
ber of Gaussians per pixel and the minimum number of Gaus
sians with weights which sum up to a given threshold are treated
as background.
The adaptation of the Gaussians over time is a bit more compli
cated. Instead of using the EM-algorithm or similar methods to
ô fe’Mi) 2 < T near ■ Tr y Xc x )
= Tnear (1 + 2 Ac + A 2 + A a ) (5)
where X z G {0,1} depending on whether current and previ
ous depth measurements are both valid. The mechanism from
(Harville et al., 2001) works well to that end. Similarly, A c G
{0,1} indicates whether the chromaticity channels of the current
observation as well as the recorded information provide trustwor
thy values. This can be estimated simply by checking if both
luminance values or their means respectively are above a certain
threshold. Finally, À a G {0,1} determines if the amplitude mod
ulation should be used for the classification and it is specified a
priori.
This matching function pays respect to the fact that observations
in the color, depth and amplitude modulation dimensions are in
practice not independent. A foreground object has most likely not
(2tt)
d i m ( * )
det(Ei)