ul 2004
'eometric
een from
>d on the
'€ approx-
curves.
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol XXXV, Part B5. Istanbul 2004
the difficulties of the task are the changes that can appear in the
background image, such as changes in illumination, shadows and
other changes in the scenery. The topic has received great atten-
tion in the past as well as in the present. Toyama et al. (1999)
list ten different algorithms to solve most of the common prob-
lems of background estimation. Among those are simple adjacent
frame differencing, pixel-wise mean values and fixed threshold-
ing, pixel-wise mean and covariance values and thresholding and
several advanced and combined methods.
In the beginning of our experiments we have used a simple test se-
quence which was used by Prati et al. (2003) for their shadow de-
tection algorithms and which they have provided to the research
community. We tested two algorithms using a varying number
of images from the test sequence. The first algorithm simply
computes the average of the frames. The second algorithm is a
state-of-the-art background estimator from a commercial image-
processing library. Both algorithms were tested on a set of 4 and
a set of 32 images from the sequence. Selected frames and the
results are shown in figure 3.
The results reveal a problem of classic background estimation:
the algorithms adapt slowly when no proper initialization is avail-
able. When the sequence contains only four frames clear ghosting
effects are visible for both the simple averaging and the commer-
cial implementation. When the length of the sequence is extended
the results are better, but still both algorithms produce artifacts in
the background image. Usually the number of frames needed for
proper initialization is above 100.
Since we wish to keep the number of images and thereby the extra
effort of manual image acquisition to a minimum, this behavior
is not acceptable. Therefore we propose a different mechanism
of computing the background image from a small set of images.
While most background estimation processes only consider one
image at a time, a method suitable for processing a continuous
stream of images, our method processes the full set of images
at a time iterating over each pixel. We can think of the image
sequence as a stack of per-pixel registered images, where each
pixel exactly corresponds to the pixel of a different image at the
same location. Thus we can create a pixel stack for each pixel
location.
Using the basic assumption of background estimation that the
background dominates over the altering foreground objects, we
have to identify the subset of pixels within each pixel stack, which
resembles the background. In other word we have to identify the
set of pixels, which form the best consensus within each pixel
stack. Projecting the pixels into a certain color space, we can
think of the problem as a clustering task, where we have to find
the largest cluster of pixels and disregard outliers. Figure 4 shows
the pixels taken from the same location in four images of the test
sequence of figure 3. The pixels are displayed in the two dimen-
sions (red/green) of the three-dimensional RGB color space. The
diagram shows three pixels in the upper right corner, which form
a cluster and an outlier in the lower left corner.
Any unsupervised clustering technique could be employed in or-
der to solve this task. Our approach is inspired by the RANSAC
method introduced by Fischler and Bolles (1981). However, since
the number of samples is small we do not have to randomly se-
lect candidates, but we can rather test all available samples. For
every sample we test how many of the other samples are compat-
ible. Compatibility is tested using a distance function within the
selected color space. Either Mahalanobis distance or Euclidean
distance can be used to compute the distance. If the distance is
below a fixed threshold, the samples are determined to be com-
patible. The tested sample receives a vote for each compatibility.
869
LLL. Lal
80 100 120 140 160 180 200
Figure 4: A sequence of images is treated as a stack of images.
The graphic on the left shows a stack of four images. A single
pixel is observed along a ray passing through all images at exactly
the same position.
The diagram to the right shows the pixel values in red/green color
space. Three pixels agree while one is clearly an outlier.
Figure 5: The result of the proposed algorithm on the test se-
quence. The left image shows the result obtained from computa-
tion in RGB color space, while the right image shows the result
from HSV color space. In both cases the persons walking before
the background were completely suppressed from only four im-
ages. The computation in HSV color space gives slightly better
results.
The sample with the largest number of votes is selected as the
background pixel.
For the implementation we had to chose the appropriate distance
function and select a proper color space. While some researches
favor RGB Color space, others have reported good results from
CIE color space (Coorg and Teller, 1999). A comparison of the
results of the proposed algorithm from computation in RGB and
HSV color space is given in figure 5. Simple Euclidean distance
performed sufficiently in our test. For this test four input images
were used. In both cases the persons walking before the back-
ground were completely suppressed.
Figure 6 shows a sequence of four images from a fixed viewpoint.
The object depicted is the facade of a house imaged from across
a busy street. Traffic and pedestrians partially occlude the lower
portion of the facade. In figure 7 the results of the background
estimation process are shown. The algorithm is able to remove
all occlusions of moving objects from the images. Figure 7 (b) is
an image that was computed by combining those samples, which
received the least number of votes during clustering. The image
thereby combines all outlier pixels and thus contains a combina-
tion of all occlusions. It serves as a comparison to image 7 (a)
to assess the quality of the removal. In image 7 (c) all pixels are
marked in white for which no unique cluster could be determined,
i.e. all samples received only one vote or two clusters received the
same number of votes. In these cases the pixel of the first frame
was chosen to represent the background.
4 MULTI-VIEW FUSION
In section 3 we have shown how moving objects occluding a
facade can be removed using several images taken from a fixed
viewpoint. A static object occluding a facade will be imaged at