> XXXIX-B3, 2012
utomatic extraction of
| images. Proc.ISPRS,
extraction of building
ternational Journal of
759-841
mi-automated building
itting, Photogrammetric
71-180.
y modelling for mobile
M/refer/ps/SZ FH CIP
iscan
1-software
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XXXIX-B3, 2012
XXII ISPRS Congress, 25 August — 01 September 2012, Melbourne, Australia
USING STEREO VISION TO SUPPORT THE AUTOMATED ANALYSIS OF
SURVEILLANCE VIDEOS
Moritz Menze and Daniel Muhle
Institute of Photogrammetry and GeoInformation
Leibniz Universität Hannover
Nienburger Strafe 1, 30167 Hannover, Germany
menze @ipi.uni-hannover.de, muhle @ipi.uni-hannover.de
http://www.ipi.uni-hannover.de
Commission III/1
KEY WORDS: Image Sequences, Stereoscopic Vision, Image Matching
ABSTRACT:
Video surveillance systems are no longer a collection of independent cameras, manually controlled by human operators. Instead,
smart sensor networks are developed, able to fulfil certain tasks on their own and thus supporting security personnel by automated
analyses. One well-known task is the derivation of people's positions on a given ground plane from monocular video footage. An
improved accuracy for the ground position as well as a more detailed representation of single salient people can be expected from a
stereoscopic processing of overlapping views. Related work mostly relies on dedicated stereo devices or camera pairs with a small
baseline. While this set-up is helpful for the essential step of image matching, the high accuracy potential of a wide baseline and the
according good intersection geometry is not utilised. In this paper we present a stereoscopic approach, working on overlapping views
of standard pan-tilt-zoom cameras which can easily be generated for arbitrary points of interest by an appropriate reconfiguration of
parts of a sensor network. Experiments are conducted on realistic surveillance footage to show the potential of the suggested approach
and to investigate the influence of different baselines on the quality of the derived surface model. Promising estimations of people's
position and height are retrieved. Although standard matching approaches show helpful results, future work will incorporate temporal
dependencies available from image sequences in order to reduce computational effort and improve the derived level of detail.
1 INTRODUCTION
The increasing number of surveillance cameras and the corre-
sponding amount of videos to be checked establish a need for
automated analysis of such footage. Due to the amount of video
material on the one hand and limitations in financial and techni-
cal resources on the other, human operators are not able to de-
tect salient behaviour on-line for wider areas nor can they handle
a manual reconfiguration of large sensor networks to achieve a
good coverage. The completion of these tasks based on reason-
able resources w.r.t the number of cameras and security personnel
is one major focus of research on video surveillance systems.
Impressive progress has been made in terms of people detection
and tracking, mostly relying on monocular views of the scene.
Stereoscopic analysis is shown to be very helpful for detection
and tracking, especially in more complex scenes or in presence of
occlusions. Furthermore, dense image matching can be used to
reconstruct the visible surface of people. This leads to a more ro-
bust estimation of the person's position in the scene as well as to
a hint regarding it's posture enabling additional reasoning about
people's actions and interactions in three-dimensional space. Up
to now, stereoscopic analysis is mostly carried out using special
and more expensive stereo devices or network set-ups.
Recent work on reconfigurable smart camera networks aims at
self-organising structures capable of automated reconfiguration
with respect to maximum coverage or focus on individual sub-
jects. This paper investigates the employment of stereo vision
algorithms using pairs of standard pan-tilt-zoom (PTZ) cameras
of a reconfigurable sensor network. The presented approach and
does not require special stereo devices or dedicated and possibly
redundant configuration of the camera network.
The remainder of this paper is organised in four sections. It gives
a brief insight into related work in section 2, presents the pro-
47
posed approach in section 3, describes its experimental valida-
tion in section 4 and ends with section 5 drawing conclusions and
giving an outlook to future work.
2 RELATED WORK
Using stereo vision approaches in video surveillance is a notion
that regularly appears in the literature. A lot of publications make
use of stereo vision devices like Konolige’s Small Vision Sys-
tem (Konolige, 1997) which consists of mechanically aligned
cameras and dedicated stereo algorithms. It is integrated into a
combined detection and tracking algorithm for one static stereo
camera in (Haritaoglu et al., 1998). In that publication, the dispar-
ity map is combined with an intensity based model to implement
background subtraction. Removal of background is an important
step in this paper as well but it will rely on colour information
to extract foreground regions which are then processed by dense
image matching.
(Zhao et al., 2005) employ a network of stereo cameras with
overlapping fields of view which yields reliable tracking results
even in complex environments but also makes excessive use of
resources. Less resource-intensive is the approach presented by
(Zhou et al., 2010). A pair of PTZ cameras is installed with a
short baseline and utilised to generate high-resolution images of
detected people as well as three-dimensional information from
stereo vision. In contrast to these approaches, the presented work
investigates the application of stereo vision in almost arbitrary
PTZ camera networks allowing its use in existing video surveil-
lance systems.
In this paper, we focus on the derivation of more robust and
more accurate geometrical information about people classified as
salient. Thus, the approach has to be embedded into a frame-
work of components for people detection and tracking (Monari et