Full text: Technical Commission III (B3)

> XXXIX-B3, 2012 
utomatic extraction of 
| images. Proc.ISPRS, 
extraction of building 
ternational Journal of 
759-841 
mi-automated building 
itting, Photogrammetric 
71-180. 
y modelling for mobile 
M/refer/ps/SZ FH CIP 
iscan 
1-software 
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XXXIX-B3, 2012 
XXII ISPRS Congress, 25 August — 01 September 2012, Melbourne, Australia 
USING STEREO VISION TO SUPPORT THE AUTOMATED ANALYSIS OF 
SURVEILLANCE VIDEOS 
Moritz Menze and Daniel Muhle 
Institute of Photogrammetry and GeoInformation 
Leibniz Universität Hannover 
Nienburger Strafe 1, 30167 Hannover, Germany 
menze @ipi.uni-hannover.de, muhle @ipi.uni-hannover.de 
http://www.ipi.uni-hannover.de 
Commission III/1 
KEY WORDS: Image Sequences, Stereoscopic Vision, Image Matching 
ABSTRACT: 
Video surveillance systems are no longer a collection of independent cameras, manually controlled by human operators. Instead, 
smart sensor networks are developed, able to fulfil certain tasks on their own and thus supporting security personnel by automated 
analyses. One well-known task is the derivation of people's positions on a given ground plane from monocular video footage. An 
improved accuracy for the ground position as well as a more detailed representation of single salient people can be expected from a 
stereoscopic processing of overlapping views. Related work mostly relies on dedicated stereo devices or camera pairs with a small 
baseline. While this set-up is helpful for the essential step of image matching, the high accuracy potential of a wide baseline and the 
according good intersection geometry is not utilised. In this paper we present a stereoscopic approach, working on overlapping views 
of standard pan-tilt-zoom cameras which can easily be generated for arbitrary points of interest by an appropriate reconfiguration of 
parts of a sensor network. Experiments are conducted on realistic surveillance footage to show the potential of the suggested approach 
and to investigate the influence of different baselines on the quality of the derived surface model. Promising estimations of people's 
position and height are retrieved. Although standard matching approaches show helpful results, future work will incorporate temporal 
dependencies available from image sequences in order to reduce computational effort and improve the derived level of detail. 
1 INTRODUCTION 
The increasing number of surveillance cameras and the corre- 
sponding amount of videos to be checked establish a need for 
automated analysis of such footage. Due to the amount of video 
material on the one hand and limitations in financial and techni- 
cal resources on the other, human operators are not able to de- 
tect salient behaviour on-line for wider areas nor can they handle 
a manual reconfiguration of large sensor networks to achieve a 
good coverage. The completion of these tasks based on reason- 
able resources w.r.t the number of cameras and security personnel 
is one major focus of research on video surveillance systems. 
Impressive progress has been made in terms of people detection 
and tracking, mostly relying on monocular views of the scene. 
Stereoscopic analysis is shown to be very helpful for detection 
and tracking, especially in more complex scenes or in presence of 
occlusions. Furthermore, dense image matching can be used to 
reconstruct the visible surface of people. This leads to a more ro- 
bust estimation of the person's position in the scene as well as to 
a hint regarding it's posture enabling additional reasoning about 
people's actions and interactions in three-dimensional space. Up 
to now, stereoscopic analysis is mostly carried out using special 
and more expensive stereo devices or network set-ups. 
Recent work on reconfigurable smart camera networks aims at 
self-organising structures capable of automated reconfiguration 
with respect to maximum coverage or focus on individual sub- 
jects. This paper investigates the employment of stereo vision 
algorithms using pairs of standard pan-tilt-zoom (PTZ) cameras 
of a reconfigurable sensor network. The presented approach and 
does not require special stereo devices or dedicated and possibly 
redundant configuration of the camera network. 
The remainder of this paper is organised in four sections. It gives 
a brief insight into related work in section 2, presents the pro- 
47 
posed approach in section 3, describes its experimental valida- 
tion in section 4 and ends with section 5 drawing conclusions and 
giving an outlook to future work. 
2 RELATED WORK 
Using stereo vision approaches in video surveillance is a notion 
that regularly appears in the literature. A lot of publications make 
use of stereo vision devices like Konolige’s Small Vision Sys- 
tem (Konolige, 1997) which consists of mechanically aligned 
cameras and dedicated stereo algorithms. It is integrated into a 
combined detection and tracking algorithm for one static stereo 
camera in (Haritaoglu et al., 1998). In that publication, the dispar- 
ity map is combined with an intensity based model to implement 
background subtraction. Removal of background is an important 
step in this paper as well but it will rely on colour information 
to extract foreground regions which are then processed by dense 
image matching. 
(Zhao et al., 2005) employ a network of stereo cameras with 
overlapping fields of view which yields reliable tracking results 
even in complex environments but also makes excessive use of 
resources. Less resource-intensive is the approach presented by 
(Zhou et al., 2010). A pair of PTZ cameras is installed with a 
short baseline and utilised to generate high-resolution images of 
detected people as well as three-dimensional information from 
stereo vision. In contrast to these approaches, the presented work 
investigates the application of stereo vision in almost arbitrary 
PTZ camera networks allowing its use in existing video surveil- 
lance systems. 
In this paper, we focus on the derivation of more robust and 
more accurate geometrical information about people classified as 
salient. Thus, the approach has to be embedded into a frame- 
work of components for people detection and tracking (Monari et 
 
	        
Waiting...

Note to user

Dear user,

In response to current developments in the web technology used by the Goobi viewer, the software no longer supports your browser.

Please use one of the following browsers to display this page correctly.

Thank you.