ISPRS Commission III, Vol.34, Part 3A „Photogrammetric Computer Vision“, Graz, 2002
tion images. They use local directional histograms to segment
regions showing similar grid orientation. In (Price, 2000) mul-
tiple high resolution images and Digital Surface Models (DSM)
are combined to extract the urban road grid in complex, though
stereotypical, residential areas. After manual initialization of two
intersecting road segments defining the first mesh, the grid is it-
eratively expanded by hypothesizing new meshes and matching
them to image edges. During final verification, the contextual
knowledge is exploited that streets are elongated structures whose
sides may be defined by high objects like buildings or trees. Thus,
so-called extended streets (few consecutive road segments) are
simultaneously adjusted by moving them to local minima of the
DSM while isolated and badly rated segments are removed. The
internal evaluation of a road segment mainly depends on the edge
support found during hypothesis matching. However, ratings of
single segments may be altered during verification of extended
streets, which seems justified since this verification is carried out
from a more global perspective on the object "road".
An interesting approach regarding the role of internal evalua-
tion is employed in the system of (Tupin et al., 1999) for find-
ing consistent interpretations of SAR scenes (Synthetic Aperture
RADAR). In a first step, different low level operators with spe-
cific strengths are applied to extract image primitives, i.e., cues
for roads, rivers, urban/ industrial areas, relief characteristics, etc.
Since a particular operator may vote for more than one object
class (e.g. road and river), a so-called focal and non-focal el-
ement is defined for each operator (usually the union of real-
world object classes). The operator response is transformed into
a confidence value characterizing the match with its focal ele-
ment. Then, all confidence values are combined in an evidence-
theoretical framework to assign unique semantics to each prim-
itive attached with a certain probability. Finally, a feature adja-
cency graph is constructed in which global knowledge about ob-
jects (road segments form a network, industrial areas are close to
cities, ...) is introduced in form of object adjacency probabilities.
Based on the probabilities of objects and their relations the final
scene interpretation is formulated as a graph labelling problem
that is solved by energy minimization. In (Tónjes et al., 1999),
scene interpretation is based on a priori knowledge stored in a se-
mantic net and rules for controlling the extraction. Each instance
of an object, e.g., a road axis, is hypothesized top-down and inter-
nally evaluated by comparing the expected attribute values of the
object with the actual values measured in the image. Competing
alternative hypotheses are stored in a search tree as long as no
further hypotheses can be formed. Finally, the best interpretation
is selected from the tree by an optimum path search.
In summary, many approaches derive confidence values from low
level features such as lines or edges. In the following steps the
values are propagated and aggregated providing eventually a ba-
sis for the final decision about the presence of the desired ob-
ject. This procedure may cause problems since the evaluation
is purely based on local features while global object properties
are neglected. Therefore, some approaches introduce additional
knowledge (e.g., roads forming a network or fitting to "valleys"
of a DSM) at a later stage when more evidence for an object has
been acquired. All mentioned approaches have in common that
they use one predefined model for simultaneously extracting and
evaluating roads. Due to the complexity of urban areas, however,
it is appropriate to use a flexible model for extraction and evalu-
ation, which can easily adapt to specific situations occurring dur-
ing the extraction, e.g., lower intensities and weaker contrast in
shadow areas. Before describing our evaluation methodology in
more detail we give a brief summary of the extraction system.
3 SYSTEM OVERVIEW
Our system tries to accommodate aspects having proved to be of
great importance for road extraction: By integrating a flexible,
detailed road and context model one can capture the varying ap-
pearance of roads and the influence of background objects such
as trees, buildings, and cars in complex scenes. The fusion of dif-
ferent scales helps to eliminate isolated disturbances on the road
while the fundamental structures are emphasized (Mayer and Ste-
ger, 1998). This can be supported by considering the function
of roads connecting different sites and thereby forming a fairly
dense and sometimes even regular network. Hence, exploiting
the network characteristics adds global information and, thus,
the selection of the correct hypotheses becomes easier. As basic
data, our system expects high resolution aerial images (resolution
« 15 cm) and a reasonably accurate DSM with a ground resolu-
tion of about 1 m. In the following, we sketch our road model and
extraction strategy. For a comprehensive description we refer the
reader to (Hinz et al., 2001a, Hinz et al., 2001b).
3.1 Road and Context Model:
The road model illustrated in Fig. 1 a) compiles knowledge about
radiometric, geometric, and topological characteristics of urban
roads in form of a hierarchical semantic net. The model rep-
resents the standard case, i.e., the appearance of roads is not
affected by relations to other objects. It describes objects by
means of “concepts”, and is split into three levels defining dif-
ferent points of view. The real world level comprises the objects
to be extracted: The road network, its junctions and road links,
as well as their parts and specializations (road segments, lanes,
markings,...). These concepts are connected to the concepts of
the geometry and material level via concrete relations (Tonjes et
al., 1999). The geometry and material level is an intermediate
level which represents the 3D-shape of an object as well as its
material describing objects independently of sensor characteris-
tics and viewpoint (Clément et al., 1993). In contrast, the image
level which is subdivided into coarse and fine scale comprises the
features to detect in the image: Lines, edges, homogeneous re-
gions, etc. Whereas the fine scale gives detailed information, the
coarse scale adds global information. Because of the abstraction
in coarse scale, additional correct hypotheses for roads can be
found and sometimes also false ones can be eliminated based on
topological criteria, while details, like exact width and position
of the lanes and markings, are integrated from fine scale. In this
way the extraction benefits from both scales.
The road model is extended by knowledge about context: So-
called context objects, i.e., background objects like buildings or
vehicles, may hinder road extraction if they are not modelled ap-
propriately but they substantially support the extraction if they
are part of the road model. We define global and local context:
Global context: The motivation for employing global context
stems from the observation that it is possible to find semantically
meaningful image regions — so-called context regions — where
roads show typical prominent features and where certain rela-
tions between roads and background objects have a similar im-
portance. Consequently, the relevance of different components
of the road model and the importance of different context rela-
tions (described below) must be adapted to the respective context
region. In urban areas, for instance, relations between vehicles
and roads are more important since traffic is usually much denser
inside of settlements than in rural areas. As (Baumgartner et al.,
1999), we distinguish urban, forest, and rural context regions.
Local context: We model the local context with so-called con-
text relations, i.e., certain relations between a small number of
road and context objects. In dense settlements, for instance, the
footprints of buildings are almost parallel to roads and they give
therefore strong hints for road sides. Vice-versa, buildings or
other high objects potentially occlude larger parts of a road or cast
shadows on it. A context relation "occlusion" gives rise to the se-
lection of another image providing a better view on this particular
part of the scene, whereas a context relation "shadow" can tell an
extraction algorithm to choose modified parameter settings. Also
vehicles occlude the pavement of a lane segment. Hence, vehicle
outlines as, e.g., detected by the algorithm of (Hinz and Baum-
gartner, 2001) can be directly treated as parts of a lane. In a very
A - 164