ISPRS Commission III, Vol.34, Part 3A „Photogrammetric Computer Vision“, Graz, 2002
OCCLUSION, CLUTTER, AND ILLUMINATION INVARIANT OBJECT RECOGNITION
Carsten Steger
MV Tec Software GmbH
NeherstraBe 1, 81675 München, Germany
steger@mvtec.com
Commission III, Working Group III/5
KEY WORDS: Computer Vision, Real-Time Object Recognition
ABSTRACT
An object recognition system for industrial inspection that recognizes objects under similarity transformations in real time is proposed.
It uses novel similarity measures that are inherently robust against occlusion, clutter, and nonlinear illumination changes. They can be
extended to be robust to global as well as local contrast reversals. The matching is performed based on the maxima of the similarity
measure in the transformation space. For normal applications, subpixel-accurate poses are obtained by extrapolating the maxima of
the similarity measure from discrete samples in the transformation space. For applications with very high accuracy requirements,
least-squares adjustment is used to further refine the extracted pose.
1 INTRODUCTION
Object recognition is used in many computer vision applications.
It is particularly useful for industrial inspection tasks, where of-
ten an image of an object must be aligned with a model of the
object. The transformation (pose) obtained by the object recog-
nition process can be used for various tasks, e.g., pick and place
operations or quality control. In most cases, the model of the ob-
ject is generated from an image of the object. This 2D approach
is taken because it usually is too costly or time consuming to cre-
ate a more complicated model, e.g., a 3D CAD model. Therefore,
in industrial inspection tasks one is usually interested in match-
ing a 2D model of an object to the image. The object may be
transformed by a certain class of transformations, depending on
the particular setup, e.g., translations, euclidean transformations,
similarity transformations, or general 2D affine transformations
(which are usually taken as an approximation to the true perspec-
tive transformations an object may undergo).
A large number of object recognition strategies exist. The ap-
proach to object recognition proposed in this paper uses pixels as
its geometric features, i.e., not higher level features like lines or
elliptic arcs. Therefore, only similar pixel-based strategies will
be reviewed.
Several methods have been proposed to recognize objects in im-
ages by matching 2D models to images. A survey of matching
approaches is given in (Brown, 1992). In most 2D matching
approaches the model is systematically compared to the image
using all allowable degrees of freedom of the chosen class of
transformations. The comparison is based on a suitable similar-
ity measure (also called match metric). The maxima or minima
of the similarity measure are used to decide whether an object is
present in the image and to determine its pose. To speed up the
recognition process, the search is usually done in a coarse-to-fine
manner, e.g., by using image pyramids (Tanimoto, 1981).
The simplest class of object recognition methods is based on the
gray values of the model and image itself and uses normalized
cross correlation or the sum of squared or absolute differences as
a similarity measure (Brown, 1992). Normalized cross correla-
tion is invariant to linear brightness changes but is very sensitive
to clutter and occlusion as well as nonlinear contrast changes.
The sum of gray value differences is not robust to any of these
changes, but can be made robust to linear brightness changes by
explicitly incorporating them into the similarity measure, and to
a moderate amount of occlusion and clutter by computing the
similarity measure in a statistically robust manner (Lai and Fang,
1999).
A more complex class of object recognition methods does not use
the gray values of the model or object itself, but uses the object's
edges for matching (Borgefors, 1988, Rucklidge, 1997). In all ex-
isting approaches, the edges are segmented, i.e., a binary image
is computed for both the model and the search image. Usually,
the edge pixels are defined as the pixels in the image where the
magnitude of the gradient is maximum in the direction of the gra-
dient. Various similarity measures can then be used to compare
the model to the image. The similarity measure in (Borgefors,
1988) computes the average distance of the model edges and the
image edges. The disadvantage of this similarity measure is that
it is not robust to occlusions because the distance to the nearest
edge increases significantly if some of the edges of the model are
missing in the image.
The Hausdorff distance similarity measure used in (Rucklidge,
1997) tries to remedy this shortcoming by calculating the maxi-
mum of the k-th largest distance of the model edges to the image
edges and the /-th largest distance of the image edges and the
model edges. If the model contains n points and the image con-
tains m edge points, the similarity measure is robust to 100k /n%
occlusion and 100] /m% clutter. Unfortunately, an estimate for m
is needed to determine /, which is usually not available.
All of these similarity measures have the disadvantage that they
do not take into account the direction of the edges. In (Olson and
Huttenlocher, 1997) it is shown that disregarding the edge direc-
tion information leads to false positive instances of the model in
the image. The similarity measure proposed in (Olson and Hut-
tenlocher, 1997) tries to improve this by modifying the Hausdorff
distance to also measure the angle difference between the model
and image edges. Unfortunately, the implementation is based on
multiple distance transformations, which makes the algorithm too
computationally expensive for industrial inspection.
Finally, another class of edge based object recognition algorithms
is based on the generalized Hough transform (Ballard, 1981). Ap-
proaches of this kind have the advantage that they are robust to
occlusion as well as clutter. Unfortunately, the GHT requires ex-
tremely accurate estimates for the edge directions or a complex
and expensive processing scheme, e.g., smoothing the accumula-
tor space, to determine whether an object is present and to deter-
A - 345