ISPRS Commission III, Vol.34, Part 3A ,,Photogrammetric Computer Vision", Graz, 2002
corresponds to the portion of the model that is visible in the im-
age. For example, if the object is 50% occluded, the score (on
average) cannot exceed 0.5. This is a highly desirable property
because it gives the user the means to select an intuitive threshold
for when an object should be considered as recognized.
A desirable feature of the above similarity measures (2)-(4) is
that they do not need to be evaluated completely when object
recognition is based on a threshold smin for the similarity mea-
sure that a potential match must achieve. Let s; denote the partial
sum of the dot products up to the j-th element of the model. For
the match metric that uses the sum of the normalized dot prod-
ucts, this is:
| deve) |
(5)
YE dill + legs ll
Obviously, all the remaining terms of the sum are all < 1. There-
fore, the partial score can never achieve the required score Smin
if s; < Smin — 1 + j/n, and hence the evaluation of the sum
can be discontinued after the j-th element whenever this condi-
tion is fulfilled. This criterion speeds up the recognition process
considerably.
Nevertheless, further speed-ups are highly desirable. Another cri-
terion is to require that all partial sums have a score better than
Smin, Le, 8j > Smin. When this criterion is used, the search will
be very fast, but it can no longer be ensured that the object recog-
nition finds the correct instances of the model because if missing
parts of the model are checked first, the partial score will be be-
low the required score. To speed up the recognition process with a
very low probability of not finding the object although it is visible
in the image, the following heuristic can be used: the first part of
the model points is examined with a relatively safe stopping cri-
terion, while the remaining part of the model points are examined
with the hard threshold spin. The user can specify what fraction
of the model points is examined with the hard threshold with a
parameter g. If g = 1, all points are examined with the hard
threshold, while for g = 0, all points are examined with the safe
stopping criterion. With this, the evaluation of the partial sums is
stopped whenever s; « min(smin — 1 -- fj /n, $minj /n), where
f = (1 — gSmin)/(1 — Smin). Typically, the parameter g can
be set to values as high as 0.9 without missing an instance of the
model in the image.
3 OBJECT RECOGNITION
The above similarity measures are applied in an object recogni-
tion system for industrial inspection that recognizes objects under
similarity transformations, i.e., translation, rotation, and uniform
scaling, in real time. Although only similarity transformations are
implemented at the moment, extensions to general affine trans-
formations are not difficult to implement. The system consists of
two modules: an offline generation of the model and an online
recognition.
The model is generated from an image of the object to be recog-
nized. An arbitrary region of interest specifies the object's loca-
tion in the image. Usually, the ROI is specified by the user. Alter-
natively, it can be generated by suitable segmentation techniques.
To speed up the recognition process, the model is generated in
multiple resolution levels, which are constructed by building an
image pyramid from the original image. The number of pyramid
levels lax is chosen by the user.
Each resolution level consists of all possible rotations and scal-
ings of the model, where thresholds $5, and $max for the angle
and omin and omax for the scale are selected by the user. The
step length for the discretization of the possible angles and scales
can either be done automatically by a method similar to the one
described in (Borgefors, 1988) or be set by the user. In higher
pyramid levels, the step length for the angle is computed by dou-
bling the step length of the next lower pyramid level.
The rotated and scaled models are generated by rotating and scal-
ing the original image of the current pyramid level and perform-
ing the feature extraction in the rotated image. This is done be-
cause the feature extractors may be anisotropic, i.e., the extracted
direction vectors may depend on the orientation of the feature in
the image in a biased manner. If it is known that the feature ex-
tractor is isotropic, the rotated models may be generated by per-
forming the feature extraction only once per pyramid level and
transforming the resulting points and direction vectors.
The feature extraction can be done by a number of different im-
age processing algorithms that return a direction vector for each
image point. One such class of algorithms are edge detectors, e.g,
the Sobel or Canny (Canny, 1986) operators. Another useful class
of algorithms are line detectors (Steger, 1998). Finally, corner de-
tectors that return a direction vector, e.g., (Fórstner, 1994), could
also be used. Because of runtime considerations the Sobel filter is
used in the current implementation of the object recognition sys-
tem. Since in industrial inspection the lighting can be controlled,
noise does not pose a significant problem in these applications.
To recognize the model, an image pyramid is constructed for the
image in which the model should be found. For each level of
the pyramid, the same filtering operation that was used to gen-
erate the model, e.g., Sobel filtering, is applied to the image.
This returns a direction vector for each image point. Note that
the image is not segmented, i.e., thresholding or other operations
are not performed. This results in true robustness to illumination
changes.
To identify potential matches, an exhaustive search is performed
for the top level of the pyramid, i.e., all precomputed models of
the top level of the model resolution hierarchy are used to com-
pute the similarity measure via (2), (3), or (4) for all possible
poses of the model. A potential match must have a score larger
than a user-specified threshold smin and the corresponding score
must be a local maximum with respect to neighboring scores. As
described in Section 2, the threshold smin is used to speed up the
search by terminating the evaluation of the similarity measure as
early as possible. With the termination criteria, this seemingly
brute-force strategy actually becomes extremely efficient. On av-
erage, about 9 pixels of the model are tested for every pose on the
top level of the pyramid.
After the potential matches have been identified, they are tracked
through the resolution hierarchy until they are found at the lowest
level of the image pyramid. Various search strategies like depth-
first, best-first, etc., have been examined. It turned out that a
breadth-first strategy is preferable for various reasons, most no-
tably because a heuristic for a best-first strategy is hard to define,
and because depth-first search results in slower execution if all
matches should be found.
Once the object has been recognized on the lowest level of the
image pyramid, its position and rotation are extracted to a reso-
lution better than the discretization of the search space, i.e., the
translation is extracted with subpixel precision and the angle and
scale with a resolution better than their respective step lengths.
This is done by fitting a second order polynomial (in the four
pose variables) to the similarity measure values in a 3 x 3 x 3 x 3
neighborhood around the maximum score. The coefficients of
the polynomial are obtained by convolution with 4D facet model
masks. The corresponding 2D masks are given in (Steger, 1998).
They generalize to arbitrary dimensions in a straightforward man-
ner.
4 LEAST-SQUARES POSE REFINEMENT
While the pose obtained by the extrapolation algorithm is accu-
rate enough for most applications, in some applications an even
A - 347