ıbination of fea-
of such objects
1g a training set
catures (positive
vel features not
termine the best
nbined to satisfy
invariant covari-
nage is a plot of
intensity values
hape of the his-
‘the image (or a
arrow histogram
| skewed toward
lal histogram (or
ply the presence
are statistically
y distribution of
features encode
for the image. À
image will have
riance, and low
ositive when the
sitive side), and
is out to the left
nber of different
tribution is con-
intensity levels.
ed to encode the
in the image are
levels. Complex
o vary inversely
lures
as a measure of
dges. Edge fea-
mple differential
> Sobel (Duda et
Cross (Roberts,
ec Canny (Canny,
, 1980) edge de-
s allowing edge
see figure 1(a)).
reshold and thin-
letectors such as
popular because
of viewpoint in-
the features de-
s and so cannot
ypoint detectors
that detect fewer
T (Lowe, 2004),
6).
spects of the hu-
ss that analyses
containing more
(a) Canny (b) Harris
Figure 1: Edge and corner keypoint detectors.
“interesting” pixels. The images in figure 2 show high responses
for regions with many edges representing busyness in the images
or changes in intensity or frequency components of the image.
(a) Frequency-tuned (Achanta et (b) Maximal Symmetric Surround
al., 2009) (Achanta and Süsstrunk, 2010)
Figure 2: Examples of saliency detectors.
Segmentation is the process of partitioning the image into mul-
tiple segments. Edge based saliency maps are used to segment
the images into interesting and non-interesting regions by simple
thresholding. Figure 3 demonstrates how this procedure drasti-
cally reduces the area of the image expected to contain meaning-
ful information about the objects of interest.
(a) Edge based saliency map (b) Mask applied to image showing
(Rosin, 2009) interesting region.
Figure 3: Saliency based segmentation using simple thresholding
on saliency map.
The methods described do not require any kind of offline pre-
processing to use, however they are also weak at detecting salient
image regions while maintaining a low false alarm rate for higher-
level features of interest. Learning a model of saliency offline is
à more promising method for detecting salient image regions.
24 Learning based features
In order to detect salient regions of an image, a model of saliency
can be learned for comparison against new images from train-
ing data. The Support Vector Machine (SVM) is a method of
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XXXIX-B3, 2012
XXII ISPRS Congress, 25 August — 01 September 2012, Melbourne, Australia
supervised machine learning based on the theory of statistical
learning (Cortes and Vapnik, 1995). The theory behind the SVM
guarantees that any N dimensional feature space is linearly sep-
arable in N + M dimensions (where M is not excluded from
being possibly infinite). The SVM finds a separating hyperplane
(in N + M dimensional space) between two classes of training
data (the positive and the negative examples). The placement of
this hyperplane is such that the distances between the hyperplane
and the closest training instances (the support vectors) on either
side of the plane are maximised. Since noise in the training data
cannot be avoided, the SVM is extended to incorporate a “soft-
margin” around the hyperplane to allow training instance outliers
to sit on the wrong side of the hyperplane. The complete learning
algorithm seeks to maximise the distance of the support vectors to
the hyperplane, while minimising the distances from the separat-
ing hyperplane of training instances found to be on the wrong
side of the separating hyperplane. Finally, since training data
isn’t always linearly separable in the provided N dimensional
feature space, a kernel function can be used to place the data
into a higher dimensionality feature space to increase the like-
lihood that a separating hyperplane with a good fit to the data can
be found. The kernel function may be a high degree polynomial
(or worse) on the training data, but this does not incur any extra
processing overhead since the training data only ever appears as
a dot product of vectors inside the kernel function. SVMs have
demonstrated excellent performance in a number of similar stud-
ies (Felzenszwalb et al., 2010), (Dalal and Triggs, 2005), (Lin et
al., 2011) concerning object detection.
Bounding boxes are positioned around examples of the objects to
be identified in a set of training images (see figure 4). Features
for these bounded regions are calculated and then concatenated
as N-dimensional feature vectors (where N is the number of fea-
tures used) to generate a set of positive training examples. The
same number of negative feature vectors are randomly generated
(from image regions that do not contain the objects of interest).
The positive and negative examples are passed to an SVM for
training and five-fold cross-validation is performed, varying the
parameters to the kernel functions of the SVM to identify an op-
timal model without overfitting to the training data (linear and
radial basis functions are evaluated for their performance dur-
ing cross-validation). The generated model represents a weight
vector which is multiplied (as the scalar product) with a feature
vector calculated from a new (previously unseen) image region to
determine whether the image region is salient or not. The feature
measurements explored in our approach consist of: Histogram of
Orientations (over whole image sub-regions), edge density, Har-
ris keypoint density, FAST keypoint density, mean depth of the
range image (in the image sub-region), standard deviation of the
intensity histogram, skew of the intensity histogram, energy of
the intensity histogram, and entropy of the intensity histogram.
Figure 4: Training images displaying positive training instances
(yellow) and negative instances (red).
Initial results are promising with a 85% correct identification rate