and image operators such as the Histogram of Oriented Gradi-
ents (HOG) (Dalal and Triggs, 2005) that encapsulates changes
in the magnitude and orientation of contrast over a grid of small
image patches. HOG features have shown excellent performance
in their ability to recognise a range of different object types in-
cluding natural objects as well as more artificial objects (Dalal
and Triggs, 2005), (Felzenszwalb et al., 2010), (Schroff et al.,
2008).
The success of such appearance based feature detection meth-
ods for intensity images led to the development of similar ap-
pearance based features for range images. Spin images (Johnson
and Hebert, 1999) use 2-D histograms rotated around a reference
point in space. Splash features (Stein and Medioni, 1992) are
similar to HOG features in that they collect a distribution of sur-
face normal orientations around a reference point. NARF (Nor-
mal Aligned Radial Feature) features (Steder et al., 2010) detect
stable surface regions combined with large depth changes in ob-
ject borders. The feature is designed to be stable across differ-
ent viewpoints. Tripod operators (Pipitone and Adams, 1993)
compactly encode surface shape information of objects by taking
surface range measurements at the three corners of an equilat-
eral triangle. Other range based descriptors include surface patch
representations (Chen and Bhanu, 2007), surface normal based
signatures (Li and Guskov, 2007), and tensor-based descriptors
(Mian et al., 2006). However, for the most part, there is still little
evidence that any of these range image based features are signif-
icantly better than any others for specific object detection tasks.
Recent work has combined intensity based features with range to
first segment images into planar range regions before using this
information to guide the object detection process with intensity
based features (Rapus et al., 2008), (Wei et al., 2011).
This paper reports upon a number of low-level feature extraction
methods for their usefulness in describing salient image regions
containing higher-level features/objects. The features of inter-
est include those based on the “bag of words” concept in which
many low-level features are used together to model the character-
istics of an image region in order to measure the saliency relative
to the whole image (section 2). These generate response maps
indicating regions of interest in the image. Feature extraction
methods that encode a greater amount of spatial and geometric
information from range and intensity image regions are discussed
in the context of their use in parts-based models for higher-level
object detection (section 3). Extraction of rudimentary line seg-
ment information from the 3-D images for use in detecting and
modelling/matching object geometry is also discussed. The pa-
per concludes with a summary of future work and known issues
to be addressed (section 4).
2 OBJECT SALIENCY
Given the large amount of data to be processed, it is necessary to
first extract candidate regions with greater likelihood of contain-
ing higher-level features of interest. For a given object detection
task, such as finding all bus shelters, a saliency detection method
is required to return all approximate locations of bus shelters in
the data. A consequence of this is a high false alarm rate. Subse-
quent processing is then more efficient because the total remain-
ing amount of data is substantially reduced.
Some low-level features are more suited than others for discrim-
ination between high-level features of interest. It is recognised
that it is not possible to find a combination of one or more fea-
tures that will detect all high-level features of interest. The se-
lection of low-level features must be task driven; the objects to
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XXXIX-B3, 2012
XXII ISPRS Congress, 25 August — 01 September 2012, Melbourne, Australia
be detected must first be specified so that the combination of fea-
tures that is most appropriate for the matching of such objects
can be used. Machine learning approaches using a training set
of manually chosen instances of the high-level features (positive
examples) as well as instances of other high-level features not
of the required class (negative examples) will determine the best
low-level features to use and how they can be combined to satisfy
the task.
2.1 Statistical based features
The statistical based features capture the scale invariant covari-
ance of object structure. The histogram of an image is a plot of
the number of pixels for each grey level value (or intensity values
of a colour channel for colour images). The shape of the his-
togram provides information about the nature of the image (or a
sub-region of the image). For example, a very narrow histogram
implies a low contrast image, while a histogram skewed toward
the high end implies a bright image, and a bi-modal histogram (or
a histogram with multiple strong peaks) can imply the presence
of one or more objects.
The histogram features considered in this paper are statistically
based in that the histogram models the probability distribution of
intensity levels in the image. These statistical features encode
characteristics of the intensity level distribution for the image. A
bright image will have a high mean and a dark image will have
a low mean. High contrast regions have high variance, and low
contrast images have low variance. The skew is positive when the
tail of the histogram spreads out to the right (positive side), and
is negative when the tail of the histogram spreads out to the left
(negative side). High energy means that the number of different
intensity levels in the region is low i.e., the distribution is con-
centrated over only a small number of different intensity levels.
Entropy is a measure of the number of bits required to encode the
region data. Entropy increases as the pixel values in the image are
distributed among a larger number of intensity levels. Complex
regions have higher entropy and entropy tends to vary inversely
with energy.
2.2 Localised keypoint, edge and corner features
Rosin (2009) argues for the density of edges as a measure of
salience because interesting objects have more edges. Edge fea-
tures have been very popular and range from simple differential
measures of adjacent pixel contrasts such as the Sobel (Duda et
al., 1973), Prewitt (Prewitt, 1970), and Robert's Cross (Roberts,
1963) operators to complex operators such as the Canny (Canny,
1986) and the Marr-Hildreth (Marr and Hildreth, 1980) edge de-
tectors. Canny produces single pixel wide edges allowing edge
linking, and exhibits good robustness to noise (see figure 1(a)).
The simpler operators such as Sobel require a threshold and thin-
ning to obtain single pixel wide edges. Corner detectors such as
Harris (Harris and Stephens, 1988) have been popular because
they produce features expressing a high degree of viewpoint in-
variance (see figure 1(b)). However, many of the features de-
tected by the Canny operator are false corners and so cannot
be semantically interpreted. More recently, keypoint detectors
with stronger robustness to viewpoint invariance that detect fewer
false features have been proposed such as SIFT (Lowe, 2004),
SURF (Bay et al., 2006) and FAST (Rosten, 2006).
2.3 Saliency based features
Saliency based features take inspiration from aspects of the hu-
man visual system. This is a task driven process that analyses
global image features to identify image regions containing more
“inte
for re
or ch
Segr
tiple
the i
thres
cally
ful ir