tested in remote sensing studies. Shao and Foerstner (1994) eval-
uate the potential of Gabor filters for aerial imagery, whereas
Galun et al. (2003) perform landscape image segmentation with
edge-based texture filters. Martin et al. (2004) have used a subset
of the root filter sets (RFS) for the segmentation and Lazaridis and
Petrou (2006) derive the texture features from the Walsh trans-
form. That smooth filter banks can be approximated by differ-
ences of box filters has been exploited for example by Bay et
al. (2008), who use box filters to approximate the responses of
Hessian filters. They also utilize integral images for rapid com-
putation of the filter responses. Drauschke and Mayer (2010)
compare and assess the performance of several of the previously
mentioned texture filters using images of building facades. They
outline the needs for a more universal approach encompassing
desired properties of all tested filter banks, because each single
filter bank gives optimal results only for a specific dataset.
Other works propose to extract vast amounts of features and ap-
ply either genetic algorithms (van Coillie et al., 2007; Rezaei et
al., 2012) or partial least squares for feature dimensionality re-
duction (Schwartz et al., 2009; Hussain and Triggs, 2010) prior
to classifier training. In computer vision the closest related work
to ours is probably the integral channel features method (Dollár
et al., 2009). For object detection, they randomly sample a large
number of box filter responses over the detection window and
use AdaBoost to select the most discriminative features, also us-
ing integral images. They show improvement over methods that
stick to hand-crafted feature layouts.
Deep belief networks (DBN) follow a similar line of thought in
that they try to learn (non-linear) feature extractors as part of un-
supervised pre-training (Hinton and Salakhutdinov, 2006; Ran-
zato et al., 2007). First steps have also been made to adapt them to
feature learning and patch-based classification of high-resolution
remote sensing data by Mnih and Hinton (2010, 2012). However,
in a recent evaluation of Tokarczyk et al. (2012) DBNSs as feature
extractors did not improve the classification results compared to
standard linear filter banks for VHR remote sensing data.
An alternative strategy is to model local object patterns via prior
distributions with generative Markov Random Fields (MRF) or
discriminative Conditional Random Fields (CRF). Usually, such
methods are used to encode smoothness priors over local neigh-
borhoods (Schindler, 2012), but some works exist that instead of
applying smoothness constraints directly encode texture through
the prior term (Gimel'farb, 1996; Zhu et al., 1997). Helmholz
et al. (2012) apply the approach of (Gimel’farb, 1996) to aerial
and satellite images as one of several feature extractors inside a
semi-automatic quality control tool for topographic datasets.
2 METHODS
In line with recent machine learning literature, we pose feature
extraction and classification as a joint problem. Instead of pre-
selecting certain features that seem appropriate for describing a
particular scene and classifying them in a subsequent step, feature
selection is completely left to the classifier, such that those fea-
tures are selected which best solve the classification problem for a
given dataset. In the following we first describe hand-crafted fea-
tures commonly used in remote sensing, which serve as our base-
line, before turning to details about the proposed quasi-exhaustive
features. Thereafter, the boosting algorithm for training (i.e., fea-
ture selection and weighting) and testing is explained.
2.1 Baseline features
We start by describing the baselines for our comparison, namely
hand-crafted features commonly used for classifying aerial and
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume II-3/W1, 2013
VCM 2013 - The ISPRS Workshop on 3D Virtual City Modeling, 28 May 2013, Regina, Canada
satellite images. With hand-crafted we mean that a human ex-
pert makes an "educated guess" on what kind of features seem
appropriate for classification of the data at hand. Note that this
procedure runs the risk of losing important information, if an in-
formative feature is not anticipated. We propose to circumvent
manual pre-selection by computing a large number of intensity
values and intensity differences over a range of scales, both per
channel and between channels. The relevant subset is then auto-
matically selected during classifier training.
2.1.4 15x15 pixel neighborhood State-of-the-art airborne
data is captured with a GSD < 0.2 m while space borne imagery
can be acquired with a resolution of X 0.5 m. In such very high
spatial resolution imagery even small objects like single trees
consist of several pixels, and on larger objects like building roofs
sub-structures, like chimneys, dormers and tile patterns emerge.
To cope with the resulting high intra-class variability of the ra-
diometric signature, and to exploit class-specific texture patterns,
one should thus consider also the intensities in a pixels' neighbor-
hood. Hence, for each pixel also the intensities of its neighbors
within a square window are added to the feature vector. Typi-
cal window sizes range from 3x3 to 21 x21 depending on image
resolution and object size. We have tested various different win-
dow sizes and found 15x 15 patches to be sufficient, leading to a
feature vector with 225 dimensions per channel. Thus we obtain
a 675 dimensional feature space for our test images with three
channels. Note, since the classifier is free to base its decision on
the intensities of the central pixel, the method includes the case of
not using any neighborhood. Note also, due to the strongly over-
lapping content of adjacent windows using the neighborhood can
be expected to smooth the classification results.
2.1.2 Augmented 15x15 pixel neighborhood This feature
set represents a standard “educated guess” in the optical remote
sensing domain. In addition to the local neighborhood of each
single pixel, we add the NDVI channel, as well as linear combi-
nations of the raw data found by PCA. Given N input channels,
all N principal components are added to the feature vectors, be-
cause a-priori dimensionality reduction is not the goal here, while
the classification stage performs feature selection anyway. PCA
and NDVI are treated like additional channels, thus for each pixel
we again put all values inside the 15x 15 nieghborhood into the
feature vector, thereby adding another 225 dimensions per chan-
nel. For input images with three channels plus three PCA chan-
nels and one NDVI channel we obtain a 1575-dimensional feature
space.
2.1.3 Texture filters Here we use a filter bank wide-spread
in semantic segmentation in the computer vision domain. Im-
ages are first converted to an opponent Gaussian color model
(OGCM) in order to account for intensity changes due to light-
ing fluctuations or viewpoints (we have also tested several other
color spaces, but OGCM yielded most stable responses). Nor-
malized color channels are convolved with a set of filters adopted
from (Winn et al., 2005) being a subset of the RFS filters. It
contains three Gaussians, four first-order Gaussian derivatives,
and four Laplacian-of-Gaussians (LoG). Three Gaussian kernels
at scales {o, 20, 40} are applied. The first-order Gaussian deriva-
tives are computed separately in x- and y-direction at scales
(2c, 4c) thus yielding four responses and the four LoGs have
scales (c, 20, 4,80). In total, 11 features are computed per
channel leading to a 33-dimensional feature space for our test
images with three channels. We tested multiple choices of o and
found 0 = 0.7 to deliver best results. Note that by convolution
of the channels with such filters each pixel’s neigborhood is im-
plicitly considered.
7X:
Figure 1: M
4X4 patches
2.2 Quasi
The main ic
avoid data-s
redundant, c
a compact bi
ing (Hinton
generative ti
Rather we p
puting a lar;
small subse!
specific sele
teristics, ligl
simple set of
efficiently. 1
tations of sm
lot dependin
2010). Our ¢
® raw pix
differer
ties ave
over 5»
e pixel-w
are onl
full 15:
the infc
cally in
signific
e mean i
3x3,4:
within
(i) non-
window
tral pix
This feature
frequencies 1
proper (e.g.,
mation basec
an image is
efficient by 1
that the dime
methods is 1
three channe
2.3 Boosti
As a classific
aBoost (Freu
selection anc
can be reduc
prove the acc