Object: ISPRS Workshop on 3D Virtual City Modeling (VCM 2013)

tested in remote sensing studies. Shao and Foerstner (1994) eval- 
uate the potential of Gabor filters for aerial imagery, whereas 
Galun et al. (2003) perform landscape image segmentation with 
edge-based texture filters. Martin et al. (2004) have used a subset 
of the root filter sets (RFS) for the segmentation and Lazaridis and 
Petrou (2006) derive the texture features from the Walsh trans- 
form. That smooth filter banks can be approximated by differ- 
ences of box filters has been exploited for example by Bay et 
al. (2008), who use box filters to approximate the responses of 
Hessian filters. They also utilize integral images for rapid com- 
putation of the filter responses. Drauschke and Mayer (2010) 
compare and assess the performance of several of the previously 
mentioned texture filters using images of building facades. They 
outline the needs for a more universal approach encompassing 
desired properties of all tested filter banks, because each single 
filter bank gives optimal results only for a specific dataset. 
Other works propose to extract vast amounts of features and ap- 
ply either genetic algorithms (van Coillie et al., 2007; Rezaei et 
al., 2012) or partial least squares for feature dimensionality re- 
duction (Schwartz et al., 2009; Hussain and Triggs, 2010) prior 
to classifier training. In computer vision the closest related work 
to ours is probably the integral channel features method (Dollár 
et al., 2009). For object detection, they randomly sample a large 
number of box filter responses over the detection window and 
use AdaBoost to select the most discriminative features, also us- 
ing integral images. They show improvement over methods that 
stick to hand-crafted feature layouts. 
Deep belief networks (DBN) follow a similar line of thought in 
that they try to learn (non-linear) feature extractors as part of un- 
supervised pre-training (Hinton and Salakhutdinov, 2006; Ran- 
zato et al., 2007). First steps have also been made to adapt them to 
feature learning and patch-based classification of high-resolution 
remote sensing data by Mnih and Hinton (2010, 2012). However, 
in a recent evaluation of Tokarczyk et al. (2012) DBNSs as feature 
extractors did not improve the classification results compared to 
standard linear filter banks for VHR remote sensing data. 
An alternative strategy is to model local object patterns via prior 
distributions with generative Markov Random Fields (MRF) or 
discriminative Conditional Random Fields (CRF). Usually, such 
methods are used to encode smoothness priors over local neigh- 
borhoods (Schindler, 2012), but some works exist that instead of 
applying smoothness constraints directly encode texture through 
the prior term (Gimel'farb, 1996; Zhu et al., 1997). Helmholz 
et al. (2012) apply the approach of (Gimel’farb, 1996) to aerial 
and satellite images as one of several feature extractors inside a 
semi-automatic quality control tool for topographic datasets. 
2 METHODS 
In line with recent machine learning literature, we pose feature 
extraction and classification as a joint problem. Instead of pre- 
selecting certain features that seem appropriate for describing a 
particular scene and classifying them in a subsequent step, feature 
selection is completely left to the classifier, such that those fea- 
tures are selected which best solve the classification problem for a 
given dataset. In the following we first describe hand-crafted fea- 
tures commonly used in remote sensing, which serve as our base- 
line, before turning to details about the proposed quasi-exhaustive 
features. Thereafter, the boosting algorithm for training (i.e., fea- 
ture selection and weighting) and testing is explained. 
2.1 Baseline features 
We start by describing the baselines for our comparison, namely 
hand-crafted features commonly used for classifying aerial and 
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume II-3/W1, 2013 
VCM 2013 - The ISPRS Workshop on 3D Virtual City Modeling, 28 May 2013, Regina, Canada 
satellite images. With hand-crafted we mean that a human ex- 
pert makes an "educated guess" on what kind of features seem 
appropriate for classification of the data at hand. Note that this 
procedure runs the risk of losing important information, if an in- 
formative feature is not anticipated. We propose to circumvent 
manual pre-selection by computing a large number of intensity 
values and intensity differences over a range of scales, both per 
channel and between channels. The relevant subset is then auto- 
matically selected during classifier training. 
2.1.4 15x15 pixel neighborhood  State-of-the-art airborne 
data is captured with a GSD < 0.2 m while space borne imagery 
can be acquired with a resolution of X 0.5 m. In such very high 
spatial resolution imagery even small objects like single trees 
consist of several pixels, and on larger objects like building roofs 
sub-structures, like chimneys, dormers and tile patterns emerge. 
To cope with the resulting high intra-class variability of the ra- 
diometric signature, and to exploit class-specific texture patterns, 
one should thus consider also the intensities in a pixels' neighbor- 
hood. Hence, for each pixel also the intensities of its neighbors 
within a square window are added to the feature vector. Typi- 
cal window sizes range from 3x3 to 21 x21 depending on image 
resolution and object size. We have tested various different win- 
dow sizes and found 15x 15 patches to be sufficient, leading to a 
feature vector with 225 dimensions per channel. Thus we obtain 
a 675 dimensional feature space for our test images with three 
channels. Note, since the classifier is free to base its decision on 
the intensities of the central pixel, the method includes the case of 
not using any neighborhood. Note also, due to the strongly over- 
lapping content of adjacent windows using the neighborhood can 
be expected to smooth the classification results. 
2.1.2 Augmented 15x15 pixel neighborhood This feature 
set represents a standard “educated guess” in the optical remote 
sensing domain. In addition to the local neighborhood of each 
single pixel, we add the NDVI channel, as well as linear combi- 
nations of the raw data found by PCA. Given N input channels, 
all N principal components are added to the feature vectors, be- 
cause a-priori dimensionality reduction is not the goal here, while 
the classification stage performs feature selection anyway. PCA 
and NDVI are treated like additional channels, thus for each pixel 
we again put all values inside the 15x 15 nieghborhood into the 
feature vector, thereby adding another 225 dimensions per chan- 
nel. For input images with three channels plus three PCA chan- 
nels and one NDVI channel we obtain a 1575-dimensional feature 
space. 
2.1.3 Texture filters Here we use a filter bank wide-spread 
in semantic segmentation in the computer vision domain. Im- 
ages are first converted to an opponent Gaussian color model 
(OGCM) in order to account for intensity changes due to light- 
ing fluctuations or viewpoints (we have also tested several other 
color spaces, but OGCM yielded most stable responses). Nor- 
malized color channels are convolved with a set of filters adopted 
from (Winn et al., 2005) being a subset of the RFS filters. It 
contains three Gaussians, four first-order Gaussian derivatives, 
and four Laplacian-of-Gaussians (LoG). Three Gaussian kernels 
at scales {o, 20, 40} are applied. The first-order Gaussian deriva- 
tives are computed separately in x- and y-direction at scales 
(2c, 4c) thus yielding four responses and the four LoGs have 
scales (c, 20, 4,80). In total, 11 features are computed per 
channel leading to a 33-dimensional feature space for our test 
images with three channels. We tested multiple choices of o and 
found 0 = 0.7 to deliver best results. Note that by convolution 
of the channels with such filters each pixel’s neigborhood is im- 
plicitly considered. 
     
  
   
  
  
  
  
   
  
  
   
   
   
   
   
    
    
   
   
   
     
   
   
   
    
   
   
  
   
  
   
  
   
   
   
    
    
     
   
  
   
   
   
   
  
    
    
  
  
   
   
  
   
   
  
  
   
   
    
   
   
    
  
   
    
7X: 
  
  
  
  
  
  
Figure 1: M 
4X4 patches 
2.2 Quasi 
The main ic 
avoid data-s 
redundant, c 
a compact bi 
ing (Hinton 
generative ti 
Rather we p 
puting a lar; 
small subse! 
specific sele 
teristics, ligl 
simple set of 
efficiently. 1 
tations of sm 
lot dependin 
2010). Our ¢ 
® raw pix 
differer 
ties ave 
over 5» 
e pixel-w 
are onl 
full 15: 
the infc 
cally in 
signific 
e mean i 
3x3,4: 
within 
(i) non- 
window 
tral pix 
This feature 
frequencies 1 
proper (e.g., 
mation basec 
an image is 
efficient by 1 
that the dime 
methods is 1 
three channe 
2.3 Boosti 
As a classific 
aBoost (Freu 
selection anc 
can be reduc 
prove the acc
	        
Waiting...

Note to user

Dear user,

In response to current developments in the web technology used by the Goobi viewer, the software no longer supports your browser.

Please use one of the following browsers to display this page correctly.

Thank you.