2. METHOD
2.1 Basic principle of Random Forests
The Random Forests classifier developed by Breiman (2001) is
a combination of decision trees (DT(x,®, U , where x is an
input vector, and ©, denotes a random vector which is sampled
independently but with the same distribution as the past
Op 9;
training data, and then an no pruned classification and
regression tree (CART) is grew from each bootstrap sample ß
where only one of M randomly selected features is chosen for
the split at each node of CART. The chosen feature is the one
that minimizes the Gini impurity which can be written as
(Breiman et al., 1984):
Gini(B) - P (/(C.B)/p)(7(c,.B)/]l) ^
where f (C op) / I| is the probability that the randomly selected
T bootstrap samples are first drawn from the
pixel belongs to class C,. Finally, the output of the classifier is
determined by a majority vote of all individually trained trees.
There are two parameters: the number of variables (M) in the
random subset at each node and the number of trees (7) in the
forest. The selection of parameter M has influence on the final
error rate. If M is increased, both the correlation between the
trees and the strength (classification accuracy) of individual tree
in the forest are increased. The error rate is proportional to the
correlation, but inverse proportional to the strength (Joelsson et
al., 2008). Usually, M is set to the square root of number of
features (Gislason et al., 2006). Because Random Forests is fast
and not overfit, the number of trees T can be as many as
possible. However, due to the memory limit of the machine, T is
usually several hundred (Horning, 2010), here is set to 100. The
Random Forests also provides two additional measures: the
variable importance and internal structure. Variable importance
measures the importance of the predictor variables (features). To
estimate a feature importance, the OOB samples are first run
through the trees and count the votes for the correct
classification. Then, the prediction accuracy is repeatedly
obtained after randomly permuting all the values of this feature
while all the other features stay the same. The importance score
is the decrease of the correct class votes after the variable
permutation, averaged over all the trees. The intuition is that a
random variable permutation can simulates the absence of that
variable from the forest (Guo et al., 2011). Thus the higher an
average accuracy decrease is, the more important that feature is.
ith
,
A Sn
Beeps, Lio x
End LT 3
"waren É
BERLINS C
"gb
Figure 1. Study area of Mannheim, Germany
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XXXIX-B7, 2012
XXII ISPRS Congress, 25 August — 01 September 2012, Melbourne, Australia
2.2 Study Area and Datasets
Laser scanning data covering Mannheim, Germany, were
acquired in 2004 by a Falcon II sensor- a Fiber based system
concept, TopoSys® GmbH. The airplane flew at an average
height of 1,200 m above the mean sea level, with a camera on
board for the 0.5m-resolution aerial photographs with RGB
bands. The average point density and point spacing within the
test site is about 4 points/m2 and 0.5 m, respectively. The lidar
dataset records both range (first- and last- returns) and intensity
information of the laser pulse. In this research Lidar data is
considered in 2D geometry with optical image data. The
experimental area is a typical urban region that contains
variously sized buildings with different orientations, as well as
trees and grass interspersed among buildings. Meanwhile, the
study area and its vicinity are relative flat, with elevations
ranging from approximately 89.83 m to 159.71 m.
2.3 Training sample and reference data
The training samples are chosen using the photo-interpretation
method in the commercial software ENVIG. Table 1 lists the
number of training samples. As a proportion of the full image to
be analysed the number of training samples would represent less
than 1% to 5%. For accuracy assessment, an adequate number
of testing data is required per class of interest. Congalton and
Green (2009) pointed out that it is necessary to have sufficient
testing data for building a valid statistically error matrix to
represent classification accuracy. Thus, the sample size N was
determined by Equation (2) for the binomial probability theory:
> Z'p(100- p)
N z
E*
Q)
Where p is the expected percent accuracy, E is the allowable
error, and Z = 1.96 from the standard normal deviant for the 95%
two-sided confidence level. An expected accuracy of 9596 was
selected because the land-use classification system specifies that
each class category should be mapped to at least 8596 accuracy,
and then the allowable error of 5% is chosen. For this study area,
the sample size (N) of 996 meets the demand of Congalton and
Green’s (2009) rule-of-thumb of a minimum of 50 samples per
class.
Categories Training samples | Test data
ROI Pixels ROI | Pixels
Buildings 103 927 50 569
High vegetation | 36 524 26 421
Ground 60 934 44 685
Grass 12 172 10 98
Table 1. The training samples and test data.
2.4 Features
There are several groups of features, including lidar height-
based, lidar intensity-based features, and RGB aerial image-
based features. They are listed as follows. Relevant features are
shown in Figures 2(a), (b) and (c).