In: Wagner W., SzSkely, B. (eds.): ISPRS TC VII Symposium - 100 Years ISPRS, Vienna, Austria, July 5-7, 2010, IAPRS, Vol. XXXVIII, Part 7B
2.5 Selection of features for the estimation of forest
attributes
The ¿-nearest neighbor (¿-nn) method was used to estimate
forest variables (e.g. Kilkki & Paivinen, 1987; Tokola et al.,
1996). . The value of k was set to 5, euclidean distances were
used to measure closeness in the feature space and the nearest
neighbors were weighted with the squared inverse distances.
The accuracy of the estimates produced by the k-nn estimator
was tested via leave-one-out cross-validation on the field plots
by comparing the estimates of each field plot to the measured
value (ground truth) of the plot. The accuracy of the estimates
was measured by the relative root mean square error RMSE
(Equation 1).
rmse% = \oo* rmse a)
y
number of features to a reasonable minimum. Only features
belonging to the best genome in each step were included in the
next step. Feature selection was run separately for both areas
and each feature extraction unit (field plot, small segments,
large segments).
There were 14 (study area 1) or 19 (study area 2) features
selected into the final Grid sets, 17 or 19 into the Seg350 sets
and 12 or 17 into the SeglOOO sets. Of the selected features,
majority (63-79%) were based on the ALS data.
3. RESULTS AND DISCUSSION
In both study areas the features extracted from square grid
elements worked better in estimating the forest attributes than
the features extracted from image segments. Furthermore,
features from image segments derived using minimum size of
350 m 2 performed better in the estimation than features
extracted from larger segments (minimum size 0.1 ha).
where:
RMSE =
£(*-*) 2
y, = measured value of variable y on plot i
y>i = estimated value of variable y on plot i
y = mean of the observed values
n = number of plots
Automatic feature selection was carried out using a simple
genetic algorithm presented by Goldberg (1989), and
implemented in the GAlib C++ library (Wall 1996). The GA
process starts by generating an initial population of strings
(chromosomes or genomes), which consist of separate features
(genes). The strings evolve during a user-defined number of
iterations (generations). The evolution includes the following
operations: selecting strings for mating using a user-defined
objective criterion (the better the more copies in the mating
pool), letting the strings in the mating pool to swap parts
(crossing over), causing random noise (mutations) in the
offspring (children), and passing the resulting strings into the
next generation.
In the present study, the starting population consisted of 300
random feature combinations (genomes). The length of the
genomes corresponded to the total number of features in each
step, and the genomes contained a 0 or 1 at position i, denoting
the absence or presence of image feature i. The number of
generations was 30. The objective variable to be minimized
during the process was a weighted combination of relative
RMSEs of ¿-nn estimates for mean total volume, mean volumes
of Scots pine, Norway spruce and deciduous species, mean
diameter and mean height, with total volume having a weight of
50%, and the remaining variables 10% each. Genomes that were
selected for mating swapped parts with each other with a
probability of 80%, producing children. Occasional mutations
(flipping 0 to 1 or vice versa) were added to the children
(probability 1%). The strings were then passed to the next
generation. The overall best genome of the current iteration was
always passed to the next generation, as well. Four successive
steps (all including 30 generations) were taken to reduce the
Study area 2 had generally better estimation accuracy compared
to data sets of study area 1. The main reason for this is probably
the higher number of sample plots in study area 2, which gives
higher number of potential nearest neighbors for each sample
plot in the ¿-nn estimation. The estimation accuracy results for
the forest attributes used in this study are presented in tables 2
and 3.
GRID
SEG350
SEG1000
Height
18.5
22.4
25.5
Diameter
25.5
27.7
32.0
Total volume
27.8
34.0
36.6
Volume of pine
74.2
77.1
99.9
Volume of spruce
83.9
87.5
103.3
Volume of
deciduous sp.
85.3
88.7
93.9
Table 2. Estimation results for the feature sets (relative RMSE,
%) of study area 1
GRID
SEG350
SEG1000
Height
12.5
13.9
16.5
Diameter
19.8
23.1
25.2
Total volume
29.6
32.9
36.6
Volume of pine
125.2
138.5
137.0
Volume of spruce
59.0
61.5
63.8
Volume of
deciduous sp.
99.2
113.4
111.3
Table 3. Estimation results for the feature sets (relative RMSE,
%) of study area 2
There were large differences between the study areas in the
estimation accuracy of the volumes per tree species groups.
Apparently, the differences were caused by the different tree
species structure of the two study areas. Typically, the dominant
tree species had the highest estimation accuracy, and the less
dominant lowest. On the other hand, the volume of deciduous
trees had better estimation accuracy compared to the minority