International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XXXIX-B7, 2012
XXII ISPRS Congress, 25 August — 01 September 2012, Melbourne, Australia
2) Tree Pruning Phase
This phase should remove dependency on
statistical noise or variation that may be particular
only to the training set.
Decision tree algorithm is a data mining induction
techniques that recursively partitions a data set of
records using depth-first greedy approach (Hunts et al,
1966) or breadth-first approach (Shafer et al, 1996)
until all the data items belong to a particular class.
The decision tree modeling comes in two main
branches: Breiman's Classification and Regression
Trees (CART) and Quinlan's See5/C5.0 (and its
predecessors, C4.5 and ID3). This study we mainly
introduce the C5.0 model.
2.1.2 C5.0 algorithm
C5.0 is one of the most classic algorithms in
decision tree models, which increased boosting
technology on the basis of C4.5. According to C5.0
algorithm, the original training sample set is
considered the root node of the decision tree, and then
the gainratio of every feature attribute are calculated.
Some definitions were put forward:
Information entropy: suppose S is the set of n data
samples. The category attribute C has m different
values, and it divides the sample set into m different
category C;(i = 12,.,m). Suppose n, is the
amount that the samples belong to C; in S. then the
information entropy E(s) of S is defined as,
E(S) = — XZi21 Pi log2 (pi) (1)
Where p; is a proportion, can be calculated by
pi = m which the samples belongs to C; in the total
sample. (|s| is the total number of sample set S, here
|s| 2 n).
The conditional entropy of attribute A: suppose À
has v different value {a;,a,,..,ay}, the attribute A
divides the set S into v subsets (S455, ... Sy). nj is
the sample number of C;. so the conditional entropy
E(S|A) of the attribute A is:
E(S|A) = — Xj. pj Xizi pij logz (pij) @)
Where
pj is also a proportion, pj — 2 = Bh py is
a conditional probability, pi; — dsl is the sample
J
number that the attribute A belongs to a, in S,
|S;| = EE, N5).
The Gain of attribute A:
Gain(A) = E(A) — E(S|A) (3)
The GainRatio of attribute A:
Gain(A)
GainRatio(A) — Spli(A)
(4)
Where Splitl(A) 2 — Xj. p; log; (pj).
C5.0 splits the training samples according to the
biggest information gain. The first split can define the
sample subset. Then the second split is according to
the other field, this procedure will repeat until the
sample subset can't split At last, check the
lowest-level split, these sample subsets that has
non-significant will be eliminated or cut. The key to
construct decision tree using C5.0 is the training
samples, choosing a certain number of sample is very
important. While the number of samples is not the
more the better, after a lot of experiments we found
that it is more important for the samples’ Uniformity
and representative. The other important procedure is
the feature extraction. The feature mainly include
spectral and texture feature. The feature’s selection
should according to the classification system and the
land cover type. The common feature may contain the
value of TC, the NDVI, the texture, and so on.
2.2 Land cover classification workflow based on
C5.0
Remote sensing classification depends on the
theory called statistical pattern recognition, means to
extract one team statistical feature value of patterns to
be recognized, and then make the classification
decision according to one certain rule. The land cover
classification workflow based on C5.0 has the five
following procedures:
1) Establish a classification system
A suitable classification system is prerequisites for
a successful classification. Cingolani etal. (2004)
identified three major problems when medium spatial
resolution data are used for vegetation classifications:
defining adequate hierarchical levels for mapping,
defining discrete land-cover units discernible by
selected remote-sensing data, and selecting
representative training sites. In this case, a hierarchical
classification system is adopted to take different
conditions into account mainly based on the users’
needs. This classification system includes ten ‘level 1°,
while each ‘level 1’ has some ‘lever 2°. Details could
be seen in (Higher resolution Global Land Cover
Mapping Project, 2011). The ten ‘level 1’ includes
l.artificial, 2.bareland, 3.cropland, 4.forest, 5.grass,
6.shrub, 7.tundra, 8.water, 9.wetland, 10.Perennial
snow or ice.
2) The establish of multiple files
After remote sensing images were preprocessed
firstly, and then done the band math, we can get
feature images, for example, NDVI image, TC image.
These feature images and the preprocessed images
were input into the spatial database together, and other
spatial data can compose one or more multi-band file.
Selecting what features will depend on the precision
of result, so the selection of feature images is very
important. Normally we features present on the image
have three types of features:
a. Spectral feature
Color or grey or the proportion of bands is the
spectral feature of the target. For example,
Normalized Difference Vegetation Index (NDVI) is a
simple graphical indicator that can be used to analyze
remote sensing measurements, typically but not
necessarily from a space platform, and assess whether
the target being observed contains live green
vegetation or not.