Pl-2-3
Back propagation has the advantage that the derivatives on the
right sides of (14) and (15) are calculated by the inputs and the
outputs of the neurons, for the sigmoid function has a simple
derivative as follows
/'(«)-• (i6)
3 GENERALIZATION OF THE LAYERED NEURAL
NETWORK
When using an LNN classifier, however, users have often been
faced with a generalization problem. In this chapter, we discuss
LNN classifier generalization, which has been a controversial and
often vague term in the neural network literature (Wan (1990)),
and provide a clear description from the viewpoint of architecture
design and learning paradigm. Then, we provide a clear descrip
tion of the generalization of an LNN classifier based on informa
tion statistics by introducing an information criterion.
3.3 Akaike’s Information Criterion
In order to discuss the generalization of an LNN classifier based
on information statistics, we introduce an information criterion.
Let us assume that p(x) is a probability density function and
p{x,w) is a model distribution. The discrepancy between the
model and the real distribution can be measured in terms of Kull-
back-Leibler’s information distance:
D[p{x),p{x,w)}^^p{x)\n-y^<ix . (17)
p(x,w)
If we assume that the real distribution p(x) belongs to a model
set and the number of observations is sufficiently high, the infor
mation distance in (17) can be expressed as a function of the
distance between the real and the Maximum Likelihood Estimate
(MLE) of the parameters:
3.1 Generalization and network architecture design
The architecture of an LNN is basically defined as the number of
layers, the number of nodes in each layer and the form of the
activation function, so that the selected model will identify the
relationship between input and output, and will predict correctly
with new input data.
The problem of choosing the optimal number of nodes and layers
is analogous to choosing an optimal subset of regressor variables
in a regression model (Fogel (1991)). We know that if a model
y = qp(jt) with too many free parameters is used to fit a given set
of training data (jc,y) then it may “over-fit” the training data,
while with too few parameters it may not be powerful enough to
describe the relationship between input x and output y. In other
words, if the number of parameters is too large, the calibrated
function may pass through all the specified training data
(x i ,y i ) without error. However, the function could be highly
oscillatory, leading to large errors at unknown data that are not
included in the training data set. This phenomenon is to be ex
plained in detail in section 5.
3.2 Generalization and learning paradigm
d(p(x, w), p(x, >v 0 )) = D(w, w 0 )
-w 0 )' M(w -w 0 ) , (18)
where M is Fisher’s information matrix and t denotes a transpose
of a matrix, w is the true parameter vector and w 0 is the
parameter vector for MLE. Under the assumption that the com
peting models belong to a sequence of hierarchy where the lower
dimensional models are included into the higher dimensional ones
as sub-models, Akaike has extended Maximum Likelihood Esti
mation (MLE) in such a way that An Information Criterion
(Akaike’s Information Criterion: AIC ) can be used to simultane
ously address both the parameter estimation and optimal model
selection (Akaike (1974)). Akaike defined AIC as an estimator of
the expected information distance by using approximation (18),
E[2K • D{p(x, w), p(x, w 0 )}]
«AIC= 2-f- ^\np(x,w 0 )+l\ ,
(19)
where E[-] denotes the expectation operator, K is the number of
data and / is the number of independent parameters.
Once the architecture is fixed, the behavior of the trained model
depends on the values of connection weights obtained from the
training paradigm and the limited number of training data.
Given the architecture of an LNN, it is possible that repeated
training iterations successively improve the performance of the
LNN on training data by “memorizing” training patterns. How
ever, due to the limited number of training data and the presence
of noise, over-training usually presents problems. Over-training is
the phenomenon whereby after a certain number of training ep
ochs, more training epochs will further reduce the learning error
(only slightly in many cases) on the training data set but will
produce greater errors on a new data set which are not included in
the training data set.
3.4 Application of AIC to LNN
We should notice that neural networks and traditional statistical
classifiers are not to be related in normal cases. However, when
the output of the LNN has been trained with a sufficient number
of training data, it is considered as an approximated estimate of a
Baysian posterior probability (Wan (1990) and Ruck et al.
(1990)). It makes a big difference that we have sufficient number
of training data in the case of the classification of remotely sensed
images, which provide a theoretical interpretation of the output
from the LNN classifiers as an estimate of a posterior probability.
Thus, AIC is applicable to LNNs as a criterion for generalization
and is determined by