Full text: Proceedings International Workshop on Mobile Mapping Technology

Pl-2-3 
Back propagation has the advantage that the derivatives on the 
right sides of (14) and (15) are calculated by the inputs and the 
outputs of the neurons, for the sigmoid function has a simple 
derivative as follows 
/'(«)-• (i6) 
3 GENERALIZATION OF THE LAYERED NEURAL 
NETWORK 
When using an LNN classifier, however, users have often been 
faced with a generalization problem. In this chapter, we discuss 
LNN classifier generalization, which has been a controversial and 
often vague term in the neural network literature (Wan (1990)), 
and provide a clear description from the viewpoint of architecture 
design and learning paradigm. Then, we provide a clear descrip 
tion of the generalization of an LNN classifier based on informa 
tion statistics by introducing an information criterion. 
3.3 Akaike’s Information Criterion 
In order to discuss the generalization of an LNN classifier based 
on information statistics, we introduce an information criterion. 
Let us assume that p(x) is a probability density function and 
p{x,w) is a model distribution. The discrepancy between the 
model and the real distribution can be measured in terms of Kull- 
back-Leibler’s information distance: 
D[p{x),p{x,w)}^^p{x)\n-y^<ix . (17) 
p(x,w) 
If we assume that the real distribution p(x) belongs to a model 
set and the number of observations is sufficiently high, the infor 
mation distance in (17) can be expressed as a function of the 
distance between the real and the Maximum Likelihood Estimate 
(MLE) of the parameters: 
3.1 Generalization and network architecture design 
The architecture of an LNN is basically defined as the number of 
layers, the number of nodes in each layer and the form of the 
activation function, so that the selected model will identify the 
relationship between input and output, and will predict correctly 
with new input data. 
The problem of choosing the optimal number of nodes and layers 
is analogous to choosing an optimal subset of regressor variables 
in a regression model (Fogel (1991)). We know that if a model 
y = qp(jt) with too many free parameters is used to fit a given set 
of training data (jc,y) then it may “over-fit” the training data, 
while with too few parameters it may not be powerful enough to 
describe the relationship between input x and output y. In other 
words, if the number of parameters is too large, the calibrated 
function may pass through all the specified training data 
(x i ,y i ) without error. However, the function could be highly 
oscillatory, leading to large errors at unknown data that are not 
included in the training data set. This phenomenon is to be ex 
plained in detail in section 5. 
3.2 Generalization and learning paradigm 
d(p(x, w), p(x, >v 0 )) = D(w, w 0 ) 
-w 0 )' M(w -w 0 ) , (18) 
where M is Fisher’s information matrix and t denotes a transpose 
of a matrix, w is the true parameter vector and w 0 is the 
parameter vector for MLE. Under the assumption that the com 
peting models belong to a sequence of hierarchy where the lower 
dimensional models are included into the higher dimensional ones 
as sub-models, Akaike has extended Maximum Likelihood Esti 
mation (MLE) in such a way that An Information Criterion 
(Akaike’s Information Criterion: AIC ) can be used to simultane 
ously address both the parameter estimation and optimal model 
selection (Akaike (1974)). Akaike defined AIC as an estimator of 
the expected information distance by using approximation (18), 
E[2K • D{p(x, w), p(x, w 0 )}] 
«AIC= 2-f- ^\np(x,w 0 )+l\ , 
(19) 
where E[-] denotes the expectation operator, K is the number of 
data and / is the number of independent parameters. 
Once the architecture is fixed, the behavior of the trained model 
depends on the values of connection weights obtained from the 
training paradigm and the limited number of training data. 
Given the architecture of an LNN, it is possible that repeated 
training iterations successively improve the performance of the 
LNN on training data by “memorizing” training patterns. How 
ever, due to the limited number of training data and the presence 
of noise, over-training usually presents problems. Over-training is 
the phenomenon whereby after a certain number of training ep 
ochs, more training epochs will further reduce the learning error 
(only slightly in many cases) on the training data set but will 
produce greater errors on a new data set which are not included in 
the training data set. 
3.4 Application of AIC to LNN 
We should notice that neural networks and traditional statistical 
classifiers are not to be related in normal cases. However, when 
the output of the LNN has been trained with a sufficient number 
of training data, it is considered as an approximated estimate of a 
Baysian posterior probability (Wan (1990) and Ruck et al. 
(1990)). It makes a big difference that we have sufficient number 
of training data in the case of the classification of remotely sensed 
images, which provide a theoretical interpretation of the output 
from the LNN classifiers as an estimate of a posterior probability. 
Thus, AIC is applicable to LNNs as a criterion for generalization 
and is determined by
	        
Waiting...

Note to user

Dear user,

In response to current developments in the web technology used by the Goobi viewer, the software no longer supports your browser.

Please use one of the following browsers to display this page correctly.

Thank you.