Proceedings International Workshop on Mobile Mapping Technology

li, rongxing
PI-2-5 
While the simplest procedure for network pruning is to heuristi- 
cally find and prune those connection weights whose removal 
does not change output values, we suggest carrying out these 
procedures based on A1C. That is, we prune weights whose values 
are relatively small and re-train the network. Then, we calculate 
practical application, it is to a certain degree reasonable to start by 
choosing the number of nodes in the hidden layer and to prune the 
connection weights before tuning the activation function. Then, 
we choose the model of LNNs which minimizes AIC. 
the AIC of each network and choose the appropriate combination 
of connection weights based on the minimization of AIC. 
5 LEARNING PARADIGM OF LAYERED NEURAL 
NETWORK 
4.2 Choosing the appropriate activation function 
The architecture of an LNN classifier with one hidden layer is 
composed of the number of nodes in the hidden layer and the type 
of activation function. Therefore, the optimal architecture of the 
LNN should be chosen from the viewpoint of not only the size of 
network but also the activation function form suitable for the 
training data set at hand. 
It is known that AIC has a large variance, so that the proposed 
procedures are problematic. Given the architecture of an LNN, it 
is possible that repeated training iterations will successively im 
prove the performance of the LNN on training data, but due to the 
limited number of training data and the presence of noise, over 
training often occurs. In order to improve the over-training prob 
lem, we introduce Tikhonov’s regularization. 
From the theoretical interpretation of the LNN by Shimizu (1996), 
5.1 Tikhonov regularization 
we can get various types of probability distribution by changing 
the value of parameter a(^ -1) in the generalized activation 
function 
/(“/)- , n x ’ (26) 
-a + exp(-pMy) 
to get the best fit to the training data. We estimate /3 conjoined 
with w ih and w h j . Hence, without essential loss of generality, 
Tikhonov’s regularization is a method that optimizes a function 
E + y ' z(w) > where E is the original error function, z(-) is a 
smoothing function, and y is a regularization parameter (Tikhonov 
et al. (1990)). Although there are obviously many possibilities for 
choosing the forms of the smoothing functions, the Euclidean 
norm of the parameters is most commonly used. Thus, in this 
paper, we adopt it as a smoothing function. In case of regression 
model (21), we can obtain the modified maximum log-likelihood 
as follows: 
let /3 be 1. For the use of the generalized activation function, the 
weight update rule of the ordinary back-propagation algorithm 
should be modified with its derivative 
K K 
max logA(H>,a 2 ) = log2ji logo 2 
h\o' 2 2 
/'(“)-/(“)•■& + « */(“)}, (27) 
(y - Xw)' (y - Xw) -y w'w , 
2 a 2 (29) 
in place of (16). 
yielding 
We apply the modified back-propagation algorithm and change 
parameter a of the generalized activation function in the train- 
w y =(x'X +y/) _1 X'y (3°) 
ing process to get the best fit to the training data based on AIC. 
After the appropriate size of the LNN has been fixed, the value of 
AIC is determined by the maximum log-likelihood alone. 
Solution (30) is called a Ridge estimator and was proposed to 
Least Squares Method by Hoerl and Kennard (1970). Mean 
Squared Error (MSE) of the parameters is defined as: 
-MLL--f pn{p ; (.r i ,H>)} . (28) 
k =1 j=\ 
MSE = ¿{(h’q - w) r (w 0 - >v)J . (31) 
The advantage of this approach is to provide information of the 
In case of regression model (20), MSE is 
best fit of the activation function to the training data at hand 
comparing competing activation function forms. 
MSE(tv 0 ) = o 2 tr[x' X) 1 . (32) 
4.3 Practical procedure 
By introducing Tikhonov's regularization, MSE is modified as 
The procedure described in this chapter is only an example and 
there may be many alternatives. In addition, the procedure de- 
MSE (w y ) = £[(h> y - w)‘ (w y - w)\ 
scribed here should be carried out with feedback or simultaneity 
searching for an optimal architecture. However, it is almost im- 
= o 2 tri^X t X +ylY X'x(x'X +y/)"‘j .(33) 
possible or computationally too costly for us to make a search of 
all possible subsets of the models. Thus, from the viewpoint of 
Hoerl and Kennard (1970) show that there always exists a con 
stant y > 0 such that
1
2
...
92
93
94
95
96
...
404
405
Full text: Proceedings International Workshop on Mobile Mapping Technology

Access restriction

Copyright

Note to user