PI-2-5
While the simplest procedure for network pruning is to heuristi-
cally find and prune those connection weights whose removal
does not change output values, we suggest carrying out these
procedures based on A1C. That is, we prune weights whose values
are relatively small and re-train the network. Then, we calculate
practical application, it is to a certain degree reasonable to start by
choosing the number of nodes in the hidden layer and to prune the
connection weights before tuning the activation function. Then,
we choose the model of LNNs which minimizes AIC.
the AIC of each network and choose the appropriate combination
of connection weights based on the minimization of AIC.
5 LEARNING PARADIGM OF LAYERED NEURAL
NETWORK
4.2 Choosing the appropriate activation function
The architecture of an LNN classifier with one hidden layer is
composed of the number of nodes in the hidden layer and the type
of activation function. Therefore, the optimal architecture of the
LNN should be chosen from the viewpoint of not only the size of
network but also the activation function form suitable for the
training data set at hand.
It is known that AIC has a large variance, so that the proposed
procedures are problematic. Given the architecture of an LNN, it
is possible that repeated training iterations will successively im
prove the performance of the LNN on training data, but due to the
limited number of training data and the presence of noise, over
training often occurs. In order to improve the over-training prob
lem, we introduce Tikhonov’s regularization.
From the theoretical interpretation of the LNN by Shimizu (1996),
5.1 Tikhonov regularization
we can get various types of probability distribution by changing
the value of parameter a(^ -1) in the generalized activation
function
/(“/)- , n x ’ (26)
-a + exp(-pMy)
to get the best fit to the training data. We estimate /3 conjoined
with w ih and w h j . Hence, without essential loss of generality,
Tikhonov’s regularization is a method that optimizes a function
E + y ' z(w) > where E is the original error function, z(-) is a
smoothing function, and y is a regularization parameter (Tikhonov
et al. (1990)). Although there are obviously many possibilities for
choosing the forms of the smoothing functions, the Euclidean
norm of the parameters is most commonly used. Thus, in this
paper, we adopt it as a smoothing function. In case of regression
model (21), we can obtain the modified maximum log-likelihood
as follows:
let /3 be 1. For the use of the generalized activation function, the
weight update rule of the ordinary back-propagation algorithm
should be modified with its derivative
K K
max logA(H>,a 2 ) = log2ji logo 2
h\o' 2 2
/'(“)-/(“)•■& + « */(“)}, (27)
(y - Xw)' (y - Xw) -y w'w ,
2 a 2 (29)
in place of (16).
yielding
We apply the modified back-propagation algorithm and change
parameter a of the generalized activation function in the train-
w y =(x'X +y/) _1 X'y (3°)
ing process to get the best fit to the training data based on AIC.
After the appropriate size of the LNN has been fixed, the value of
AIC is determined by the maximum log-likelihood alone.
Solution (30) is called a Ridge estimator and was proposed to
Least Squares Method by Hoerl and Kennard (1970). Mean
Squared Error (MSE) of the parameters is defined as:
-MLL--f pn{p ; (.r i ,H>)} . (28)
k =1 j=\
MSE = ¿{(h’q - w) r (w 0 - >v)J . (31)
The advantage of this approach is to provide information of the
In case of regression model (20), MSE is
best fit of the activation function to the training data at hand
comparing competing activation function forms.
MSE(tv 0 ) = o 2 tr[x' X) 1 . (32)
4.3 Practical procedure
By introducing Tikhonov's regularization, MSE is modified as
The procedure described in this chapter is only an example and
there may be many alternatives. In addition, the procedure de-
MSE (w y ) = £[(h> y - w)‘ (w y - w)\
scribed here should be carried out with feedback or simultaneity
searching for an optimal architecture. However, it is almost im-
= o 2 tri^X t X +ylY X'x(x'X +y/)"‘j .(33)
possible or computationally too costly for us to make a search of
all possible subsets of the models. Thus, from the viewpoint of
Hoerl and Kennard (1970) show that there always exists a con
stant y > 0 such that