Pl-2-4
where Pj is output of the node j in the output layer (corre
sponding to class j) and L is the number of independently adjusted
parameters. The model that minimizes AIC is the best model. If
only one model set is used, that is, the number of parameters is
fixed, then AIC will result in the MLE solution. If two different
model sets have the same value of the maximum likelihood, the
model with the smaller number of parameters will be selected
(principle of parsimony or Occam’s razor).
3.5 Generalization based on AIC
In case of LNNs, the number of parameters in (20) is determined
by the number of parameters in the activation function and the
number of connection weights. The number of parameters is
basically determined by the numbers of input nodes, output nodes,
hidden layers and hidden nodes. Here, the trade-off between the
number of parameters and the overall goodness-of-fit of the model
is also inevitable.
Although the expectation of AIC is asymptotically unbiased up to
the terms of order 0(1), it has a large variance, so that the pro
cedure discussed in the previous sections are also problematic
(Shimohira (1993)).
To give an example of over-training, let y be a linear function of
jc,
y k =x k w + e k (k=l,---,K), (21)
where e is the error term, and w is the coefficient vector. Let the
regression model be expressed by matrices
y = Xw + e . (22)
Suppose e - A r (0, <TI), where o is an unknown standard error.
ML estimator
H-o-(x'xfx'y (23)
is the solution to
i K K
max log L(w,o 2 ) = log2jt -logo 2
h’jCj*' 2 2
—r(y -XwY(y-Xw). (24)
2 o
The situation where the determinant of X'X is nearly zero is called
multicollinearity where the problem (24) is ill-posed, that is,
unstable. It is regarded as an over-fitting. The generalization of
LNN classifiers is considered to be a problem of the search for an
appropriate architecture and appropriate training algorithm so that
the LNN performs well and minimizes the errors over all un
known data. In the following chapters, we propose an LNN design
with some techniques for improving the generalization of LNN
classifiers; these are characterized by the architecture and training
algorithm.
4 LAYERED NEURAL NETWORK ARCHITECTURE
DESIGN BASED ON AIC
In this chapter, we propose an LNN architecture design for
choosing an appropriate model in terms of not only the size of
network but also a suitable activation function based on AIC.
4.1 Choosing the appropriate size based on AIC
In the three-layered neural network, the number of parameters L is
determined by the number of hidden nodes H, the number of
input nodes / and the number of output nodes J as follows
L = I x H + H xj . (25)
The numbers of nodes in the input and output layers are, in gen
eral, fixed according to the practical application problem. The
users, therefore, are only able to adjust the number of nodes in the
hidden layer.
4.1.1 Choosing the number of hidden nodes If we begin to
select a hidden layer with too few nodes, the LNN may not be
powerful enough for a given learning task, while too many hidden
nodes would lead to over-fitting the training data. Therefore, an
appropriate number of hidden nodes should be chosen so that the
LNN can guarantee the generalization ability.
The relationship between the number of hidden nodes H and the
number of output nodes J has been fairly well discussed by Me-
hrotra et al (1991), Weigend and Rumelhart (1991), and
Amirikian and Nishimura (1995). They conclude that an LNN
with one hidden layer and H hidden nodes the number of which
equals the number of output nodes 7, is considered an appropriate
size to execute a given classification task.
We suggest choosing an appropriate size of LNN by using an AIC
that can be used to simultaneously address both the parameter
estimation and forecasting the generalization ability of the model
on unknown data during the training process. We can choose the
appropriate number of hidden nodes by changing the number of
hidden nodes to get different sizes of LNN. Their AIC values are
determined after the completion of training. The LNN yielding the
minimum value of AIC will be chosen as being of the appropriate
size to execute a given classification task.
4.1.2 Pruning the connection weights Network pruning
algorithms can be applied to get the minimum number of pa
rameters (reducing the redundant parameters) so that the LNN is
more efficient in both forward computation time and generalizati
on capability. These algorithms have already been discussed by
several researchers such as Sietsma and Dow (1990).