and
The
| the
ifier
We
| Our
uced
: the
Also
tions
ding
den-
and
tions
inear
r of
erent
fiers
dis-
Xtoti-
an
| be
5).
tion
t of
pro-
d to
bias
is is
de-
naga
(2)
(3)
°t to
EQ(Ve) - 20 , (4)
D
where
pw
8 T
Vo = : ei M jq
442x u"u 2 (5)
T CT
Lue) we
16 2
In (2)-(5) Np is the size of the design set, d
dimensionality of the feature space and p is difference
between the mean vectors (p,-p,). If d>>1, from (2)-
(5) it can be seen that the bias of a linear classifier is
proportional to d/N, and the bias of a quadratic
classifier is proportional to d?/N;. In the case of linear
classifier this means that a number of datavectors,
which is a fixed multiple of the dimensionality of the
feature space (Np=c*d), is enough to compensate the
effect of finite sample size. As can be expected this is
not valid for the quadratic case. The size of the con-
stant c is dependent on the separability between the
classes (~the expected error rate). The higher is the
expected error, the greater value should the constant
have.
2.2.2 Nonparametric Parzen classifiers The
nonparametric classifiers are appealing, because they
do not do any prior assumptions about the density
structures of the underlying problem. However, they
are more computation intensive and are shown to be
heavily biased in high dimensional spaces (see below).
They also have some parameters to tune for achieving
optimality. The tuning of these parameters is extreme-
ly important and can be done by experimental error
estimation. These topics are discussed in this chapter.
The detailed derivations can be found from (Fukunaga
1990, chapters 6 and 7).
A nonparametric density function can be estimated by
the so called Parzen method. The corresponding clas-
sifier is called a Parzen classifier. The Parzen estimate
of a density function can be expressed as
Np
2i k-K(Q,x,.x) (6)
£i
p(x) me :
where K is a kernel function, k determines the size of
the kernel, Q is a metric used to compute distances and
X;:S are the sample vectors. The corresponding class-
ifier compares the a posteriori probabilities to a deci-
sion threshold, t. An analysis of the effect of all these
variables is needed.
A second order approximation of the expected value
and variance of the Parzen density estimates can be
shown to be (Fukunaga 1990, p.258-259)
327
E(p(x)) = p(x) *p(x)a (x)? (7)
2
and
wird 2 (8)
Var(p(x)} = —— (p(x), (x),Q.p*(),K’) ,
D
where
sr {Veg} | ©
p(x)
and the precise form of (8) can be found from (Fuku-
naga 1990, p. 259). (7)-(9) show that only the vari-
ance is dependent on the sample size being propor-
tional to 1/N,. Thus it can be reduced by increasing
the sample size. On the contrary the bias is not at all
dependent on the sample size and must be minimized
by a proper selection of the parameters k and Q. From
(9) we can see that both the bias and the variance are
dependent on the curvature of the density structures.
Of course the bias and variance of the density esti-
mates affect to the bias of the classification error.
After some hard mathematics this can be expressed as
d
ir 10
E{Ve} = a,k* + a,k* « ak "^ - bv, an
where a,, a,, a; and b are complicated functions depen-
ding on o,(x)-0,(x), Q and p(x), and At is the bias of
the decision threshold.
Although the constants a,, a, and a, are complicated
functions, they are only dependent on the probability
structures and the metric, and totally independent on
the kernel size and on the sample size. When k is
small, the term a, is dominating. This term reflects the
variance term, Var{p(x)}, of the density estimate.
When k grows, terms a, and a, start to dominate.
These terms reflect the effect of the bias term of the
density estimate, E{p(x)}, to the classification error.
Both terms should be adjusted properly.
When k is small the variance term of the density esti-
mate dominates. Thus the bias can be reduced by
increasing the sample size. However, on bigger values
of k, the bias term of the density estimate dominates.
This effect cannot be anymore reduced by increasing
the sample size. The difficulty is overemphasized in
higher dimensional spaces. When the intrinsic dimen-
sionality of the feature space is high it becomes hope-
less to increase the sample size (because the amount of
samples needed grows so fast, Fukunaga 1990, p. 264),
and the only choice to get a reasonable estimate is to
increase k. However, we can see from (10) that a