ESTIMATION OF CLASSIFIER’S PERFORMANCE WITH ERROR COUNTING
Jan Heikkilä
Institute of Photogrammetry and Remote Sensing
Helsinki University of Technology
002150 Espoo 15, Finland
e-mail : jani@fifille.hut.fi
Commission III
ABSTRACT
Because of the complexity of error estimation in statistical pattern recognition, error counting methodology has been
extensively used for evaluating the performance of a classifier. The paper compares different error estimation procedures,
both theoretically and experimentally by simulation.
The classical error counting methods, like resubstitution, hold-out, leave-one-out and bootstrapping, are compared with
the risk averaging methods and with methods using the relationships of error and reject tradeoffs. These second generation
estimators have two clear advantages. First, they are designed to have a smaller variance. Secondly, and more importantly,
the test samples can be unclassified. Thus, the risk-averaging methods are economical and in addition the mislabelling
problems in the test set is avoided. Finally they are well suited in non-parametric methods and thus provide a possible
solution for a general classification system. The simulation studies show that this is possible to achieve.
1. INTRODUCTION
The problem of estimating the performance of a class-
ifier is a traditional one in statistical pattern recogni-
tion. The performance is usually measured with some
estimate of the error rate, the percentage of errors the
system will produce. Even in the Bayesian case
(known statistical models), the expected error rate is
extremely hard to evaluate, and is known to have an
analytical solution only in some special cases (see e.g.
Devivjer & Kittler 1982, chapter 2). In practice the
probability models must be inferred from available
training data. If the true density functions are replaced
by the corresponding estimates, an estimate for the
Bayes error is achieved. This is usually called the
apparent error rate. Even the interpretation of this
measure is ambiguous. It is also known to be optimis-
tically biased (Devivjer & Kittler 1982, chapter 10).
The computation of the apparent error rate is extremely
difficult involving numerical integration in complex
regions, and often in high dimensional spaces. These
difficulties have been pushing the practitioners and the
theoreticians to look for more simple methods for esti-
mating the error rate of a classifier.
If some finite number of test samples with known
labelling is available, the so called error counting pro-
cedure is a natural choice for error estimation. That is,
the test set is classified and the number of
misclassifications is counted. The properties of this
type of procedures have been extensively discussed in
the literature and a multitude of methods have been
developed. The limited number of labelled samples
cause usually problems. Computation intensive
methods have been developed to cleverly use the small
number of samples so that both the design and the test
sets can utilize all available pattern vectors (see e.g.
Fukunaga 1990).
There is another alternative for replacing the computa-
325
tion of the apparent error rate. These methods are
either based on the idea of risk averaging or they use
the relationship between the error and reject rates, the
so called error-reject trade-off (Devivjer & Kittler
1982). These estimators have two advantages over the
traditional error counting procedures. First, their vari-
ance is designed to be smaller (large sample sizes).
Second, they permit to use unclassified samples as test
samples. In remote sensing applications this means
that one can use all the pixels in the satellite image as
a test set. In addition to the great economical benefit,
the problem of outliers in the test set is avoided. A
large test set is particularly essential to accurately eva-
luate a classifier with low error rate (Raudys & Jain
1991).
The difficulty, which arises in designing a pattern rec-
ognition system comes from the variety of choices
available. A large number of classification rules, both
parametric and non-parametric, have been developed.
A multitude of optimization criteria for finding a mini-
mum set of features with maximal discrimination capa-
bilities have been established. At least fifteen different
methods are available for empirical error estimation.
The total number of classification methods is thus in
the order or hundreds. One goal of this research pro-
ject is to test by simulation, if a general rule can be
found, which can be applied to most cases, when no a
prior information about the structure of the probability
density functions is given. A general rule implies the
use of a nonparametric classifier. The optimality of
nonparametric classifiers must be found in practice by
empirical error estimation (Fukunaga 1990, chapter 7).
This error estimation criteria should have as small bias
(hopefully none) and as small variance as possible, and
should be robust against outliers.
The paper is divided into five chapters. In chapter two
we will discuss about the effect of finite sample sizes
to the classifier design. This chapter is a review of the