KIX-B3, 2012
TS
; number of objects
e apply a Tracking-
rests build a strong
omized Forests can
aining samples can
f the automatically
metrics we analyse
s. We evaluate the
problem of estab-
bject and unassoci-
domain. Common
n include the near-
sentation and a set
representations in-
atives in 2D image
based information.
w unambiguous as-
ig crowds. In com-
fication of a target
arance models are
aniciu et al. (2003)
vhich was adopted
orporated adaptiv-
:ract the impact of
in illumination and
ntations are, how-
the geometric re-
tely. More recent
Avidan (2005) and
ing classifier learnt
est from the back-
at strategy for mul-
] instance specific
'ach individual tar-
1d evaluated on the
ive learning yields
r time. However,
> individual classi-
nique for building
ips is the aggrega-
'orests. Variants of
n time-critical ap-
tit and Fua, 2006),
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XXXIX-B3, 2012
XXII ISPRS Congress, 25 August — 01 September 2012, Melbourne, Australia
SLAM (Ozuysal et al., 2010), and single-instance tracking (Saf-
fari et al., 2009). The latter work suggests the use of Random
Forests in an on-line learning framework. It has been shown that
the classifier may be learnt incrementally with results compara-
ble to offline training. Using Random Forests in opposition to
boosting has various advantages. On one hand, they are able to
learn multiple classes within each single instance of a tree, on the
other they are more robust against label noise, ie. single mis-
associations vanish in the abundance of trees.
The remainder of the paper is divided in sections about the clas-
sification strategy (Section 3) and the application to data associ-
ation (Section 4). We present experimental results in Section 5
and conclude in 6.
3 CLASSIFICATION STRATEGY FOR SUPPORTING
DATA ASSOCIATION
For data association we apply a classification strategy that mod-
els a tracked person as an individual class. We apply a variant
of Randomized Forests, which selects both variable index and
split variables ramdomly, thus referred to as Extremely Random-
ized Forest (ERF). In the following, we explain the principles of
Randomized Forests and our strategy for using them as instance
specific classifier.
3.1 Randomized Forests
Randomized Forests are ensembles of binary decision trees, each
of which is grown upon a different subset of the given training
data. Each node in a tree is appointed with a test function that
consists of an index pointing to an element of the feature vector
and a threshold which decides for the left or right propagation of a
sample passed through the tree. The test function is chosen as the
one that, according to a quality measurement, scores best among a
randomly generated set of test functions with respect to the index.
If also the threshold is chosen at random, the Random Forest is
referred to as Extremely Random Forest (Geurts et al., 2006).
For our work we employed the Online-Random Forests (ORF)
of Saffari et al. (2009), where extremely randomised trees are
used for online learning. Each leaf node in a tree stores the class
conditional probability density of all samples that have reached
that leaf either during offline-training or in the update-phase of
online learning. A node decides to split, if a purity constraint is
broken, given that an adequate number of samples has been seen
so far and a maximum depth is not yet reached. For classification,
a sample s is passed through each tree ¢ out of all T trees in the
forest and obtains the probability p;(k|s) of class k € Y that
is stored in the leaf node the sample reaches. The classification
result over all of the trees in the forest amounts to the normalised
sum of probabilities,
p(kls) = 7 pelkls)
Class assignment is then finding the maximum of (1) among all
available classes in Y':
C(s) = arg max p(k|s) Q)
Adaptivity of the classifier to changing appearance of the targets
is given by the ability of the ORF to discard entire trees. As crite-
rion for discarding a tree, the Out-Of-Bag-Error (OO B E:) esti-
mate can be assessed using the fact that some trees are not being
trained on the entire training data due to the Poisson process of
on-line bagging (bootstrap aggregating), see Saffari et al. (2009)
and Oza and Russell (2001) for details.
Figure 1: Regions based on a detection that are used for training,
relative to a detection (white region indicates foreground region
belonging to the detected object) in the image domain.
3.2 Generation of the Feature Vector
For every sample features are calculated from the hue, saturation
and intensity (HSI) values of all foreground pixels inside the re-
gion given by a detection. Since changes in the articulation of
body parts reflect in the distribution of HSI values inside the de-
tected region, we strive to compensate that effect by calculating
mean values of HSI as features row-wise. We argue that the aver-
aging is reasonable, since the variance of a person’s appearance
along a horizontal line is negligible and articulation mainly af-
fects the distribution of HSI values column-wise. The feature
vector is resized to 100 entries per channel, i.e. to 300 elements
per sample. The elements of the feature vector are mean centered
and normalised to a standard deviation of one.
3.3 Training and Updating
The purpose of the classifier for data association is to gather
knowledge about the appearance of a tracked object, in order to
assess the similarity when applied to a detection. The classifier is
hence designed to learn one specific class for each object being
actively tracked, and one rejection class, which enables the ne-
glection of an association. If a detection has been triggered by a
target for which a class already exists, the response of the classi-
fier when tested on that detection should be high for that class. If
the detection stems from a target that has not been trained yet, the
distribution of class responses should be flat. The feature vectors
are calculated from regions inside the detection windows. If a
new object class is introduced, an additional set of six samples is
derived from the original position of the detection. Four samples
are calculated by shifting the detection by one pixel column- and
linewise, and two features are calculated from resizing the de-
tected region by plus/minus one pixel. The additional samples
make the classifier more robust against the localisation uncer-
tainty of the detector and allow the Forest to apply bagging from
a pool of different features. We also train a rejection class that al-
lows the omission of an association if the rejection class obtains
the maximum likelihood. The green rectangle in Figure 1 marks
an original detection, as represented by the white foreground re-
gion, which is used for extracting negative samples by shifting
(red rectangles). The rejection class is trained on four adjacent
regions that are defined in the neighbourhood of the detection,
aligned as depicted by the configuration of the red rectangles in
Figure 1.
If a new tracking object is initialised or if one is finished (see
Section 4.3), the forest is trained from the beginning. This re-
quires considerable computational effort but is reasonable, as the