International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XL-5/W2, 2013
XXIV International CIPA Symposium, 2 — 6 September 2013, Strasbourg, France
to match an input image to sample images of every class using
densely computed SIFT descriptors. Since legends and depicted
local coin images are highly discriminative, techniques dedicated
to their recognition were developed. The proposed legend recog-
nition method computes scores for a set of densely sampled can-
didate character locations and combines them to meaningful lex-
icon words using dynamic programming. Finally, a coin image
recognition providing a coarse-grained classification based on the
depicted images using Bags of Visual Words (BoVW) was re-
searched. Moreover, a fruitful combination of the two former
methods was implemented and showed an increased classifica-
tion accuracy.
4.1 Global Image Matching
Global image matching has the advantage that it does not rely
on machine learning techniques, and thus, no training data is re-
quired. This is a considerable advantage when working with an-
cient coins, because museums usually prefer to have a few ex-
amples of many different types to provide their visitors a broader
overview. Thus, only a few images per type are available, which
impedes the use of machine learning techniques. The missing
training is compensated by the flexible matching model for dense
correspondence that can handle the spatial variations of local struc-
tures between coins of the same class. The matching algorithm is
reminiscent of the SIFT flow method (Liu et al., 2011). For every
pixel in the input image, SIFT features are computed and form a
dense field of features. This allows the computation of pixel-to-
pixel correspondences between two images. The Euclidean dis-
tance between two SIFT features is considered as their matching
costs. The matching of the entire SIFT field can be described as
an energy term, which compares every SIFT descriptor of one
field with the descriptors in the respective pixel neighborhood
of another field. Minimizing this term yields the final matching
score for the two images. However, minimizing this function for
large-scale images is a complex operation. In order to accelerate
the classification process, a coarse-to-fine approach was proposed
(Zambanini and Kampel, 2013a). That is, SIFT flow matching is
performed in multiple steps with each subsequent step operating
on images in a higher resolution than the previous one. For each
step k, the acceptance rate defining how many images are passed
on to be inspected at a higher resolution in the classification stage
can be tweaked with a parameter Ax, ranging from 1% to 100%.
The experiments were carried out on 180 images of the reverse
side of coins belonging to 60 different classes. Each class com-
prises 3 different images and the coarse-to-fine granularity was
set to four steps. Moreover, two different test/training set config-
urations were used; the first option uses only one reference image
while the other configuration uses two reference images per class.
The best result of 83.3% classification rate is achieved when two
reference images per class are used and 70% of all images are re-
jected in the first subselection step as well as 50% in the second
and third step. The same classification rate can be achieved when
no subselection is performed during matching, but takes approx.
8 times longer to compute.
4.2 Legend Recognition
The legend recognition method (Kavelar et al., 2012) uses object
recognition techniques rather than standard OCR methods, since
OCR relies on successful binarization for the separation of text
and background. Coin legends have the same color as the rest of
the coin, thus intensity changes result only from the coin surface
relief structure. Therefore, effective binarization is not possible.
Instead, the appearance of the individual letters occurring in the
coin legends is taught to a Support Vector Machine (SVM) using
376
[ DIa-5[nzió[n-20 [n-30|
Best, initially 31.7% | 21.7% 17.8% 13.9%
Top 3, initially 62.2% | 39.4% 21.7% 18.3%
Best, re-scored || 53.3% | 42.8% | 34.4% | 28.9%
Top 3, re-scored || 81.1% | 66.1% | 57.8% | 51.7%
Table 1: Legend recognition results. Top 3 indicates that the cor-
rect word is among the three most probable words found.
a single SIFT descriptor (Lowe, 2004) (see Fig. 4 (a)). The leg-
end words considered comprise 18 different characters which the
SVM has to distinguish. The training process uses 50 100 x 100
pixel-sized, manually segmented images per class. The recogni-
tion process works as illustrated in Fig. 4 (b). The input image is
first scaled down to a standardized size of 348 x 348 pixels to en-
sure an approximately equal font size for all legends. This spares
the computation of SIFT features in various scales. Next, in the
keypoint extraction step, regions of interest are detected with an
entropy filter. From this regions, a list of candidate character lo-
cations (CCLs) is densely sampled and passed to the character
recognition step, which computes a SIFT descriptor for each of
the CCLs and tests it against all 18 SVMs to receive 18 scores
for every CCL. Hence, this step leads to a dense likelihood map
for every letter. In the final word recognition step, pictorial struc-
tures (Felzenszwalb and Huttenlocher, 2005) are used to generate
word hypotheses for all legend words provided via a lexicon. To
increase the confidence in the hypotheses, they are re-scored by
computing SIFT descriptors for fixed orientations based on the
CCLs in the individual hypothesis. Finally, words that received
scores above a certain threshold are rejected. Thus, for an input
image and a lexicon containing the possible legend words, the
legend recognition algorithm outputs a list of legend words or-
dered by their scores. The classification rate depends on the lexi-
con size n used and whether the word hypothesis are re-scored or
not. Table 1 gives an overview of the results achieved for 180 coin
images. Best indicates the cases where the word with the lowest
score (i.e., the best match) equals the ground truth while Top-3
indicates that the correct word is among the three best matching
words detected.
4.3 Coin Image Recognition
Just like the legend, the coin image provides a highly discrim-
inative feature. For Roman Republican coins, the coin image,
such as a she-wolf or a dolphin, is depicted on the reverse side,
while the obverse usually shows the head of a god. Since mul-
tiple coin types share the same coin images, a fine-grained clas-
sification cannot be performed based solely on the coin image.
However, it can serve as a preselection step, which prunes the set
of classes. The proposed algorithm (Anwar et al., 2013) is based
on a Bag of Visual Words (BoVW) algorithm. For an input im-
age, SIFT features are densely extracted at a constant pixel stride.
The computed features are quantized using k-means clustering;
the number of clusters k determines the size of the visual vocab-
ulary. To describe a novel image, the computed SIFT features are
mapped to their closest visual word in Euclidean space. Finally,
a histogram based on the number of features assigned to the in-
dividual words is constructed and describes the depicted symbol.
This process, however, does not consider spatial relations of the
individual parts of a symbol. Thus, various tiling patterns capable
of capturing spatial relationships (rectangular, circular, log-polar)
were evaluated. The best classification rate of 90.676 is achieved
with a vocabulary size of 100, a pixel stride of 5 and rectangular
tiling.