Full text: CMRT09

In: Stilla U, Rottensteiner F, Paparoditis N (Eds) CMRT09. IAPRS, Vol. XXXVIII, Part 3A/V4 — Paris, France, 3-4 September, 2009 
J. Fabrizio 12 , M. Cord 1 , B. Marcotegui 2 
1 UPMC Univ Paris 06 
Laboratoire d’informatique de Paris 6, 75016 Paris, France 
2 MINES Paristech, CMM- Centre de morphologie mathématique. Mathématiques et Systèmes, 
35 rue Saint Honoré - 77305 Fontainebleau cedex, France 
KEY WORDS: Urban, Text, Extraction, Localization, Detection, Learning, Classification 
We offer in this article, a method for text extraction in images issued from city scenes. This method is used in the 
French iTowns project (iTowns ANR project, 2008) to automatically enhance cartographic database by extracting text 
from geolocalized pictures of town streets. This task is difficult as 1. text in this environment varies in shape, size, 
color, orientation... 2. pictures may be blurred, as they are taken from a moving vehicle, and text may have perspective 
deformations, 3. all pictures are taken outside with various objects that can lead to false positives and in unconstrained 
conditions (especially light varies from one picture to the other). Then, we can not make the assumption on searched 
text. The only supposition is that text is not handwritten. Our process is based on two main steps: a new segmentation 
method based on morphological operator and a classification step based on a combination of multiple SVM classifiers. 
The description of our process is given in this article. The efficiency of each step is measured and the global scheme is 
illustrated on an example. 
Automatic text localization in images is a major task in 
computer vision. Applications of this task are various (au 
tomatic image indexing, visual impaired people assistance 
or optical character reading...). Our work deals with text 
localization and extraction from images in an urban en 
vironment and is a part of iTowns project (iTowns ANR 
project, 2008). This project has two main goals : 1. al 
lowing a user to navigate freely within the image flow of 
a city, 2. Extracting features automatically from this im 
age flow to automatically enhance cartographic databases 
and to allow the user to make high level queries on them 
(go to a given address, generate relevant hybrid text-image 
navigation maps (itinerary), find the location of an orphan 
image, select the images that contain an object, etc.). To 
achieve this work, geolocalized set of pictures are taken 
every meter. All images are processed off line to extract as 
many semantic data as possible and cartographic databases 
are enhanced with these data. At the same time, each mo 
saic of pictures is assembled into a complete immersive 
panorama (Figure 1). 
Many studies focus on text detection and localization in 
images. However, most of them are specific to a con 
strained context such as automatic localization of postal 
addresses on envelopes (Palumbo et al., 1992), license plate 
localization (Arth et al., 2007), text extraction in video 
sequences (Wolf et al., 2002), automatic forms reading 
(Kavallieratou et al., 2001) and more generally "documents” 
(Wahl et al., 1982). In such context, strong hypothesis 
may be asserted (blocks of text, alignments, temporal re 
dundancy for video sequences...). In our context (natural 
scenes in an urban environment), text comes from vari 
ous sources (road sign, storefront, advertisements...). Its 
extraction is difficult: no hypothesis can be made on text 
(style, position, orientation, lighting, perspective deforma 
tions...) and the amount of data is huge. Today, we work 
on 1 TB for a part of a single district in Paris. Next year, 
more districts will be processed (more than 4 TB). Differ- 
Segmentation ^R Fast filters Classification ^R Grouping 
Figure 2: General principle of our system. 
ent approaches already exist for text localization in natu 
ral scenes. States of the art are found in (Mancas-Thillou, 
2006, Retomaz and Marcotegui, 2007, Jung et al., 2004, 
Jian Liang et al., 2005). Even if preliminary works ex 
ist in natural scene (Retomaz and Marcotegui, 2007, Chen 
and Yuille, 2004), no standard solution really emerges and 
they do not focus on urban context. 
The paper presents our method and is organized as follows: 
the text localization process is presented and every step is 
detailed followed by the evaluation of main steps. In the 
last part, results are presented. Then comes the conclusion. 
The goal of our system is to localize text. Once the lo 
calization is performed, the text recognition is carried out 
by an external O.C.R. (but the system may improve the 
quality of the region by correcting perspective deforma 
tions for example). Our system is a region based approach 
and starts by isolating letters, then groups them to restore 
words and text zones. Region based approach seems to be 
more efficient, such approach was ranked first (Retomaz 
and Marcotegui, 2007) during ImagEval campaign (Im- 
agEval, 2006). Our process is composed of a cascade of 
filters (Figure 2). It segments the image. Each region is 
analysed to determine whether the region corresponds to 
text or not. First stages during selection eliminate a part 
of non text regions but try to keep as many text region as 
possible (at the price of a lot of false positives). At the 
end, detected regions that are close to other text regions are 
grouped all together. Isolated text regions are canceled. 

Note to user

Dear user,

In response to current developments in the web technology used by the Goobi viewer, the software no longer supports your browser.

Please use one of the following browsers to display this page correctly.

Thank you.