×

You are using an outdated browser that does not fully support the intranda viewer.
As a result, some pages may not be displayed correctly.

We recommend you use one of the following browsers:

Full text

Title
CMRT09
Author
Stilla, Uwe

In: Stilla U, Rottensteiner F, Paparoditis N (Eds) CMRT09. IAPRS, Vol. XXXVIII, Part 3A/V4 — Paris, France, 3-4 September, 2009
TEXT EXTRACTION FROM STREET LEVEL IMAGES
J. Fabrizio 12 , M. Cord 1 , B. Marcotegui 2
1 UPMC Univ Paris 06
Laboratoire d’informatique de Paris 6, 75016 Paris, France
2 MINES Paristech, CMM- Centre de morphologie mathématique. Mathématiques et Systèmes,
35 rue Saint Honoré - 77305 Fontainebleau cedex, France
KEY WORDS: Urban, Text, Extraction, Localization, Detection, Learning, Classification
ABSTRACT
We offer in this article, a method for text extraction in images issued from city scenes. This method is used in the
French iTowns project (iTowns ANR project, 2008) to automatically enhance cartographic database by extracting text
from geolocalized pictures of town streets. This task is difficult as 1. text in this environment varies in shape, size,
color, orientation... 2. pictures may be blurred, as they are taken from a moving vehicle, and text may have perspective
deformations, 3. all pictures are taken outside with various objects that can lead to false positives and in unconstrained
conditions (especially light varies from one picture to the other). Then, we can not make the assumption on searched
text. The only supposition is that text is not handwritten. Our process is based on two main steps: a new segmentation
method based on morphological operator and a classification step based on a combination of multiple SVM classifiers.
The description of our process is given in this article. The efficiency of each step is measured and the global scheme is
illustrated on an example.
1 INTRODUCTION
Automatic text localization in images is a major task in
computer vision. Applications of this task are various (au
tomatic image indexing, visual impaired people assistance
or optical character reading...). Our work deals with text
localization and extraction from images in an urban en
vironment and is a part of iTowns project (iTowns ANR
project, 2008). This project has two main goals : 1. al
lowing a user to navigate freely within the image flow of
a city, 2. Extracting features automatically from this im
age flow to automatically enhance cartographic databases
and to allow the user to make high level queries on them
(go to a given address, generate relevant hybrid text-image
navigation maps (itinerary), find the location of an orphan
image, select the images that contain an object, etc.). To
achieve this work, geolocalized set of pictures are taken
every meter. All images are processed off line to extract as
many semantic data as possible and cartographic databases
are enhanced with these data. At the same time, each mo
saic of pictures is assembled into a complete immersive
panorama (Figure 1).
Many studies focus on text detection and localization in
images. However, most of them are specific to a con
strained context such as automatic localization of postal
addresses on envelopes (Palumbo et al., 1992), license plate
localization (Arth et al., 2007), text extraction in video
sequences (Wolf et al., 2002), automatic forms reading
(Kavallieratou et al., 2001) and more generally "documents”
(Wahl et al., 1982). In such context, strong hypothesis
may be asserted (blocks of text, alignments, temporal re
dundancy for video sequences...). In our context (natural
scenes in an urban environment), text comes from vari
ous sources (road sign, storefront, advertisements...). Its
extraction is difficult: no hypothesis can be made on text
(style, position, orientation, lighting, perspective deforma
tions...) and the amount of data is huge. Today, we work
on 1 TB for a part of a single district in Paris. Next year,
more districts will be processed (more than 4 TB). Differ-
Segmentation ^R Fast filters Classification ^R Grouping
Figure 2: General principle of our system.
ent approaches already exist for text localization in natu
ral scenes. States of the art are found in (Mancas-Thillou,
2006, Retomaz and Marcotegui, 2007, Jung et al., 2004,
Jian Liang et al., 2005). Even if preliminary works ex
ist in natural scene (Retomaz and Marcotegui, 2007, Chen
and Yuille, 2004), no standard solution really emerges and
they do not focus on urban context.
The paper presents our method and is organized as follows:
the text localization process is presented and every step is
detailed followed by the evaluation of main steps. In the
last part, results are presented. Then comes the conclusion.
2 SEGMENTATION BASED STRATEGY
The goal of our system is to localize text. Once the lo
calization is performed, the text recognition is carried out
by an external O.C.R. (but the system may improve the
quality of the region by correcting perspective deforma
tions for example). Our system is a region based approach
and starts by isolating letters, then groups them to restore
words and text zones. Region based approach seems to be
more efficient, such approach was ranked first (Retomaz
and Marcotegui, 2007) during ImagEval campaign (Im-
agEval, 2006). Our process is composed of a cascade of
filters (Figure 2). It segments the image. Each region is
analysed to determine whether the region corresponds to
text or not. First stages during selection eliminate a part
of non text regions but try to keep as many text region as
possible (at the price of a lot of false positives). At the
end, detected regions that are close to other text regions are
grouped all together. Isolated text regions are canceled.
199