S. This is
: inherently
sses visual
ne may be
nformation
bservations
>ry analysis
| images in
ye-tracking
Xf a trained
iccuracy of
bserving à
1on focuses
> individual
ements In
ures of the
f geospatial
quirements
f geospatial
x perimental
spatial and
sed on the
protocols.
ITION
the human
ted through
"his allows
attention Is
(saccades)
scrutiny of
yroximately
/isual angle
ysis is the
al over the
t (ROIs).
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol XXXV, Part B3. Istanbul 2004
2.1 Where and What in Vision
A visual scene, perceived by a human or an animal, is so
complex that it is not possible to perceive the whole scene as
one unit. Such a holistic perception would make the scene
unique, rendering associations to other scenes and other
perceptions impossible. Thus it is necessary to have a
mechanism in the perceptual system that breaks down or
fragments the scene to a more appropriate form of a
representation.
It has been found (Mishkin et.al.,1983) that humans and higher
animals represent visual information in at least two important
subsystems: the where- and the what systems. The where-
system only processes the location of the object in the scene. It
does not represent the kind of object, but this is the task of the
what-system. The two systems work independently of each
other and never converge to one common representation
(Goldman-Rakic,1993). Physiologically, they are separated
throughout the entire cortical process of visual analysis.
The where-system builds up a spatial relation map, where no
information about the form of the object is represented. This
form of representation can be used for variable binding in
collaboration with the what-system. The what-system
represents categories of objects, without any information about
their spatial location.
In natural environments, a significant problem is to attend to a
stimuli of interest. The where-system is a part of the attention
process, since the where-system supplies information about
where to foveate in the scene. The fovea in the retina is
exclusively concerned with form perception, not with the
location of the objects in the scene.
2.2 Attention
When the brain processes a visual scene, some of the elements
of the scene are put in focus by various attention mechanisms
(Posner, 1990). It is obvious that attention must be a very
important property for identification and for learning in
biological as well as artificial systems- many researches are
focused on attention mechanisms necessary for grasping spatial
relations. In a natural scene, one of the basic problems is to
locate and identify objects and their parts.
2.3 Saccades
When the brain analyses a visual scene, it must combine the
representations obtained from different domains. One
hypothesis underlying the simulations states that attention
shifts from domain to domain in a sequential way (Crick,
1984). Since information about the form and other features of
particular objects can be obtained only when the object is
foveated, different objects can be attended to only through
saccadic movements of the eye.
These rapid eye movements, which are made at the rate of
about three per second, orient the high-acuity foveal region of
the eye over targets of interest in a visual scene. The
characteristic properties of saccadic eye movements (or
saccades) have been well studied (Carpenter, 1983). The high
velocity of saccades, reaching up to 700° per second for large
movements, serves to minimize the time in flight, so that most
of the time is spent fixating chosen targets.
Saccades are known to be ballistic, for example, their final
location is computed prior to making the movement, and the
trajectory of the movement is uninterrupted by incoming visual
signals. Furthermore, owing to the structure of the retina, the
central 1.5° of the visual field is represented with a visual
resolution that is many times greater than that of the periphery.
Saccades subserve the important function of bringing high
resolution foveal region onto targets of interest in the visual
scene.
Initial eye movement studies suggest that the primary role of
saccades might be to compensate for the lack of resolution over
the visual field by “painting” an image into an internal
memory. It was proposed that the saccadic movements and
their resultant fixations allowed the formation of a visual-
motor memory (“scan path”) that could be used for encoding
objects and scenes (Noton and Stark, 1971). However, a
number of studies, starting from Yarbus’ classical work
(Yarbus, 1967), have suggested that gaze changes are most
often directed according to the ongoing demands of the task at
hand.
The task-specific use of gaze is best understood for reading
text (O' Regan, 1990) where the eyes fixate almost every word,
sometimes skipping over smaller function words. In addition, it
is known, that saccade size during reading is modulated
according to the specific nature of the pattern recognition task
at hand (Kowler and Anton, 1987).
2.4 Fixations
It is generally agreed that visual and cognitive processing do
occur during fixations (Just and Carpenter, 1984). Fixation
identification is an inherently statistical description of
observed eye movement behaviors. The process of fixation
identification - separating and labeling fixations and saccades
in eye-tracking protocols - is an essential part of eye-movement
data analysis and can have a dramatic impact on higher-level
analyses.
Common analysis metrics include fixation or gaze durations,
saccadic velocities, saccadic amplitudes, and various
transition-based parameters between fixations and/or regions
of interest (Salvucci and Goldberg, 2000). The analysis of
fixations and saccades requires some form of fixation
identification (or simply identification) - that is, the translation
from raw eye-movement data points to fixation locations (and
implicitly the saccades between them) on the visual display.
While it is generally agreed upon that visual and cognitive
processing do occur during fixations, it is less clear exactly
when fixations start and when they end. Regardless of the
precision and flexibility associated with identification
algorithms, the identification problem is still a subjective
process. Therefore one efficient way to validate these
algorithms is to compare resultant fixations to an observer's
subjective impressions.
For spatial characteristics, three criteria have been identified
that distinguish three primary types of algorithms: velocity-
based, dispersion-based, and area-based (Salvucci and