nes
nene
010
and
pility
lized
actly
such
than
d by
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XXXIX-B4, 2012
XXII ISPRS Congress, 25 August — 01 September 2012, Melbourne, Australia
More specifically, the GPU is especially well-suited to address
problems that can be expressed as data-parallel computations —
the same program is executed on many data elements in parallel
(with high arithmetic intensity) — the ratio of arithmetic
operations to memory operations. Because the same program is
executed for each data element, there is a lower requirement for
sophisticated flow control; and because it is executed on many
data elements and has high arithmetic intensity, the memory
access latency can be hidden with calculations instead of big
data caches.
Data-parallel processing maps data elements to parallel
processing threads. Many applications that process large data
sets can use a data-parallel programming model to speed up the
computations. In 3D rendering, large sets of pixels and vertices
are mapped to parallel threads. Similarly, image and media
processing applications such as post-processing of rendered
images, video encoding and decoding, image scaling, stereo
vision, and pattern recognition can map image blocks and pixels
to parallel processing threads. In fact, many algorithms outside
the field of image rendering and processing are accelerated by
data-parallel processing, from general signal processing or
physics simulation to computational finance or computational
biology.
3. CUDA (COMPUTE UNIFIED DEVICE
ARCHITECTURE
Long time ago, the developers have tried to use GPUs for
parallel computing. The initial use of these initiatives (such as
rasterizing and Z-buffering) is very primitive and limited to
fully utilize the hardware functions. But the shading
calculations have accelerated the matrix calculations.
There was a session called “GPGPU” for “GPU computing” in
SIGGRAPH conference in 2003. But this session has been
almost no participation. In this session the best known topic was
“BrookGPU” as an stream programming language. Before the
publication of this programming language there were two
software development applications known Direct3D and
OpenGL. However, limited number of GPU applications can be
developed with these languages. After that, “Brook project”
made it possible to using GPUs as a parallel processor and can
be programming with C language. This project was developed
by Stanford University and has attracted attention of graphic
cards companies “NVIDIA” and “ATI” who are the two
different designer and manufacturer. Later, some people who
developed “Brook”, was joined to NVIDIA Company and
started offering a new marketing strategy as a unit of parallel
computation. Thus, direct use of graphics hardware has
emerged, and on behalf of a structure called the NVIDIA
CUDA.
Although announcements were made earlier, Nvidia introduced
CUDA to the public in February, 2007. This technology was
designed to meet several important requirements for a wide
audience's use. One of the most important requirement is the
ability to program GPUs easily. Simplicity is necessary to ease
GPU parallel programming and enable its use in more
disciplines. Before CUDA, GPU parallel programming was
limited to shader models of the graphics APIs. Thus, only the
problems well-suited to the nature of vertex and fragment
shaders were computed by using GPU parallel processing.
Additionally, expressing general algorithms in terms of textures
and GPU provided 3D operations by using only float numbers
were among the issues that limited the popularity of the GPU
computing. To achieve the goal of making GPU parallel
programming easy and practical, Nvidia offered to use C
programming language with minimal extensions. Another
important issue is the heterogeneous computing model, which
takes it possible to use CPU and GPU resources together.
CUDA lets programmers divide the code and data into sub-
parts, considering their suitability to the CPU/GPU architecture
and respective programming techniques. Such a division is
possible because the host and device have their own memories.
In this sense, it also becomes possible to port existing
implementations gradually, from the CPU to the GPU (Yilmaz,
2010). Briefly, CUDA technology is a software-hardware
computing architecture developed by NVIDIA and based on the
C programming language for parallel calculation to controls
GPU commands and video memory.
CUDA works with all Nvidia GPUs from the G8x series
onwards and new series including GeForce, Quadro and the
Tesla line. The data-parallel and thread-parallel architecture
introduces scalability. Since no extra effort is necessary to run
existing solution, the new GPUs are capable of running more
processing threads. It means that the code designed for the
Nvidia 8 series runs faster in Nvidia GTX series without any
additional coding. Nvidia states that programs developed for the
G8x series will also work without modification on all future
Nvidia video cards, due to binary compatibility.
The three abstractions offered by Nvidia ensure the granularity
required for good data parallelism and thread parallelism. These
below listed abstractions are designed to make CUDA
programmers life easy.
* Thread Group hierarchy: Threads are packed into
blocks which are also packed into a single grid.
* Shared memories: CUDA let threads use six different
memories that are designed to meet different
requirements.
e Barrier synchronization: This abstraction
synchronizes threads within a single block and makes
a thread wait the others to finish related computing,
before going further.
C for CUDA makes it possible to write functions that run on the
GPU by using C language. These functions are called “kernels”,
which are executed for each thread in a parallel manner unlike
the conventional serial programming functions that run only
once.
CUDA’s architecture offers thread hierarchy in top-down order
as follow:
1. Grid: A grid contains one or two dimensional blocks.
2. Blocks: A block contains one, two or three
dimensional threads. Current GPUs allow a block to
contain 512 threads at most. The blocks are executed
independently, and they are directed to available
processors to provide scalability.
3. Thread: A thread is the basic execution element.
This hierarchy and the structure are depicted by Figure 4. For
example if it is assumed that 1048576 pixels to be processed
independently in parallel manner and the block size is
determined as 512, the there are 2048 grids.
157