A ROBUST PARALLEL FRAMEWORK FOR MASSIVE SPATIAL DATA PROCESSING
ON HIGH PERFORMANCE CLUSTERS
Xuefeng Guan * *
? State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University,
129 Luoyu Road, Wuhan 430079, P. R. China —guanxuefeng@whu.edu.cn
Commission IV, WG IV/S
KEY WORDS: Data parallel processing, Split-and-Merge paradigm, Parallel framework, LiDAR
ABSTRACT:
Massive spatial data requires considerable computing power for real-time processing. With the help of the development of multicore
technology and computer component cost reduction in recent years, high performance clusters become the only economically viable
solution for this requirement. Massive spatial data processing demands heavy I/O operations however, and should be characterized
as a data-intensive application. Data-intensive application parallelization strategies are imcompatible with currently available
procssing frameworks, which are basically designed for traditional compute-intensive applications. In this paper we introduce a
Split-and-Merge paradigm for spatial data processing and also propose a robust parallel framework in a cluster environment to
support this paradigm. The Split-and-Merge paradigm efficiently exploits data parallelism for massive data processing. The
proposed framework is based on the open-source TORQUE project and hosted on a multicore-enabled Linux cluster. One common
LiDAR point cloud algorithm, Delaunay triangulation, was implemented on the proposed framework to evaluate its efficiency and
scalability. Experimental results demonstrate that the system provides efficient performance speedup.
1. INTRODUCTION
1.1 Introduction
Spatial datasets in many fields, such as laser scanning, continue
to increase with the improvements of data acquisition
technologies. The size of LiDAR point clouds has increased
from gigabytes to terabytes, even to petabytes, requiring a
significant number of computing resources to process them in a
short time. This is definitely beyond the capability for a single
desktop personal computer (PC).
A practical solution to meet this resource requirement is to
design parallel algorithms and run them on a distributed
platform. Parallelism can be exploited by decomposing the
domain into smaller subsets that can be executed concurrently.
Multicore-enabled Central Processing Units (CPU) are
becoming ubiquitous from the single desktop PC to clusters
(Borkar and Chien, 2011); while the costs to build a powerful
computing cluster are getting lower and lower. It is natural and
necessary that spatial analysts employ high performance clusters
(HPC) to efficiently process massive LiDAR point clouds.
Nowadays data processing algorithms were designed without
any consideration in concurrency. For applied scientists,
adapting these serial programs into a distributed platform is
challenging and error-prone. They usually do not have much
knowledge and experience in parallelization for the distributed
context. Furthermore, processing massive LiDAR point cloud is
inherently different from classical compute-intensive
applications. Such applications devote most of their processing
time to Input/Ouput (I/O) and manipulation of input data. This
type of application should be characterized as a data-intensive
application, as opposed to traditional compute-intensive
* Corresponding author. Email: guanxuefeng@whu.edu.cn, Tel
application. Thus, the manipulation of input data must be taken
into consideration during decomposition, scheduling, and load-
balance.
Such a framework could be helpful and desirable, in which low-
level thread/process operation routines are hided and high-level
functions/classes are supplied in an application programming
interface (API) library. This paper proposes a general parallel
framework on a HPC platform to facilitate this transition from a
single-core PC to a HPC context. This framework defines a
Split-and-Merge programming paradigm for users/programmers.
With the help of this paradigm, our framework can
automatically parallelize and schedule user’s tasks. Finally, we
evaluate this robust framework with one typical massive LiDAR
point cloud processing example, Delaunay triangulation.
Section 2 presents related work on the research on parallel data
processing framework. Section 3 introduces the Split-and-Merge
paradigm for the parallel framework. Section 4 respectively
describes the detailed implementation of our parallel framework.
Section 5 presents the results and discussion of the experiments.
Section 6 closes the paper with our conclusions.
2. RELATED WORK
Parallel data processing has been an active research field for
many years. Presently, a large body of work on parallel
frameworks for data-intensive applications can be found in the
literature.
Hawick et al. (2003) have used the grid computing techniques to
build an operational infrastructure for data processing and data
mining. Dean and Ghemawat (2008) proposed a programming
: +86 27 68778311, Fax: +86 27 68778969
202
dat
int
mo
rur
Te
sys
pre
im]
in i
3.1
Jon
int
nei
pro
opc
nei
par
nei
tha
onl
loc.