also instrumental in introducing us to the large potential
available in parallel computing. One major disadvantage of the
MPI implementation was that it required dedicated nodes for
computation, which renders most of the idle high-end
workstations unusable. Even though, simple programs could be
written easily it requires a significant learning curve to get the
maximum benefit out of parallel computing model of MPI.
Some of the issues that need to be addressed in parallel
programs are minimization of communication, load balancing
and fault tolerance. The scheduling, monitoring and managing
of jobs are not addressed and would require integration with
another tool.
2.2 Parallel Virtual Machine (PVM)
PVM is an integrated set of software tools and libraries that
emulates a general purpose, flexible, heterogeneous parallel
computing framework on interconnected computers of varied
architecture (Sterling, 2002). It is a portable message-passing
programming system, designed to link separate host machines
to form a "virtual machine" which is a single, manageable
computing resource (Geist, 1994).
The biggest advantage of PVM is that it has a large user base
supporting the libraries and tools. Writing our proxies using
PVM allowed us to leverage this knowledge that resulted in the
least amount of modification to the existing applications. It also
introduced us to parallel programming. The disadvantages were
that it was not easy to set up and not mature in the Windows
platform. It has also been superseded by MPI and is considered
not as fast. As in the MPI case, job scheduling and management
is not addressed.
2.3 Condor: High Throughput Job Scheduler
Condor is a sophisticated and unique distributed job scheduler
developed by the Condor research project at the University of
Wisconsin-Madison Department of Computer Science (Sterling,
2002). "Condor is a specialized workload management system
for compute-intensive jobs. Like other full-featured batch
systems, Condor provides a job queuing mechanism, scheduling
policy, priority scheme, resource monitoring, and resource
management. Users submit their serial or parallel jobs to
Condor, Condor places them into a queue, chooses when and
where to run the jobs based upon a policy, carefully monitors
their progress, and ultimately informs the user upon
completion." (Condor Team, 2003). It provides a high
throughput computing environment that delivers large amounts
of computational power over a long period of time even in case
of machine failures. Condor has more extensive features that
are well documented in the manual and appear in many
publications from the research team.
Condor proxies were developed to submit jobs to the Condor
pool and monitor its progress using the Condor command line
executables. The most impressive thing about Condor was that
set up was a breeze and everything worked right out of the box.
The ability to effectively harness wasted CPU power from idle
desktop workstations was an added benefit that gave Condor an
edge over the others. The scheduling, monitoring and inbuilt
fault tolerances were also features that could not be matched by
any of the other HPC models. Moreover, since it could control
both serial and parallel (MPI, PVM) programs it provides us
with growth potential.
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol XXXV, Part B3. Istanbul 2004
The vanilla universe (which is Condor terminology for serial
job management) directly matched our distribution model. In
writing the proxies all we had to do were minor changes to each
application to work on the subset of the input XML files and
modify the job submission text files of Condor to kick of a
batch file that sets up the shared drives. The major difficulty in
using Condor was the absence of an API that allowed easy third
party integration to monitor/manipulate jobs and machines.
Another shortcoming was the lack of GUI's for job control and
feedback. Clipped functionalities on the windows platform (no
transparent authentication & failure handling, automatic drive
mappings etc) were some of the minor problem we came across
at the beginning that has been partially addressed since then in
newer versions of Condor.
2.4 Distributed Component Object Model (DCOM)
DCOM is a protocol that is an extension of the Component
Object Model (COM) from Microsoft that uses remote
procedure calls to provide location independence by allowing
components to run on different machine from the client. It
enables software components to communicate directly over a
network in a reliable, secure, and efficient manner. The theory
of DCOM is great, but the implementation has proved less than
stellar (Templeman, 2003). By the time we started evaluating
distributed computing, DCOM has lost favor to newer
technologies such as web services (Templeman, 2003).
The job management for COM components could be handled
by Microsoft Application Center 2000 that is designed to enable
Web applications to achieve on-demand scalability and
mission-critical availability through software scaling, while
reducing operational complexity and costs (Microsoft, 2001).
As a full GUI based product, it appears to be uncomplicated to
configure and set up Application Center clusters. However, it is
mainly geared towards web service and our research did not
turn up how it could be used for other types of applications. It
was also not clear how one would write a component that could
be load balanced. Moreover, this solution will also require a
dedicated computation farm and has a high software price for
each server added to the cluster.
2.5 Selected Solution
Condor with its ease of set up and high throughput computing
was finally selected for implementation. Proxies were
developed for each computationally intensive process. These
proxies submit the job, get the status reported from each node
and summarize the results as shown in Figure 1. The installation
assumes that all the nodes have fast access to the file servers.
The maximum output from this configuration will be utilized by
high-end fiber array Storage Area Network (SAN) that provide
high read and write performance. Using Windows Distributed
File System (DFS) or different input and output locations from
various file servers could also avoid I/O bottleneck.
The following section will show a typical set up that uses
Condor and the proxies developed at LGGM. We will finally
summarize the actual result attained for some projects.
Inte
Subn
3.1
The
valid
map|
Conc
Pool
matc|
impo
envir
Subn
subm
as th
mach
run ti
Com]
comp
mach
netwc
submi
File S
The s
OS or
will o
Licen
floatir
compi
somet
than t
handle
runs tc
Accor
have t
going
operat
combi
it has t