The basics (for people completely new to this)
Compiling and running parallel programs is more complicated than
working with serial programs. There are two ways for a code to run
different tasks in parallel and have communication between them: shared memory
and message passing. Shared memory assumes you're on a machine with multiple
CPUs that effectively runs one copy of the operating system. Every process can
access all of the same memory and disks. Message passing is where you're using
a group of machines, each with
its own CPUs, memory, and copy of the operating system. Filesystems/disks may still be shared.
Shared memory is
the easiest to use, but multiple-CPU shared memory machines are far more expensive
to buy than distributed clusters. Many machines do a bit of both, with
small SMP nodes combined into bigger clusters so you can run in shared memory over a few processors, and use message passing if you want more. Examples of local shared memory machines are
nemesis, defiant, and mek-quake. Examples of purely cluster machines are
destiny and ice. Examples of mixed systems are rho, sword, and rama.
Having said that, there's nothing to stop you from running a
message-passing program on a shared memory machine and in fact people often do
this if their code happened to be originally written to use message-passing.
Compiling and running parallel codes
If you are trying to compile and run some else's code then the
best
advice anyone can give is to read all of their documentation, as chances are that someone has
already tried to get the code working on a similar system to the one
you're using and can tell you exactly how
to compile and run it. There is no substitute for reading the manual.
It also helps to know what parallel libraries and compilers are available
on your machine. This isn't the appropriate document for that information. Most of the
local machines have a 'notes for users' or 'introduction' document or similar that you
should have been given when you got your user account. These usually
contain a list of available software. The latest versions of these
can be found on the page for the appropriate machine linked from the list of local
servers. If your machine hasn't got one you'll just have to ask your
colleagues, or me, what's on the machine. Also look at the files
/etc/motd and /info/new and try the local man pages.
Auto-parallelizing compilers
If you have a serial code and don't wish to parallelize it yourself but
still want it to go faster, then
there are some compilers that can do a degree of shared-memory auto-parallelization for you.
They probably aren't as good at it as a human expert and they work best on
well-written, clean Fortran.
Portland compiler suite (Linux)
The Portland compilers have some autoparallelizing capability.
If you use pgf77 or pgf90 with the -Mconcur option, the compiler will do its
best to parallelize the loops. See the man pages for options to -Mconcur to
tweak the exact behaviour. You must use the same flags when linking as you did
when compiling, of course. To actually make the code run over more than one
CPU, you must set the NCPUS environment variable to the number of CPUs to use.
Intel ifort compiler (Linux)
The Intel ifort Fortran compiler has the ability to do some
autoparallelization and vectorization. If you use the -parallel option it
will attempt to parallelize where possible. See the documentation for many
more details, including options to tweak the parallelization. Set the
environment variable OMP_NUM_THREADS at runtime to the number of CPUs to
run over. Note that by default the number of threads is the number of CPUs
on the machine the code was compiled on, so those compiling on
dual-processors can often get away without the variable.
Sun compiler suite (Solaris machines)
The SunONE Studio compilers can autoparallelize and our Suns are the ideal
environment to do this in, because they are large shared
memory machines. You need to use the -xautopar switch. This will
only have an effect if you also use -O3 (which is implicit with
-fast). To activate the parallel code set OMP_NUM_THREADS to the
number of CPUs you want to run over before starting your program.
The Sun compiler suite has a parallel shared memory version of the Sun performance
library (highly optimized maths routines). Invoke it by linking with
-xparallel as well as the parallelization options.
OpenMP
This is a standard for use with shared memory machines.
To use it you take a serial code and put
OpenMP directives within it. These are little comments in the code
that an ordinary compiler will ignore, but if you compile with an OpenMP compiler
then you will get a parallelized executable. Again you
need to set an environment variable (OMP_NUM_THREADS) to the appropriate
number of processors. Compilers that support it are Portland (-mp flag),
Intel Fortran (-openmp flag), SunONE Studio suite (-openmp flag, see also
-stackvar and -xautopar to do autoparallelizing too where possible).
HPF
HPF is a superset of Fortran 90 which includes some special directives and
commands for parallelizing code. You can in theory run an HPF source file
through a Fortan 90 compiler and it will still work. The point of HPF is
that you can get the machine to parallelize your code in exactly the way you
want it. You can make it use either a message-passing interface (-Mmpi) or
shared memory (-Msmp) and you can get the Portland pghpf compiler to do a
degree of autoparallelizing (-Mautopar) even if you didn't write your code
with parallelization directives. I don't think we have other HPF
compilers available. I'm not aware of anyone here using it.
MPI codes
MPI is a standard for message passing. To use it, you must explictly write
your code using calls to the MPI library to farm off threads to go and do
parallel work, and you must link your program with the MPI library.
You then launch the program using the mpirun, mpiexec, or mprun command provided by
whatever library you're using.
This takes various options, usually including one
(-np) which is the number of processes to run. This should normally be the same as the number
of processors you want to use, though you can often set it to higher than the
number of available processors for testing. When using a distributed system you usually have
to provide a list of the nodes that MPI can use too.
MPI-CH
This is an MPI implementation that is installed on most of the local Beowulf clusters.
It can be found in /usr/local/. It has to be configured for
particular compilers and communication methods so there may be several
versions on any given system, controlled by modules. Pick the one you need for your
preferred compiler.
Use the mpif77 and mpif90 commands in the bin directory of the chosen mpi
directory to compile your code. This will automatically link your code
with the correct libraries. To run, use the mpirun or mpiexec command found in the
same directory. You will generally have to supply a correctly configured
list of machines to use as one of the options to the mpirun command.
Example:
mpirun -np 4 -nolocal -machinefile nodelist -v my_mpi_program
This runs over four processors and tells MPICH not to assume the local
processor is free for work but to look in the file
nodelist to see what machines are available for use. If you don't use
-nolocal one thread is always started on the local machine, which isn't
necessarily what you want. The line above also uses the
verbose switch to make MPICH give more information about what's going on. The -t
switch is also useful. It doesn't actually run the program but prints out
what MPICH would do, so is very useful when things aren't behaving
as you think they should.
The format of the nodelist can be quite important when you have
dual-processor nodes (eg sword), because you want to run two threads on each node and use effcient shared-memory
communication between the two processors in the same node rather than rsh.
To do this you need a list of the form
node02:2
node03:2
instead of
node02
node02
node03
node03
Setting up the nodelist correctly when using a queueing system that is assigning you a
particular set of nodes requires a little
effort, and most of the clusters have wrapper scripts that do this for
you. They are usually called mpichwrapper and you should use them
in your job scripts instead of an mpirun command.
mpichwrapper does not normally need any arguments; it works out its hostlist and number of cpus by talking to the queueing system. Please have a look at the example PBS submission scripts for
MPI jobs on the Beowulf clusters to see sample usage. They can be found in /info/pbs on each cluster. Use
the one from the machine you're running on as they are different. Note
that the rama cluster doesn't allow you to sensibly parallelize over more than two processors:
the local MPI library has been deliberately configured not to understand
non-shared memory communication. This is because rama's interconnnect is
not good for parallel work and attempting to use it for this would cause
many problems.
SCore/MPI-CH
SCore is a version of MPI-CH that does not use TCP/IP for communication
between nodes. It has better latency than MPI-CH. It's more complex to
configure than MPI-CH and so is only found on clusters intended for
serious parallel work. Like MPI-CH it requires the use of special
wrapper scripts to compile and run programs. You would compile with
mpif90 -compiler <compilername> -o mpihello mpihello.f
where <compilername> is the name of one of the real Fortran
compilers that SCore has been configured to know about.
You would then run this code by creating a nodefile as for MPI-CH above,
and doing, from the SCore master node:
scout -wait -F nodefile -e scrun -nodes=Nx2 mpihello
where N is the number of nodes you are running over. The x2 tells SCore to
run two processes per node, which is what you'd want for dual-processor
nodes. There is no need to mention each node twice in the nodefile.
In practice, the complexities of starting an SCore job are usually hidden by a workload
management system and you should consult the local documentation for your
machine before trying to run jobs. The local cluster Athens, which uses
SCore, has wrapper scripts (man scrunwrapper) to handle starting jobs. Do not try to invoke
scout directly on Athens.
LAM-MPI
This is another popular MPI implementation that is installed on most of the local
clusters. Like MPI-CH it has be compiled separately for each underlying
compiler, so use the module for
the compiler you want. It has its own mpirun command with a
slightly different syntax to the one in MPI-CH. Instead of giving a
hostlist to mpirun, you start using LAM with the lamboot
command, which takes a hostlist as argument, and then call mpirun
command after that. When you are done you run lamhalt. Most of
the clusters have a wrapper called lamwrapper to take care of
this.
Other MPI libraries on Linux
There are other Linux MPI libraries; some are better for use with specialised
interconnects. Tardis has OpenMPI and MVAPICH for its Infiniband
interconnect.
Sun Clustertools MPI
Sun provide a highly optimized MPI implementation as part of their
clustertools package. On the Suns the clustertools are installed in /opt/SUNWhpc/bin
which you need to add to your PATH. Compile your MPI program with mpf90 or
mpcc. You must link with the -lmpi switch despite the fact that you're using an MPI
compiler! The run command in this case is mprun. There's no need to supply a nodelist on an SMP machine- the clustertools handle it all. Just tell it how many processors you'd like to use. You may also want to read the manpage
for mpinfo.
The Suns also have the Sun Scientific Subroutine Library, which is a
parallel maths library (-ls3l). Within a single node it uses the Sun Performance
Library.
|