Search A-Z index Help
University of Cambridge Home Chemistry Dept Home CUC3 home
University of Cambridge > Department of Chemistry > Theoretical Chemistry > Computer Support

CUC3 parallel programming notes

The basics (for people completely new to this)

Compiling and running parallel programs is more complicated than working with serial programs. There are two ways for a code to run different tasks in parallel and have communication between them: shared memory and message passing. Shared memory assumes you're on a machine with multiple CPUs that effectively runs one copy of the operating system. Every process can access all of the same memory and disks. Message passing is where you're using a group of machines, each with its own CPUs, memory, and copy of the operating system. Filesystems/disks may still be shared.

Shared memory is the easiest to use, but multiple-CPU shared memory machines are far more expensive to buy than distributed clusters. Many machines do a bit of both, with small SMP nodes combined into bigger clusters so you can run in shared memory over a few processors, and use message passing if you want more. Examples of local shared memory machines are nemesis, defiant, and mek-quake. Examples of purely cluster machines are destiny and ice. Examples of mixed systems are rho, sword, and rama.

Having said that, there's nothing to stop you from running a message-passing program on a shared memory machine and in fact people often do this if their code happened to be originally written to use message-passing.

Compiling and running parallel codes

If you are trying to compile and run some else's code then the best advice anyone can give is to read all of their documentation, as chances are that someone has already tried to get the code working on a similar system to the one you're using and can tell you exactly how to compile and run it. There is no substitute for reading the manual.

It also helps to know what parallel libraries and compilers are available on your machine. This isn't the appropriate document for that information. Most of the local machines have a 'notes for users' or 'introduction' document or similar that you should have been given when you got your user account. These usually contain a list of available software. The latest versions of these can be found on the page for the appropriate machine linked from the list of local servers. If your machine hasn't got one you'll just have to ask your colleagues, or me, what's on the machine. Also look at the files /etc/motd and /info/new and try the local man pages.

Auto-parallelizing compilers

If you have a serial code and don't wish to parallelize it yourself but still want it to go faster, then there are some compilers that can do a degree of shared-memory auto-parallelization for you. They probably aren't as good at it as a human expert and they work best on well-written, clean Fortran.

Portland compiler suite (Linux)

The Portland compilers have some autoparallelizing capability. If you use pgf77 or pgf90 with the -Mconcur option, the compiler will do its best to parallelize the loops. See the man pages for options to -Mconcur to tweak the exact behaviour. You must use the same flags when linking as you did when compiling, of course. To actually make the code run over more than one CPU, you must set the NCPUS environment variable to the number of CPUs to use.

Intel ifort compiler (Linux)

The Intel ifort Fortran compiler has the ability to do some autoparallelization and vectorization. If you use the -parallel option it will attempt to parallelize where possible. See the documentation for many more details, including options to tweak the parallelization. Set the environment variable OMP_NUM_THREADS at runtime to the number of CPUs to run over. Note that by default the number of threads is the number of CPUs on the machine the code was compiled on, so those compiling on dual-processors can often get away without the variable.

Sun compiler suite (Solaris machines)

The SunONE Studio compilers can autoparallelize and our Suns are the ideal environment to do this in, because they are large shared memory machines. You need to use the -xautopar switch. This will only have an effect if you also use -O3 (which is implicit with -fast). To activate the parallel code set OMP_NUM_THREADS to the number of CPUs you want to run over before starting your program.

The Sun compiler suite has a parallel shared memory version of the Sun performance library (highly optimized maths routines). Invoke it by linking with -xparallel as well as the parallelization options.

OpenMP

This is a standard for use with shared memory machines. To use it you take a serial code and put OpenMP directives within it. These are little comments in the code that an ordinary compiler will ignore, but if you compile with an OpenMP compiler then you will get a parallelized executable. Again you need to set an environment variable (OMP_NUM_THREADS) to the appropriate number of processors. Compilers that support it are Portland (-mp flag), Intel Fortran (-openmp flag), SunONE Studio suite (-openmp flag, see also -stackvar and -xautopar to do autoparallelizing too where possible).

HPF

HPF is a superset of Fortran 90 which includes some special directives and commands for parallelizing code. You can in theory run an HPF source file through a Fortan 90 compiler and it will still work. The point of HPF is that you can get the machine to parallelize your code in exactly the way you want it. You can make it use either a message-passing interface (-Mmpi) or shared memory (-Msmp) and you can get the Portland pghpf compiler to do a degree of autoparallelizing (-Mautopar) even if you didn't write your code with parallelization directives. I don't think we have other HPF compilers available. I'm not aware of anyone here using it.

MPI codes

MPI is a standard for message passing. To use it, you must explictly write your code using calls to the MPI library to farm off threads to go and do parallel work, and you must link your program with the MPI library. You then launch the program using the mpirun, mpiexec, or mprun command provided by whatever library you're using. This takes various options, usually including one (-np) which is the number of processes to run. This should normally be the same as the number of processors you want to use, though you can often set it to higher than the number of available processors for testing. When using a distributed system you usually have to provide a list of the nodes that MPI can use too.

MPI-CH

This is an MPI implementation that is installed on most of the local Beowulf clusters. It can be found in /usr/local/. It has to be configured for particular compilers and communication methods so there may be several versions on any given system, controlled by modules. Pick the one you need for your preferred compiler.

Use the mpif77 and mpif90 commands in the bin directory of the chosen mpi directory to compile your code. This will automatically link your code with the correct libraries. To run, use the mpirun or mpiexec command found in the same directory. You will generally have to supply a correctly configured list of machines to use as one of the options to the mpirun command. Example:

mpirun -np 4 -nolocal -machinefile nodelist -v my_mpi_program
This runs over four processors and tells MPICH not to assume the local processor is free for work but to look in the file nodelist to see what machines are available for use. If you don't use -nolocal one thread is always started on the local machine, which isn't necessarily what you want. The line above also uses the verbose switch to make MPICH give more information about what's going on. The -t switch is also useful. It doesn't actually run the program but prints out what MPICH would do, so is very useful when things aren't behaving as you think they should.

The format of the nodelist can be quite important when you have dual-processor nodes (eg sword), because you want to run two threads on each node and use effcient shared-memory communication between the two processors in the same node rather than rsh. To do this you need a list of the form

node02:2
node03:2
instead of
node02
node02
node03
node03
Setting up the nodelist correctly when using a queueing system that is assigning you a particular set of nodes requires a little effort, and most of the clusters have wrapper scripts that do this for you. They are usually called mpichwrapper and you should use them in your job scripts instead of an mpirun command. mpichwrapper does not normally need any arguments; it works out its hostlist and number of cpus by talking to the queueing system. Please have a look at the example PBS submission scripts for MPI jobs on the Beowulf clusters to see sample usage. They can be found in /info/pbs on each cluster. Use the one from the machine you're running on as they are different. Note that the rama cluster doesn't allow you to sensibly parallelize over more than two processors: the local MPI library has been deliberately configured not to understand non-shared memory communication. This is because rama's interconnnect is not good for parallel work and attempting to use it for this would cause many problems.

SCore/MPI-CH

SCore is a version of MPI-CH that does not use TCP/IP for communication between nodes. It has better latency than MPI-CH. It's more complex to configure than MPI-CH and so is only found on clusters intended for serious parallel work. Like MPI-CH it requires the use of special wrapper scripts to compile and run programs. You would compile with

mpif90 -compiler <compilername> -o mpihello mpihello.f
where <compilername> is the name of one of the real Fortran compilers that SCore has been configured to know about.

You would then run this code by creating a nodefile as for MPI-CH above, and doing, from the SCore master node:

scout -wait -F nodefile -e scrun -nodes=Nx2 mpihello
where N is the number of nodes you are running over. The x2 tells SCore to run two processes per node, which is what you'd want for dual-processor nodes. There is no need to mention each node twice in the nodefile.

In practice, the complexities of starting an SCore job are usually hidden by a workload management system and you should consult the local documentation for your machine before trying to run jobs. The local cluster Athens, which uses SCore, has wrapper scripts (man scrunwrapper) to handle starting jobs. Do not try to invoke scout directly on Athens.

LAM-MPI

This is another popular MPI implementation that is installed on most of the local clusters. Like MPI-CH it has be compiled separately for each underlying compiler, so use the module for the compiler you want. It has its own mpirun command with a slightly different syntax to the one in MPI-CH. Instead of giving a hostlist to mpirun, you start using LAM with the lamboot command, which takes a hostlist as argument, and then call mpirun command after that. When you are done you run lamhalt. Most of the clusters have a wrapper called lamwrapper to take care of this.

Other MPI libraries on Linux

There are other Linux MPI libraries; some are better for use with specialised interconnects. Tardis has OpenMPI and MVAPICH for its Infiniband interconnect.

Sun Clustertools MPI

Sun provide a highly optimized MPI implementation as part of their clustertools package. On the Suns the clustertools are installed in /opt/SUNWhpc/bin which you need to add to your PATH. Compile your MPI program with mpf90 or mpcc. You must link with the -lmpi switch despite the fact that you're using an MPI compiler! The run command in this case is mprun. There's no need to supply a nodelist on an SMP machine- the clustertools handle it all. Just tell it how many processors you'd like to use. You may also want to read the manpage for mpinfo.

The Suns also have the Sun Scientific Subroutine Library, which is a parallel maths library (-ls3l). Within a single node it uses the Sun Performance Library.