Local use of OpenPBS
OpenPBS is the queueing system we use on
Rama, and Athens. The source code I have
at the moment is for version 2.3.16 but with patches (see below). OpenPBS
seems not to be being developed any more, but there is a successor called
which is closely based on OpenPBS. This is installed on Sword, Destiny, Nimbus, Mek-Quake, Tardis, and Clust. We use
OpenPBS and Torque with the Maui Scheduler
as a dropin replacement for PBS's native scheduler.
Advice for users
I wrote some notes for users
on how to use the queueing system.
The existence of different kinds of parallel codes has caused some
confusion. Here's an introduction I wrote, originally for users of Corona and subsequently expanded.
Compiling and installing
I have found it helpful to apply some patches to the OpenPBS source code. I used the following:
I haven't needed to patch Torque yet.
Very useful web sites:
I have not needed to do anything particularly unusual with the
configuration of mom and server. I found a few useful things on the web
that don't seem to be in the manual, or maybe I missed them:
- You can make a queue restricted to certain users only by setting
'acl_user_enable'= True and 'acl_users' to a list of people who can use
that queue. This would only really be of use with the various patches
available to make a queue have required node properties; at the moment
you can set a default but I believe that a user can reset it.
- 'resources_max.mem' is a scheduling device that causes PBS to query
the nodes and run the job on one that reports sufficient physical
memory. It cannot be used to restrict the size of user jobs on Linux, as
there's no code in the pbs_mom source to do that. Instead you have to
set 'resources_max.vmem' which is all very well, but the code to report
the used vmem is broken. See my patch above for a fix.
You need to be careful with parallel jobs (I didn't find this one on the
web but discovered it the hard way!) if you are setting CPU time limits.
PBS attempts to tot up the used CPU time of all process in a session (ie
with the same Linux sessionID),
but it can't tell what time is being used by processes spawned by (say)
MPI over ssh on other nodes. Therefore if you set the cput resource you
get strange behaviour. There is a per process CPU time resource (pcput)
but I think that walltime is probably safer.
Here are a few things that the manual was unclear about, or that took me a
while to spot, that are
useful to know.
I rewrote the pbsacct utility in Python, fixing a couple of problems it
had with my accounting files and adding another measure of job efficiency.
You can get the script here.
Scheduling policy (for PBS native scheduler so out of date)
The PBS manual suggests that you do not set up separate queues but instead
get the users to submit to a central pool and have the scheduler run jobs
out of that pool. This prevents users from cheating by saying they need
different resources than what they actually require. I think that this is
a nice idea but in practice it seems to break down when the users have
conflicting requirements (see for example the horribly restrictive
setup on the SP which was required to accomodate the mixed workload). It
also expects the users to learn PBS's syntax for asking for resources.
Originally I went for separate queues with different priorities. The
extra long jobs require a node with a special property, and there are
deliberately only a few of these so as to prevent the long jobs from
dominating the machine when a user submits a batch of twenty (you know who
you are). You then give this queue a high priority to prevent the short jobs
being able to completely crowd out the long jobs by always grabbing their
special nodes. The short job users will not submit into the long queue
because they have fewer chances of getting a node in that queue. The long job users can't submit in the short queue because they won't get the cpu time. This approach also
works well where you have a few nodes with a special hardware property and
you want a certain type of job to be able to get on those nodes. It fails
where you have overlapping sets of 'special' nodes.
This setup has the problem where a user gets the machine on a quiet day
and submits forty jobs, thus occupying the machine for the rest of the
week. The next day when the other users submit they find they have to wait
a week, and are
annoyed. You can't limit the first user to only a few running jobs because
that makes inefficient use of the machine when it is quiet. The best
solution I have found to the dilemma is the mechanism on the SP, where users have
a max_queued limit. Jobs queued in excess of the limit are held and do not
start to work their way up the FIFO pipe, allowing another user to slip a
few jobs in ahead. The max_queued resource on PBS simply causes the job
to be rejected, which is not as helpful, and you can't set a per-user
A nice theoretical way to deal with this is to give each user a number of tokens. A
job can only go on the FIFO queue when it has a token; if there isn't a
token it waits until its owner has another job leave the FIFO queue, either through being run or cancelled or whatever, and then it joins the queue at the
end. This is exactly what the SP's max_queued restriction implements, and it has proved popular (at least,
no one's complained about it). I cannot see how
to implement this with the PBS FIFO scheduler as there seem to be no
per-user limits in the default queue attributes. I suspect you have to
replace pbs_sched. Maui, which I'm now using on all the clusters, has this
Eventually I did replace pbs_sched with Maui, and went for the 'one big
pool' approach. This works well in conjunction with the more sophisticated
scheduling polcies available with Maui, including the token-based approach
above (known to Maui as idle job throttling).
I found a useful introduction to
queu(e)ing theory (both spellings are correct, OED says the extra e one is
popular with queueing researchers) which proved that it's usually better
to have a few fast processors than lots of slow ones even if you get the
same total GFlops. This seems to make intuitive sense as well. However it
does depend how you measure 'betterness'. For the users it is how quickly
they can get onto a CPU, and then how quickly their job runs. For me,
higher utilisation is better. The two are not compatible.