Nodes Down in Torque

From SysadminWiki

Toqrque doesn't seem to be able to handle nodes marked as down in the batch system. If a node goes down with jobs still on it the server cannot cleanup properly the files that keep track of the jobs and keep on reporting that the jobs are still there. This affects the scheduling of further jobs whic are put in an idle state. Torque seems much happier if the node is marked offline.

In short if you find that you have many jobs queued while there are nodes free the first thing to do is to look if you have some node down, check if there are any jobs running on that node, remove the stale files normally located in /var/spool/pbs/server_priv/jobs, and put the node offline.

Tired of constantly checking for down nodes and doing the cleanup by hand for each node I wrote a script that does it for me. The script torque-node-down.pl (http://www.sysadmin.hep.ac.uk/svn/fabric-management/torque/jobs/torque-node-down.pl) can be downloaded from the repository.

How to use it

The script can be run in a cron job on the Torque server. Due to the tasks it carries out it has to be run by root. The cron I installed looks like this

cat /etc/cron.d/torque-nodes-down 
04 4,10,16,22 * * * root /usr/local/admin/bin/torque-nodes-down.pl

The script logs every action in /var/log/messages unless --debug option is used.

The --enable option allows to remove the offline flag from each node that is marked only as offline. i.e. doesn't have, in the pbsnodes output, a state field = down,offline. See notes for when might come out useful.

There is also a --help

Notes

The script looks for nodes that are down according to PBS/Torque server as reported by pbsnodes -a. Nodes might look down when in actual facts they are not. If you find that the script has put offline a considerable number of nodes at the same time you might want to check your network, dhcp or any other point that might. You can use the --enable option to remove the offline flag from these type of nodes.

It will die if it is not run by root, it can't use pbnodes or it can't open /var/log/messages.

Requires

Torque: pbsnodes
Perl: Sys::Syslog, File::Basename, Getopt::Long