Processes On Batch Nodes
Sometimes user processes escape the batch system killing. This often happens for example on Torque nodes where pbs_mom keeps track only of processes belonging to the original SID (Session ID) tree. There are cases in which the children processes of a batch job don't maintain the same SID for example if the application starts a daemon, the daemon has now init process as a parent and a SID different from the original. Another case can be sudden death of one of the intermediate parents that leave the children orphans.
It is debatable that processes escaping the batch job SID tree are legitimate even if the user had all the intetion of starting a daemon or spawning processes and then cleanly kill them at the end of the job. In both the above mentioned cases it means that the surviving processes can keep on running indefinitely generating at best 'only' a misuse of resources that escape the batch system accounting and at worst a security hole.
A recent example of these type of jobs running on grid clusters were BOINC daemons started unintentionally by biomed processes. Sometimes you can find other VOs orphans as well.
If the batch system is torque the clean up of surviving processes has to be done either in the pbs_mom epilogue (http://www.clusterresources.com/wiki/doku.php?id=torque:appendix:g_prologue_and_epilogue_scripts) or running a cleanup script in a cron job. There are currently two scripts in the repository that reflect these two methods #PBS MOM epilogue script and #Standalone script. If the batch system is #Sun Grid Engine method a variable must be set.
PBS MOM epilogue script
This script (http://www.sysadmin.hep.ac.uk/svn/fabric-management/torque/epilogue/prune_userprocs) can be called in the pbs_mom epilogue (http://www.clusterresources.com/wiki/doku.php?id=torque:appendix:g_prologue_and_epilogue_scripts) script so it is batch system dependent. This solution is quite neat as the control of the termination is within the batch system boundaries and uses batch system tools to find the sid associated with a jobid of other batch processes on a multicpu node. It will:
- reconstruct the process trees (and thereby all legitimate processes) for jobs related to any currently running batch job on the node)
- find all user jobs (with uid>99) that are running
- subtract the legitimate jobs from the full job set
- kill off the remainer (i.e. stray and daemonized jobs)
- write the result to syslog for (central) capture.
How to use it
Add the script to /usr/local/bin or equivalent.
Add to /var/spool/pbs/mom_priv/epilogue (or your local equivalent which may be in "/var/torque/"):
PRUNE_USERPROCS=/usr/local/bin/prune_userprocs if [ -x "$PRUNE_USERPROCS" ]; then $PRUNE_USERPROCS -a -k 9 else echo Cannot execute $PRUNE_USERPROCS >& 2 fi
Use epilogue.parallel if you are dealing with MPI jobs.
This script uses UID=100 as minimum UID to separate the the system processes from the users ones. If you are running any sort of daemons on the Worker Nodes like for example xrootd it will kill the daemons. You can add the option -u YOUR-MIN-UID when you call prune_procs in the epilogue script to override the default value.
Perl: Sys::Syslog and Getopt::Long
(David Groep, Nikhef)
This script (http://www.sysadmin.hep.ac.uk/svn/fabric-management/processes/management/KillBatchOrphans.pl) is independent from the batch system. It kills all the users (UID>=500) processes belonging to SID trees whose first ancestor is init (PPID 1). The processes killed are logged with syslog.
How to use it
It can be used interactively or in a cron job. It has few options to restrict the type of processes based on cputime or waltime consumed or minimal UID required. It also has a debug option which will just list on STDOUT the processes that should be killed. It can be run only by root.
KillBatchOrphans.pl --help will give a brief help.
It can also be used also on other machines like a CE to monitor or eliminate stray processes left behind by users. For example I found jobmanagers started by some monitoring script that hadn't been killed and were using a lot of CPU.
Perl: Sys::Syslog, Getopt::Long and File::Basename
(Alessandra Forti, University of Manchester)
Sun Grid Engine method
SGE has an API to deal with orphans and daemons. It is activated by setting
If this parameter is set then the supplementary group id's are used to identify all processes which are to be terminated when a job should be deleted. The paramater can be set running qconf -mconf from any admin host and adding the line anywhere in the file or the parameter in the existing execd_params list if it's there. There is no need to restart anything for the changes to take effect. It should work with MPI jobs as well.
(Kostas Georgiou, Imperial College)