Pilot Jobs
From SysadminWiki
| Table of contents |
Definition
A pilot job is basically just a small script which downloads a real job from a repository once it starts executing, hence it is not committed to any particular task, or perhaps even a particular user, until that point. If there are no tasks waiting the pilot job exits immediately. In principle, if the time limits on the queue are long enough a single pilot job could run more than one real job.
Distinction between User pilot jobs and VO pilot jobs
- User pilot jobs: pilot jobs pull only jobs from its owner. This was more or less true when only few people were running simulation productions jobs at sites.
- VO pilot jobs: someone authorized by the VO submits ALL the pilots jobs. These can pull real jobs belonging to any other user in the VO. So who submits the pilot jobs is not the "real" owner of the job that is running.
Why are they used
In general they allow faster execution - if you submit 200 pilot jobs the first 50 to execute will run the actual jobs, the other will abort. Also you can decide what jobs you want to run after the pilot jobs go in. If the user just submitted the same number of pilot jobs there would be little advantage to it over submitting the normal way.
VO pilot jobs allow the VO to do intra-VO scheduling. Since the pull is done at the last second the VO can prioritize the jobs and change these internal priorities if they need.
How do they look like
They look like any other job submtted on the grid. The difference being that most of them will abort after a short time. They can't be stopped. There is no way to know what they will pull on the system but this is true also for any other grid job that could execute a wget, scp.... to copy an executable on the local system and run it. This type of behaviour could be stopped only blocking outbound connections on the WNs which is not a viable solution.
Site accounting
Pilot jobs mostly don't use CPU resources unless they pull a job from the VO scheduler/Repository. However once they are queued they occupy wall clock time on the system i.e. a place in the queue and some short time on the WNs as well. If someone decides to make a massive job submission to a site they can clog the system not allowing other jobs through. This is equivalent to occupying a system.
Intra-VO accounting
It's VO responsability. Sites have nothing to do with it. All sites remain concerned with is global VO accounting. They are not required to supply information about which VO group has used their resources.
Traceability
At the moment it is not possible to distinguish between the real owner and the pilot job owner. This presents 3 problems for sites:
- Some sites require for legal reasons to know who the real owner is.
- If a security incident happens the only way to block the culprit is to blacklist the pilot job owner and therefore potentially the whole VO.
- Very likely two jobs from the same pilot job user will run on the same node making it difficult traceability.
JSPG (Joint Security Policy Group) Raccomandation
As is well known and being discussed in several EGEE bodies, sites are concerned about the security implications of this method of workload management.
- JSPG agrees that we REQUIRE suitable auditing and traceability at the individual user level both on the WN and the VO Scheduler available on demand.
- Sites may hold the submitter of the Pilot Job responsible for all actions of that job.
- VO's should be aware that the controls to ban users will result in the blocking of the whole VO, instead of just one user.
- VOs with a VO scheduler asked what information should be made available from their side.
glexec
To solve #Intra-VO_Accounting and #Traceability issues it was proposed to use Glexec on the WN.
