GLite-WMS and glite-LB
From SysadminWiki
| Table of contents |
gLite-WMS Admin Notes
Last Update: Oct, 23rd 2007
WMS release: gLite 3.0.2 U34 (13.09.07)
Managing Daemons
On a typical WMS node the following services must be running:
- glite-lb-locallogger:
- glite-lb-logd running
- glite-lb-interlogd running
- glite-lb-logd running
- glite-lb-proxy:
- glite-lb-proxy running as 14183
- glite-lb-proxy running as 14183
- glite-proxy-renewald:
- glite-proxy-renewd running
- glite-wms-ftpd:
- glite-wms-in.ftpd (pid 18861 18440 14883) is running...
- glite-wms-in.ftpd (pid 18861 18440 14883) is running...
- glite-wms-jc:
- JobController running in pid: 14928
- CondorG master running in pid: 15262
- CondorG schedd running in pid: 15267
- JobController running in pid: 14928
- glite-wms-lm:
- Logmonitor running...
- Logmonitor running...
- glite-wms-wm:
- /opt/glite/bin/glite-wms-workload_manager (pid 23411) is running...
- /opt/glite/bin/glite-wms-workload_manager (pid 23411) is running...
- glite-wms-wmproxy:
- WMProxy httpd listening on port 7443
- httpd (pid 3030 508 24285 24221 16409 27941 27940 28742 19018 20796 16017 16016) is running ....
- WMProxy Server running instances:
- ...
- WMProxy httpd listening on port 7443
Scripts to check the daemons status and to start/stop are located in the /opt/glite/etc/init.d/ directory (i.e. /opt/glite/etc/init.f/glite-wms-wm start/stop/status).
Glite production installation also provide a more generic service, called gLite, to manage all of them simultaneously, try service gLite status/start/stop
Location of Log Files
Important log files are located in the /var/log/glite/ directory:
- wmproxy.log
Used in case of authentication or submisison error
- workload_manager_events.log
Used to check the status of the matchmaking process (from Waiting to Ready status) and the query to the information system to fill in the InformationSuperMaket
- jobcontoller_events.log
Used to check the jobs events once arrived on condor
- httpd-wmproxy-errors_2007-09-06-12.log
Used in case of problems in contacting the WMProcy service
- httpd-wmproxy-access_2007-10-08-21.log
- logmonitor_events.log
Aggregate information about each job coming from various log files
- glite-wms-wmproxy-purge-proxycache.log
- lcmaps.log
Used when there are problems in the mapping of remote users to local pool accounts
Other log files that can be useful in caso of trouble are the condor log in:
- /var/local/condor/log/
- /var/glite/logmonitor/CondorG.log/
Important Configuration Parameters Definition in glite_wms.conf
The general configuration file for the WMS is located in /opt/glite/etc/glite_wms.conf
This file is organized in section, one for every running service plus a Common section. <br
Common Section
In general there is no need to change this section
Common = [
DGUser = "${GLITE_WMS_USER}";
HostProxyFile = "${GLITE_LOCATION_VAR}/wms.proxy";
];
WMProxy Section
WorkloadManagerProxy = [
LBProxy = true;
EnableServiceDiscovery = false;
LBServiceDiscoveryType = "org.glite.lb.server";
ListMatchRootPath = "/tmp";
OperationLoadScripts = [
jobStart = "${GLITE_LOCATION}/sbin/glite_wms_wmproxy_load_monitor --oper jobStart --load1 10 --load5 10 --load15 10 --memusage 99 --diskusage 95 --fdnum 500";
jobRegister = "${GLITE_LOCATION}/sbin/glite_wms_wmproxy_load_monitor --oper jobRegister --load1 10 --load5 10 --load15 10 --memusage 99 --diskusage 95 --fdnum 500";
jobSubmit = "${GLITE_LOCATION}/sbin/glite_wms_wmproxy_load_monitor --oper jobSubmit --load1 10 --load5 10 --load15 10 --memusage 99 --diskusage 95 --fdnum 500";
];
LogLevel = 5;
LBServer = {"wms007.cnaf.infn.it:9000"};
SandboxStagingPath = "${GLITE_LOCATION_VAR}/SandboxDir";
LBLocalLogger = "localhost:9002";
MinPerusalTimeInterval = 5;
WeightsCacheValidity = 86400;
LogFile = "${GLITE_LOCATION_LOG}/wmproxy.log";
AsyncJobStart = true;
GridFTPPort = 2811;
MaxServedRequests = 50;
ServiceDiscoveryInfoValidity = 3600;
MaxInputSandboxSize = 100000000;
SDJRequirements = RegExp("*sdj$", other.GlueCEUniqueID);
];
For a complete description of every single paramenter please refer to the following datamat page:
http://trinity.datamat.it/projects/EGEE/wiki/wiki.php?n=GliteWms.Conf
Note that very important paramenters are those that configure the so called limiter (OperationLoadScripts) used to inhibit submission in the case that some system load limits are hit.
Since the WMS and LB suggested deployment is to have them on two separate physical machines a foundamental parameter is also LBServer.
Workload Manager Section
WorkloadManager = [
CeMonitorAsynchPort = 0;
DisablePurchasingFromGris = true;
EnableBulkMM = true;
EnableIsmDump = false;
MaxRetryCount = 10;
SiServiceName = "org.glite.SEIndex";
ExpiryPeriod = 86400;
PipeDepth = 200;
JobWrapperTemplateDir = "/opt/glite/etc/templates";
IsmDump = "${GLITE_LOCATION_VAR}/workload_manager/ismdump.fl";
DispatcherType = "filelist";
PboxHostName = "";
MatchRetryPeriod = 21600;
CeForwardParameters = {"GlueHostMainMemoryVirtualSize","GlueHostMainMemoryRAMSize","GlueCEPolicyMaxCPUTime"};
CeMonitorServices = {};
PboxPortNum = 6699;
Input = "${GLITE_LOCATION_VAR}/workload_manager/input.fl";
LogLevel = 5;
LogFile = "${GLITE_LOCATION_LOG}/workload_manager_events.log";
WorkerThreads = 5;
DliServiceName = "data-location-interface";
MaxOutputSandboxSize = -1;
IsmBlackList = {};
IsmUpdateRate = 600;
];
Important parameters in this section are:
EnableBulkMM = true; //enable the bulk matchmaking for collection - it is advised to set it to true
MaxOutputSandboxSize = -1; //set a limit on the output sandbox in order to avoid disk full problems, -1 set it to unlimited.
Note that for now only -1 should be used because of bug https://savannah.cern.ch/bugs/?func=detailitem&item_id=27215
IsmBlackList // allow to set a list of CEs that are banned
IsmUpdateRate = 600; // information supermarket update rate (in seconds)
WorkerThreads = 5; // enable the multithread for the WM component. Speed up the matchmaking process. 5 is a god compromise between machine load and speed.
LogMonitor Section
LogMonitor = [
MainLoopDuration = 5;
GlobusDownTimeout = 7200;
CondorLogRecycleDir = "${GLITE_LOCATION_VAR}/logmonitor/CondorG.log/recycle";
LockFile = "${GLITE_LOCATION_VAR}/logmonitor/lock";
LogLevel = 5;
JobsPerCondorLog = 1000;
LogFile = "${GLITE_LOCATION_LOG}/logmonitor_events.log";
ExternalLogFile = "${GLITE_LOCATION_LOG}/logmonitor_external.log";
RemoveJobFiles = true;
AbortedJobsTimeout = 600;
IdRepositoryName = "irepository.dat";
CondorLogDir = "${GLITE_LOCATION_VAR}/logmonitor/CondorG.log";
MonitorInternalDir = "${GLITE_LOCATION_VAR}/logmonitor/internal";
];
Usaually thereis no nedd to change the default parameters with the exceptio of : RemoveJobFiles = true;
That by default is set to false.
Setting it to true will force condor to remove unused internal files when the job are in a final state.
Job Controller Section
JobController = [
CondorRelease = "${CONDORG_INSTALL_PATH}/bin/condor_release";
OutputFileDir = "${GLITE_LOCATION_VAR}/jobcontrol/condorio";
CondorRemove = "${CONDORG_INSTALL_PATH}/bin/condor_rm";
DagmanMaxPre = 10;
Input = "${GLITE_LOCATION_VAR}/jobcontrol/queue.fl";
CondorQuery = "${CONDORG_INSTALL_PATH}/bin/condor_q";
MaximumTimeAllowedForCondorMatch = 1800;
SubmitFileDir = "${GLITE_LOCATION_VAR}/jobcontrol/submit";
LogFile = "${GLITE_LOCATION_LOG}/jobcontoller_events.log";
CondorDagman = "${CONDORG_INSTALL_PATH}/bin/condor_dagman";
CondorSubmit = "${CONDORG_INSTALL_PATH}/bin/condor_submit";
LogLevel = 5;
ContainerRefreshThreshold = 1000;
LockFile = "${GLITE_LOCATION_VAR}/jobcontrol/lock";
];
Usually there is no need to change default Job Controller configuration parameters.
Network Server Section
Altough the Network Server is no more installed on WMS nodes some configuration paramenters in its section of the global conf file are still needed
NetworkServer = [
QuotaInsensibleDiskPortion = 2.0;
SandboxStagingPath = "${GLITE_LOCATION_VAR}/SandboxDir";
Gris_Port = 2170;
II_Timeout = 100;
EnableQuotaManagement = false;
DLI_SI_CatalogTimeout = 60;
Gris_DN = "mds-vo-name=local, o=grid";
EnableDynamicQuotaAdjustment = false;
II_DN = "mds-vo-name=local, o=grid";
MasterThreads = 8;
ConnectionTimeout = 300;
LogFile = "${GLITE_LOCATION_LOG}/networkserver_events.log";
ListMatchParadise = "${GLITE_LOCATION_TMP}/MatchArea";
LogLevel = 5;
II_Contact = "egee-bdii.cnaf.infn.it";
Gris_Timeout = 20;
BacklogSize = 64;
II_Port = 2170;
ListeningPort = 7772;
QuotaAdjustmentAmount = 10000;
MaxInputSandboxSize = 10000000;
DispatcherThreads = 10;
];
The inportant parameters in this sectiona re those regardig the contact with the information system. In particular:
- II_Contact = "egee-bdii.cnaf.infn.it"; set the hostname of the bdii to be contacted
- II_Port = 2170; set the port on which the bdii is contacted
- Gris_DN = "mds-vo-name=local, o=grid"; set the path where the bdii is publishing information
- II_Timeout = 100; Set the timeout for the bdii query. It is important that this value is not too small, it is very dangerous if many bdii queries fail for timeot reasons. The risk is that all the information on the InformationSupermarker expire making all jobs in Waiting Status not to match any CE (they remain in Waiting Status for a long time, until a query to the bdii is successfull). By default that value is set to 30, but 100 is a safer.
- MaxInputSandboxSize = 10000000; this puts a limit in the dimension of the input sandbox of the jdl. Units are byte.
A full desscription of the WMS configuration file can be found here:
https://twiki.cnaf.infn.it/cgi-bin/twiki/view/EgeeJra1It/WMSConfFile
Garbage Collection
Major sources of garbage in a WMS node are:
- Sandbox directories
- Condor Sandbox directories
- Log Files
Sandobox Directories
They are located in /var/glite/SandboxDir/ and they are automatically purged when a get-output of a job is done.
If a VO does not take care of getting the job output back from the WMS node SB dir can become a serious problem for the HD occupancy.
Another situation in which the SB dir can become problematic is when a job (or a certain number of jobs) has a huge output sandbox and the control on the OSB is not enabled on the glite_wms.conf file (see WMProxy configuration parameter section).
In any case it is a good habit to purge periodically the Sandbox Dir.
By default, installing with yaim, the glite-wms-purge cron job is installed in /etc/cron.d and this should do the job. Unfortunately there is no written documentation on this script but the runtime help.
If the usage of the official purger is to be preferred, of course SandBox Dir can be simply deleted by a homemade script that remove directory older than a certain date.
If it si needed to purge very recent directories the official purger is better since before removing a check on the current job status is performed.
Condor Sandobox Directories
They are locate in:
- /var/glite/jobcontrol/condorio/
- /var/glite/jobcontrol/submit/
BY default the wms is not configured to automatically purge those directories when job are in a final stae.
To enable their purging it is needed to set:
RemoveJobFiles = true;
in the LogMonitor section of the glite_wms.conf file (note that a rstart of LM and JC is also needed).
It is strongly advised to set that parameter to true since hd occupancy can become a problem due to the condor sandbox directory.
If the WMS is already overloaded by the condor sandbox dir a manual (script based) removal of those directories can be done. Jobs that are not finished could be lost, so there should be manually removed directories older tan at least two weeks.
Log Files
Glite log files are locate in: /var/log/glite/.
This directory on a heavly used WMS can become quite big, on the order of tens of GB.
Old rotated log files should be manually removed.
WMS User Authentication and Mapping
WMS User Authentication is performed by the WMProxy component based on a GACL module.
The fundamental file used to manages the WMS authentication is the /opt/glite/etc/glite_wms_wmproxy.gacl file.
This file contains the name of the VO that are allowed to use the WMS. A .gacl file example tahat allows the dteam and ops VOs is:
<gacl version='0.0.1'>
<entry>
<voms>
<fqan>ops/Role=NULL</fqan>
</voms>
<allow>
<exec/>
</allow>
</entry>
<entry>
<voms>
<fqan>dteam/Role=NULL</fqan>
</voms>
<allow>
<exec/>
</allow>
</entry>
This file can also contain the DNs of single users that are allow to use the WMS resources. This entry in the .gacl file will allow Daniele Cesini to use the WMS even if he is not in the VOs allowed to use the WMS:
<entry>
<user>
<dn>/C=IT/O=INFN/OU=Personal Certificate/L=CNAF/CN=Daniele Cesini/Email=daniele.cesini@cnaf.infn.it/</dn>
</user>
<allow><exec/></allow>
</entry>
An entry with a DENY tag can be used to ban users or VO:
<entry>
<voms>
<fqan>/dteam/Role=admin/Capability=NULL</fqan>
</voms>
<deny><exec/></deny>
</entry>
As the previous examples shows, it is possible to allow/ban users and VOs on the basis of their FQAN (i.e. those returned by the voms-proxy-info --fqan command).
For more information on the .gacl fiel syntax please refer to the DATAMAT WMProxy pages at: http://trinity.datamat.it/projects/EGEE/wiki/wiki.php?n=GliteWmsWmproxy.Gacl
User Mapping ona WMS node is done through lcmaps as inany other gLite services, so foundamental places to loook in case of mapping problems are:
- the gridmap file: /etc/grid-security/grid-mapfile;
- the lcmaps log: /var/log/glite/lcmaps.log;
- the gridmapdir: /etc/grid-security/gridmapdir/;
- the existing pool accounts for a VO or a VO group/role
Managing the WMProxy
Admin Tools
As already stated the WMProxy behavior is controlled editing the .gacl file.
This file can be edited by hand with any text editor or via a administrative tools that are package and distributed with the WMProxy itself.
These administrative tools are:
- glite-wms-wmproxy-gacladmin -- WMProxy GACL files handling
- glite-wms-wmproxy-gridmapfile2gacl -- Generation of WMProxy AuhthZ GACL file from grid-mapfile
A detailed description of the usage of these tool please refer http://trinity.datamat.it/projects/EGEE/wiki/wiki.php?n=WMProxyService.Tools
A third tool can be used to purge some delegate proxies garbage:
- glite-wms-wmproxy-purge-proxycache -- Purging of expired proxies from WMProxy delegation cache area
YAIM installed WMS come with a cron (/etc/cron.d/glite-wms-wmproxy-purge-proxycache.cron) that make use of this tool.
Drainign a WMS
It can be sometimes needed to put the WMS in draining mode, so that it does not accept new submission request but allows any other operation as the output retrieval. This can be easily achieved usind the so called drain file: ${GLITE_LOCATION_VAR}/.drain :
<gacl>
<entry>
<any-user/>
<deny><exec/></deny>
</entry>
</gacl>
Further information can be found here: http://trinity.datamat.it/projects/EGEE/wiki/wiki.php?n=WMProxyService.Drain
Common TroubleShooting
Monitoring the WMS
A first version of the gLite WMS and LCG RB monitoring tool written by Yvan Calas (Yvan.Calas at cern dot ch) is actually available on the following wiki page: http://wiki.egee-see.org/index.php/RB/WMS_Monitoring.
Two RPMs are needed: one for the client side and one for the server side. Many thanks to Dusan Vudragovic (dusan at cern ch) who optimized and packaged this tool for usage outside Cern.
Note that a new version will come soon which will be able to monitor the stand-alone LB nodes.
Any question can be sent to Dusan or Yvan.
