BDII
From SysadminWiki
The LCG information system consists of four distinct parts. The Generic Information Provider (GIP), the MDS, GRIS, the site BDII and the top level BDII. The documentation about the BDII maintained at CERN is quite good. See BDII links paragraph.
BDII problems
The Information System runs as a hierarchy of slapd daemons each aggregating the lower level information.
There are 3 levels:
- resource level (BDII/globus-mds) running on SEs and CEs
- site BDII collects all the site service information
- top level BDII collects all the sites information
All the services used to run globus-mds to start a slapd on port 2135. Lately DPM globus-mds has been converted to a resource BDII which runs a slapd on port 2170.
All the BDIIs, whatever level, run on port 2170 and the processes cannot be distinguished as they look alike and have the same arguments. So you cannot distinguish them with a ps (http://www.linux.ie/newusers/beginners-linux-guide/ps.php) command as you get something like the following in any case:
ps -aux|grep bdii edguser 3308 0.1 0.6 6100 3384 ? S Oct08 5:05 /usr/bin/perl -w /opt/bdii//sbin/bdii-update /opt/bdii/etc/bdii.conf edguser 3494 0.1 0.5 5636 2764 ? S Oct08 3:18 bdii-fwd [accepting proxy for localhost] edguser 6212 0.4 1.0 77944 5108 ? S 13:21 0:00 /usr/sbin/slapd -f /opt/bdii//var/2171/bdii-slapd.conf -h ldap://localhost:2171 -u edguser edguser 6383 0.5 1.0 76920 5116 ? S 13:22 0:00 /usr/sbin/slapd -f /opt/bdii//var/2172/bdii-slapd.conf -h ldap://localhost:2172 -u edguser edguser 6620 1.1 1.0 77936 5112 ? S 13:22 0:00 /usr/sbin/slapd -f /opt/bdii//var/2173/bdii-slapd.conf -h ldap://localhost:2173 -u edguser edguser 6716 0.0 0.5 5636 2864 ? S 13:23 0:00 bdii-fwd [131.154.100.4:49417 --> 127.0.0.1:2173] edguser 6718 0.0 0.5 5636 2876 ? S 13:23 0:00 bdii-fwd [131.154.100.4:49417 <-- 127.0.0.1:2173]
Starting the wrong BDII on a node
The system administrator has to pay now particular attention to which slapd he starts because now we have possible conflicts on the port used. For example if by mistake a site BDII (or indeed a top level BDII) is started on a DPM machine, the DPM resource BDII will not start. This is not easy to distinguish at first sight by the type of processes as it was when the service level was taken care by globus-mds just by port number or by the different arguments slapd process got. The error doesn't seem to be registered in the log files.
To know what you are actually running on a machine you can however make an ldapsearch (http://www.novell.com/documentation/nas4nw/usnas4nw/nasnwenu/ldapsrch.html) query.
Different levels of BDII return different amount of information. However it is enough for you to parse the first 15 lines of the query to understand what type of BDII you are running. The queries are slightly different as the mds-vo-name relative DN has the value resource rather than local or MY-SITE-NAME.
Resource BDII
A successful answer from your DPM resource BDII should return information only about the resources it refers to (for example DPM resources). The query should contain the relative DN mds-vo-name=resource,o=grid.
ldapsearch -x -H ldap://my-dpm.my-domain:2170 -b mds-vo-name=resource,o=grid| head -20 version: 2 # # filter: (objectclass=*) # requesting: ALL # # resource, grid dn: mds-vo-name=resource,o=grid objectClass: GlueTop # httpg://my-dpm.my-domain:8443/srm/managerv1, resource, grid dn: GlueServiceUniqueID=httpg://my-dpm.my-domain:8443/srm/managerv1,mds-vo-name=resource,o=grid objectClass: GlueTop
Site BDII
A successful answer from your site BDII should return information about all the resources at your site. The query contains the relative DN mds-vo-name=MY-SITE-NAME,o=grid.
ldapsearch -x -H ldap://my-bdii.my-domain:2170 -b mds-vo-name=MY-SITE-NAME,o=grid| head -20 version: 2 # # filter: (objectclass=*) # requesting: ALL # # MY-SITE-NAME, grid dn: mds-vo-name=MY-SITE-NAME,o=grid objectClass: GlueTop # my-ce.my-domain:2119/jobmanager-lcgpbs-atlas, MY-SITE-NAME, grid dn: GlueCEUniqueID=my-ce.my-domain:2119/jobmanager-lcgpbs-atlas,mds-vo-name=MY-SITE-NAME,o=grid objectClass: GlueCETop
Top level BDII
A succesfull answer from a top level BDII should return information on all the sites in the grid. So if your BDII is reporting information about other sites you know that you have started a top level BDII. The query contains the relative DN mds-vo-name=local,o=grid.
ldapsearch -x -H ldap://my-bdii.my-domain:2170 -b mds-vo-name=local,o=grid| head -20 version: 2 # # filter: (objectclass=*) # requesting: ALL # # local, grid dn: mds-vo-name=local,o=grid objectClass: GlueTop # SOME-OTHER-SITE-NAME, local, grid dn: mds-vo-name=SOME-OTHER-SITE-NAME,mds-vo-name=local,o=grid objectClass: GlueTop # someone-else-ce.someone-else-domain:2119/jobmanager-pbs-aegis, SOME-OTHER-SITE-NAME, local, grid dn: GlueCEUniqueID=someone-else-ce.someone-else-domain:2119/jobmanager-pbs-atlas,mds-vo-name=SOME-OTHER-SITE-NAME,mds-vo-name=local,o=grid objectClass: GlueCETop
NOTE
All the queries could have been run simply with o=grid however it is useful to know if each BDII is correctly configured and answering the correct query. For example you could have a misconfigured DPM resource BDII responding to a query with the old mds-vo-name=local,o=grid. In this case the site BDII would not see the DPM SE resource and storage resources wouldn't be publish correctly (they wouldn't be published at all).
How To Fix it
If you are running the wrong level of BDII in the wrong place. You need to kill manually all the slapd and make sure there are no leftovers processes around (ps aux|grep bdii will help) and reconfigure the machine with the correct BDII. If you are using YAIM (https://twiki.cern.ch/twiki//bin/view/LCG/YaimGuide311) check that all the BDII variables are correct in your site-info.def.
