Full Partitions

From SysadminWiki

Partitions completely full should be avoided. The consequences of lack of space are different depending on what partitions become full, but there are few partitions (or they can be directories filling their partitions) whose filling might partially or completely stop the machine from working. Depending on your configuration these directories can belong to the same partition so the filling of one automatically affects the others or you can have each mapped to a partition basically limiting the damage if one fills.

What to monitor

/var

/var contains many subdirs that are used by many processes for all sorts of pourposes and can block the processes themselves from functioning correctly if the space is exahusted. Example of processes that will stop working correctly if /var gets full are pbs_server, pbs_mom, maui and mysql and postgres databases; grid services and other services that use /var/log to log their files will continue to work but they'll stop logging. See savannah bug 17418 (https://savannah.cern.ch/bugs/?func=detailitem&item_id=17418). Syslog facility also will close the log files and stop logging.

All the above mentioned processes will not automatically resume the correct behaviour if the space becomes available again. So it is extremely important to monitor /var and and send an alert when the used space reaches 80-90% (?). One of the reason so much available space needs to be left is because some processes can fill /var with their temporary files stored in /var/tmp. Some times a very huge temporary file is stored all the services are affected and after a short time the temporary file gets deleted leaving the sys admin startled as to why the services don't work properly anymore. On top of it /var cannot be automatically cleaned up as /tmp as it contains files that cannot be deleted like the log files themselves or the database files. Manual intervention after the alert is required. Stopping automatically the services if the available space becomes too small should be evaluated in order to avoid services running without logging.

/tmp

/tmp is used also by processes to store temporary files and it is often used a scratch space by users. While /var cannot really be cleaned up, especially if it is filled by database tables or log files rather than /tmp files or cahced rpms, /tmp can and should be cleaned up. There are many ways to keep /tmp clean one of the most common is to use of tmpwatch (http://www.linuxcommand.org/man_pages/tmpwatch8.html) in a cron job to cleanup /tmp and other dirs periodically. An example of cron job found on RedHat like systems in /etc/cron.daily is

  /usr/sbin/tmpwatch 240 /tmp
  /usr/sbin/tmpwatch 720 /var/tmp
  for d in /var/{cache/man,catman}/{cat?,X11R6/cat?,local/cat?}; do
     if [ -d "$d" ]; then
        /usr/sbin/tmpwatch -f 720 $d
     fi
  done

as you'll read on the man page the time is expressed in hours.

scratch space

Particularly on batch nodes it can be useful to create a proper scratch space partition for users to use for their temporary files to avoid them to unintentionally interfere with the correct behaviour of the machine. If this gets full it affects only the users applications but not the system. The scratch space can have the same monitoring and cleanup procedures as /tmp but with different parameters for example tailored on the walltime of the longest queue of the batch system. If you use Cfengine you can use also this method to clean /scratch