148
edits
(Added link to the mana page for the command queue Nagios alert) |
|||
| Line 1: | Line 1: | ||
Our buildbot masters make use of several queuedirs to perform out of process tasks such as pushing events to pulse or uploading logs. | Our buildbot masters make use of several queuedirs to perform out of process tasks such as pushing events to pulse or uploading logs. | ||
== | == Queuedirs == | ||
A queuedir is a simple directory structure on disk where individual jobs are stored in files. The files are moved between directories depending on what state they're in. | A queuedir is a simple directory structure on disk where individual jobs are stored in files. The files are moved between directories depending on what state they're in. | ||
| Line 10: | Line 10: | ||
* <tt>'''dead'''</tt>: failed jobs go here. this is bad. | * <tt>'''dead'''</tt>: failed jobs go here. this is bad. | ||
We currently have two queuedirs: /dev/shm/queue/commands | We currently have two queuedirs: '''/dev/shm/queue/commands''' and '''/dev/shm/queue/pulse''' | ||
== | == Processors == | ||
We currently have two processors: command_runner.py and pulse_publisher.py. These are run as services, start on boot, and are managed by puppet. They both run out of a virtualenv in /builds/buildbot/queue. | We currently have two processors: '''command_runner.py''' and '''pulse_publisher.py'''. These are run as services, start on boot, and are managed by puppet. They both run out of a virtualenv in /builds/buildbot/queue. | ||
We have nagios checks in place to ensure that the queue processors are running, and that there are no dead jobs. | We have nagios checks in place to ensure that the queue processors are running, and that there are no dead jobs. | ||
== | == Troubleshooting == | ||
If a processor isn't running, they can be restarted via /etc/init.d/command_runner or /etc/init.d/pulse_publisher | If a processor isn't running, they can be restarted via /etc/init.d/command_runner or /etc/init.d/pulse_publisher | ||
If there are dead jobs, you can read the per job log files in the dead directory. After resolving the issue, job files should be moved back into <tt>'''new'''</tt> to be retried, or deleted. ''Note: there is a <tt>retry_dead_queue</tt> sub-command for [http://hg.mozilla.org/build/tools/file/5053b0ea4564/buildfarm/maintenance/manage_masters.py manage_masters.py]'' (Or, for ansible lovers <tt>deadqueue.yml</tt> in [https://github.com/mozilla/build-ansible build-ansible]. Example invocation: <tt>ansible-playbook -i master-inventory.py deadqueue.yml</tt>) | If there are dead jobs, you can read the per job log files in the dead directory. After resolving the issue, job files should be moved back into <tt>'''new'''</tt> to be retried, or deleted. | ||
* ''Note: there is a <tt>retry_dead_queue</tt> sub-command for [http://hg.mozilla.org/build/tools/file/5053b0ea4564/buildfarm/maintenance/manage_masters.py manage_masters.py]'' (Or, for ansible lovers <tt>deadqueue.yml</tt> in [https://github.com/mozilla/build-ansible build-ansible]. Example invocation: <tt>ansible-playbook -i master-inventory.py deadqueue.yml</tt>) | |||
== | == Implementation == | ||
http://hg.mozilla.org/build/tools/file/739018ba9ff1/lib/python/buildtools/queuedir.py | http://hg.mozilla.org/build/tools/file/739018ba9ff1/lib/python/buildtools/queuedir.py | ||
| Line 30: | Line 31: | ||
http://hg.mozilla.org/build/puppet-manifests/file/0deb57fc17ae/modules/buildmaster/manifests/queue.pp | http://hg.mozilla.org/build/puppet-manifests/file/0deb57fc17ae/modules/buildmaster/manifests/queue.pp | ||
== See also == | |||
https://mana.mozilla.org/wiki/display/NAGIOS/Command+Queue | |||
{{Release Engineering How To|Queue Directories}} | {{Release Engineering How To|Queue Directories}} | ||
edits