Our buildbot masters make use of several queuedirs to perform out of process tasks such as pushing events to pulse or uploading logs.
A queuedir is a simple directory structure on disk where individual jobs are stored in files. The files are moved between directories depending on what state they're in.
- tmp: write out new job files here before moving into new.
- new: when job files are moved into here, the queue processors will pick them up and move them into cur to indicate they're currently being processed.
- cur: jobs currently in progress are here.
- logs: output and debugging information for current and finished jobs are here. Logs older than 5 minutes get deleted.
- dead: failed jobs go here. this is bad.
We currently have two queuedirs: /dev/shm/queue/commands and /dev/shm/queue/pulse
We currently have two processors: command_runner.py and pulse_publisher.py. These are run as services, start on boot, and are managed by puppet. They both run out of a virtualenv in /builds/buildbot/queue.
We have nagios checks in place to ensure that the queue processors are running, and that there are no dead jobs.
If a processor isn't running, they can be restarted via /etc/init.d/command_runner or /etc/init.d/pulse_publisher
If there are dead jobs, you can read the per job log files in the dead directory. After resolving the issue, job files should be moved back into new to be retried, or deleted.
- Note: there is a retry_dead_queue sub-command for manage_masters.py (Or, for ansible lovers deadqueue.yml in build-ansible. Example invocation: ansible-playbook -i master-inventory.py deadqueue.yml)