ReleaseEngineering/Queue directories: Difference between revisions
m (added to howto category) |
(leave bread crumb) |
||
| Line 20: | Line 20: | ||
If a processor isn't running, they can be restarted via /etc/init.d/command_runner or /etc/init.d/pulse_publisher | If a processor isn't running, they can be restarted via /etc/init.d/command_runner or /etc/init.d/pulse_publisher | ||
If there are dead jobs, you can read the log files in the dead directory. After resolving the issue, job files should be moved back into <tt>'''new'''</tt> to be retried, or deleted. | If there are dead jobs, you can read the log files in the dead directory. After resolving the issue, job files should be moved back into <tt>'''new'''</tt> to be retried, or deleted. ''Note: there is a <tt>retry_dead_queue</tt> sub-command for [http://hg.mozilla.org/build/tools/file/5053b0ea4564/buildfarm/maintenance/manage_masters.py manage_masters.py]'' | ||
== implementation == | == implementation == | ||
Revision as of 21:32, 15 November 2014
Our buildbot masters make use of several queuedirs to perform out of process tasks such as pushing events to pulse or uploading logs.
queuedirs
A queuedir is a simple directory structure on disk where individual jobs are stored in files. The files are moved between directories depending on what state they're in.
- tmp: write out new job files here before moving into new.
- new: when job files are moved into here, the queue processors will pick them up and move them into cur to indicate they're currently being processed.
- cur: jobs currently in progress are here.
- logs: output and debugging information for current and finished jobs are here. Logs older than 5 minutes get deleted.
- dead: failed jobs go here. this is bad.
We currently have two queuedirs: /dev/shm/queue/commands, and /dev/shm/queue/pulse
processors
We currently have two processors: command_runner.py and pulse_publisher.py. These are run as services, start on boot, and are managed by puppet. They both run out of a virtualenv in /builds/buildbot/queue.
We have nagios checks in place to ensure that the queue processors are running, and that there are no dead jobs.
troubleshooting
If a processor isn't running, they can be restarted via /etc/init.d/command_runner or /etc/init.d/pulse_publisher
If there are dead jobs, you can read the log files in the dead directory. After resolving the issue, job files should be moved back into new to be retried, or deleted. Note: there is a retry_dead_queue sub-command for manage_masters.py
implementation
http://hg.mozilla.org/build/tools/file/739018ba9ff1/lib/python/buildtools/queuedir.py
http://hg.mozilla.org/build/tools/file/739018ba9ff1/buildbot-helpers/command_runner.py
http://hg.mozilla.org/build/tools/file/739018ba9ff1/buildbot-helpers/pulse_publisher.py