- 1 Why no reboots:
- 2 Results:
- 3 No reboots timeline (puppet):
- 4 How no-reboot mode is enabled (idllizer, post_flight)
- 5 What issues have been noted since no-reboot work started?
- 6 How are we tracking the status of machines, and measuring effectiveness?
Why no reboots:
Turning off reboots saves machine time in several ways: 70-120 seconds of reboot time, greater potential for FS cache utilization, and opportunities for sharing work (like clobbering) across jobs more effectively. As such, if successful, we should expect to see less instances launched per day -- as the throughput of the infrastructure rises.
A preliminary survey of the time saved by non-rebooting spot instances suggests that -- if jobs are not failing more often as a result of the changes -- a great deal of wasted time is being recovered:
minimum time saved (seconds) = (<iterations_seen> - <halts_seen>) * <reboot_sec_avg>
One hour sample from 01-13-2015 01 CST Raw Data
The test/try machines spend more time rebooting (79s) and less time doing pre-flight tasks (~45s) on average while builders have an opposite skew (82s pre-flight tasks and 67s reboot time). We seem to have saved around 60 hours of machine time during this interval.
|47||79||132104||178||6||2804||all spot instances||14070|
|45||79||121606||167||6||2676||test and try||13329|
Three hour sample from 01-13-2015 02 CST Raw Data
During the longer samples builder reboot times increased dramatically, however, this is likely the result of builder issues which were occuring around the time the data was taken; test/try results remained stable. We seem to have saved around 204 hours of machine time during this interval.
|47||92||410104||605||7||8547||all spot instances||55769|
|45||80||362177||553||7||7884||test and try||44648|
NOTE: The data above was gathered via runners influxdb logging and verified by spot checking runner logs (/var/log/runner.log) The reports themselves were generated by this script: http://pastebin.mozilla.org/8191890
No reboots timeline (puppet):
date: Fri Jan 16 17:47:33 2015 +0000 summary: Bug 1122601 - Coerce runner to reboot after particular job types; r=rail
date: Tue Jan 13 21:07:53 2015 +0000 summary: Bug 1109932 - Enable reboots for all try, talos, and test slaves; r=Callek
date: Tue Jan 06 14:26:55 2015 -0600 summary: Bug 1118125 - Turn off osx reboots; r=Callek
date: Tue Dec 30 19:40:23 2014 +0000 summary: Bug 1103123 - Turn off rebooting of all linux slaves; r=callek
date: Thu Dec 18 18:45:10 2014 +0000 summary: Bug 1113245 - Remove cleanslate process list on Linux and Mac machines during reboots with halt.py; r=rail
date: Fri Dec 12 21:33:04 2014 +0000 summary: Bug 1103123 - Turn off rebooting of talos machines; r=catlee
How no-reboot mode is enabled (idllizer, post_flight)
Buildbot is now started/managed by runner, which runs tasks in an infinite loop according to some specified order [each task is blocking]. As such, buildbot initiates a graceful shutdown immediately after accepting any job so that the runner tasks may loop around again after it’s finished. A single runner loop looks like this:
<tasks before buildbot> -> buildbot.py [graceful shutdown] -> <tasks after buildbot> -> post_flight.py
Any machine with a hostname that matches some regular expression found in this list will be rebooted by post_flight. For example: [“^tst-“, “^t-"] would reboot all test machines after any job.
BuildAPI is used to fetch data about the most recent job, if the job fails the slave is rebooted. This feature may need to be disabled, since it could mask failures. The thinking, on turning it on, was that we could track problems via logging and avoid tree closures in the case of problems. This is likely too optimistic. Failing hard may be better in the end.
Works like the hostname blacklist, except, acting on the name of the most recently run job (which is known about by BuildAPI).
What issues have been noted since no-reboot work started?
These bugs have been noted, since December '14, as having possible connection to Runner/NoReboots:
How are we tracking the status of machines, and measuring effectiveness?
Runner constantly uploads task stats to influxdb, for dashboards see: https://stats.taskcluster.net/grafana/#/dashboard/db/runner