ReleaseEngineering/NoReboots

Warning: This RelEng page is obsolete!This is largely based on Buildbot infra. Though some may apply to Taskcluster, this page needs to be updated.

1 Why no reboots:
2 Results:
- 2.1 One hour sample from 01-13-2015 01 CST Raw Data
- 2.2 Three hour sample from 01-13-2015 02 CST Raw Data
3 No reboots timeline (puppet):
4 How no-reboot mode is enabled (idllizer, post_flight)
- 4.1 post_flight checks:
  - 4.1.1 hostname blacklist
  - 4.1.2 build api
5 What issues have been noted since no-reboot work started?
6 How are we tracking the status of machines, and measuring effectiveness?

Why no reboots:

Turning off reboots saves machine time in several ways: 70-120 seconds of reboot time, greater potential for FS cache utilization, and opportunities for sharing work (like clobbering) across jobs more effectively. As such, if successful, we should expect to see less instances launched per day -- as the throughput of the infrastructure rises.

Results:

A preliminary survey of the time saved by non-rebooting spot instances suggests that -- if jobs are not failing more often as a result of the changes -- a great deal of wasted time is being recovered:

   minimum time saved (seconds) = (<iterations_seen> - <halts_seen>) * <reboot_sec_avg>

One hour sample from 01-13-2015 01 CST Raw Data

The test/try machines spend more time rebooting (79s) and less time doing pre-flight tasks (~45s) on average while builders have an opposite skew (82s pre-flight tasks and 67s reboot time). We seem to have saved around 60 hours of machine time during this interval.

start_to_bb_sec_avg	reboot_sec_avg	start_to_bb_sec_total	halts_seen	reboot_percentage	iterations_seen	type	reboot_sec_total
47	79	132104	178	6	2804	all spot instances	14070
45	79	121606	167	6	2676	test and try	13329
82	67	10498	11	8	128	builders	741

Three hour sample from 01-13-2015 02 CST Raw Data

During the longer samples builder reboot times increased dramatically, however, this is likely the result of builder issues which were occuring around the time the data was taken; test/try results remained stable. We seem to have saved around 204 hours of machine time during this interval.

start_to_bb_sec_avg	reboot_sec_avg	start_to_bb_sec_total	halts_seen	reboot_percentage	iterations_seen	type	reboot_sec_total
47	92	410104	605	7	8547	all spot instances	55769
45	80	362177	553	7	7884	test and try	44648
72	213	47927	52	7	663	builders	11121

NOTE: The data above was gathered via runners influxdb logging and verified by spot checking runner logs (/var/log/runner.log) The reports themselves were generated by this script: http://pastebin.mozilla.org/8191890

No reboots timeline (puppet):

 date:        Fri Jan 16 17:47:33 2015 +0000
 summary:     Bug 1122601 - Coerce runner to reboot after particular job types; r=rail

 date:        Tue Jan 13 21:07:53 2015 +0000
 summary:     Bug 1109932 - Enable reboots for all try, talos, and test slaves; r=Callek

 date:        Tue Jan 06 14:26:55 2015 -0600
 summary:     Bug 1118125 - Turn off osx reboots; r=Callek

 date:        Tue Dec 30 19:40:23 2014 +0000
 summary:     Bug 1103123 - Turn off rebooting of all linux slaves; r=callek

 date:        Thu Dec 18 18:45:10 2014 +0000
 summary:     Bug 1113245 - Remove cleanslate process list on Linux and Mac machines during reboots with halt.py; r=rail

 date:        Fri Dec 12 21:33:04 2014 +0000
 summary:     Bug 1103123 - Turn off rebooting of talos machines; r=catlee

How no-reboot mode is enabled (idllizer, post_flight)

Buildbot is now started/managed by runner, which runs tasks in an infinite loop according to some specified order [each task is blocking]. As such, buildbot initiates a graceful shutdown immediately after accepting any job so that the runner tasks may loop around again after it’s finished. A single runner loop looks like this:

   <tasks before buildbot> -> buildbot.py [graceful shutdown] -> <tasks after buildbot> -> post_flight.py

The graceful shutdown is initiated by idelizer.py, then, post_flight.py decides whether or not to shut down the machine or go forward with another loop.

post_flight checks:

hostname blacklist

Any machine with a hostname that matches some regular expression found in this list will be rebooted by post_flight. For example: [“^tst-“, “^t-"] would reboot all test machines after any job.

build api

BuildAPI is used to fetch data about the most recent job, if the job fails the slave is rebooted. This feature may need to be disabled, since it could mask failures. The thinking, on turning it on, was that we could track problems via logging and avoid tree closures in the case of problems. This is likely too optimistic. Failing hard may be better in the end.

jobname blacklist=====

Works like the hostname blacklist, except, acting on the name of the most recently run job (which is known about by BuildAPI).

What issues have been noted since no-reboot work started?

These bugs have been noted, since December '14, as having possible connection to Runner/NoReboots:

bug 1114541 bug 989048 bug 1109932 bug 1114688 bug 1111137

How are we tracking the status of machines, and measuring effectiveness?

Runner constantly uploads task stats to influxdb, for dashboards see: https://stats.taskcluster.net/grafana/#/dashboard/db/runner

ReleaseEngineering/NoReboots

Contents

Why no reboots:

Results:

One hour sample from 01-13-2015 01 CST Raw Data

Three hour sample from 01-13-2015 02 CST Raw Data

No reboots timeline (puppet):

How no-reboot mode is enabled (idllizer, post_flight)

post_flight checks:

hostname blacklist

build api

What issues have been noted since no-reboot work started?

How are we tracking the status of machines, and measuring effectiveness?

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools