Difference between revisions of "Sheriffing/Job Visibility Policy"

From MozillaWiki
Jump to: navigation, search
m (4) Scheduled on every push: Clarify coalescing)
m (7) Low intermittent failure rate: Grammar)
(2 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
== Requirements for being shown in the default TBPL view ==
 
== Requirements for being shown in the default TBPL view ==
  
This page was created to clarify the existing requirements that a platform/test-suite has to meet, before its jobs can be shown in the default [https://tbpl.mozilla.org/ TBPL] view. To propose changes to this policy, please speak to the sheriffs and/or post to [https://lists.mozilla.org/listinfo/dev-platform dev.platform].
+
This page was created to clarify the requirements that a platform/test-suite has to meet, before its jobs can be shown in the default [https://tbpl.mozilla.org/ TBPL] view. Common sense will apply in cases where some of the requirements are not applicable for a particular platform/build/test type.
 +
 
 +
To propose changes to this policy, please speak to the sheriffs and/or post to [https://lists.mozilla.org/listinfo/dev-platform dev.platform].
  
 
=== 1) Has an active owner ===
 
=== 1) Has an active owner ===
Line 30: Line 32:
 
=== 5) Easily run on try server ===
 
=== 5) Easily run on try server ===
 
* Otherwise developers who have had their landing backed out for breaking the job type will be unable to easily debug/fix the failures, particularly if they only reproduce on our infrastructure.
 
* Otherwise developers who have had their landing backed out for breaking the job type will be unable to easily debug/fix the failures, particularly if they only reproduce on our infrastructure.
* Developers should not be expected to guess try chooser options, so http://trychooser.pub.build.mozilla.org/ must have been updated.
+
* Developers should not be expected to guess try chooser options, so http://trychooser.pub.build.mozilla.org/ should be updated if appropriate.
  
 
=== 6) Outputs failures in a TBPL-starrable format ===
 
=== 6) Outputs failures in a TBPL-starrable format ===
* Failures must appear in the TBPL annotated summary (ie: use the standard TEST-UNEXPECTED-{FAIL,PASS}, PROCESS-CRASH, ... format), otherwise sheriffs & devs have to open the full logs.
+
* It is highly recommended that new test harnesses do not reinvent the wheel and instead use parts of MozBase (eg: mozcrash) if at all possible - speak to the A-Team for more info.
* Failures must output the test names correctly, so TBPL can perform the BzAPI intermittent-failure searches for bug suggestions.
+
* Failures must appear in the TBPL annotated summary (ie: they must match the [https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php log parsing regexp]), otherwise the full log will have to be opened for every failure.
* Exceptions & timeouts must be caught and handled with a TBPL compatible failure message.
+
* Failure output must be in the format expected by TBPL's [https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/AnnotatedSummaryGenerator.php bug suggestion generator] (otherwise sheriffs have to manually search Bugzilla when starring intermittent failures):
* The sheriffs will be happy to help advise how to meet this requirement.
+
** For in-tree/product issues (eg: test failures, crashes):
 +
*** Pipe symbol used as delimiter.
 +
*** 1st token: One of {TEST-UNEXPECTED-FAIL, TEST-UNEXPECTED-PASS, PROCESS-CRASH}.
 +
*** 2nd token: A unique test name/filepath (not a generic test loader that runs 100s of other test files, since otherwise bug suggestions will return too many results).
 +
*** 3rd token: The specific failure message (eg: the test part that failed, the top frame of a crash or the leaked objects list for a leak).
 +
** For non test-specific issues (eg: infra/automation/harness):
 +
*** TBPL falls back to searching Bugzilla for the entire failure line (excluding mozharness logging prefix), so it should be both unique to that failure type & repeatable (ie: no use of process IDs for which there will rarely be a repeat match against a bug summary).
 +
** Exceptions & timeouts must be handled with appropriate log output (eg: the failure line must state in which test the timeout occurred, not just that the entire run has timed out).
 +
* The sheriffs will be happy to advise regarding TBPL log output compatibility.
  
=== 7) Per job intermittent failure rate of less than 5% ===
+
=== 7) Low intermittent failure rate ===
 
* A high failure rate:
 
* A high failure rate:
 
** Causes unnecessary sheriff workload.
 
** Causes unnecessary sheriff workload.
 
** Affects the ability to sheriff the trees as a whole, particularly during times of heavy coalescing.
 
** Affects the ability to sheriff the trees as a whole, particularly during times of heavy coalescing.
 
** Undermines devs confidence in the platform/test-suite - which as demonstrated by Firefox for Android, permanently affects their willingness to believe any future failures, even once the intermittent-failure rate is lowered.
 
** Undermines devs confidence in the platform/test-suite - which as demonstrated by Firefox for Android, permanently affects their willingness to believe any future failures, even once the intermittent-failure rate is lowered.
* A mozilla-central push results in ~400 jobs. A 5% failure rate would mean 20 failures on that push - ie: an OrangeFactor of 20. The typical OrangeFactor across all trunk trees is normally 3-5, so a 5% failure rate is extremely generous.
+
* A mozilla-central push results in ~400 jobs. The typical OrangeFactor across all trunk trees is normally (excluding the recent spike) 3-4, ie: a failure rate of ~1%.
 +
* Therefore as a rough guide a new platform/testsuite must have at most a 5% per job failure rate initially, and ideally <1% longer term.
 +
* However, sheriffs will make the final determination of whether a job type has too many intermittent failures. This will be a based on a combination of factors including failure rate, length of time the failures have been occurring, owner interest in fixing them & whether TBPL is able to make bug suggestions.
  
 
=== 8) Must avoid patterns known to cause non deterministic failures ===
 
=== 8) Must avoid patterns known to cause non deterministic failures ===
 +
* Must avoid pulling the tip of external repositories as part of the build - since landings there can cause non-obvious failures (legacy exception being gaia). If an external repository is absolutely necessary, instead reference the desired changeset from a manifest in mozilla-central (like talos does).
 
* Must not rely on resources outside of the build network:
 
* Must not rely on resources outside of the build network:
 
** Since these will cause failures when the external site is unavailable, as well as impacting end to end times & adding noise to performance tests.
 
** Since these will cause failures when the external site is unavailable, as well as impacting end to end times & adding noise to performance tests.
Line 64: Line 77:
  
 
=== 11) Easy for a dev to run locally ===
 
=== 11) Easy for a dev to run locally ===
* Is supported by mach.
+
* Supported by mach (if appropriate).
* Ideally part of mozilla-central (legacy exception being Talos).  
+
* Ideally part of mozilla-central (legacy exceptions being Talos, gaia).
  
 
== Requesting changes in visibility ==
 
== Requesting changes in visibility ==
Line 73: Line 86:
 
* Your platform/test-suite will still be being run, just not shown on the default view. This model has worked well for many projects/build types (eg jetpack, xulrunner, spidermonkey).
 
* Your platform/test-suite will still be being run, just not shown on the default view. This model has worked well for many projects/build types (eg jetpack, xulrunner, spidermonkey).
 
* To see it, append '&showall=1' to the URL ({{bug|748833}} will add a checkbox for this to the TBPL UI).
 
* To see it, append '&showall=1' to the URL ({{bug|748833}} will add a checkbox for this to the TBPL UI).
* To filter the jobs displayed, under the 'Filters' menu use the 'job name' field (which supports regex).
+
* To filter the jobs displayed, under the 'Filters' menu use the 'job name' field (which supports regexp).
 
* eg: to see both ASan & Valgrind jobs on mozilla-central (neither of which are shown by default), use: [https://tbpl.mozilla.org/?showall=1&jobname=(asan|valgrind) https://tbpl.mozilla.org/?showall=1&jobname=(asan|valgrind)]
 
* eg: to see both ASan & Valgrind jobs on mozilla-central (neither of which are shown by default), use: [https://tbpl.mozilla.org/?showall=1&jobname=(asan|valgrind) https://tbpl.mozilla.org/?showall=1&jobname=(asan|valgrind)]
  
 
== The future ==
 
== The future ==
 
* Planned improvements to our tooling will likely mean that some of these requirements can be relaxed in the future, as well as making it easier for maintainers of non-default-view job types to track their success/failure without having to monitor TBPL continuously.
 
* Planned improvements to our tooling will likely mean that some of these requirements can be relaxed in the future, as well as making it easier for maintainers of non-default-view job types to track their success/failure without having to monitor TBPL continuously.
* The successor to TBPL ([[Auto-tools/Projects/TBPL2]]) will support:
+
* Planned features for the successor to TBPL ([[Auto-tools/Projects/TBPL2]]) include:
** Multiple dashboards/views for different use cases/teams (giving us more flexibility than just "default view" or "&showall=1").
+
** Multiple dashboards/views for different use-cases/teams (giving us more flexibility than just "default view" or "&showall=1").
 
** Opt-in notifications (email, IRC, dashboard, ...?) of failures for desired job types (see proposal in {{bug|851061}}).
 
** Opt-in notifications (email, IRC, dashboard, ...?) of failures for desired job types (see proposal in {{bug|851061}}).
 
* [[Auto-tools/Projects/Bisect_in_the_cloud]] will allow sheriffs to more easily narrow regression ranges for job types that do not run on every push, making it more viable to accept them into certain views/dashboards.
 
* [[Auto-tools/Projects/Bisect_in_the_cloud]] will allow sheriffs to more easily narrow regression ranges for job types that do not run on every push, making it more viable to accept them into certain views/dashboards.

Revision as of 16:25, 1 April 2013

Requirements for being shown in the default TBPL view

This page was created to clarify the requirements that a platform/test-suite has to meet, before its jobs can be shown in the default TBPL view. Common sense will apply in cases where some of the requirements are not applicable for a particular platform/build/test type.

To propose changes to this policy, please speak to the sheriffs and/or post to dev.platform.

1) Has an active owner

  • Who is committed to ensuring the other requirements are met not just initially, but over the long term.
  • Who will ensure the new job type is switched off to save resources, should we stop finding it useful in the future.

2) Breakage is expected to be followed by tree closure or backout

  • Failures visible in the default view (other than those that are known intermittents/transient), must have their cause backed out in a timely fashion or else the tree closed until diagnosed.
  • Why? If tier != 1 jobs were instead made visible in the default view, they would:
    • Interfere with ability to sheriff the tree:
      • Indistinguishable from tier-1 failures.
      • Appear in the failure count/cause the tab to glow.
      • Slow down navigation of failures when using keyboard shortcuts.
    • Cause extra workload for sheriffs by making them perform initial diagnosis/bug filing & then starring of the failure on every push until it is fixed an indeterminate amount of time later.
    • Cause confusion for non-sheriffs using project branches/try-server, as well as on all trees at the weekends when there are no employed sheriffs.
  • If your platform/test falls under the category of "someone should just file a bug and it will be investigated by our team later", then it unfortunately does not meet this requirement. From past requests this normally translates to "group X think this job type is important but we want to delegate the task of monitoring it to someone else".

3) Runs on all trees that merge into mozilla-central

  • Otherwise job failures when tree X merges into mozilla-central will not be attributable to a single changeset, resulting in either tree closure or backout of the entire merge (see requirement #2).
  • When filing the release engineering bug to enable your job on all the required trees, ask to enable it on "mozilla-central based trees" and release engineering will enable it in the default config from which all trunk trees inherit (unless the various tree owners have explicitly opted out). As a rough guide, mozilla-central based trees include mozilla-inbound, fx-team, services-central, ionmonkey, graphics as well as many of the other project/disposable repositories.

4) Scheduled on every push

  • Otherwise job failures will not be attributable to a single changeset, resulting in either tree closure or backout of multiple pushes (see requirement #2).
  • An exception is made for nightly builds with an virtually equivalent non-nightly variant that is built on every push & for tests run on PGO builds (given that PGO builds take an inordinate amount of time, we still schedule them every 3/6 hours depending on tree, and relatively speaking there any not too many PGO-only test failures).
  • Note also that coalescing (buildbot queue collapsing when there is more than one queued job of the exact same tree/type) may mean that not all scheduled jobs actually get run. Whilst coalescing makes sheriffing harder, it's a necessary evil given that automation infrastructure demand frequently outstrips supply.

5) Easily run on try server

  • Otherwise developers who have had their landing backed out for breaking the job type will be unable to easily debug/fix the failures, particularly if they only reproduce on our infrastructure.
  • Developers should not be expected to guess try chooser options, so http://trychooser.pub.build.mozilla.org/ should be updated if appropriate.

6) Outputs failures in a TBPL-starrable format

  • It is highly recommended that new test harnesses do not reinvent the wheel and instead use parts of MozBase (eg: mozcrash) if at all possible - speak to the A-Team for more info.
  • Failures must appear in the TBPL annotated summary (ie: they must match the log parsing regexp), otherwise the full log will have to be opened for every failure.
  • Failure output must be in the format expected by TBPL's bug suggestion generator (otherwise sheriffs have to manually search Bugzilla when starring intermittent failures):
    • For in-tree/product issues (eg: test failures, crashes):
      • Pipe symbol used as delimiter.
      • 1st token: One of {TEST-UNEXPECTED-FAIL, TEST-UNEXPECTED-PASS, PROCESS-CRASH}.
      • 2nd token: A unique test name/filepath (not a generic test loader that runs 100s of other test files, since otherwise bug suggestions will return too many results).
      • 3rd token: The specific failure message (eg: the test part that failed, the top frame of a crash or the leaked objects list for a leak).
    • For non test-specific issues (eg: infra/automation/harness):
      • TBPL falls back to searching Bugzilla for the entire failure line (excluding mozharness logging prefix), so it should be both unique to that failure type & repeatable (ie: no use of process IDs for which there will rarely be a repeat match against a bug summary).
    • Exceptions & timeouts must be handled with appropriate log output (eg: the failure line must state in which test the timeout occurred, not just that the entire run has timed out).
  • The sheriffs will be happy to advise regarding TBPL log output compatibility.

7) Low intermittent failure rate

  • A high failure rate:
    • Causes unnecessary sheriff workload.
    • Affects the ability to sheriff the trees as a whole, particularly during times of heavy coalescing.
    • Undermines devs confidence in the platform/test-suite - which as demonstrated by Firefox for Android, permanently affects their willingness to believe any future failures, even once the intermittent-failure rate is lowered.
  • A mozilla-central push results in ~400 jobs. The typical OrangeFactor across all trunk trees is normally (excluding the recent spike) 3-4, ie: a failure rate of ~1%.
  • Therefore as a rough guide a new platform/testsuite must have at most a 5% per job failure rate initially, and ideally <1% longer term.
  • However, sheriffs will make the final determination of whether a job type has too many intermittent failures. This will be a based on a combination of factors including failure rate, length of time the failures have been occurring, owner interest in fixing them & whether TBPL is able to make bug suggestions.

8) Must avoid patterns known to cause non deterministic failures

  • Must avoid pulling the tip of external repositories as part of the build - since landings there can cause non-obvious failures (legacy exception being gaia). If an external repository is absolutely necessary, instead reference the desired changeset from a manifest in mozilla-central (like talos does).
  • Must not rely on resources outside of the build network:
    • Since these will cause failures when the external site is unavailable, as well as impacting end to end times & adding noise to performance tests.
    • eg: Emulator/driver binaries direct from a vendor's site, package downloads from PyPi or page assets for unit/performance tests.
  • Must not contain time bombs, e.g. tests that will fail after a certain date or when run at certain times.
  • See the guide on avoiding intermittent failures.

9) Supports the disabling of individual tests

  • It must be possible for sheriffs to disable an individual test per platform or entirely, by either annotating the test or editing a manifest/moz.build/Makefile. (See also requirement #10).

10) Has sufficient documentation

11) Easy for a dev to run locally

  • Supported by mach (if appropriate).
  • Ideally part of mozilla-central (legacy exceptions being Talos, gaia).

Requesting changes in visibility

  • Please file a bug using this template, so that changes in visibility are more discoverable (vs IRC or asking as part of a bug in another product/component).

My platform/test-suite does not meet the requirements, what now?

  • Your platform/test-suite will still be being run, just not shown on the default view. This model has worked well for many projects/build types (eg jetpack, xulrunner, spidermonkey).
  • To see it, append '&showall=1' to the URL (bug 748833 will add a checkbox for this to the TBPL UI).
  • To filter the jobs displayed, under the 'Filters' menu use the 'job name' field (which supports regexp).
  • eg: to see both ASan & Valgrind jobs on mozilla-central (neither of which are shown by default), use: https://tbpl.mozilla.org/?showall=1&jobname=(asan|valgrind)

The future

  • Planned improvements to our tooling will likely mean that some of these requirements can be relaxed in the future, as well as making it easier for maintainers of non-default-view job types to track their success/failure without having to monitor TBPL continuously.
  • Planned features for the successor to TBPL (Auto-tools/Projects/TBPL2) include:
    • Multiple dashboards/views for different use-cases/teams (giving us more flexibility than just "default view" or "&showall=1").
    • Opt-in notifications (email, IRC, dashboard, ...?) of failures for desired job types (see proposal in bug 851061).
  • Auto-tools/Projects/Bisect_in_the_cloud will allow sheriffs to more easily narrow regression ranges for job types that do not run on every push, making it more viable to accept them into certain views/dashboards.