Sheriffing/Job Visibility Policy: Difference between revisions

Jump to navigation Jump to search
Tweaks from dev.tree-management discussions
m (→‎4) Scheduled on every push: Clarify coalescing)
(Tweaks from dev.tree-management discussions)
Line 2: Line 2:
== Requirements for being shown in the default TBPL view ==
== Requirements for being shown in the default TBPL view ==


This page was created to clarify the existing requirements that a platform/test-suite has to meet, before its jobs can be shown in the default [https://tbpl.mozilla.org/ TBPL] view. To propose changes to this policy, please speak to the sheriffs and/or post to [https://lists.mozilla.org/listinfo/dev-platform dev.platform].
This page was created to clarify the existing requirements that a platform/test-suite has to meet, before its jobs can be shown in the default [https://tbpl.mozilla.org/ TBPL] view. Common sense will apply in cases where some of the requirements are not applicable for a particular platform/build/test type.
 
To propose changes to this policy, please speak to the sheriffs and/or post to [https://lists.mozilla.org/listinfo/dev-platform dev.platform].


=== 1) Has an active owner ===
=== 1) Has an active owner ===
Line 30: Line 32:
=== 5) Easily run on try server ===
=== 5) Easily run on try server ===
* Otherwise developers who have had their landing backed out for breaking the job type will be unable to easily debug/fix the failures, particularly if they only reproduce on our infrastructure.
* Otherwise developers who have had their landing backed out for breaking the job type will be unable to easily debug/fix the failures, particularly if they only reproduce on our infrastructure.
* Developers should not be expected to guess try chooser options, so http://trychooser.pub.build.mozilla.org/ must have been updated.
* Developers should not be expected to guess try chooser options, so http://trychooser.pub.build.mozilla.org/ should be updated if appropriate.


=== 6) Outputs failures in a TBPL-starrable format ===
=== 6) Outputs failures in a TBPL-starrable format ===
* Failures must appear in the TBPL annotated summary (ie: use the standard TEST-UNEXPECTED-{FAIL,PASS}, PROCESS-CRASH, ... format), otherwise sheriffs & devs have to open the full logs.
* It is highly recommended that new test harnesses do not reinvent the wheel and instead use parts of MozBase (eg: mozcrash) if at all possible - speak to the A-Team for more info.
* Failures must output the test names correctly, so TBPL can perform the BzAPI intermittent-failure searches for bug suggestions.
* Failures must appear in the TBPL annotated summary (ie: they must match the [https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php log parsing regexp]), otherwise the full log will have to be opened for every failure.
* Exceptions & timeouts must be caught and handled with a TBPL compatible failure message.
* Failure output must be in the format expected by TBPL's [https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/AnnotatedSummaryGenerator.php bug suggestion generator] (otherwise sheriffs have to manually search Bugzilla when starring intermittent failures):
* The sheriffs will be happy to help advise how to meet this requirement.
** For in-tree/product issues (eg: test failures, crashes):
*** Pipe symbol used as delimiter.
*** 1st token: One of {TEST-UNEXPECTED-FAIL, TEST-UNEXPECTED-PASS, PROCESS-CRASH}.
*** 2nd token: A unique test name/filepath (not a generic test loader that runs 100s of other test files, since otherwise bug suggestions will return too many results).
*** 3rd token: The specific failure message (eg: the test part that failed, the top frame of a crash or the leaked objects list for a leak).
** For non test-specific issues (eg: infra/automation/harness):
*** TBPL falls back to searching Bugzilla for the entire failure line (excluding mozharness logging prefix), so it should be both unique to that failure type & repeatable (ie: no use of process IDs for which there will rarely be a repeat match against a bug summary).
** Exceptions & timeouts must be handled with appropriate log output (eg: the failure line must state in which test the timeout occurred, not just that the entire run has timed out).
* The sheriffs will be happy to advise regarding TBPL log output compatibility.


=== 7) Per job intermittent failure rate of less than 5% ===
=== 7) Low intermittent failure rate ===
* A high failure rate:
* A high failure rate:
** Causes unnecessary sheriff workload.
** Causes unnecessary sheriff workload.
** Affects the ability to sheriff the trees as a whole, particularly during times of heavy coalescing.
** Affects the ability to sheriff the trees as a whole, particularly during times of heavy coalescing.
** Undermines devs confidence in the platform/test-suite - which as demonstrated by Firefox for Android, permanently affects their willingness to believe any future failures, even once the intermittent-failure rate is lowered.
** Undermines devs confidence in the platform/test-suite - which as demonstrated by Firefox for Android, permanently affects their willingness to believe any future failures, even once the intermittent-failure rate is lowered.
* A mozilla-central push results in ~400 jobs. A 5% failure rate would mean 20 failures on that push - ie: an OrangeFactor of 20. The typical OrangeFactor across all trunk trees is normally 3-5, so a 5% failure rate is extremely generous.
* A mozilla-central push results in ~400 jobs. The typical OrangeFactor across all trunk trees is normally (excluding the recent spike) 3-4, ie: a failure rate of ~1%.
* Therefore as a rough guide new platform/testsuite must have at most a 5% failure rate initially, and ideally <1% longer term.
* However, sheriffs will make the final determination of whether a job type has too many intermittent failures. This will be a based on a combination of factors including failure rate, length of time the failures have been occurring, owner interest in fixing them & whether TBPL is able to make bug suggestions.


=== 8) Must avoid patterns known to cause non deterministic failures ===
=== 8) Must avoid patterns known to cause non deterministic failures ===
* Must avoid pulling the tip of external repositories as part of the build - since landings there can cause non-obvious failures (legacy exception being gaia). If an external repository is absolutely necessary, instead reference the desired changeset from a manifest in mozilla-central (like talos does).
* Must not rely on resources outside of the build network:
* Must not rely on resources outside of the build network:
** Since these will cause failures when the external site is unavailable, as well as impacting end to end times & adding noise to performance tests.
** Since these will cause failures when the external site is unavailable, as well as impacting end to end times & adding noise to performance tests.
Line 64: Line 77:


=== 11) Easy for a dev to run locally ===
=== 11) Easy for a dev to run locally ===
* Is supported by mach.
* Supported by mach (if appropriate).
* Ideally part of mozilla-central (legacy exception being Talos).  
* Ideally part of mozilla-central (legacy exceptions being Talos, gaia).  


== Requesting changes in visibility ==
== Requesting changes in visibility ==
Line 73: Line 86:
* Your platform/test-suite will still be being run, just not shown on the default view. This model has worked well for many projects/build types (eg jetpack, xulrunner, spidermonkey).
* Your platform/test-suite will still be being run, just not shown on the default view. This model has worked well for many projects/build types (eg jetpack, xulrunner, spidermonkey).
* To see it, append '&showall=1' to the URL ({{bug|748833}} will add a checkbox for this to the TBPL UI).
* To see it, append '&showall=1' to the URL ({{bug|748833}} will add a checkbox for this to the TBPL UI).
* To filter the jobs displayed, under the 'Filters' menu use the 'job name' field (which supports regex).
* To filter the jobs displayed, under the 'Filters' menu use the 'job name' field (which supports regexp).
* eg: to see both ASan & Valgrind jobs on mozilla-central (neither of which are shown by default), use: [https://tbpl.mozilla.org/?showall=1&jobname=(asan|valgrind) https://tbpl.mozilla.org/?showall=1&jobname=(asan|valgrind)]
* eg: to see both ASan & Valgrind jobs on mozilla-central (neither of which are shown by default), use: [https://tbpl.mozilla.org/?showall=1&jobname=(asan|valgrind) https://tbpl.mozilla.org/?showall=1&jobname=(asan|valgrind)]


== The future ==
== The future ==
* Planned improvements to our tooling will likely mean that some of these requirements can be relaxed in the future, as well as making it easier for maintainers of non-default-view job types to track their success/failure without having to monitor TBPL continuously.
* Planned improvements to our tooling will likely mean that some of these requirements can be relaxed in the future, as well as making it easier for maintainers of non-default-view job types to track their success/failure without having to monitor TBPL continuously.
* The successor to TBPL ([[Auto-tools/Projects/TBPL2]]) will support:
* Planned features for the successor to TBPL ([[Auto-tools/Projects/TBPL2]]) include:
** Multiple dashboards/views for different use cases/teams (giving us more flexibility than just "default view" or "&showall=1").
** Multiple dashboards/views for different use-cases/teams (giving us more flexibility than just "default view" or "&showall=1").
** Opt-in notifications (email, IRC, dashboard, ...?) of failures for desired job types (see proposal in {{bug|851061}}).
** Opt-in notifications (email, IRC, dashboard, ...?) of failures for desired job types (see proposal in {{bug|851061}}).
* [[Auto-tools/Projects/Bisect_in_the_cloud]] will allow sheriffs to more easily narrow regression ranges for job types that do not run on every push, making it more viable to accept them into certain views/dashboards.
* [[Auto-tools/Projects/Bisect_in_the_cloud]] will allow sheriffs to more easily narrow regression ranges for job types that do not run on every push, making it more viable to accept them into certain views/dashboards.
canmove, Confirmed users
1,126

edits

Navigation menu