Sheriffing/Job Visibility Policy: Difference between revisions

Sheriffing/Job Visibility Policy (view source)

Revision as of 16:18, 1 April 2013

2,116 bytes added , 1 April 2013

Tweaks from dev.tree-management discussions

Edmorley

canmove, Confirmed users

1,126

edits

@@ Line 2: / Line 2: @@
 == Requirements for being shown in the default TBPL view ==
-This page was created to clarify the existing requirements that a platform/test-suite has to meet, before its jobs can be shown in the default [https://tbpl.mozilla.org/ TBPL] view. To propose changes to this policy, please speak to the sheriffs and/or post to [https://lists.mozilla.org/listinfo/dev-platform dev.platform].
+This page was created to clarify the existing requirements that a platform/test-suite has to meet, before its jobs can be shown in the default [https://tbpl.mozilla.org/ TBPL] view. Common sense will apply in cases where some of the requirements are not applicable for a particular platform/build/test type.
+To propose changes to this policy, please speak to the sheriffs and/or post to [https://lists.mozilla.org/listinfo/dev-platform dev.platform].
 === 1) Has an active owner ===
@@ Line 30: / Line 32: @@
 === 5) Easily run on try server ===
 * Otherwise developers who have had their landing backed out for breaking the job type will be unable to easily debug/fix the failures, particularly if they only reproduce on our infrastructure.
-* Developers should not be expected to guess try chooser options, so http://trychooser.pub.build.mozilla.org/ must have been updated.
+* Developers should not be expected to guess try chooser options, so http://trychooser.pub.build.mozilla.org/ should be updated if appropriate.
 === 6) Outputs failures in a TBPL-starrable format ===
-* Failures must appear in the TBPL annotated summary (ie: use the standard TEST-UNEXPECTED-{FAIL,PASS}, PROCESS-CRASH, ... format), otherwise sheriffs & devs have to open the full logs.
+* It is highly recommended that new test harnesses do not reinvent the wheel and instead use parts of MozBase (eg: mozcrash) if at all possible - speak to the A-Team for more info.
-* Failures must output the test names correctly, so TBPL can perform the BzAPI intermittent-failure searches for bug suggestions.
+* Failures must appear in the TBPL annotated summary (ie: they must match the [https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php log parsing regexp]), otherwise the full log will have to be opened for every failure.
-* Exceptions & timeouts must be caught and handled with a TBPL compatible failure message.
+* Failure output must be in the format expected by TBPL's [https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/AnnotatedSummaryGenerator.php bug suggestion generator] (otherwise sheriffs have to manually search Bugzilla when starring intermittent failures):
-* The sheriffs will be happy to help advise how to meet this requirement.
+** For in-tree/product issues (eg: test failures, crashes):
+*** Pipe symbol used as delimiter.
+*** 1st token: One of {TEST-UNEXPECTED-FAIL, TEST-UNEXPECTED-PASS, PROCESS-CRASH}.
+*** 2nd token: A unique test name/filepath (not a generic test loader that runs 100s of other test files, since otherwise bug suggestions will return too many results).
+*** 3rd token: The specific failure message (eg: the test part that failed, the top frame of a crash or the leaked objects list for a leak).
+** For non test-specific issues (eg: infra/automation/harness):
+*** TBPL falls back to searching Bugzilla for the entire failure line (excluding mozharness logging prefix), so it should be both unique to that failure type & repeatable (ie: no use of process IDs for which there will rarely be a repeat match against a bug summary).
+** Exceptions & timeouts must be handled with appropriate log output (eg: the failure line must state in which test the timeout occurred, not just that the entire run has timed out).
+* The sheriffs will be happy to advise regarding TBPL log output compatibility.
-=== 7) Per job intermittent failure rate of less than 5% ===
+=== 7) Low intermittent failure rate ===
 * A high failure rate:
 ** Causes unnecessary sheriff workload.
 ** Affects the ability to sheriff the trees as a whole, particularly during times of heavy coalescing.
 ** Undermines devs confidence in the platform/test-suite - which as demonstrated by Firefox for Android, permanently affects their willingness to believe any future failures, even once the intermittent-failure rate is lowered.
-* A mozilla-central push results in ~400 jobs. A 5% failure rate would mean 20 failures on that push - ie: an OrangeFactor of 20. The typical OrangeFactor across all trunk trees is normally 3-5, so a 5% failure rate is extremely generous.
+* A mozilla-central push results in ~400 jobs. The typical OrangeFactor across all trunk trees is normally (excluding the recent spike) 3-4, ie: a failure rate of ~1%.
+* Therefore as a rough guide new platform/testsuite must have at most a 5% failure rate initially, and ideally <1% longer term.
+* However, sheriffs will make the final determination of whether a job type has too many intermittent failures. This will be a based on a combination of factors including failure rate, length of time the failures have been occurring, owner interest in fixing them & whether TBPL is able to make bug suggestions.
 === 8) Must avoid patterns known to cause non deterministic failures ===
+* Must avoid pulling the tip of external repositories as part of the build - since landings there can cause non-obvious failures (legacy exception being gaia). If an external repository is absolutely necessary, instead reference the desired changeset from a manifest in mozilla-central (like talos does).
 * Must not rely on resources outside of the build network:
 ** Since these will cause failures when the external site is unavailable, as well as impacting end to end times & adding noise to performance tests.
@@ Line 64: / Line 77: @@
 === 11) Easy for a dev to run locally ===
-* Is supported by mach.
+* Supported by mach (if appropriate).
-* Ideally part of mozilla-central (legacy exception being Talos).
+* Ideally part of mozilla-central (legacy exceptions being Talos, gaia).
 == Requesting changes in visibility ==
@@ Line 73: / Line 86: @@
 * Your platform/test-suite will still be being run, just not shown on the default view. This model has worked well for many projects/build types (eg jetpack, xulrunner, spidermonkey).
 * To see it, append '&showall=1' to the URL ({{bug|748833}} will add a checkbox for this to the TBPL UI).
-* To filter the jobs displayed, under the 'Filters' menu use the 'job name' field (which supports regex).
+* To filter the jobs displayed, under the 'Filters' menu use the 'job name' field (which supports regexp).
 * eg: to see both ASan & Valgrind jobs on mozilla-central (neither of which are shown by default), use: [https://tbpl.mozilla.org/?showall=1&jobname=(asan|valgrind) https://tbpl.mozilla.org/?showall=1&jobname=(asan|valgrind)]
 == The future ==
 * Planned improvements to our tooling will likely mean that some of these requirements can be relaxed in the future, as well as making it easier for maintainers of non-default-view job types to track their success/failure without having to monitor TBPL continuously.
-* The successor to TBPL ([[Auto-tools/Projects/TBPL2]]) will support:
+* Planned features for the successor to TBPL ([[Auto-tools/Projects/TBPL2]]) include:
-** Multiple dashboards/views for different use cases/teams (giving us more flexibility than just "default view" or "&showall=1").
+** Multiple dashboards/views for different use-cases/teams (giving us more flexibility than just "default view" or "&showall=1").
 ** Opt-in notifications (email, IRC, dashboard, ...?) of failures for desired job types (see proposal in {{bug|851061}}).
 * [[Auto-tools/Projects/Bisect_in_the_cloud]] will allow sheriffs to more easily narrow regression ranges for job types that do not run on every push, making it more viable to accept them into certain views/dashboards.

Sheriffing/Job Visibility Policy: Difference between revisions

Sheriffing/Job Visibility Policy (view source)

Revision as of 16:18, 1 April 2013

Navigation menu

Search