Auto-tools/Projects/Stockwell/Ideas

From MozillaWiki
Jump to: navigation, search
  • Actions which might reduce OF
    • on-going triage of frequent intermittent failures
      • check if test recently modified
      • find regressing changeset
      • find an owner
      • identify and call-out platform correlations, interesting logs, or other info that might encourage resolution of the bug
    • make it easier to find regressions
    • longer ActiveData history
    • fail-per-run stats
      • OF tracks failures and pushes but not successful test runs
      • ActiveData provides failures-per-test-run, but lacks long-term history
    • find long-term neglected intermittent failures
      • identify on-going bugs that don't exceed the 30/week threshold but contribute to high OF over time
    • publicize best practices for tests
    • race reduction strategy
      • develop and publicize more best practices
      • lint rules?
      • logging/diagnostics to make it easier to recognize and debug races in tests
    • enable eslint on more test files - bug 1357557
    • test verification - bug 1357513
      • test verification of changed files on checkin
      • test verification for mach
      • test verification for backfill
    • easier backouts
      • backout button on treeherder
    • make intermittent fail count a searchable parameter in bugzilla
      • ideally, it would be nice to search for failures in a range of dates, but simple 1-day and 7-day counts would be valuable
      • ideally, actual failure rate (failures/run) would be nice too
    • component intermittent triage dashboard
      • provide a tool for component owners to find intermittent failures requiring attention
    • ownership clarity for meta-bugs
      • who is responsible for intermittent failures on browser startup and shutdown, leaks, hangs, and other failures not closely associated with an identifiable subset of tests?
    • strategy for meta-bugs
      • ...since they usually cannot be resolved by skipping any single test
    • avoid failures from long-running tests/jobs
      • monitor test/job durations to identify those that are approaching timeout thresholds
    • autoclassification
      • to reduce impact of intermittents
    • run fewer tests
      • could we reduce redundancy, consolidate tests?
      • do we need to run everything on all platforms?
    • quarantine new/modified tests
    • put intermittently failing tests in "purgatory"
    • evolve triage/skipping policy
    • track disabled test counts
      • ideally by component/owner
      • ...to check that we haven't disabled the world for intermittent failures
    • improve one-click loaner
      • ...easier to reproduce/fix failures
    • get more people fixing bugs
      • get more resources?
      • policy/nudges?
    • reduce infrastructure failures
      • ...somehow, since they often have big impact
    • reproduce failures in rr
    • provide automatic feedback for try pushes:
      • did my try push cause unexpected failures? ...an autoclassify issue?