User:Ctalbert/WarOnOrange

From MozillaWiki
Jump to: navigation, search

War on Orange Notes

Easing pain of current process (things to do now)

  • make it faster to download buildlogs.
  • make tinderbox pushlog robot not so spammy in bugs
  • make tbpl summary view and starring faster and less prone to timeouts
  • 5.5 hour turn around time - can we just expand the hardware? Where would we put it if we did?
  • fix brief error parser in tinderbox & maybe unify with tbplrobot (paraphrased correctly?)

Education & Accountability

  • checklist (or automatic code analysis) of things that often go wrong
  • Have people be responsible for new intermittent oranges - an orange tag-team
  • Close tree when orange is occurring over N times
  • MDC page on how to write more robust tests
  • MDC page on best practices for debugging/analyzing intermittent tests

Better Diagnostics

  • Topfails::highligh changes in failure frequency for a test
  • Topfails::track failures in tests marked as random (want to know if the failure rate changes drastically)
  • Topfails::classify failures better (what does this mean? Is it group by failure mode? focus, timeout, crash, etc?)
  • Topfails:: ignore hidden tinderboxes (doesn't it already do this?)
  • permanent R & R box, continuously running, vnc into it.
  • ability to take full screen snapshots when focus based tests fail (depends on being able to tag tests as relying on focus, or do we do this on every hang?)
  • Link to previous good log on which the test in question passed

Measurements

  • Need to be confident enough in tracking our failures that we feel OK with turning off tbpl robot's bug spam (or bringing it down to something manageable)
  • What is the platform distribution of intermittent oranges?
  • Can we determine how frequently an intermittent orange occurs? On a given push you have xx% chance you will have a test go orange. If we do that over the entire set of oranges we should be able to come up with that number. Monitoring that number can help us see if our efforts have made an impact.
  • Of the top 20 intermittent failures, what is the likelihood any of those will happen on any given checkin? This tells us whether we've succeeded in making something more or less random while working on it?

Prioritization Schemes

  • below a 1 month threshold, we tend to not care about the orange as much
  • prioritize on frecency

Core Code Changes

  • Replacing timer based APIs (or timer based mechanisms/effects that people use) with event based mechanisms
  • Make setTimeout less flaky - bug 473680
  • be able to toggle on/off "extreme" logging at will on a per test basis
  • JS stack traces included in failures bug 516733

Test Harness code changes

  • focus issues affect many tests
  • enable a way to mark mochitest as failing randomly/tag as intermittent orange
    • be able to run those marked mochitests over and over in a "random orange" suite (start with mochi but enable this for all test harnesses)
  • capture more verbose data per test, discard this data if test passes, include it in log if test fails
  • reftest - integrate the visual diff viewer into tbpl/log viewer so that it's easy for people to see what has changed
  • ignore intentional crashes (bug 539823)
  • distinguish plugin crashes (bug 571436)
  • crash signatures (bug 570723)
  • leak summaries (bug 571423)
  • compiler error summaries (bug 482177)
  • buildbot error summaries (bug 457976, bug 522792)

Tools for a Better Process

  • Push based process (mozilla pulse perhaps) that notifies developers when things go south with their checkin.

Preventative Automation

  • if we had test suite that exercised negative pressures stalling threads, ntp, clock skew, skipping execution, missed reads etc that really runs the gamut of crazy issues, then we could run it on try server builds and see if it picks up anything in your patch. (or do this on a R & R box to help people debug these issues)