From MozillaWiki
Jump to: navigation, search


Mozilla has a problem with intermittent failures which are commonly known as "Oranges". These occur at an ever increasing rate and over the years many tools have been build and worked on to make managing the volume of these intermittent failures easier. Stockwell will provide a series of changes to the engineering workflow at Mozilla to have a longer term sustainable solution to these failures. Effectively we will turn intermittent failures into actionable items where all developers feel empowered and responsible for making things better.


  • gbrown - triage, reproducability
  • jmaher - policy, experimentation
  • wlach - developer hacking, tools


Meetings are fortnightly on Tuesday 8:30am PDT

Information and notes are on the meeting wiki


Goal of the project

Reduce our intermittent failures when you push to <5 per push.

We will build tools, processes, and relationships to change the way we view testing and automation. When failures occur, we need to understand how realistic it is to fix the issue and give tools to developers to fix them if possible. If it is not realistic to fix the issue, we need to reduce the visibility and keep track of it in case it becomes severe and needs more serious time invested in it.

Non Goals

We will not be fixing every intermittent, nor will we be disabling all tests.

Dependencies / Who will use this

  • sheriffs - they typically identify new intermittents
  • autostar - will be automatically categorizing intermittents
  • developers - will be fixing intermittents
  • testduty - new role (short term/long term) to triage and keep orange factor down

Design / Approach

  • TBD in December

Milestones and Dates

  • December 10th, 2016 - deliver plan to Mozilla Developers including plan for Q1, metrics to track
  • Q1, 2017 - implement plan, continue experiments
  • Q2, 2017 - repeat Q1, deliver report and 2017 Q3/Q4 plan in San Francisco to Mozilla developers


  • TBD in December

Getting Involved

We don't have a clear list of bugs, but when we do, they will show up here. If we determine we have mentored bugs, they will be easily discoverable here as well.

ID Summary Priority Status
1241535 find a way to quickly "retrigger" a job to generate a sps profile for talos -- NEW
1316113 tracking bug for tests which fail to run solo or repeat successfully -- NEW
1322433 Make it easier to retrigger a job with failing test with extra logging and debugging options -- REOPENED
1337839 consider logging all information from test logs now that we are on taskcluster -- NEW
1337844 make reproducing test failures in context of automation much easier -- NEW
1337977 Display failure frequency in Treeherder -- NEW
1357513 [meta] New/modified test verification -- NEW
1357557 [meta] Enable eslint on more test files -- NEW

8 Total; 8 Open (100%); 0 Resolved (0%); 0 Verified (0%);