Auto-tools/Projects/Stockwell

From MozillaWiki
Jump to: navigation, search

Overview

Mozilla has a problem with intermittent failures which are commonly known as "Oranges". These occur at an ever increasing rate and over the years many tools have been build and worked on to make managing the volume of these intermittent failures easier. Stockwell will provide a series of changes to the engineering workflow at Mozilla to have a longer term sustainable solution to these failures. Effectively we will turn intermittent failures into actionable items where all developers feel empowered and responsible for making things better.

Team

  • gbrown - test quality, tooling, deep analysis
  • jmaher - policy, experimentation, triage, docs

Meetings

Meetings are fortnightly on Tuesday 8:30am PDT

Information and notes are on the meeting wiki

Problem

Goal of the project

Reduce our intermittent failures when you push to <5 per push.

We will build tools, processes, and relationships to change the way we view testing and automation. When failures occur, we need to understand how realistic it is to fix the issue and give tools to developers to fix them if possible. If it is not realistic to fix the issue, we need to reduce the visibility and keep track of it in case it becomes severe and needs more serious time invested in it.

Non Goals

We will not be fixing every intermittent, nor will we be disabling all tests.

Dependencies / Who will use this

  • sheriffs - they typically identify new intermittents
  • autostar - will be automatically categorizing intermittents
  • developers - will be fixing intermittents
  • testduty - new role (short term/long term) to triage and keep orange factor down

Design / Approach

Milestones and Dates

  • December 10th, 2016 - deliver plan to Mozilla Developers including plan for Q1, metrics to track
  • Q1, 2017 - implement plan, continue experiments
  • Q2, 2017 - repeat Q1, deliver report and 2017 Q3/Q4 plan in San Francisco to Mozilla developers

Triage

Triage is one of the first things we did and it is still require to be successful. Triage will change in scope over time as we adjust processes, expectations, tools, and robots.

Getting Involved

We don't have a clear list of bugs, but when we do, they will show up here. If we determine we have mentored bugs, they will be easily discoverable here as well.

Full Query
ID Summary Priority Status
1316113 tracking bug for tests which fail to run solo or repeat successfully -- NEW
1322433 Make it easier to retrigger a job with failing test with extra logging and debugging options P3 REOPENED
1337839 consider logging all information from test logs now that we are on taskcluster -- NEW
1337844 make reproducing test failures in context of automation much easier -- NEW
1337977 Display failure frequency in Treeherder P3 NEW
1357513 [meta] New/modified test verification P3 NEW
1395696 Consider adding a commit hook for tests.yml changes P3 NEW

7 Total; 7 Open (100%); 0 Resolved (0%); 0 Verified (0%);