Buildbot/Talos/Sheriffing/Alert FAQ
FAQ
What alerts are displayed in Alert Manager
[Alert Manager] is designed to show users the alerts that need attention first. This is based on a time based series, and we only show alerts that are regressions. When an alert is added to the system, it is marked with a status of 'NEW'. Once you have investigated it and added a bug, the status changes to 'Investigating'. You will only see 'NEW' or 'Investigating' alerts by default when pulling up the main view of [Alert Manager].
Keep in mind there are other types of alerts. We do our best to automatically detect [merges]. When we do detect merges, those are hidden from the default view and a 'show/hide merged alerts' link shows up under the revision id. Clicking that will show all the merged alerts.
To see all alerts related to a change, you can view the 'Show All' version by clicking the show all checkbox in the top nav bar and clicking filter. That will bring in all alerts (resolved, wontfix, duplicate, etc.) so it is recommended that you only do this while viewing alerts for a single revision.
Do we care about all alerts/tests
Yes we do. Some tests are more useful to look at than others, mostly due to the noise in the tests. Here are some alerts which are should be looked at with a lower priority or ignored:
- LibXUL Memory during link - not a talos test, data is posted to graph server from the builds. In fact I rarely look at these
- %CPU/MainRSS - these are counters- unless they are really high >10% there is no need to investigate these.
- Dromaeo DOM - we only alert on >10%. These are more of a sanity test than a real performance test at this point
- a11y - not a noisy test, but historically it has had a lot of ups and downs unrelated to checkins
Here are some platforms/tests which are exceptions:
- OSX 10.6, we only run tests that are related to graphics rendering. We don't run tp5o, ts_paint, sessionrestore*, tart/cart.
- Windows XP - we don't run dromaeo*, kraken, v8
- Linux64 - the only platform which supports media_tests
- Windows 7 - the only platform that supports xperf (toolchain is only installed there)
Lastly, we should prioritize alerts on the Mozilla-Beta and Mozilla-Aurora branches since those are affecting more people.
What does a regression look like on the graph
On almost all of our tests, we are measuring based on time. This means that the lower the score the better. Whenever the graph increases in value that is a regression.
we have some tests which measure internal metrics. A few of those are actually reported where a higher score is better. This is confusing, but we refer to these as reverse tests. The list of tests which are reverse are: dromaeo_css dromaeo_dom v8 version 7 canvasmark
Why does Alert Manager print -xx%
The alert will either be a regression or an improvement. For the alerts we show by default, it is regressions only. It is important to know the severity of an alert. For example a 3% regression is important to understand, but a 30% regression probably needs to be fixed ASAP. This is annotated as a XX% in the UI. We use -XX% to denote a regression, and for the improvement alerts it is +XX%.
NOTE: for the reverse tests we take that into account, so whenever there is a -XX%, that is the percent of the regression.