Buildbot/Talos/Sheriffing/Alert FAQ: Difference between revisions

Jump to navigation Jump to search
m
june 2016 update
m (june 2016 update)
Line 2: Line 2:


= What alerts are displayed in Alert Manager =
= What alerts are displayed in Alert Manager =
[[http://alertmanager.allizom.org:8080/alerts.html# Alert Manager]] is designed to show users the alerts that need attention first.  This is based on a time based series, and we only show alerts that are regressionsWhen an alert is added to the system, it is marked with a status of 'NEW'.  Once you have investigated it and added a bug, the status changes to 'Investigating'. You will only see 'NEW' or 'Investigating' alerts by default when pulling up the main view of [[http://alertmanager.allizom.org:8080/alerts.html# Alert Manager]].
[[https://treeherder.mozilla.org/perf.html#/alerts Perfherder Alerts]] defaults to talos alerts that are untriagedIt is a goal to keep this list empty! You can view alerts that are improvements or in any other state (i.e. investigating, fixed, etc.) by using the drop down at the top of the page.
 
Keep in mind there are other types of alerts.  We do our best to automatically detect [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ merges]].  When we do detect merges, those are hidden from the default view and a 'show/hide merged alerts' link shows up under the revision id. Clicking that will show all the merged alerts.
 
To see all alerts related to a change, you can view the 'Show All' version by clicking the show all checkbox in the top nav bar and clicking filter.  That will bring in all alerts (resolved, wontfix, duplicate, etc.) so it is recommended that you only do this while viewing alerts for a single revision.
 


= Do we care about all alerts/tests =
= Do we care about all alerts/tests =
Yes we do.  Some tests are more useful to look at than others, mostly due to the noise in the tests.  Here are some alerts which are should be looked at with a lower priority or ignored:
Yes we do.  Some tests are more commonly invalid, mostly due to the noise in the tests.  We also adjust the threshold per test, the default is 2%, but for dromaeo it is 5%
* LibXUL Memory during link - not a talos test, data is posted to graph server from the builds.  In fact I rarely look at these
* %CPU/MainRSS - these are counters- unless they are really high >10% there is no need to investigate these.
* Dromaeo DOM - we only alert on >10%.  These are more of a sanity test than a real performance test at this point
* a11y - not a noisy test, but historically it has had a lot of ups and downs unrelated to checkins


Here are some platforms/tests which are exceptions:
Here are some platforms/tests which are exceptions about what we run:
* OSX 10.6, we only run tests that are related to graphics rendering.  We don't run tp5o, ts_paint, sessionrestore*, tart/cart.
* Windows XP - we don't run dromaeo*, kraken, v8
* Windows XP - we don't run dromaeo*, kraken, v8
* Linux64 - the only platform which supports media_tests
* Linux64 - the only platform which we run dromaeo_dom
* Windows 7 - the only platform that supports xperf (toolchain is only installed there)
* Windows 7 - the only platform that supports xperf (toolchain is only installed there)


Line 42: Line 32:


= Why does Alert Manager print -xx% =
= Why does Alert Manager print -xx% =
The alert will either be a regression or an improvement.  For the alerts we show by default, it is regressions only.  It is important to know the severity of an alert.  For example a 3% regression is important to understand, but a 30% regression probably needs to be fixed ASAP.  This is annotated as a XX% in the UI.  We use -XX% to denote a regression, and for the improvement alerts it is +XX%.
The alert will either be a regression or an improvement.  For the alerts we show by default, it is regressions only.  It is important to know the severity of an alert.  For example a 3% regression is important to understand, but a 30% regression probably needs to be fixed ASAP.  This is annotated as a XX% in the UI.  there are no + or - to indicate improvement or regression, this is an absolute number.  Use the bar graph to the side to determine which type of alert this is.


NOTE: for the reverse tests we take that into account, so whenever there is a -XX%, that is the percent of the regression.
NOTE: for the reverse tests we take that into account, so the bar graph will know to look in the correct direction.
Confirmed users
3,376

edits

Navigation menu