Auto-tools/Projects/AlertManager

From MozillaWiki
Jump to: navigation, search

Team

  • Talos Sheriffs (:jmaher, :vaibhav1994, :mishravikas)
  • Developers (:jmaher, :kaustabh93, :mishravikas, :vaibhav1994)
  • Others (:avih, :wlach, :dminor)

Problem

Alert Manager is a simple single purpose tool for managing Talos based automated alerts.

When a developer checks in code to the Mozilla trunk branch, we build 15+ builds and run a series of unit and performance tests for those builds. To report a regression we look for a sustained regression, which means we need data from future checkins to reliably detect regressions. As code moves between branches, we get duplicate alerts. Sometimes these are just noise, or we backout the offending patch for unit test failures. In all cases for every 33 alerts we get, it results in 1 bug that requires some form of developer attention.

Alert Manager provides a WebUI for us to triage, categorize, investigate, and manage the large alert volume while keeping our performance regressions well documented.


Goals & Considerations

The goals of Alert Manager is to enable sheriffs to quickly and reliably investigate alerts and translate real regressions into bugs.

Alert Manager should be able to work with a variety of data sources (graph server alerts, dzalerts, perfherder).

Alert Manager needs to work around noise in the tests and missing data in job scheduling.

Non-Goals

  • graphing software
  • scheduling of jobs
  • how jobs are run
  • what tests are run
  • noise in the test jobs
  • automated bug filing

Dependencies / Who will use this

Dependencies

  • Graph server to generate alerts and post to mozilla.dev.tree-alerts
  • leafnode/fetchnews on alert manager server to get messages from mozilla.dev.tree-alerts and insert into local database
  • parse_news.py creates links to treeherder and depends on the treeherder API to get the pushlog for a focused view
  • every day we update a table of bug id/status related to active open bugs related to alerts- this requires the bugzilla api

Users

  • Talos Sheriffs (:jmaher, :vaibhav1994, :mishravikas)
  • Release Managers
  • Developers who will look at the impact of their changes while investigating an alert

Design and Approach

The automated alerts are sent to the [mozilla.dev.tree-management newgroup]. We have a script which parses all the alerts and puts them into a sql database. We care about:

  • revision range (set of changes between last known good data point and the one that shows the regression)
  • platform
  • test name
  • percent regressed
  • branch reported on
  • date reported
  • link to graph server showing the regression graphically

With that information we can triage alerts successfully. When the script runs that parses the newsgroup messages and inserts the new alerts into the database, we first see if this alert is found already by looking if this has >10 bugs associated with it (bugs are in changeset comments) and if it does, we see if the list of changesets includes a previously inserted set of changesets. If it does and our platform and test match, then we mark the alert as merged and in the UI we hide those by default.

The resulting alerts that we show by default in the UI are usually regressions (we ignore the improvements by default) and have a decent success rate in filtering out duplicates (we merge to 4 other branches daily). In the UI, we can view the graph, a link to the hg view of the code changes, and a link to tbpl so we can retrigger jobs, look at the raw logs and see the data before/after the given changeset.

An end user takes action on this by looking at the data before/after the changeset in question. It is common practice to retrigger at least once if not 3 times in order to show that this is really the offending changeset. At this time we build builds which were skipped if we can do that. Once we have determined the changeset that caused the problem, we file a bug and add it into the webUI so we can easily reference the bug. Here is a link to a graphic view of this workflow.

Milestones and Dates

Our goal is to provide the best set of tools to manage all performance regressions at Mozilla. To achieve that here are some changes we need to consider implementing:

  • automatic build backfilling
  • automatic retriggering
  • improve automatic merge detection
  • improve performance of webUI
  • support per page alerts (high resolution)
    • adapt to different database table
    • adjust UI to handle the large number of tests, bundle by test suite
  • detect backouts and mark regressions + improvements accordingly (happens about 5% of the time)
  • detect pgo vs non pgo (pgo has priority but more backfilling)

Eventually this will be integrated into tbpl's replacement treeherder. Until TreeHerder is proven and has a solid performance data storage backend and display UI, we are continuing our work here. All features implemented here will act as a beta version with much of the logic and code to be used in TreeHerder.


Implementation

Technical notes, plans, and designs detailing how the project will be realized. The specifics of "how". This should also include how we expect this to be used and typical use cases

Getting Involved

Getting Started

You can always contact :jmaher or :dminor, or :Kaustabh93 on irc. To find the code go here:

* git clone https://github.com/jmaher/alert_manager.git

Some good directions are up on github on the [readme]. That has sample data and everything needed to get going.

Expectations

Links to coding styles, patch/pull request guidelines, unittest requirements

Bugs

Here are some bugs to get your feet wet and start making progress:

No results.

0 Total; 0 Open (0%); 0 Resolved (0%); 0 Verified (0%);


If you are setup with alert manager and are ready for a more challenging bug, check out some of these bugs:

No results.

0 Total; 0 Open (0%); 0 Resolved (0%); 0 Verified (0%);