Auto-tools/Projects/AlertManager: Difference between revisions

m
→‎Non-Goals: - brief data
m (→‎Non-Goals: - brief data)
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Overview =
= Team =
 
* Talos Sheriffs (:jmaher, :vaibhav1994, :mishravikas)
* Developers (:jmaher, :kaustabh93, :mishravikas, :vaibhav1994)
* Others (:avih, :wlach, :dminor)
 
= Problem =


Alert Manager is a simple single purpose tool for managing Talos based automated alerts.
Alert Manager is a simple single purpose tool for managing Talos based automated alerts.
Line 7: Line 13:
Alert Manager provides a WebUI for us to triage, categorize, investigate, and manage the large alert volume while keeping our performance regressions well documented.
Alert Manager provides a WebUI for us to triage, categorize, investigate, and manage the large alert volume while keeping our performance regressions well documented.


= How it works =
 
= Goals & Considerations=
 
The goals of Alert Manager is to enable sheriffs to quickly and reliably investigate alerts and translate real regressions into bugs.
 
Alert Manager should be able to work with a variety of data sources (graph server alerts, dzalerts, perfherder).
 
Alert Manager needs to work around noise in the tests and missing data in job scheduling.
 
= Non-Goals =
 
* graphing software
* scheduling of jobs
* how jobs are run
* what tests are run
* noise in the test jobs
* automated bug filing
 
= Dependencies / Who will use this =
== Dependencies ==
* Graph server to generate alerts and post to mozilla.dev.tree-alerts
* leafnode/fetchnews on alert manager server to get messages from mozilla.dev.tree-alerts and insert into local database
* parse_news.py creates links to treeherder and depends on the treeherder API to get the pushlog for a focused view
* every day we update a table of bug id/status related to active open bugs related to alerts- this requires the bugzilla api
 
== Users ==
* Talos Sheriffs (:jmaher, :vaibhav1994, :mishravikas)
* Release Managers
* Developers who will look at the impact of their changes while investigating an alert
 
= Design and Approach =


The automated alerts are sent to the [[https://groups.google.com/forum/#!forum/mozilla.dev.tree-management mozilla.dev.tree-management newgroup]].  We have a script which parses all the alerts and puts them into a sql database.  We care about:
The automated alerts are sent to the [[https://groups.google.com/forum/#!forum/mozilla.dev.tree-management mozilla.dev.tree-management newgroup]].  We have a script which parses all the alerts and puts them into a sql database.  We care about:
Line 24: Line 60:
An end user takes action on this by looking at the data before/after the changeset in question.  It is common practice to retrigger at least once if not 3 times in order to show that this is really the offending changeset.  At this time we build builds which were skipped if we can do that.  Once we have determined the changeset that caused the problem, we file a bug and add it into the webUI so we can easily reference the bug.  Here is a link to a [http://elvis314.wordpress.com/2014/05/08/the-lifecycle-of-a-talos-performance-regression/ graphic view of this workflow].
An end user takes action on this by looking at the data before/after the changeset in question.  It is common practice to retrigger at least once if not 3 times in order to show that this is really the offending changeset.  At this time we build builds which were skipped if we can do that.  Once we have determined the changeset that caused the problem, we file a bug and add it into the webUI so we can easily reference the bug.  Here is a link to a [http://elvis314.wordpress.com/2014/05/08/the-lifecycle-of-a-talos-performance-regression/ graphic view of this workflow].


= Alert Manager Roadmap =
= Milestones and Dates =


Our goal is to provide the best set of tools to manage all performance regressions at Mozilla.  To achieve that here are some changes we need to consider implementing:
Our goal is to provide the best set of tools to manage all performance regressions at Mozilla.  To achieve that here are some changes we need to consider implementing:
* automatic build backfilling
* automatic build backfilling
* automatic retriggering
* automatic retriggering
* improve automate merge detection
* improve automatic merge detection
* improve performance of parse_news script
* improve performance of webUI
* improve performance of webUI
* generate reports of common release metrics
* query in the UI for older alerts which have already been released
* query bugzilla for all bugs to get state, make a workflow based on that
* support per page alerts (high resolution)
* support per page alerts (high resolution)
** adapt to different database table
** adapt to different database table
Line 43: Line 75:
Eventually this will be integrated into tbpl's replacement [https://treeherder.mozilla.org/ui/#/jobs treeherder].  Until TreeHerder is proven and has a solid performance data storage backend and display UI, we are continuing our work here.  All features implemented here will act as a beta version with much of the logic and code to be used in TreeHerder.
Eventually this will be integrated into tbpl's replacement [https://treeherder.mozilla.org/ui/#/jobs treeherder].  Until TreeHerder is proven and has a solid performance data storage backend and display UI, we are continuing our work here.  All features implemented here will act as a beta version with much of the logic and code to be used in TreeHerder.


= Want to help? =


= Implementation =
''Technical notes, plans, and designs detailing how the project will be realized.  The specifics of "how". This should also include how we expect this to be used and typical use cases''
= Getting Involved =
== Getting Started ==
You can always contact :jmaher or :dminor, or :Kaustabh93 on irc.  To find the code go here:
  * git clone https://github.com/jmaher/alert_manager.git
  * git clone https://github.com/jmaher/alert_manager.git


Some good directions are up on github on the [[https://github.com/jmaher/alert_manager/blob/master/README.md readme]].  That has sample data and everything needed to get going.
Some good directions are up on github on the [[https://github.com/jmaher/alert_manager/blob/master/README.md readme]].  That has sample data and everything needed to get going.
== Expectations ==
''Links to coding styles, patch/pull request guidelines, unittest requirements''
== Bugs ==


Here are some bugs to get your feet wet and start making progress:
Here are some bugs to get your feet wet and start making progress:
Line 54: Line 100:
         "quicksearch": "status:new,assigned,reopened,unconfirmed",
         "quicksearch": "status:new,assigned,reopened,unconfirmed",
         "product": "Testing",
         "product": "Testing",
         "whiteboard": "[good",
         "whiteboard": "[good first bug",
         "blocks": 1029516
         "blocks": 1029516
     }
     }
Line 65: Line 111:
         "quicksearch": "status:new,assigned,reopened,unconfirmed",
         "quicksearch": "status:new,assigned,reopened,unconfirmed",
         "product": "Testing",
         "product": "Testing",
        "whiteboard": "[good next bug",
         "blocks": 1029516
         "blocks": 1029516
     }
     }
</bugzilla>
</bugzilla>
Confirmed users
3,376

edits