Auto-tools/Projects/Automation Papercuts: Difference between revisions
m (→Goals) |
No edit summary |
||
| (2 intermediate revisions by the same user not shown) | |||
| Line 3: | Line 3: | ||
Some parts of this document will refer to the Art of Unix Programming, which I (wlach) recommend highly. | Some parts of this document will refer to the Art of Unix Programming, which I (wlach) recommend highly. | ||
== Problem == | == Problem Statement == | ||
Many of our automation harnesses (talos, mochitest, etc.) are overly involved to set up and don't give clear and understandable error messages when they fail. This in turn leads to several | Many of our automation harnesses (talos, mochitest, etc.) are overly involved to set up and don't give clear and understandable error messages and return codes when they fail. This in turn leads to several issues: | ||
* Difficult to bring new members of automation & release engineering up to speed on projects | * Difficult to bring new members of automation & release engineering up to speed on projects | ||
* Difficult to attract new developers to work on our stuff (when things fail on first try for unknown reasons, it's discouraging!) | * Difficult to attract new developers to work on our stuff (when things fail on first try for unknown reasons, it's discouraging!) | ||
* Difficult for developers to test their software (hard to disambiguate problems in their own code with failures in the automation) | * Difficult for developers to test their software (hard to disambiguate problems in their own code with failures in the automation) | ||
* Additional code needs to be written by RelEng to parse out logs to determine reasons for failure. | |||
In most cases, this isn't actually due to systemic problems within the infrastructure but rather just small "papercuts" where we're not doing some combination of the following: | In most cases, this isn't actually due to systemic problems within the infrastructure but rather just small "papercuts" where we're not doing some combination of the following: | ||
| Line 15: | Line 16: | ||
* Automating setup steps | * Automating setup steps | ||
* Checking preconditions before executing code | * Checking preconditions before executing code | ||
* Handling errors when they occur (see the [http://catb.org/~esr/writings/taoup/html/ch01s06.html#id2878538 | * Handling errors when they occur (see the [http://catb.org/~esr/writings/taoup/html/ch01s06.html#id2878538 rule of repair]: "when you must fail, fail noisily and as soon as possible") | ||
* Giving clear and understandable output and debugging messages (but only when the user asks for it | * Giving clear and understandable output and debugging messages (but only when the user asks for it: see the [http://catb.org/~esr/writings/taoup/html/ch01s06.html#id2878450 rule of silence]) | ||
* Sending out appropriate return codes when programs exit | |||
== Goals == | == Goals == | ||
The goal of this project is to quite simply correct | The first goal is to make our automation give consistent error codes when things fail. We'll use the existing schema defined by Release Engineering [https://github.com/buildbot/buildbot/blob/master/master/buildbot/status/results.py#L16 here], which is: | ||
<pre> | |||
0 = success | |||
1 = warning | |||
2 = failure | |||
3 = skipped | |||
4 = exception | |||
5 = retry | |||
</pre> | |||
The goal of this project is to quite simply correct these problems bug by bug. There is no systemic problem in our code that's causing the problems outlined above: usually individual problems are fixable by small patches. In some cases, we may need to refactor a subsystem in our automation to fix a problem, but this should be rare. | |||
As a bonus, often these issues make "good first bugs", which will hopefully get the community excited about hacking on our infrastructure. | As a bonus, often these issues make "good first bugs", which will hopefully get the community excited about hacking on our infrastructure. | ||
== Work to be done == | |||
=== Phase 1: Consistent Return Codes === | |||
''' FIXME: Add bug #'s for these ''' | |||
* run_tests.py should err out early if firefox is already running | * Talos | ||
* Mochitest | |||
* Mochitest Remote (mobile/fennec) | |||
* Reftest | |||
* Reftest Remote | |||
* XPCShell | |||
* Any others?? Mozmill? Peptest? | |||
=== Phase 2: Better error messages/diagnostics === | |||
We will work on this for the remainder of Q4 2011 and Q1 2012. The goal is to get as many issues as possible reported and fixed in that time! | |||
* <strike>{{bug|698561}}</strike>: run_tests.py should err out early if firefox is already running | |||
* {{bug|695937}}: pageloader extension/talos should die more gracefully when pageset doesn't exist (potential mentors: wlach, jmaher, jhammel) | |||
* '''FIXME: Need more of these!''' | * '''FIXME: Need more of these!''' | ||
If you fix one of these issues, be sure to flag philor to let him know that certain error messages may change. | |||
== Questions/Comments/Concerns == | == Questions/Comments/Concerns == | ||
Feel free to drop us a line over at the #ateam channel! | Feel free to drop us a line over at the #ateam channel! | ||
Latest revision as of 15:23, 9 November 2011
NOTE: This is only a proposal at this time
Some parts of this document will refer to the Art of Unix Programming, which I (wlach) recommend highly.
Problem Statement
Many of our automation harnesses (talos, mochitest, etc.) are overly involved to set up and don't give clear and understandable error messages and return codes when they fail. This in turn leads to several issues:
- Difficult to bring new members of automation & release engineering up to speed on projects
- Difficult to attract new developers to work on our stuff (when things fail on first try for unknown reasons, it's discouraging!)
- Difficult for developers to test their software (hard to disambiguate problems in their own code with failures in the automation)
- Additional code needs to be written by RelEng to parse out logs to determine reasons for failure.
In most cases, this isn't actually due to systemic problems within the infrastructure but rather just small "papercuts" where we're not doing some combination of the following:
- Automating setup steps
- Checking preconditions before executing code
- Handling errors when they occur (see the rule of repair: "when you must fail, fail noisily and as soon as possible")
- Giving clear and understandable output and debugging messages (but only when the user asks for it: see the rule of silence)
- Sending out appropriate return codes when programs exit
Goals
The first goal is to make our automation give consistent error codes when things fail. We'll use the existing schema defined by Release Engineering here, which is:
0 = success 1 = warning 2 = failure 3 = skipped 4 = exception 5 = retry
The goal of this project is to quite simply correct these problems bug by bug. There is no systemic problem in our code that's causing the problems outlined above: usually individual problems are fixable by small patches. In some cases, we may need to refactor a subsystem in our automation to fix a problem, but this should be rare.
As a bonus, often these issues make "good first bugs", which will hopefully get the community excited about hacking on our infrastructure.
Work to be done
Phase 1: Consistent Return Codes
FIXME: Add bug #'s for these
- Talos
- Mochitest
- Mochitest Remote (mobile/fennec)
- Reftest
- Reftest Remote
- XPCShell
- Any others?? Mozmill? Peptest?
Phase 2: Better error messages/diagnostics
We will work on this for the remainder of Q4 2011 and Q1 2012. The goal is to get as many issues as possible reported and fixed in that time!
bug 698561: run_tests.py should err out early if firefox is already running- bug 695937: pageloader extension/talos should die more gracefully when pageset doesn't exist (potential mentors: wlach, jmaher, jhammel)
- FIXME: Need more of these!
If you fix one of these issues, be sure to flag philor to let him know that certain error messages may change.
Questions/Comments/Concerns
Feel free to drop us a line over at the #ateam channel!