Auto-tools/Projects/Automation Papercuts: Difference between revisions

Latest revision as of 15:23, 9 November 2011

NOTE: This is only a proposal at this time

Some parts of this document will refer to the Art of Unix Programming, which I (wlach) recommend highly.

Problem Statement

Many of our automation harnesses (talos, mochitest, etc.) are overly involved to set up and don't give clear and understandable error messages and return codes when they fail. This in turn leads to several issues:

Difficult to bring new members of automation & release engineering up to speed on projects
Difficult to attract new developers to work on our stuff (when things fail on first try for unknown reasons, it's discouraging!)
Difficult for developers to test their software (hard to disambiguate problems in their own code with failures in the automation)
Additional code needs to be written by RelEng to parse out logs to determine reasons for failure.

In most cases, this isn't actually due to systemic problems within the infrastructure but rather just small "papercuts" where we're not doing some combination of the following:

Automating setup steps
Checking preconditions before executing code
Handling errors when they occur (see the rule of repair: "when you must fail, fail noisily and as soon as possible")
Giving clear and understandable output and debugging messages (but only when the user asks for it: see the rule of silence)
Sending out appropriate return codes when programs exit

Goals

The first goal is to make our automation give consistent error codes when things fail. We'll use the existing schema defined by Release Engineering here, which is:

0 = success
1 = warning
2 = failure
3 = skipped
4 = exception
5 = retry

The goal of this project is to quite simply correct these problems bug by bug. There is no systemic problem in our code that's causing the problems outlined above: usually individual problems are fixable by small patches. In some cases, we may need to refactor a subsystem in our automation to fix a problem, but this should be rare.

As a bonus, often these issues make "good first bugs", which will hopefully get the community excited about hacking on our infrastructure.

Work to be done

Phase 1: Consistent Return Codes

FIXME: Add bug #'s for these

Talos
Mochitest
Mochitest Remote (mobile/fennec)
Reftest
Reftest Remote
XPCShell
Any others?? Mozmill? Peptest?

Phase 2: Better error messages/diagnostics

We will work on this for the remainder of Q4 2011 and Q1 2012. The goal is to get as many issues as possible reported and fixed in that time!

~~bug 698561~~: run_tests.py should err out early if firefox is already running
bug 695937: pageloader extension/talos should die more gracefully when pageset doesn't exist (potential mentors: wlach, jmaher, jhammel)
FIXME: Need more of these!

If you fix one of these issues, be sure to flag philor to let him know that certain error messages may change.

Questions/Comments/Concerns

Feel free to drop us a line over at the #ateam channel!

@@ Line 3: / Line 3: @@
 Some parts of this document will refer to the Art of Unix Programming, which I (wlach) recommend highly.
-== Problem ==
+== Problem Statement ==
-Many of our automation harnesses (talos, mochitest, etc.) are overly involved to set up and don't give clear and understandable error messages when they fail. This in turn leads to several problems:
+Many of our automation harnesses (talos, mochitest, etc.) are overly involved to set up and don't give clear and understandable error messages and return codes when they fail. This in turn leads to several issues:
 * Difficult to bring new members of automation & release engineering up to speed on projects
 * Difficult to attract new developers to work on our stuff (when things fail on first try for unknown reasons, it's discouraging!)
 * Difficult for developers to test their software (hard to disambiguate problems in their own code with failures in the automation)
+* Additional code needs to be written by RelEng to parse out logs to determine reasons for failure.
 In most cases, this isn't actually due to systemic problems within the infrastructure but rather just small "papercuts" where we're not doing some combination of the following:
@@ Line 15: / Line 16: @@
 * Automating setup steps
 * Checking preconditions before executing code
-* Handling errors when they occur (see the [http://catb.org/~esr/writings/taoup/html/ch01s06.html#id2878538|rule of repair]: "when you must fail, fail noisily and as soon as possible")
+* Handling errors when they occur (see the [http://catb.org/~esr/writings/taoup/html/ch01s06.html#id2878538 rule of repair]: "when you must fail, fail noisily and as soon as possible")
-* Giving clear and understandable output and debugging messages (but only when the user asks for it, see the [rule of silence|http://catb.org/~esr/writings/taoup/html/ch01s06.html#id2878450])
+* Giving clear and understandable output and debugging messages (but only when the user asks for it: see the [http://catb.org/~esr/writings/taoup/html/ch01s06.html#id2878450 rule of silence])
+* Sending out appropriate return codes when programs exit
 == Goals ==
-The goal of this project is to quite simply correct this problems bug by bug. There is no systemic problem in our code that's causing the problems outlined above: usually individual problems are fixable by small patches. In some cases, we may need to refactor a subsystem in our automation to fix a problem, but this should be rare.
+The first goal is to make our automation give consistent error codes when things fail. We'll use the existing schema defined by Release Engineering [https://github.com/buildbot/buildbot/blob/master/master/buildbot/status/results.py#L16 here], which is:
+<pre>
+= success
+= warning
+= failure
+= skipped
+= exception
+= retry
+</pre>
+The goal of this project is to quite simply correct these problems bug by bug. There is no systemic problem in our code that's causing the problems outlined above: usually individual problems are fixable by small patches. In some cases, we may need to refactor a subsystem in our automation to fix a problem, but this should be rare.
 As a bonus, often these issues make "good first bugs", which will hopefully get the community excited about hacking on our infrastructure.
-* '''FIXME''': By what criteria could we consider this project 'done'?
+== Work to be done ==
-*'''FIXME''': Would changing the way we fail in automation interfere with the way we're flagging oranges currently? If so, what can we do to mitigate this?
+=== Phase 1: Consistent Return Codes ===
-== List of Papercuts ==
+''' FIXME: Add bug #'s for these '''
-* run_tests.py should err out early if firefox is already running ({bug|698561}) (potential mentors: wlach, jmaher, jhammel)
+* Talos
-* pageloader extension/talos should die more gracefully when pageset doesn't exist ({bug|695937}) (potential mentors: wlach, jmaher, jhammel)
+* Mochitest
+* Mochitest Remote (mobile/fennec)
+* Reftest
+* Reftest Remote
+* XPCShell
+* Any others?? Mozmill? Peptest?
+=== Phase 2: Better error messages/diagnostics ===
+We will work on this for the remainder of Q4 2011 and Q1 2012. The goal is to get as many issues as possible reported and fixed in that time!
+* <strike>{{bug|698561}}</strike>: run_tests.py should err out early if firefox is already running
+* {{bug|695937}}: pageloader extension/talos should die more gracefully when pageset doesn't exist (potential mentors: wlach, jmaher, jhammel)
 * '''FIXME: Need more of these!'''
+If you fix one of these issues, be sure to flag philor to let him know that certain error messages may change.
 == Questions/Comments/Concerns ==
 Feel free to drop us a line over at the #ateam channel!

Auto-tools/Projects/Automation Papercuts: Difference between revisions

Latest revision as of 15:23, 9 November 2011

Contents

Problem Statement

Goals

Work to be done

Phase 1: Consistent Return Codes

Phase 2: Better error messages/diagnostics

Questions/Comments/Concerns

Navigation menu

Auto-tools/Projects/Automation Papercuts: Difference between revisions

Latest revision as of 15:23, 9 November 2011

Problem Statement

Goals

Work to be done

Phase 1: Consistent Return Codes

Phase 2: Better error messages/diagnostics

Questions/Comments/Concerns

Navigation menu

Search