NOTE: This is only a proposal at this time
Some parts of this document will refer to the Art of Unix Programming, which I (wlach) recommend highly.
Many of our automation harnesses (talos, mochitest, etc.) are overly involved to set up and don't give clear and understandable error messages and return codes when they fail. This in turn leads to several issues:
- Difficult to bring new members of automation & release engineering up to speed on projects
- Difficult to attract new developers to work on our stuff (when things fail on first try for unknown reasons, it's discouraging!)
- Difficult for developers to test their software (hard to disambiguate problems in their own code with failures in the automation)
- Additional code needs to be written by RelEng to parse out logs to determine reasons for failure.
In most cases, this isn't actually due to systemic problems within the infrastructure but rather just small "papercuts" where we're not doing some combination of the following:
- Automating setup steps
- Checking preconditions before executing code
- Handling errors when they occur (see the rule of repair: "when you must fail, fail noisily and as soon as possible")
- Giving clear and understandable output and debugging messages (but only when the user asks for it: see the rule of silence)
- Sending out appropriate return codes when programs exit
The first goal is to make our automation give consistent error codes when things fail. We'll use the existing schema defined by Release Engineering here, which is:
0 = success 1 = warning 2 = failure 3 = skipped 4 = exception 5 = retry
The goal of this project is to quite simply correct these problems bug by bug. There is no systemic problem in our code that's causing the problems outlined above: usually individual problems are fixable by small patches. In some cases, we may need to refactor a subsystem in our automation to fix a problem, but this should be rare.
As a bonus, often these issues make "good first bugs", which will hopefully get the community excited about hacking on our infrastructure.
Work to be done
Phase 1: Consistent Return Codes
FIXME: Add bug #'s for these
- Mochitest Remote (mobile/fennec)
- Reftest Remote
- Any others?? Mozmill? Peptest?
Phase 2: Better error messages/diagnostics
We will work on this for the remainder of Q4 2011 and Q1 2012. The goal is to get as many issues as possible reported and fixed in that time!
bug 698561: run_tests.py should err out early if firefox is already running
- bug 695937: pageloader extension/talos should die more gracefully when pageset doesn't exist (potential mentors: wlach, jmaher, jhammel)
- FIXME: Need more of these!
If you fix one of these issues, be sure to flag philor to let him know that certain error messages may change.
Feel free to drop us a line over at the #ateam channel!