ReleaseEngineering/Mozpool/Handling Panda Failures
This page describes the appropriate process to handle panda failures.
Contents
Which Foopy is a Panda Connected To
See http://hg.mozilla.org/build/tools/file/default/buildfarm/mobile/devices.json
No Builds but Panda in 'free' State
Make sure a build slave is running on the foopy. Example:
$ ssh cltbld@foopy45 $ ps auxwww | grep panda-0169
If no results, start up the build slave:
$ /builds/manage_buildslave.sh start panda-0169
Failure condition
Nagios State Check failure
Nagios does not ping panda boards directly but instead checks for failures states through mozpool.
Misbehavior In Tests
See #Log Failure In Interim Tracking Bug.
Mozpool failed_ state
See ReleaseEngineering/Mozpool/How To Interpret Device State in Mozpool
Responses
Known Issues and Handling
There are a couple of known issues that require manual intervention.
- failed_b2g_pinging: At this time mozpool cannot detect whether this state was caused by a bad panda or a bad b2g build or just a failed hardware (probably ethernet) initialization. For handling this state, find the failed panda in the lifeguard UI and execute a "please_self_test". This should test the board and put it back into a free state. If the board continues to be a repeat offender and ending up in a failed_b2g_pinging state, follow the #Log Failure In Interim Tracking Bug instructions and log under the Interim Bad Panda bug.
Look for common failure modes in logs
In the BMM or Lifeguard UIs, click the device's "log" link and look at the logs.
- If you see "Invalid bootconfig", this is probably user error. Scroll up a bit to find "writing pxe_config .. boot config ..", and look at the boot config. Is it well-formed JSON? Is it using the right key name ("b2gbase" for B2G)?
- If you see "Fetching.." followed by "wget failed:" (wget doesn't log very well, sorry), then likely the fetch URL is wrong. Try accessing it from a mobile-imaging server to check that the proper flows are in place, and verify that the directory exists and contains the appropriate files. It's certainly possible that downloads will fail due to network problems or the like, which is why there are retries.
- If you see an "Imaging complete" line, and then big fat nothing -- no pinging or anything -- then that's a repeat of bug 817762, which makes pandas sad (literally). Re-open the bug and find Dustin.
If it's not any of these, copy the relevant portion of the logs (from the last "entering state pxe_power_cycling") and proceed to #Log Failure In Interim Tracking Bug.
Log Failure In Interim Tracking Bug
Add a comment to bug bad-panda-log containing as much data as you have about the failure:
- panda name
- failure type
- failure time (copy/paste the nagios alert from IRC, if nagios noticed)
- what that panda was doing when it failed (if known)
Ack any corresponding nagios alerts with "added to bad panda log".
IT Actions
Jake will visit scl1 periodically and consult bug bad-panda-log, as well as mozpool itself. He will fix whatever's bad, and in the process develop knowledge of the common failure modes and their remediations, leading to a robust long-term process.