ReleaseEngineering/Mozpool/Handling Panda Failures

From MozillaWiki
Jump to: navigation, search

This page describes the appropriate process to handle panda failures.

Warning signWarning: There are two parts to this process: the interim process, and the in-development long-term process. Relops will communicate clearly when the long-term process is put in place.

Which Foopy is a Panda Connected To

See http://hg.mozilla.org/build/tools/file/default/buildfarm/mobile/devices.json

No Builds but Panda in 'free' State

Make sure a build slave is running on the foopy. Example:

$ ssh cltbld@foopy45
$ ps auxwww | grep panda-0169

If no results, start up the build slave:

$ /builds/manage_buildslave.sh start panda-0169

Failure condition

Nagios State Check failure

Nagios does not ping panda boards directly but instead checks for failures states through mozpool.

Misbehavior In Tests

See #Log Failure In Interim Tracking Bug.

Mozpool failed_ state

See ReleaseEngineering/Mozpool/How To Interpret Device State in Mozpool

Responses

Known Issues and Handling

There are a couple of known issues that require manual intervention.

  1. failed_b2g_pinging: At this time mozpool cannot detect whether this state was caused by a bad panda or a bad b2g build or just a failed hardware (probably ethernet) initialization. For handling this state, find the failed panda in the lifeguard UI and execute a "please_self_test". This should test the board and put it back into a free state. If the board continues to be a repeat offender and ending up in a failed_b2g_pinging state, follow the #Log Failure In Interim Tracking Bug instructions and log under the Interim Bad Panda bug.

Look for common failure modes in logs

In the BMM or Lifeguard UIs, click the device's "log" link and look at the logs.

  • If you see "Invalid bootconfig", this is probably user error. Scroll up a bit to find "writing pxe_config .. boot config ..", and look at the boot config. Is it well-formed JSON? Is it using the right key name ("b2gbase" for B2G)?
  • If you see "Fetching.." followed by "wget failed:" (wget doesn't log very well, sorry), then likely the fetch URL is wrong. Try accessing it from a mobile-imaging server to check that the proper flows are in place, and verify that the directory exists and contains the appropriate files. It's certainly possible that downloads will fail due to network problems or the like, which is why there are retries.
  • If you see an "Imaging complete" line, and then big fat nothing -- no pinging or anything -- then that's a repeat of bug 817762, which makes pandas sad (literally). Re-open the bug and find Dustin.

If it's not any of these, copy the relevant portion of the logs (from the last "entering state pxe_power_cycling") and proceed to #Log Failure In Interim Tracking Bug.

Log Failure In Interim Tracking Bug

Add a comment to bug bad-panda-log containing as much data as you have about the failure:

  • panda name
  • failure type
  • failure time (copy/paste the nagios alert from IRC, if nagios noticed)
  • what that panda was doing when it failed (if known)

Ack any corresponding nagios alerts with "added to bad panda log".

IT Actions

Jake will visit scl1 periodically and consult bug bad-panda-log, as well as mozpool itself. He will fix whatever's bad, and in the process develop knowledge of the common failure modes and their remediations, leading to a robust long-term process.