ReleaseEngineering/Mozpool/Handling Panda Failures

From MozillaWiki
Jump to navigation Jump to search

This page describes the appropriate process to handle panda failures.

Warning signWarning: There are two parts to this process: the interim process, and the in-development long-term process. Relops will communicate clearly when the long-term process is put in place.

Which Foopy is a Panda Connected To

See http://hg.mozilla.org/build/tools/file/default/buildfarm/mobile/devices.json

No Builds but Panda in 'free' State

Make sure a build slave is running on the foopy. Example:

$ ssh cltbld@foopy45 $ ps auxwww | grep panda-0169

If no results, start up the build slave:

$ /builds/manage_buildslave.sh start panda-0169

Failure condition

Nagios State Check failure

Nagois does not ping panda boards directly but instead checks for failures states through mozpool.

Misbehavior In Tests

See #Log Failure In Interim Tracking Bug.

Mozpool failed_ state

failed_power_cycling: "The power-cycle operation itself has failed or timed out multiple times"

  • Explanation: This is typically caused by a relay board failure. See Relay Board Failures

failed_pxe_booting: "While PXE booting, the device repeatedly failed to contact the imaging server from the live image."

  • Explanation: Mozpool successfully powercycled the relay associated with the panda board but the panda board did not check-in with mozpool within the allotted time.
    • Reasons:
      1. wrong relay powercycled: This can happen if the system.relay.0 k/v store does not match the actually attached relay.
      2. u-boot fails to start: This can be caused by a bad preseed image or failed hardware.
      3. u-boot fails to obtain dhcp IP: Make sure inventory information is correct. If the mac address is incorrect or is not in the correct vlan scope, this will cause the dhcp request to fail
      4. u-boot fails to find the local vlan tftp server: dhcp relies on vendor-class within the dhcp discovery packet to return the proper tftp server for pxe booting. An mismatched verndor-class can indicate a different version of u-boot or a non-preseed image is installed on the SD card.
      5. u-boot fails to find pxe config: This can happen if the mac address is in correct in inventory. Since pxe configs are generated from the mac address, this will cause u-boot to load the previous installed OS.
      6. squashfs file failed to download: This can be caused by the file being renamed on the apache server or being mispelled in the pxe_config db entry. If the squashfs fails to download, check the apache error logs and the pxe-config in the mozpool db associated with the failed request for mismatched squashfs URL.

failed_mobile_init_started: "While executing mobile-init, the device repeatedly failed to contact the imaging server from the live image."

  • Explanation: The panda board successfully pxe booted into the live environment but failed to continue executing a second stage script.

failed_sut_verifying: "Could not connect to SUT agent." There is a known bug bug 836417 causing all sut_verifying checks to fail after a reimage. The workaround for this is to force state "sut_verifying" a few minutes after the first failure.

  • Explanation: Mozpool was unable to connect to a panda running SUTAgent. This may be an indication that SUTAgent has failed to start or had crashed.

failed_android_downloading: "While installing Android, the device timed out repeatedly while downloading Android"

  • Explanation: One or more android artifacts failed to download during the second stage script.
    • Reasons:
      1. URL in the android second stage does not match the location of the android artifacts: Check the generated URL in the second stage android script and the apache logs of the mobile-imaging server locale to the panda in question.
      2. partition formatting failed: This can be caused by a bad preseed image or failed panda hardware.
      3. partition failed to mount: Formatting succeeded but failed to mount filesystems. This can be caused by a faulty SD card or a bad preseed image.

Responses

Look for common failure modes in logs

In the BMM or Lifeguard UIs, click the device's "log" link and look at the logs.

  • If you see "Invalid bootconfig", this is probably user error. Scroll up a bit to find "writing pxe_config .. boot config ..", and look at the boot config. Is it well-formed JSON? Is it using the right key name ("b2gbase" for B2G)?
  • If you see "Fetching.." followed by "wget failed:" (wget doesn't log very well, sorry), then likely the fetch URL is wrong. Try accessing it from a mobile-imaging server to check that the proper flows are in place, and verify that the directory exists and contains the appropriate files. It's certainly possible that downloads will fail due to network problems or the like, which is why there are retries.
  • If you see an "Imaging complete" line, and then big fat nothing -- no pinging or anything -- then that's a repeat of bug 817762, which makes pandas sad (literally). Re-open the bug and find Dustin.

If it's not any of these, copy the relevant portion of the logs (from the last "entering state pxe_power_cycling") and proceed to #Log Failure In Interim Tracking Bug.

Log Failure In Interim Tracking Bug

Add a comment to bug bad-panda-log containing as much data as you have about the failure:

  • panda name
  • failure type
  • failure time (copy/paste the nagios alert from IRC, if nagios noticed)
  • what that panda was doing when it failed (if known)

Ack any corresponding nagios alerts with "added to bad panda log".

IT Actions

Jake will visit scl1 periodically and consult bug bad-panda-log, as well as mozpool itself. He will fix whatever's bad, and in the process develop knowledge of the common failure modes and their remediations, leading to a robust long-term process.