ReleaseEngineering/Mozpool/How To Interpret Device State in Mozpool

From MozillaWiki
Jump to: navigation, search

If you look at the Lifeguard or BMM UI's in Mozpool, you'll see that each device has a state. This page will help to break those down.

Note: This is not an exhaustive list of states. For that, see devicemachine.py.

Normal States

These are states that a panda normally will stay in for a while

new

Newly-added pandas show up in this state. Mozpool will not allocate these. They will automatically be self-tested, and either enter the 'free' state or be marked failed.

ready

The device is functional, but currently not allocated to any request. Devices in this state may be allocated by mozpool at any time.

maintenance_mode

The device is in maintenance mode: live-booted to the Ubuntu image and awaiting login via SSH. You can re-image or power-cycle it from this state when you're done with your maintenance.

locked_out

The device is in use using other automation (relay.py in SUTtools, most commonly), and should remain untouched by Mozpool. The "please" events don't work in this state -- the device must be forced back to one of the other normal states to get Mozpool to care about it again.

troubleshooting

The device is non-functional and is being worked on. This state is similar to locked_out but allows "please" events -- the device can only be forced into and out of this state.

Action States

When lifeguard is doing things to a device, it cycles the device through a number of states. Devices don't stay in these states for very long. Lifeguard has timeouts for each state, and will only retry each state a configured number of times before it decides the device has failed (see the failure states, below).

If a device is in an action state, the correct behavior on your part is to wait patiently until it gets to a normal or failed state. Lifeguard's still doing its thing, and you'll only get in the way. Don't make the lifeguard angry.

The details of the action states will change as we develop the tool, but there's a general pattern:

Lifeguard operations begin with one of two actions: either cycle the power and boot from the sdcard (states with a pc_ prefix), or cycle the power with a PXE config in place (states with a pxe_ prefix). Depending on the PXE config selected, the latter moves into purpose-specific states. These have prefixes like android_, b2g_ or maintenance. For installs, the suffixes roughly track the progress of the install:

  • download the binaries
  • extract them onto the sdcard
  • reboot
  • come up with an active network connection

You can see these state transitions in the second-stage scripts at http://hg.mozilla.org/build/puppet/file/tip/modules/bmm/templates/.

Failed States

Sometimes pandas go bad. Bad panda! When that happens, Lifeguard will generally detect it by not seeing the expected behavior, and assigning the device to a failed_ state. Many of these states are named by adding "failed_" to the name of the action state in which the device failed repeatedly.

When pandas fail, it's not always a hardware problem; see Handling Panda Failures for the process to follow there; the below just describes the states.

failed_power_cycling

"The power-cycle operation itself has failed or timed out multiple times"

  • Explanation: This is typically caused by a relay board failure. See Relay Board Failures

failed_pxe_booting

"While PXE booting, the device repeatedly failed to contact the imaging server from the live image."

  • Explanation: Mozpool successfully powercycled the relay associated with the panda board but the panda board did not check-in with mozpool within the allotted time.
    • Reasons:
      1. wrong relay powercycled: This can happen if the system.relay.0 k/v store does not match the actually attached relay.
      2. u-boot fails to start: This can be caused by a bad preseed image or failed hardware.
      3. u-boot fails to obtain dhcp IP: Make sure inventory information is correct. If the mac address is incorrect or is not in the correct vlan scope, this will cause the dhcp request to fail
      4. u-boot fails to find the local vlan tftp server: dhcp relies on vendor-class within the dhcp discovery packet to return the proper tftp server for pxe booting. An mismatched verndor-class can indicate a different version of u-boot or a non-preseed image is installed on the SD card.
      5. u-boot fails to find pxe config: This can happen if the mac address is in correct in inventory. Since pxe configs are generated from the mac address, this will cause u-boot to load the previous installed OS.
      6. squashfs file failed to download: This can be caused by the file being renamed on the apache server or being mispelled in the pxe_config db entry. If the squashfs fails to download, check the apache error logs and the pxe-config in the mozpool db associated with the failed request for mismatched squashfs URL.

failed_mobile_init_started

"While executing mobile-init, the device repeatedly failed to contact the imaging server from the live image."

  • Explanation: The panda board successfully pxe booted into the live environment but failed to continue executing a second stage script.

failed_sut_verifying

"Could not connect to SUT agent." There is a known bug bug 836417 causing all sut_verifying checks to fail after a reimage. See #Known Issues and Handling

  • Explanation: Mozpool was unable to connect to a panda running SUTAgent. This may be an indication that SUTAgent has failed to start or had crashed.

failed_android_downloading

"While installing Android, the device timed out repeatedly while downloading Android"

  • Explanation: One or more android artifacts failed to download during the second stage script.
    • Reasons:
      1. URL in the android second stage does not match the location of the android artifacts: Check the generated URL in the second stage android script and the apache logs of the mobile-imaging server locale to the panda in question.
      2. partition formatting failed: This can be caused by a bad preseed image or failed panda hardware.
      3. partition failed to mount: Formatting succeeded but failed to mount filesystems. This can be caused by a faulty SD card or a bad preseed image.

failed_android_extracting

"While installing Android, the device timed out repeatedly while extracting Android"

  • Explanation: The android artifacts were successfully download but one or more failed to exctract.
    • Reasons:
      1. artifact corrupted: If one of the android artifact tarballs are corrupt, extraction will fail.

failed_b2g_downloading

"While installing B2G, the device timed out repeatedly while downloading B2G"

  • Explanation: One or more B2g artifacts failed to download during the second stage script.
    • Reasons:
      1. Artifact URL provided does not match the location of the b2g artifacts: Check the URL and the apache logs of the server hosting the b2g atrifacts.
      2. partition formatting failed: This can be caused by a bad preseed image or failed panda hardware.
      3. partition failed to mount: Formatting succeeded but failed to mount filesystems. This can be caused by a faulty SD card or a bad preseed image.

failed_b2g_extracting

"While installing B2G, the device timed out repeatedly while extracting B2G"

  • Explanation: The B2G artifacts were successfully download but one or more failed to exctract.
    • Reasons:
      1. artifact corrupted: If one of the B2G artifact tarballs are corrupt, extraction will fail.

failed_b2g_pinging

"While installing B2G, the device timed out repeatedly while pinging the new image waiting for it to come up"

  • Explanation: After the b2g installation, the device is powercycled, given time to boot and then pinged. This failure state is reached after repeated ping attemps fail.
    • Reasons:
      1. bad image: If a b2g image installs correctly but fails to boot, this can simply a bad build.
      2. ethernet driver fails to initialize: Once in awhile, the ethernet device on the panda board fails to initialize.