ReleaseEngineering/Mozpool/Handling Panda Failures/Long-Term Process: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Created page with " The first level of diagnosis for a panda is which failed state lifeguard has chosen for the panda. The next level is usually to look at the logs in lifeguard. ... Check the ...")
 
No edit summary
 
Line 1: Line 1:
= Handling Panda Failures =


Failing pandas are passed between three groups.


The first level of diagnosis for a panda is which failed state lifeguard has chosen for the panda.  The next level is usually to look at the logs in lifeguard.
== Release Engineering ==
* Check for causes in release engineering automation
* Hand to relops using <<bug process>>


...
== Release Operations ==
* Look for new problems
* Look for and work around known but unfixed failure states (e.g., sut_verify)
* Hand off to DCOps <<bug process>>


Check the logs for anything from the board, tagged "syslog"If you see that, then the board is booting and has networkOtherwise, you'll need to investigate starting at the beginning:
== DC Operations ==
* check for power
** inspect green lights on the panda board
** if no lights, unplug power cable and check with a volt meter.  Positive probe is inserted inside barrel plug and negative probe is touched to outside barrel plug.  This should read approx 5 volts.
*** <b> DO NOT LET BARREL PLUG TOUCH THE CHASSIS OR OTHER PARTS.</b> This will cause a short and blow the fuse.
*** if no voltage is present, check fuseIf fuse is blown, remove fuse and file a bug with relops


* does the board have power? (blinkenlights)
* check cat5 cables
* does the board have link? (lights on the switch)
** inspect internal and external cat5 cables (use fluke)
* does the board have an sdcard?
** make sure cables are securely inserted into RJ45 jacks
* does the board become pingable even briefly if you force a power-cycle from the BMM UI?
** if so, then power, link, and sdcard are all working at least a little bit - try a new sdcard


after all of that, if you haven't found the problem, then it's time for some serial diagnostics.
* check SD Card
** If power and cat5 do not have any obvious problems, preceded with replacing SD Card
*** the replacement SD Card should be new and have a fresh preseed image installed
*** make sure SD Card is securely inserted
*** Deliver used SD Cards to Relops for testing or decommission


... failed_*_downloading
If all of the above has been performed and the panda board still shows problems, reopen the tracking bug to have DCOPS replace panda boardThe failed panda should be given to relops for further diagnostics or decommissioning.
 
Most of the time, this will be either a corrupt or dead sdcard.  Try swapping in another card.
 
Let's also try re-writing the u-boot image to the card, in case it was corrupted, but marking the sdcard somehowIf it turns out that the u-boot image gets corrupted sometimes, but re-writing it fixes that, then we can avoid trashing a lot of good sdcards.  If it never helps, delete this paragraph.

Latest revision as of 17:24, 12 March 2013

Handling Panda Failures

Failing pandas are passed between three groups.

Release Engineering

  • Check for causes in release engineering automation
  • Hand to relops using <<bug process>>

Release Operations

  • Look for new problems
  • Look for and work around known but unfixed failure states (e.g., sut_verify)
  • Hand off to DCOps <<bug process>>

DC Operations

  • check for power
    • inspect green lights on the panda board
    • if no lights, unplug power cable and check with a volt meter. Positive probe is inserted inside barrel plug and negative probe is touched to outside barrel plug. This should read approx 5 volts.
      • DO NOT LET BARREL PLUG TOUCH THE CHASSIS OR OTHER PARTS. This will cause a short and blow the fuse.
      • if no voltage is present, check fuse. If fuse is blown, remove fuse and file a bug with relops
  • check cat5 cables
    • inspect internal and external cat5 cables (use fluke)
    • make sure cables are securely inserted into RJ45 jacks
  • check SD Card
    • If power and cat5 do not have any obvious problems, preceded with replacing SD Card
      • the replacement SD Card should be new and have a fresh preseed image installed
      • make sure SD Card is securely inserted
      • Deliver used SD Cards to Relops for testing or decommission

If all of the above has been performed and the panda board still shows problems, reopen the tracking bug to have DCOPS replace panda board. The failed panda should be given to relops for further diagnostics or decommissioning.