CIDuty/How To/Troubleshoot Hardware

From MozillaWiki
< CIDuty‎ | How To
Jump to: navigation, search
About

Often we need to troubleshoot the hardware workers for various reasons:

  • falling off the network
  • machine has shut down
  • generic worker/OCC fail to start or not running
  • hardware failure
  • not picking up tasks

If you notice any releng-hardware workers missing or not picking tasks, escalate to ciduty in #ci.

Monitoring

In order to be able to find the workers with issues, we use the following tools :

Grafana
Taskcluster Worker Checker
Nagios MDC1 and MDC2
Taskcluster

Logs

Sometimes a log can give us a better overview about the server or machine in question. For checking logs we use Papertrail

Workers

Windows 10

When a windows machine needs to be action-ed upon the best place to start is its logs. IF the logs aren't showing the worker ready for tasks reboot it from the worker explorer. Below are a two ways to fix this:

  • connect to it via iLo using the username and password that can be found under releng GPG private/passwords and click on Power Switch > Cold Boot
  • reimage it

Sometimes a reboot won't do the trick and in this case machine needs to be re-imaged.
Following the moonshot spreadsheet, re-image the machine through the HP iLO Integrated Remote Console, click Here to learn how to.
Be sure to follow the process until it completes and check back the worker explorer to see if it's picking tasks again.

Linux 64

When a linux machine needs to be action-ed upon the best place to start is its logs. If a linux worker stopped picking tasks, there are four ways to fix this

  • connect to it via ssh using the username and password that can be found under releng GPG private/passwords and run the following command : reboot
  • connect to it via iLo using the username and password that can be found under releng GPG private/passwords and click on Power Switch > Cold Boot
  • go to worker explorer and reboot it via roller.
  • reimage the machine

Click Here to learn how to reboot/ping via Roller Machines usually recover from this, if not re-image them following the moonshot spreadsheet and this page. When the procedure came to an end, you should get a puppet e-mail about the re-imaged worker. Remember to check back on it in the worker explorer to see if it's picking tasks again.

OSX 10.10

If an OSX machines stop taking tasks reboot, there are two ways to fix this:

  • connect to it via ssh using the username and password that can be found under releng GPG private/passwords and run the following command : reboot
  • go to worker explorer and reboot it via Roller

Click Here to learn how to reboot/ping via Roller Most of the time this recovers the worker. Otherwise re-image it following this.

Troubleshooting

Worker Actual Status

Sometimes while checking logs you may not find the machine. Mostly this happens when the machine was offline for many day and also when a machine has been taken down ( maintenance, hardware issues, etc )
To learn more about a machine, if it is loaned, hardware issues, etc you should find on Moonshot Inventory or/and on node definition ( here we have searched for T-W1064-MS-072) but if you don't find enough information, you should check on Bugzilla using the following keywords : ALL machine_name.
You can also check the actual status of the machine, Here.

How to add/define a worker if it is missing from Taskcluster

If we cannot ssh into OSX nodes, we can try to restart them from Taskcluster.
But if they are not visible in the Taskcluster worker explorer, then you can create them using this version of quarantine script that will add/define a worker if it is missing.
After setting up the taskcluster cli and script run the following command : e.g. :python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc2 t-yosemite-r7-449
After the steps above the worker explorer will show the machine and you can reboot it from there, using roller
If the issue is not fixed ( the machine does not take jobs and SSH is still not working ), create a bug for RelOps to physically reboot and reimage/netboot the machines.
If the restart fails, the Automatic Bug Generator will create a bug for RelOps.

No video on all cartridges from a chassis

If we see any connection problems to ilo, we can try the `reset cm` command to reset the ilo manager.

  • Connect to the moon-chassis using SSH connection
  • Run the following command reset cm

For more details check Bug 1504942

SSH not working
  • Check the Papertrail logs
  • Reboot it from Taskcluster. It may have old auth keys or not completed re-imaging
  • Create a tracking bug or update the existent one.