CIDuty/How To/Troubleshoot Hardware: Difference between revisions

From MozillaWiki
< CIDuty‎ | How To
Jump to navigation Jump to search
(Added content to the page)
 
m (minor update)
 
(6 intermediate revisions by 2 users not shown)
Line 1: Line 1:
===== About =====
Often we need to troubleshoot the hardware workers for various reasons:
Often we need to troubleshoot the hardware workers for various reasons:
* falling off the network
* falling off the network
* machine has shut down
* generic worker/OCC fail to start or not running
* generic worker/OCC fail to start or not running
* hardware failure
* hardware failure
Line 7: Line 11:
If you notice any releng-hardware workers missing or not picking tasks, escalate to ciduty in #ci.
If you notice any releng-hardware workers missing or not picking tasks, escalate to ciduty in #ci.


= Windows 10 =
===== Monitoring =====
 
In order to be able to find the workers with issues, we use the following tools :
 
'''[https://grafana.relops.mozops.net/d/3PORALriz/workers?orgId=1&from=now-3h&to=now&refresh=1m&var-provisioner=releng-hardware&var-workerType=All Grafana]'''<br />
'''[https://github.com/Akhliskun/taskcluster-worker-checker Taskcluster Worker Checker]'''<br />
'''Nagios [https://nagios1.private.releng.mdc1.mozilla.com MDC1] and [https://nagios1.private.releng.mdc1.mozilla.com MDC2]'''<br />
'''[https://tools.taskcluster.net/provisioners/ Taskcluster]'''<br />
 
===== Logs =====
 
Sometimes a log can give us a better overview about the server or machine in question.
For checking logs we use '''[https://papertrailapp.com/ Papertrail]'''
 
==== Workers ====
 
===== Windows 10 =====
 
When a windows machine needs to be action-ed upon the best place to start is its [https://papertrailapp.com/ logs].
When a windows machine needs to be action-ed upon the best place to start is its [https://papertrailapp.com/ logs].
IF the logs aren't showing the worker ready for tasks reboot it from the [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw worker explorer].
IF the logs aren't showing the worker ready for tasks reboot it from the [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw worker explorer].
Sometimes a reboot won't do the trick and in this case machine needs to be re-imaged. Following the [https://docs.google.com/spreadsheets/d/1IPTmppvqDw0PQV-O1LgXLJg_7TC-H_IAAnSxcur8c7I/edit?ts=5ad7748a#gid=562893333 moonshot spreadsheet], re-image the machine through the HP iLO Integrated Remote Console. Be sure to follow the process until it completes and check back on it in the worker explorer to see if it's picking tasks again.
Below are a two ways to fix this:
* connect to it via iLo using the ''username and password'' that can be found under releng GPG ''private/passwords'' and click on Power Switch > Cold Boot
* reimage it
Sometimes a reboot won't do the trick and in this case machine needs to be re-imaged. <br />
Following the [https://docs.google.com/spreadsheets/d/1IPTmppvqDw0PQV-O1LgXLJg_7TC-H_IAAnSxcur8c7I/edit?ts=5ad7748a#gid=562893333 moonshot spreadsheet], re-image the machine through the HP iLO Integrated Remote Console, click [[CIDuty/How_To/Reimage_Windows_Workers|'''Here''']] to learn how to. <br />
Be sure to follow the process until it completes and check back the [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw worker explorer] to see if it's picking tasks again.
 
===== Linux 64 =====


= Linux 64 =
When a linux machine needs to be action-ed upon the best place to start is its [https://papertrailapp.com/ logs].
When a linux machine needs to be action-ed upon the best place to start is its [https://papertrailapp.com/ logs].
If a linux worker stopped picking tasks, reboot it from the [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos worker explorer]. Machines usually recover from this, if not re-image them following the [https://docs.google.com/spreadsheets/d/1IPTmppvqDw0PQV-O1LgXLJg_7TC-H_IAAnSxcur8c7I/edit?ts=5ad7748a#gid=562893333 moonshot spreadsheet] and [https://mana.mozilla.org/wiki/display/DC/How+To+Reimage+Releng+HP+Moonshot+Linux+Machines these steps]. You should be getting a puppet e-mail regarding the re-imaged worker.
If a linux worker stopped picking tasks, there are four ways to fix this
* connect to it via ssh using the ''username and password'' that can be found under releng GPG ''private/passwords'' and run the following command : <code>reboot</code>
* connect to it via iLo using the ''username and password'' that can be found under releng GPG ''private/passwords'' and click on Power Switch > Cold Boot
* go to [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos worker explorer] and reboot it via roller.
* reimage the machine
Click [[CIDuty/How_To/Take_actions_to_RelEng_Hardware_from_TaskCluster_UI|'''Here''']] to learn how to reboot/ping via Roller
Machines usually recover from this, if not re-image them following the [https://docs.google.com/spreadsheets/d/1IPTmppvqDw0PQV-O1LgXLJg_7TC-H_IAAnSxcur8c7I/edit?ts=5ad7748a#gid=562893333 moonshot spreadsheet] and t[[CIDuty/How_To/Reimage_Linux_Workers|his page]].  
When the procedure came to an end, you should get a puppet e-mail about the re-imaged worker.
Remember to check back on it in the [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos worker explorer] to see if it's picking tasks again.
Remember to check back on it in the [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos worker explorer] to see if it's picking tasks again.


= OSX 10.10 =
===== OSX 10.10 =====
When OSX machines stop taking tasks reboot them from the [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010 worker explorer]. Most of the time this recovers the worker.  
 
Otherwise re-image it by running the appropriate line as per [https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=34014655 this].
If an OSX machines stop taking tasks reboot, there are two ways to fix this:
* connect to it via ssh using the ''username and password'' that can be found under releng GPG ''private/passwords'' and run the following command : reboot
* go to [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010 worker explorer] and reboot it via Roller
Click [[CIDuty/How_To/Take_actions_to_RelEng_Hardware_from_TaskCluster_UI|'''Here''']] to learn how to reboot/ping via Roller
Most of the time this recovers the worker.  
Otherwise re-image it following [[CIDuty/How_To/Reimage_OSX_Workers|this]].
 
==== Troubleshooting ====
 
===== Worker Actual Status =====
 
Sometimes while [https://papertrailapp.com/ checking logs] you may not find the machine. Mostly this happens when the machine was offline for many day and also when a machine has been taken down ( maintenance, hardware issues, etc )<br />
To learn more about a machine, if it is loaned, hardware issues, etc  you should find on [https://docs.google.com/spreadsheets/d/1IPTmppvqDw0PQV-O1LgXLJg_7TC-H_IAAnSxcur8c7I/edit?pli=1#gid=562893333 Moonshot Inventory] or/and on [https://github.com/mozilla-releng/build-puppet/search?q=T-W1064-MS-072&unscoped_q=T-W1064-MS-072 node definition] ( here we have searched for T-W1064-MS-072) but if you don't find enough information, you should check on [https://bugzilla.mozilla.org/ Bugzilla] using the following keywords : ALL machine_name.<br />
You can also check the actual status of the machine, [https://mozilla.service-now.com/nav_to.do Here].
 
===== How to add/define a worker if it is missing from Taskcluster =====
 
If we cannot ssh into OSX nodes, we can try to restart them from Taskcluster.<br />
But if they are not visible in the Taskcluster worker explorer, then you can create them using this version of [[CIDuty/How_To/QuarantineMultipleInstances|quarantine script]] that will add/define a worker if it is missing.<br />
After setting up the taskcluster cli and script run the following command : e.g. :<code>python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc2 t-yosemite-r7-449</code><br />
After the steps above the worker explorer will show the machine and you can reboot it from there, using roller<br />
If the issue is not fixed ( the machine does not take jobs and SSH is still not working ), create a bug for RelOps to physically reboot and reimage/netboot the machines.<br />
If the restart fails, the Automatic Bug Generator will create a bug for RelOps.
 
===== No video on all cartridges from a chassis =====
 
If we see any connection problems to ilo, we can try the `reset cm` command to reset the ilo manager.
 
* Connect to the moon-chassis using SSH connection
* Run the following command <code>reset cm</code>
   
For more details check [https://bugzilla.mozilla.org/show_bug.cgi?id=1504942#c6 Bug 1504942]
 
===== SSH not working =====
 
* Check the [https://papertrailapp.com/ Papertrail logs]
* Reboot it from [[CIDuty/How_To/Take_actions_to_RelEng_Hardware_from_TaskCluster_UI|Taskcluster]]. It may have old auth keys or not completed re-imaging
* Create a tracking bug or update the existent one.

Latest revision as of 01:59, 30 April 2019

About

Often we need to troubleshoot the hardware workers for various reasons:

  • falling off the network
  • machine has shut down
  • generic worker/OCC fail to start or not running
  • hardware failure
  • not picking up tasks

If you notice any releng-hardware workers missing or not picking tasks, escalate to ciduty in #ci.

Monitoring

In order to be able to find the workers with issues, we use the following tools :

Grafana
Taskcluster Worker Checker
Nagios MDC1 and MDC2
Taskcluster

Logs

Sometimes a log can give us a better overview about the server or machine in question. For checking logs we use Papertrail

Workers

Windows 10

When a windows machine needs to be action-ed upon the best place to start is its logs. IF the logs aren't showing the worker ready for tasks reboot it from the worker explorer. Below are a two ways to fix this:

  • connect to it via iLo using the username and password that can be found under releng GPG private/passwords and click on Power Switch > Cold Boot
  • reimage it

Sometimes a reboot won't do the trick and in this case machine needs to be re-imaged.
Following the moonshot spreadsheet, re-image the machine through the HP iLO Integrated Remote Console, click Here to learn how to.
Be sure to follow the process until it completes and check back the worker explorer to see if it's picking tasks again.

Linux 64

When a linux machine needs to be action-ed upon the best place to start is its logs. If a linux worker stopped picking tasks, there are four ways to fix this

  • connect to it via ssh using the username and password that can be found under releng GPG private/passwords and run the following command : reboot
  • connect to it via iLo using the username and password that can be found under releng GPG private/passwords and click on Power Switch > Cold Boot
  • go to worker explorer and reboot it via roller.
  • reimage the machine

Click Here to learn how to reboot/ping via Roller Machines usually recover from this, if not re-image them following the moonshot spreadsheet and this page. When the procedure came to an end, you should get a puppet e-mail about the re-imaged worker. Remember to check back on it in the worker explorer to see if it's picking tasks again.

OSX 10.10

If an OSX machines stop taking tasks reboot, there are two ways to fix this:

  • connect to it via ssh using the username and password that can be found under releng GPG private/passwords and run the following command : reboot
  • go to worker explorer and reboot it via Roller

Click Here to learn how to reboot/ping via Roller Most of the time this recovers the worker. Otherwise re-image it following this.

Troubleshooting

Worker Actual Status

Sometimes while checking logs you may not find the machine. Mostly this happens when the machine was offline for many day and also when a machine has been taken down ( maintenance, hardware issues, etc )
To learn more about a machine, if it is loaned, hardware issues, etc you should find on Moonshot Inventory or/and on node definition ( here we have searched for T-W1064-MS-072) but if you don't find enough information, you should check on Bugzilla using the following keywords : ALL machine_name.
You can also check the actual status of the machine, Here.

How to add/define a worker if it is missing from Taskcluster

If we cannot ssh into OSX nodes, we can try to restart them from Taskcluster.
But if they are not visible in the Taskcluster worker explorer, then you can create them using this version of quarantine script that will add/define a worker if it is missing.
After setting up the taskcluster cli and script run the following command : e.g. :python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc2 t-yosemite-r7-449
After the steps above the worker explorer will show the machine and you can reboot it from there, using roller
If the issue is not fixed ( the machine does not take jobs and SSH is still not working ), create a bug for RelOps to physically reboot and reimage/netboot the machines.
If the restart fails, the Automatic Bug Generator will create a bug for RelOps.

No video on all cartridges from a chassis

If we see any connection problems to ilo, we can try the `reset cm` command to reset the ilo manager.

  • Connect to the moon-chassis using SSH connection
  • Run the following command reset cm

For more details check Bug 1504942

SSH not working
  • Check the Papertrail logs
  • Reboot it from Taskcluster. It may have old auth keys or not completed re-imaging
  • Create a tracking bug or update the existent one.