CIDuty/How To/Troubleshoot Hardware: Difference between revisions
(Added content to the page) |
m (minor update) |
||
| (6 intermediate revisions by 2 users not shown) | |||
| Line 1: | Line 1: | ||
===== About ===== | |||
Often we need to troubleshoot the hardware workers for various reasons: | Often we need to troubleshoot the hardware workers for various reasons: | ||
* falling off the network | * falling off the network | ||
* machine has shut down | |||
* generic worker/OCC fail to start or not running | * generic worker/OCC fail to start or not running | ||
* hardware failure | * hardware failure | ||
| Line 7: | Line 11: | ||
If you notice any releng-hardware workers missing or not picking tasks, escalate to ciduty in #ci. | If you notice any releng-hardware workers missing or not picking tasks, escalate to ciduty in #ci. | ||
= Windows 10 = | ===== Monitoring ===== | ||
In order to be able to find the workers with issues, we use the following tools : | |||
'''[https://grafana.relops.mozops.net/d/3PORALriz/workers?orgId=1&from=now-3h&to=now&refresh=1m&var-provisioner=releng-hardware&var-workerType=All Grafana]'''<br /> | |||
'''[https://github.com/Akhliskun/taskcluster-worker-checker Taskcluster Worker Checker]'''<br /> | |||
'''Nagios [https://nagios1.private.releng.mdc1.mozilla.com MDC1] and [https://nagios1.private.releng.mdc1.mozilla.com MDC2]'''<br /> | |||
'''[https://tools.taskcluster.net/provisioners/ Taskcluster]'''<br /> | |||
===== Logs ===== | |||
Sometimes a log can give us a better overview about the server or machine in question. | |||
For checking logs we use '''[https://papertrailapp.com/ Papertrail]''' | |||
==== Workers ==== | |||
===== Windows 10 ===== | |||
When a windows machine needs to be action-ed upon the best place to start is its [https://papertrailapp.com/ logs]. | When a windows machine needs to be action-ed upon the best place to start is its [https://papertrailapp.com/ logs]. | ||
IF the logs aren't showing the worker ready for tasks reboot it from the [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw worker explorer]. | IF the logs aren't showing the worker ready for tasks reboot it from the [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw worker explorer]. | ||
Sometimes a reboot won't do the trick and in this case machine needs to be re-imaged. Following the [https://docs.google.com/spreadsheets/d/1IPTmppvqDw0PQV-O1LgXLJg_7TC-H_IAAnSxcur8c7I/edit?ts=5ad7748a#gid=562893333 moonshot spreadsheet], re-image the machine through the HP iLO Integrated Remote Console. Be sure to follow the process until it completes and check back | Below are a two ways to fix this: | ||
* connect to it via iLo using the ''username and password'' that can be found under releng GPG ''private/passwords'' and click on Power Switch > Cold Boot | |||
* reimage it | |||
Sometimes a reboot won't do the trick and in this case machine needs to be re-imaged. <br /> | |||
Following the [https://docs.google.com/spreadsheets/d/1IPTmppvqDw0PQV-O1LgXLJg_7TC-H_IAAnSxcur8c7I/edit?ts=5ad7748a#gid=562893333 moonshot spreadsheet], re-image the machine through the HP iLO Integrated Remote Console, click [[CIDuty/How_To/Reimage_Windows_Workers|'''Here''']] to learn how to. <br /> | |||
Be sure to follow the process until it completes and check back the [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw worker explorer] to see if it's picking tasks again. | |||
===== Linux 64 ===== | |||
When a linux machine needs to be action-ed upon the best place to start is its [https://papertrailapp.com/ logs]. | When a linux machine needs to be action-ed upon the best place to start is its [https://papertrailapp.com/ logs]. | ||
If a linux worker stopped picking tasks, reboot it | If a linux worker stopped picking tasks, there are four ways to fix this | ||
* connect to it via ssh using the ''username and password'' that can be found under releng GPG ''private/passwords'' and run the following command : <code>reboot</code> | |||
* connect to it via iLo using the ''username and password'' that can be found under releng GPG ''private/passwords'' and click on Power Switch > Cold Boot | |||
* go to [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos worker explorer] and reboot it via roller. | |||
* reimage the machine | |||
Click [[CIDuty/How_To/Take_actions_to_RelEng_Hardware_from_TaskCluster_UI|'''Here''']] to learn how to reboot/ping via Roller | |||
Machines usually recover from this, if not re-image them following the [https://docs.google.com/spreadsheets/d/1IPTmppvqDw0PQV-O1LgXLJg_7TC-H_IAAnSxcur8c7I/edit?ts=5ad7748a#gid=562893333 moonshot spreadsheet] and t[[CIDuty/How_To/Reimage_Linux_Workers|his page]]. | |||
When the procedure came to an end, you should get a puppet e-mail about the re-imaged worker. | |||
Remember to check back on it in the [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos worker explorer] to see if it's picking tasks again. | Remember to check back on it in the [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos worker explorer] to see if it's picking tasks again. | ||
= OSX 10.10 = | ===== OSX 10.10 ===== | ||
Otherwise re-image it | If an OSX machines stop taking tasks reboot, there are two ways to fix this: | ||
* connect to it via ssh using the ''username and password'' that can be found under releng GPG ''private/passwords'' and run the following command : reboot | |||
* go to [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010 worker explorer] and reboot it via Roller | |||
Click [[CIDuty/How_To/Take_actions_to_RelEng_Hardware_from_TaskCluster_UI|'''Here''']] to learn how to reboot/ping via Roller | |||
Most of the time this recovers the worker. | |||
Otherwise re-image it following [[CIDuty/How_To/Reimage_OSX_Workers|this]]. | |||
==== Troubleshooting ==== | |||
===== Worker Actual Status ===== | |||
Sometimes while [https://papertrailapp.com/ checking logs] you may not find the machine. Mostly this happens when the machine was offline for many day and also when a machine has been taken down ( maintenance, hardware issues, etc )<br /> | |||
To learn more about a machine, if it is loaned, hardware issues, etc you should find on [https://docs.google.com/spreadsheets/d/1IPTmppvqDw0PQV-O1LgXLJg_7TC-H_IAAnSxcur8c7I/edit?pli=1#gid=562893333 Moonshot Inventory] or/and on [https://github.com/mozilla-releng/build-puppet/search?q=T-W1064-MS-072&unscoped_q=T-W1064-MS-072 node definition] ( here we have searched for T-W1064-MS-072) but if you don't find enough information, you should check on [https://bugzilla.mozilla.org/ Bugzilla] using the following keywords : ALL machine_name.<br /> | |||
You can also check the actual status of the machine, [https://mozilla.service-now.com/nav_to.do Here]. | |||
===== How to add/define a worker if it is missing from Taskcluster ===== | |||
If we cannot ssh into OSX nodes, we can try to restart them from Taskcluster.<br /> | |||
But if they are not visible in the Taskcluster worker explorer, then you can create them using this version of [[CIDuty/How_To/QuarantineMultipleInstances|quarantine script]] that will add/define a worker if it is missing.<br /> | |||
After setting up the taskcluster cli and script run the following command : e.g. :<code>python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc2 t-yosemite-r7-449</code><br /> | |||
After the steps above the worker explorer will show the machine and you can reboot it from there, using roller<br /> | |||
If the issue is not fixed ( the machine does not take jobs and SSH is still not working ), create a bug for RelOps to physically reboot and reimage/netboot the machines.<br /> | |||
If the restart fails, the Automatic Bug Generator will create a bug for RelOps. | |||
===== No video on all cartridges from a chassis ===== | |||
If we see any connection problems to ilo, we can try the `reset cm` command to reset the ilo manager. | |||
* Connect to the moon-chassis using SSH connection | |||
* Run the following command <code>reset cm</code> | |||
For more details check [https://bugzilla.mozilla.org/show_bug.cgi?id=1504942#c6 Bug 1504942] | |||
===== SSH not working ===== | |||
* Check the [https://papertrailapp.com/ Papertrail logs] | |||
* Reboot it from [[CIDuty/How_To/Take_actions_to_RelEng_Hardware_from_TaskCluster_UI|Taskcluster]]. It may have old auth keys or not completed re-imaging | |||
* Create a tracking bug or update the existent one. | |||
Latest revision as of 01:59, 30 April 2019
About
Often we need to troubleshoot the hardware workers for various reasons:
- falling off the network
- machine has shut down
- generic worker/OCC fail to start or not running
- hardware failure
- not picking up tasks
If you notice any releng-hardware workers missing or not picking tasks, escalate to ciduty in #ci.
Monitoring
In order to be able to find the workers with issues, we use the following tools :
Grafana
Taskcluster Worker Checker
Nagios MDC1 and MDC2
Taskcluster
Logs
Sometimes a log can give us a better overview about the server or machine in question. For checking logs we use Papertrail
Workers
Windows 10
When a windows machine needs to be action-ed upon the best place to start is its logs. IF the logs aren't showing the worker ready for tasks reboot it from the worker explorer. Below are a two ways to fix this:
- connect to it via iLo using the username and password that can be found under releng GPG private/passwords and click on Power Switch > Cold Boot
- reimage it
Sometimes a reboot won't do the trick and in this case machine needs to be re-imaged.
Following the moonshot spreadsheet, re-image the machine through the HP iLO Integrated Remote Console, click Here to learn how to.
Be sure to follow the process until it completes and check back the worker explorer to see if it's picking tasks again.
Linux 64
When a linux machine needs to be action-ed upon the best place to start is its logs. If a linux worker stopped picking tasks, there are four ways to fix this
- connect to it via ssh using the username and password that can be found under releng GPG private/passwords and run the following command :
reboot - connect to it via iLo using the username and password that can be found under releng GPG private/passwords and click on Power Switch > Cold Boot
- go to worker explorer and reboot it via roller.
- reimage the machine
Click Here to learn how to reboot/ping via Roller Machines usually recover from this, if not re-image them following the moonshot spreadsheet and this page. When the procedure came to an end, you should get a puppet e-mail about the re-imaged worker. Remember to check back on it in the worker explorer to see if it's picking tasks again.
OSX 10.10
If an OSX machines stop taking tasks reboot, there are two ways to fix this:
- connect to it via ssh using the username and password that can be found under releng GPG private/passwords and run the following command : reboot
- go to worker explorer and reboot it via Roller
Click Here to learn how to reboot/ping via Roller Most of the time this recovers the worker. Otherwise re-image it following this.
Troubleshooting
Worker Actual Status
Sometimes while checking logs you may not find the machine. Mostly this happens when the machine was offline for many day and also when a machine has been taken down ( maintenance, hardware issues, etc )
To learn more about a machine, if it is loaned, hardware issues, etc you should find on Moonshot Inventory or/and on node definition ( here we have searched for T-W1064-MS-072) but if you don't find enough information, you should check on Bugzilla using the following keywords : ALL machine_name.
You can also check the actual status of the machine, Here.
How to add/define a worker if it is missing from Taskcluster
If we cannot ssh into OSX nodes, we can try to restart them from Taskcluster.
But if they are not visible in the Taskcluster worker explorer, then you can create them using this version of quarantine script that will add/define a worker if it is missing.
After setting up the taskcluster cli and script run the following command : e.g. :python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc2 t-yosemite-r7-449
After the steps above the worker explorer will show the machine and you can reboot it from there, using roller
If the issue is not fixed ( the machine does not take jobs and SSH is still not working ), create a bug for RelOps to physically reboot and reimage/netboot the machines.
If the restart fails, the Automatic Bug Generator will create a bug for RelOps.
No video on all cartridges from a chassis
If we see any connection problems to ilo, we can try the `reset cm` command to reset the ilo manager.
- Connect to the moon-chassis using SSH connection
- Run the following command
reset cm
For more details check Bug 1504942
SSH not working
- Check the Papertrail logs
- Reboot it from Taskcluster. It may have old auth keys or not completed re-imaging
- Create a tracking bug or update the existent one.