CIDuty/How To/Troubleshoot Hardware: Difference between revisions

info updated and added a new topic
(modified info, added links and formating)
(info updated and added a new topic)
Line 26: Line 26:


==== Workers ====
==== Workers ====
===== How to add/define a worker if it is missing from Taskcluster =====
If we cannot ssh into OSX nodes, we can try to restart them from Taskcluster.<br />
But if they are not visible in the Taskcluster worker explorer, then you can create them using this version of [[CIDuty/How_To/QuarantineMultipleInstances|quarantine script]] that will add/define a worker if it is missing.
After setting up the taskcluster cli and script run the following command : e.g. :<code>python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc2 t-yosemite-r7-449</code>
After the steps above the worker explorer will show the machine and you can reboot it from there, using roller<br />
If the issue is not fixed ( the machine does not take jobs and SSH is still not working ), create a bug for RelOps to physically reboot and reimage/netboot the machines.<br />
The Automatic Bug Generator will create a bug for RelOps if the restart fails.


===== Windows 10 =====
===== Windows 10 =====
Line 42: Line 50:
When a linux machine needs to be action-ed upon the best place to start is its [https://papertrailapp.com/ logs].
When a linux machine needs to be action-ed upon the best place to start is its [https://papertrailapp.com/ logs].
If a linux worker stopped picking tasks, there are four ways to fix this
If a linux worker stopped picking tasks, there are four ways to fix this
* connect to it via ssh using the ''username and password'' that can be found under releng GPG ''private/passwords'' and run the following command : reboot
* connect to it via ssh using the ''username and password'' that can be found under releng GPG ''private/passwords'' and run the following command : <code>reboot</code>
* connect to it via iLo using the ''username and password'' that can be found under releng GPG ''private/passwords'' and click on Power Switch > Cold Boot
* connect to it via iLo using the ''username and password'' that can be found under releng GPG ''private/passwords'' and click on Power Switch > Cold Boot
* go to [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos worker explorer] and reboot it via roller.
* go to [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos worker explorer] and reboot it via roller.
Line 62: Line 70:
===== Worker Actual Status =====
===== Worker Actual Status =====


Sometimes while checking logs you may not find the machine. Mostly this happens when the machine was offline for many day and also when a machine has been taken down ( maintenance, hardware issues, etc )<br />
Sometimes while [https://papertrailapp.com/ checking logs] you may not find the machine. Mostly this happens when the machine was offline for many day and also when a machine has been taken down ( maintenance, hardware issues, etc )<br />
To learn more about a machine, if it is loaned, hardware issues, etc  you should find on [https://docs.google.com/spreadsheets/d/1IPTmppvqDw0PQV-O1LgXLJg_7TC-H_IAAnSxcur8c7I/edit?pli=1#gid=562893333 Moonshot Inventory], but if you don't find enough information, you should check on [https://bugzilla.mozilla.org/ Bugzilla] using the following keywords : ALL machine_name.<br />
To learn more about a machine, if it is loaned, hardware issues, etc  you should find on [https://docs.google.com/spreadsheets/d/1IPTmppvqDw0PQV-O1LgXLJg_7TC-H_IAAnSxcur8c7I/edit?pli=1#gid=562893333 Moonshot Inventory] or/and on [https://github.com/mozilla-releng/build-puppet/search?q=T-W1064-MS-072&unscoped_q=T-W1064-MS-072 node definition] ( here we have searched for T-W1064-MS-072) but if you don't find enough information, you should check on [https://bugzilla.mozilla.org/ Bugzilla] using the following keywords : ALL machine_name.<br />
You can also check the actual status of the machine, [https://mozilla.service-now.com/nav_to.do Here].
You can also check the actual status of the machine, [https://mozilla.service-now.com/nav_to.do Here].
canmove, Confirmed users
112

edits