Connect and Troubleshoot workers in CI: Difference between revisions
m (quick updates) |
m (quick updates) |
||
Line 7: | Line 7: | ||
If we cannot ssh into OSX nodes, we can try to restart them from Taskcluster. But if they are not visible in the Taskcluster worker explorer, then you can create them using this version of [https://github.com/davehouse/relops-infra/blob/quarantine_nonexisting/quarantine_tc.py quarantine script] that will add/define a worker if it is missing. | If we cannot ssh into OSX nodes, we can try to restart them from Taskcluster. But if they are not visible in the Taskcluster worker explorer, then you can create them using this version of [https://github.com/davehouse/relops-infra/blob/quarantine_nonexisting/quarantine_tc.py quarantine script] that will add/define a worker if it is missing. | ||
* Step 1: [[BuildDuty:TaskClusterCli|Connect to Taskcluster CLI]] | * Step 1: [[BuildDuty:TaskClusterCli|Connect to Taskcluster CLI]] | ||
* Step 2: Use this command: e.g. < | * Step 2: Use this command: e.g. <code>python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc2 t-yosemite-r7-449</code> | ||
After the steps above the worker explorer will show the machine and you can reboot it from there, using [[ReleaseEngineering/How To/RelOps Hardware Controller (Roller)|roller]] | After the steps above the worker explorer will show the machine and you can reboot it from there, using [[ReleaseEngineering/How To/RelOps Hardware Controller (Roller)|roller]] | ||
Line 27: | Line 27: | ||
* Check if the host responds to ping. | * Check if the host responds to ping. | ||
* Connect to the worker using SSH: | * Connect to the worker using SSH: | ||
** check if the worker process is running: < | ** check if the worker process is running: <code>ps -ef|grep</code> | ||
** check the logs: < | ** check the logs: <code> top -u </code> to see if there are high CPU usage from something other than python or firefox | ||
= Rebooting workers = | = Rebooting workers = |
Revision as of 12:01, 29 August 2018
NOTE: This page is under construction. If the information is not clear please ask CIduty team.
How to add/define a worker if it is missing from Taskcluster
If we cannot ssh into OSX nodes, we can try to restart them from Taskcluster. But if they are not visible in the Taskcluster worker explorer, then you can create them using this version of quarantine script that will add/define a worker if it is missing.
- Step 1: Connect to Taskcluster CLI
- Step 2: Use this command: e.g.
python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc2 t-yosemite-r7-449
After the steps above the worker explorer will show the machine and you can reboot it from there, using roller
If the issue is not fix ( the machine does not take jobs and SSH is still not working ), create a bug for DCOps to physically reboot and reimage/netboot the machines.The Automatic Bug Generator will create a bug for DCOps if the restart fails.
Taskcluster Checker
Using the client.py script from the GitHub Repository you can find all TC workers which are missing and need to be debugged.
In the README file you can find how to use the checker.
Machine Quick Check
Here are a few methods to check a worker:
- Check the problem tracking bug: e.g problem tracking bug
- Check the node definition in puppet repo: e.g node definition
- Look into Papertrail for logs: e.g papertrail logs
- Check if the host responds to ping.
- Connect to the worker using SSH:
- check if the worker process is running:
ps -ef|grep
- check the logs:
top -u
to see if there are high CPU usage from something other than python or firefox
- check if the worker process is running:
Rebooting workers
Here are the methods to reboot a worker:
- Mac OS X
- Reboot from Taskcluster using roller
- Connect to the worker using SSH and reboot it from console.
- Windows and Linux Moonshot
- Reboot from Taskcluster using roller
- Connect to the management web interface and start the java KVM console or use HP iLO Integrated Remote Console application and then in the upper left corner you can Power Switch/Cold Boot the machine.
Re-imaging workers
Most of the time when we find a worker that is not in Taskcluster or it didn’t took jobs for more then one day, we try rebooting them, but this solution doesn’t helps all the time. The final solution is to re-image the worker. Currently, in MDC1 the image used is Generic Worker 10. From time to time, the final solution also fails to work, so, we need to Reset the Bios and after this reimage the machine. After this step, most of the time the problem gets solved.
How to re-image:
- Windows MS : How To Image or Reimage a Windows Linux Server
- Linux MS: How To Reimage Releng HP Moonshot Linux Machines
- Mac OS X: How To Reimage Mac Minis [Remotely]
SSH not working
- Step 1 : Check the Papertrail logs
- Step 2 : Reboot it from Taskcluster. It may have old auth_keys or not completed re-imaging
- Step 3 : File a problem tracking bug or update the existent problem tracking bug.