Connect and Troubleshoot workers in CI

From MozillaWiki
Jump to: navigation, search


NOTE: This page is under construction. If the information is not clear please ask CIduty team.

How to add/define a worker if it is missing from Taskcluster

If we cannot ssh into OSX nodes, we can try to restart them from Taskcluster. But if they are not visible in the Taskcluster worker explorer, then you can create them using this version of quarantine script that will add/define a worker if it is missing.

  • Step 1: Connect to Taskcluster CLI
  • Step 2: Use this command: e.g. python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc2 t-yosemite-r7-449

After the steps above the worker explorer will show the machine and you can reboot it from there, using roller

If the issue is not fix ( the machine does not take jobs and SSH is still not working ), create a bug for DCOps to physically reboot and reimage/netboot the machines.The Automatic Bug Generator will create a bug for DCOps if the restart fails.

Taskcluster Checker

Using the client.py script from the GitHub Repository you can find all TC workers which are missing and need to be debugged.

In the README file you can find how to use the checker.

Machine Quick Check

Here are a few methods to check a worker:

  • Check the problem tracking bug: e.g problem tracking bug
  • Check the node definition in puppet repo: e.g node definition
  • Look into Papertrail for logs: e.g papertrail logs
  • Check if the host responds to ping.
  • Connect to the worker using SSH:
    • check if the worker process is running: ps -ef|grep
    • check the logs: top -u to see if there are high CPU usage from something other than python or firefox

Rebooting workers

Here are the methods to reboot a worker:

  • Mac OS X
    • Reboot from Taskcluster using roller
    • Connect to the worker using SSH and reboot it from console.
  • Windows and Linux Moonshot
    • Reboot from Taskcluster using roller
    • Connect to the management web interface and start the java KVM console or use HP iLO Integrated Remote Console application and then in the upper left corner you can Power Switch/Cold Boot the machine.

Re-imaging workers

Most of the time when we find a worker that is not in Taskcluster or it didn’t took jobs for more then one day, we try rebooting them, but this solution doesn’t helps all the time. The final solution is to re-image the worker. Currently, in MDC1 the image used is Generic Worker 10. From time to time, the final solution also fails to work, so, we need to Reset the Bios and after this reimage the machine. After this step, most of the time the problem gets solved.

How to re-image:

SSH not working

  • Step 1 : Check the Papertrail logs
  • Step 2 : Reboot it from Taskcluster. It may have old auth_keys or not completed re-imaging
  • Step 3 : File a problem tracking bug or update the existent problem tracking bug.

No video on all cartridges from a chassis

If we see any connection problems to ilo, we can try the `reset cm` command to reset the ilo manager.

  • Step 1 : Connect to the moon-chassis using SSH connection
  • Step 2 : run "reset cm"

See also Bug 1504942