NOTE: This page is under construction. If the information is not clear please ask CIduty team.

How to add/define a worker if it is missing from Taskcluster

If we cannot ssh into OSX nodes, we can try to restart them from Taskcluster. But if they are not visible in the Taskcluster worker explorer, then you can create them using this version of quarantine script that will add/define a worker if it is missing.

Step 1: Connect to Taskcluster CLI
Step 2: Use this command: e.g. python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc2 t-yosemite-r7-449

After the steps above the worker explorer will show the machine and you can reboot it from there, using roller

If the issue is not fix ( the machine does not take jobs and SSH is still not working ), create a bug for DCOps to physically reboot and reimage/netboot the machines.The Automatic Bug Generator will create a bug for DCOps if the restart fails.

Taskcluster Checker

Using the client.py script from the GitHub Repository you can find all TC workers which are missing and need to be debugged.

In the README file you can find how to use the checker.

Machine Quick Check

Here are a few methods to check a worker:

Check the problem tracking bug: e.g problem tracking bug
Check the node definition in puppet repo: e.g node definition
Look into Papertrail for logs: e.g papertrail logs
Check if the host responds to ping.
Connect to the worker using SSH:
- check if the worker process is running: ps -ef|grep
- check the logs: top -u to see if there are high CPU usage from something other than python or firefox

Rebooting workers

Here are the methods to reboot a worker:

Mac OS X
- Reboot from Taskcluster using roller
- Connect to the worker using SSH and reboot it from console.
Windows and Linux Moonshot
- Reboot from Taskcluster using roller
- Connect to the management web interface and start the java KVM console or use HP iLO Integrated Remote Console application and then in the upper left corner you can Power Switch/Cold Boot the machine.

Re-imaging workers

Most of the time when we find a worker that is not in Taskcluster or it didn’t took jobs for more then one day, we try rebooting them, but this solution doesn’t helps all the time. The final solution is to re-image the worker. Currently, in MDC1 the image used is Generic Worker 10. From time to time, the final solution also fails to work, so, we need to Reset the Bios and after this reimage the machine. After this step, most of the time the problem gets solved.

How to re-image:

Windows MS : How To Image or Reimage a Windows Linux Server
Linux MS: How To Reimage Releng HP Moonshot Linux Machines
Mac OS X: How To Reimage Mac Minis [Remotely]

SSH not working

Step 1 : Check the Papertrail logs
Step 2 : Reboot it from Taskcluster. It may have old auth_keys or not completed re-imaging
Step 3 : File a problem tracking bug or update the existent problem tracking bug.

No video on all cartridges from a chassis

If we see any connection problems to ilo, we can try the `reset cm` command to reset the ilo manager.

Step 1 : Connect to the moon-chassis using SSH connection
Step 2 : run "reset cm"

Connect and Troubleshoot workers in CI

Contents

How to add/define a worker if it is missing from Taskcluster

Taskcluster Checker

Machine Quick Check

Rebooting workers

Re-imaging workers

SSH not working

No video on all cartridges from a chassis

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools