CIDuty/How To/High Pending Counts: Difference between revisions

updated with new info and made some format changes
(updated with new info and made some format changes)
Line 83: Line 83:
* load on upload/stage: this can affect the download of artifacts for builds and tests, leading to retries and high pending counts
* load on upload/stage: this can affect the download of artifacts for builds and tests, leading to retries and high pending counts


If there are no alerts, it is worth asking in #MOC and/or #infra to see if IT is tracking any events not currently on our nagios radar.
If there are no alerts, it is worth asking in #netops and/or #systems to see if IT is tracking any events not currently on our nagios radar.
Also, you can check on [https://mozilla.statuspage.io/history Mozilla Status] to see if there is any planned or unplanned action.


=== TaskCluster ===
=== TaskCluster ===
Line 100: Line 101:


==== If pending goes to CRITICAL ====
==== If pending goes to CRITICAL ====
1. Make sure that workers are picking tasks by looking into specific type of worker in TaskCluster. Machines may go "lazy" after ending a job as an exception (those might need a reboot).
1. Make sure that workers are picking tasks by looking into specific type of worker in TaskCluster. Machines may go "lazy" after ending a job as an exception (those might need a [[CIDuty/How_To/Take_actions_to_RelEng_Hardware_from_TaskCluster_UI|reboot]]).


2. Read backscroll in #taskcluster and search bugzilla under Taskcluster component to find correlation with pending alerts.
2. Read backscroll in #taskcluster and search bugzilla under Taskcluster component to find correlation with pending alerts.


3. If no correlation can be found, let people know in #taskcluster about the spike in case it is not expected.
3. If no correlation can be found, let people know in #ci about the spike in case it is not expected.


= Rebooting taskcluster workers =
= Rebooting taskcluster workers =
Rebooting taskcluster workers has to be done manually depending on the type of machine. For gecko-t-win10-64 and gecko-t-linux-talos it has to be done via ILO. For gecko-t-osx-1010 it can be done via SSH.
Rebooting taskcluster workers it can be done manually or automatic depending on the type of machine. For gecko-t-win10-64 and gecko-t-linux-talos it can to be done via iLO or using [https://github.com/Akhliskun/taskcluster-worker-checker Taskcluster Worker Checker]. For gecko-t-osx-1010 it can be done via SSH or via [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010 roller on Taskcluster].


= See also =
= See also =
https://mana.mozilla.org/wiki/display/NAGIOS/Backlog+Age
[https://docs.google.com/spreadsheets/d/1pUFq6Z5M5a1ydbSzxNjQivFfryVoksdXa9xXTg9gtzc/edit#gid=0 CIDuty Escalation Path]
[https://mana.mozilla.org/wiki/display/NAGIOS/Backlog+Age Backlog Age]
canmove, Confirmed users
112

edits