CIDuty/How To/High Pending Counts

From MozillaWiki
< CIDuty‎ | How To
Jump to: navigation, search

Dealing with high pending counts

Demand will sometimes outstrip supply in the worker pool. There's a series of possible issues that may result in a test/build backlog. At this point, Mozilla's automation uses one CI tools for validating the changes made to the code base TaskCluster. Hence, the first step in dealing with such high pending counts is finding out which automation is affected. We have a Nagios check in place that constantly monitors the number of jobs in each worker pool and alerts in #platform-ops-alerts when that number reaches past certain thresholds.

<nagios-releng> Tue 23:01:26 UTC [8020] [Unknown] nagios1.private.releng.mdc1.mozilla.com:Pending tests is WARNING: WARNING Pending tests: 2151 on gecko-t-linux-xlarge (http://m.mozilla.org/Pending+tests)

For TaskCluster, you can find the pending count for individual worker pools by accessing the corresponding endpoint from https://queue.taskcluster.net/v1/pending/{}/{}.

There's also a handy python script that can be used to see pending counts for each worker pool at one point:

$ python check_pending_jobs.py -h
usage: check_pending_jobs.py [-h] [-B] [-t] [-C CRITICAL] [-W WARNING]
                             [-c CRITICAL] [-w WARNING] [-b] [-T]

optional arguments:
  -h, --help            show this help message and exit
  -B, --builds          compute number of pending builds per machine pool
  -t, --tests           compute number of pending tests per machine pool
  -C CRITICAL, --builds_critical CRITICAL
                        Set builds CRITICAL level as integer eg. 300
  -W WARNING, --builds_warning WARNING
                        Set builds WARNING level as integer eg. 200
  -c CRITICAL, --tests_critical CRITICAL
                        Set tests CRITICAL level as integer eg. 3000
  -w WARNING, --tests_warning WARNING
                        Set tests WARNING level as integer eg. 2000
  -b, --buildbot        Display pending jobs on buildbot machine pools
  -T, --taskcluster     Display pending jobs on taskcluster workers

What platforms are affected?

Some platforms, notably talos test for all OSes, have finite pools of hardware. Once all the machines are running jobs, any other work will be queued up.

Spikes in the number of pending requests for Linux jobs that run on AWS instances (build, try, and test) can also occur. AWS instances are terminated when not required, and it can take a while (30 mins?) to spin up new instances to meet sudden demand.

Where is the load coming from?

Did nightlies just get triggered? Did nightlies just trigger dependent l10n jobs?

These are predictable daily sources of spiky load.

Did the trees just open following a closure?

There is usually a big pulse of activity right after a tree closure as developers start landing code again.

Is someone abusing the try server?

Occasionally someone will (re)trigger a large of jobs on a single changeset. People who do this often and with good reason (e.g. jmaher) usually do so on weekends when there is less contention for the infrastructure. If someone does this mid-week, it's best to find them on IRC and figure out why they've done this. You may need to cancel some/all of their extra jobs if it's impacting other developers.

There is no designated tool for finding these types of abuses. When looking for abuse, I check which branch the pending jobs are coming (usually try), and then walk back through the treeherder history for that branch looking for revisions with multiple jobs triggered. A particular backlog source is running the same set of tests multiple times (e.g. by pushing using "--rebuild 20"). That is generally needed when dealing with intermittent failures that are difficult to reproduce and require multiple runs. Unfortunately, that will sometimes unfairly block other users from getting their test results in time. To confirm the backlog is coming from such pushes you can check the current jobs in the pending queue, look for duplicated entries and then go to treeherder and see which user those pushes belong to.

Are infrastructure problems causing retries?

For example, if the builds cannot fetch a package they will retry. Or if builds cannot upload their resulting binaries. For example, IT once implemented a DNS redirect to a server where we didn't have ssh keys to upload the resulting binaries. See: bug 1198296 In this case builds fail, they retry and pending counts rise.

cd /builds/aws_manager/cloud-tools
python scripts/aws_terminate_by_ami_id.py  <your ami id>

For an example of this problem, see bug 1203104

You can also check by running Firefox Infra Changelog script. The tool builds a change log of commits happening on git and hg that could affect Firefox CI Infra. The script can be found here : https://github.com/mozilla-releng/firefox-infra-changelog The following command will check for new commits and update the files.

python update-files.py

Also, the client script can be run using the following arguments, depending the needs :

Short Flag 	Long Flag 	Description
-c 	       --complete 	Runs script for all available repositories
-g 	       --git 	        Runs script only for repos that are on GitHub
-hg 	       --mercurial 	Runs script only for repos that are on Mercurial
-m 	       --manual 	Let the user choose for which repositories the script will run
-l 	       --logger 	Activate logger output in the console
-d 	       --days 	        Generate the changelog.md for <int> amount of days.
-u 	       --update 	Runs script for all available repositories and auto push the changes to github

If infra issue has been found, depending the case, this should be escalated to the right team ( network, data center, etc ). Please check the following document for escalation teams.

Are we underbidding for AWS spot instances?

We use AWS spot instances for a large proportion of our continuous integration's farm's capacity. We have an algorithm that bids for the different instance types within a range of prices. Prices are here https://github.com/mozilla/build-cloud-tools/blob/master/configs/watch_pending.cfg#L63 Algorithm is here https://github.com/mozilla/build-cloud-tools/blob/master/cloudtools/aws/spot.py. If we are underbidding for the current costs of the spot instances, we won't get any new AWS instances and we pending counts will go up. There's a Nagios check in place that should notify us in #platform-ops-alerts when such things happen.

Are we unable to connect to AWS due to network issues?

Chaos really. We depend heavily on AWS.

What code/systems have changed?

Figure out how long the backlog has been building, and then check the Maintenance wiki to see if code has recently landed that would be affecting that platform.

Other sources to check include:

Has there been a network or systems event?

We get nagios alerts in #platform-ops-alerts for the following:

  • BGP flaps from AWS: this can affect connectivity between slaves and masters in AWS
  • load on upload/stage: this can affect the download of artifacts for builds and tests, leading to retries and high pending counts

If there are no alerts, it is worth asking in #netops and/or #systems to see if IT is tracking any events not currently on our nagios radar. Also, you can check on Mozilla Status to see if there is any planned or unplanned action.

TaskCluster

Is coalescing working?

We have SETA configured to coalesce (run certain test jobs less often) on taskcluster on autoland, mozilla-inbound and and graphics branches. This coalescing does not apply to mac tests until bug 1382204 is resolved. If a large number of new test jobs have been recently added, their profile might not be in seta yet and thus contributing to a higher load. See bug bug 1386405 for an example of how to resolve this issue.

Are we hitting EBS limits?

When the request for running workers is high, we may hit certain EBS limits for our AWS account and won't be able to spawn new instances. If that happens, we should coordinate with the TaskCluster team and investigate why we have such a large number of running workers. If needed, they can contact Amazon to increase those limits. Bug 1391564 would serve as a good example.

Bad AMIs?

The workers spawned from such AMIs may not able to take any jobs, which will in turn result in a growing backlog. In such cases, we should ping someone in #taskcluster to roll back the problematic AMIs to the last known good ones. It would also help temporarily bumping the capacity for certain pools until the pending counts are dropping to reasonable values.

Is autologin not working?

Similar to the case above, this would also prevent the existing workers from running any new task and it may be the consequence of changing the cltbld passwords on those machines. The most recent example is bug 1376807.

If pending goes to CRITICAL

1. Make sure that workers are picking tasks by looking into specific type of worker in TaskCluster. Machines may go "lazy" after ending a job as an exception (those might need a reboot).

2. Read backscroll in #taskcluster and search bugzilla under Taskcluster component to find correlation with pending alerts.

3. If no correlation can be found, let people know in #ci about the spike in case it is not expected.

Rebooting taskcluster workers

Rebooting taskcluster workers it can be done manually or automatic depending on the type of machine. For gecko-t-win10-64 and gecko-t-linux-talos it can to be done via iLO or using Taskcluster Worker Checker. For gecko-t-osx-1010 it can be done via SSH or via roller on Taskcluster.

See also

CIDuty Escalation Path Backlog Age