CIDuty/How To/High Pending Counts: Difference between revisions

Jump to navigation Jump to search
Line 33: Line 33:


There is no designated tool for finding these types of abuses. When looking for abuse, I check which branch the pending jobs are coming (usually try), and then walk back through the [https://treeherder.mozilla.org/#/jobs?repo=try treeherder] history for that branch looking for revisions with multiple jobs triggered.
There is no designated tool for finding these types of abuses. When looking for abuse, I check which branch the pending jobs are coming (usually try), and then walk back through the [https://treeherder.mozilla.org/#/jobs?repo=try treeherder] history for that branch looking for revisions with multiple jobs triggered.
=== Are infrastructure problems are causing retries? ===
The builds cannot fetch a package and retries. Retries spike load. Pending counts rise.
The builds cannot upload their resulting binaries. This happened once when IT made DNS redirect to a server where we didn't have ssh keys to upload the resulting binaries. For example: {{bug|1198296}} In this case builds fail.
=== Is coalescing working? ===
We have SETA configured to coalesce (run certain test jobs less often). If this breaks, we will see a spike in load and high pending counts. There is a bug open to fix this {{bug|1199347}}.  To see if this is a problem, tail the /builds/buildbot/tests_scheduler/master/twistd.log on buildbot-master81 and ensure there are lines indicating the jobs are being skipped on mozilla-inbound and fx-team, the two branches where SETA is currently enabled.  For example:
<pre>
kmoir@buildbot-master81.bb.releng.scl3.mozilla.com master]$ tail -f twistd.log
2015-09-10 08:17:54-0700 [-] tests-mozilla-inbound-ubuntu32_vm-opt-unittest-7-3600: skipping with 4/7 important changes since only 81/3600s have elapsed
2015-09-10 08:17:54-0700 [-] tests-mozilla-inbound-snowleopard-debug-unittest-7-3600: skipping with 3/7 important changes since only 2190/3600s have elapsed
</pre>
=== Are new AWS instances starting and running buildbot? ===
There will be an alert in #buildduty regarding aws_watch_pending.log not being updated if new instances are not being created.  A common cause is that there is a typo in configs/watch_pending.cfg. Look at the logs on the aws manager instance (/var/log/messages). There should be an error message regarding a typo in the json file. We shouldn't really get to that point because there are tests to verify this but sometimes it happens.  For example, https://bugzilla.mozilla.org/show_bug.cgi?id=1195893#c8. If the are AWS instances starting, ssh to a instance that has recently started and look at the /var/log/runner.log and see if there are errors.  Does the /builds/slave/twistd.log indicate that builds are completing on this machine?
=== Is there a problem with the AMI golden master? ===
Each night, we create a new AMIs for Amazon instances from our puppet configs.  Once it is ready, all new instances are created with this image.  If there is a problem with the image, this has to be corrected and new AMIs generated.  If the image is broken to the extent that it should be pulled, you can deregister the AMI in the amazon console so the previous night's AMI can be used instead.  To quickly bring down the instances that are launched with the problem AMI, you can use this script on aws-manager2.srv.releng.scl3.mozilla.com
<pre>
cd /builds/aws_manager/cloud-tools
python scripts/aws_terminate_by_ami_id.py  <your ami id>
</pre>
For an example of this problem, see {{bug|1203104}}
=== Are we underbidding for AWS spot instances? ===
We use AWS spot instances for a large proportion of our continuous integration's farm's capacity. We have an algorithm that bids for the different instance types within a range of prices. Prices are here https://github.com/mozilla/build-cloud-tools/blob/master/configs/watch_pending.cfg#L50 Algorithm is here https://github.com/mozilla/build-cloud-tools/blob/master/cloudtools/aws/spot.py If we are underbidding for the current costs of the spot instances, we won't get any new AWS instances and we pending counts will go up.
=== Are ssh keys are problem on the the masters? ===
Buildbot ssh keys have a problem {{bug|1198332}}
=== Are there problems connecting to the Buildbot database? ===
Cannot connect to database due to network or other issues. Pending count will probably not increase, will just stay the same because jobs aren't deleted from db as they complete.
=== Cannot connect to AWS due to network issues ===
Chaos really. We heavily depend on AWS.


== What code/systems have changed? ==
== What code/systems have changed? ==
Confirmed users
1,989

edits

Navigation menu