CIDuty/How To/High Pending Counts: Difference between revisions

Line 2: Line 2:
{{Release Engineering How To|Dealing with high pending counts}}
{{Release Engineering How To|Dealing with high pending counts}}
= Dealing with high pending counts =
= Dealing with high pending counts =
Demand will sometimes outstrip supply in the slave pool. A high number of pending build can indicate a problem with the scheduler, (a set of) buildbot-masters, or a particular pool of slaves (and hence possibly puppet).
Demand will sometimes outstrip supply in the worker pool. A high number of pending build can indicate a problem with the scheduler, (a set of) buildbot-masters, or a particular pool of slaves (and hence possibly puppet).


The number of pending builds is available in [http://builddata.pub.build.mozilla.org/reports/pending/pending.html graphs] and is also displayed per slave type in [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html slave health].  The graphs are helpful for noticing anomalous behavior. You will also see an alert in #buildduty, regarding the high number of pending jobs, for example
The number of pending builds is available in [http://builddata.pub.build.mozilla.org/reports/pending/pending.html graphs] and is also displayed per worker type in [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html slave health].  The graphs are helpful for noticing anomalous behavior. You will also see an alert in #buildduty, regarding the high number of pending jobs, for example


<pre>
<pre>
Line 11: Line 11:


Here are some steps you can use to help figure out why it's happening:
Here are some steps you can use to help figure out why it's happening:
== Is there a spike in pending jobs ==
https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/pending


== What platforms are affected? ==
== What platforms are affected? ==
Line 22: Line 26:


These are predictable daily sources of spiky load.
These are predictable daily sources of spiky load.
=== Are the pending jobs in [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/jacuzzis.html jacuzzis]? ===
Lots of pending jobs in a given jacuzzi is generally fine. That's what [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/jacuzzis.html jacuzzis] are for: to make sure a single job type doesn't soak up capacity from the entire pool. If there are an anomalously high number of pending jobs for a single jacuzzi, it's best to look for signs of abuse (below).


=== Did the trees just open following a closure? ===
=== Did the trees just open following a closure? ===
Line 38: Line 39:


=== Is coalescing working? ===
=== Is coalescing working? ===
We have SETA configured to coalesce (run certain test jobs less often). If this breaks, we will see a spike in load and high pending counts. There is a bug open to fix this {{bug|1199347}}.  To see if this is a problem, tail the /builds/buildbot/tests_scheduler/master/twistd.log on buildbot-master81 and ensure there are lines indicating the jobs are being skipped on mozilla-inbound and fx-team, the two branches where SETA is currently enabled.  For example:
We have SETA configured to coalesce (run certain test jobs less often) on taskcluster on autoland, mozilla-inbound and and graphics branches. This coalescing does not apply to mac tests until {{bug|1382204}} is resolvedIf a large number of new test jobs have been recently added, their profile might not be in seta yet and thus contributing to a higher loadSee bug {{bug|1386405}} for an example of how to resolve this issue.
 
<pre>
kmoir@buildbot-master81.bb.releng.scl3.mozilla.com master]$ tail -f twistd.log
2015-09-10 08:17:54-0700 [-] tests-mozilla-inbound-ubuntu32_vm-opt-unittest-7-3600: skipping with 4/7 important changes since only 81/3600s have elapsed
2015-09-10 08:17:54-0700 [-] tests-mozilla-inbound-snowleopard-debug-unittest-7-3600: skipping with 3/7 important changes since only 2190/3600s have elapsed
</pre>
 
As an aside, another reason that the logs may not indicate that tests are not being skipped is that both mozilla-inbound and fx-team have been closed for a whileIf this is the case, no jobs are being skipped because no jobs are being scheduled.


=== Are new AWS instances starting and running buildbot? ===
=== Are new AWS instances starting and running buildbot? ===
Confirmed users
1,989

edits