ReleaseEngineering/Buildduty/StandupMeetingNotesQ12015: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Created page with " 2015-03-31 https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Reconfigs http://coop.deadsquid.com/2015/03/the-changing-face-of-buildduty/ 2015-03-30 buil...")
 
(better import)
 
Line 1: Line 1:
'''2015-03-31'''<br />


2015-03-31
* [[ReleaseEngineering/Buildduty/Reconfigs]]
* [http://coop.deadsquid.com/2015/03/the-changing-face-of-buildduty/ http://coop.deadsquid.com/2015/03/the-changing-face-of-buildduty/]
*


    https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Reconfigs
<br />


    http://coop.deadsquid.com/2015/03/the-changing-face-of-buildduty/
'''2015-03-30'''<br />


* buildduty report cleanup &lt;- lots




2015-03-30
'''2015-03-27'''<br />


    buildduty report cleanup <- lots
* [https://bugzil.la/1138234 '''https://bugzil.la/1138234'''] - b2g_bumper stalling frequently
** is this still happening or can we resolve this? RESOLVED




2015-03-27
'''2015-03-26'''<br />


    https://bugzil.la/1138234 - b2g_bumper stalling frequently
* [https://bugzil.la/1147853 '''https://bugzil.la/1147853'''] - Widespread &quot;InternalError: Starting video failed&quot; failures across all trees on AWS-based test instances
** possibly related to runslave changes (''''''[https://bugzil.la/1143018 '''https://bugzil.la/1143018'''] ) and interaction with runner
* Q1 is almost done
** what do we need to document/update prior to buildduty hand-off next week?
* testing out latest mh prod rev on cedar in a canary fashion :)
** be better if releng stays under the radar for at least the rest of the day
* jlund|buildduty&gt; nagios-releng: downtime vcssync2.srv.releng.usw2.mozilla.com 1h &quot;bug 1135266&quot;
* disabling two foopies for recovery
** [https://bugzil.la/1146130 https://bugzil.la/1146130]
*


    is this still happening or can we resolve this? RESOLVED
<br />


'''2015-03-25'''<br />


2015-03-26
* no-op reconfig for fubar - [https://bugzil.la/1147314 https://bugzil.la/1147314]
* full reconfig for bhearsum
* [https://bugzil.la/1143018 '''https://bugzil.la/1143018'''] - Update runslave.py with current machine types and basedirs
** updated basedirs in slavealloc db
** deployed puppet change to runslave.py
* ryanvm wants a number of bugs looked at that result in windows ending up with start screen
** namely [https://bugzil.la/1135545 https://bugzil.la/1135545]
*** see also
**** [https://bugzil.la/924728 https://bugzil.la/924728]
**** [https://bugzil.la/1090633 https://bugzil.la/1090633]


    https://bugzil.la/1147853 - Widespread "InternalError: Starting video failed" failures across all trees on AWS-based test instances
** should we ping markco/Q?
*** ni: Q
*** commented {{bug|1135545#c89}}


    possibly related to runslave changes ( https://bugzil.la/1143018 ) and interaction with runner
* did another reconfig
* landed ec2 windows slave health fix
** needs second patch
* backed out [https://bugzil.la/1146379 https://bugzil.la/1146379] hgtool should avoid pulling if it already has a revision
** context comment 3-5
* landed [https://bugzil.la/1146855 https://bugzil.la/1146855] across all trees


    Q1 is almost done


    what do we need to document/update prior to buildduty hand-off next week?
'''2015-03-24'''<br />


    testing out latest mh prod rev on cedar in a canary fashion :)
* [https://bugzil.la/1146855 '''https://bugzil.la/1146855'''] - added dlls to tooltool to make minidump_stackwalk
* broken reconfig this morning?
** [https://hg.mozilla.org/build/buildbot-configs/rev/22f9faca403c https://hg.mozilla.org/build/buildbot-configs/rev/22f9faca403c] - missing comma
* aws instances stuck in long running or something? high pending
** &quot;high&quot; is relative: there were *maybe* 200 jobs pending on AWS pools
** [2:32pm] Callek: coop|buildduty: RyanVM|sheriffduty jlund|buildduty: fyi -- [https://github.com/mozilla/build-cloud-tools/pull/54 https://github.com/mozilla/build-cloud-tools/pull/54] mgerva just updated our limits for aws linux64 testers by about 50%, nick cautioned to keep an eye on master health with this increase.
* [https://bugzil.la/1145387 '''https://bugzil.la/1145387'''] - t-yosemite-r5-0073 can't connect to a master (specifically bm107)
** notes in bug, gracefully restarting bm107 to see if that helps
** slave connected to bm108 and is fine now


    be better if releng stays under the radar for at least the rest of the day


    jlund|buildduty> nagios-releng: downtime vcssync2.srv.releng.usw2.mozilla.com 1h "bug 1135266"
'''2015-03-23'''<br />


    disabling two foopies for recovery
* coop working on [https://bugzil.la/978928 '''https://bugzil.la/978928'''] - Reconfigs should be automatic, and scheduled via a cron job


    https://bugzil.la/1146130


'''2015-03-20'''<br />


* chemspill in progress, ***NO UNNECESSARY CHANGES***
* coop going through &quot;All dependencies resolved&quot; section of buildduty report
** doing all non-pandas first
** will do a second, panda-only pass after


2015-03-25


    no-op reconfig for fubar - https://bugzil.la/1147314
'''2015-03-19'''<br />


    full reconfig for bhearsum
* [https://etherpad.mozilla.org/bz1144762 https://etherpad.mozilla.org/bz1144762]
* chemspill coming from pwn2own


    https://bugzil.la/1143018 - Update runslave.py with current machine types and basedirs


    updated basedirs in slavealloc db
'''2015-03-18'''<br />


    deployed puppet change to runslave.py
* [https://bugzil.la/1144762 https://bugzil.la/1144762] - more hg timeouts, new bug filed for today's fun
** happening again this morning, possibly different cause
** from fox2mike: [https://fox2mike.pastebin.mozilla.org/8826256 https://fox2mike.pastebin.mozilla.org/8826256]
** from cyliang: [https://graphite-scl3.mozilla.org/dashboard/#http-zlbs https://graphite-scl3.mozilla.org/dashboard/#http-zlbs]
*** notice table-top on outbound
** tackling from another end by re-visiting this bug: {{bug|1113460#c16}}
** [https://etherpad.mozilla.org/bz1144762 https://etherpad.mozilla.org/bz1144762]
** killed a ton of RETRY l10n and fuzzer jobs that were hitting hg.m.o and 503'ing


    ryanvm wants a number of bugs looked at that result in windows ending up with start screen


    namely https://bugzil.la/1135545
'''2015-03-17'''<br />


    see also
* tst-linux64-spot pending &gt;4500 as of 6:30am PT (!)
** inadvertent puppet change: [http://hg.mozilla.org/build/puppet/rev/dfe40f33e6d0#l2.1 http://hg.mozilla.org/build/puppet/rev/dfe40f33e6d0#l2.1]
*** mrrrgn deploying fix and rescuing instances


    https://bugzil.la/924728
* [https://bugzil.la/1143681 '''https://bugzil.la/1143681'''] - Some AWS test slaves not being recycled as expected
** found a way to track these down
*** in the AWS console, search for the instances in the Spot Requests tab. You can click on the instance ID to get more info.
*** e.g. for tst-linux64-spot-233, the instance has no name associated and is marked as &quot;shutting-down&quot;


    https://bugzil.la/1090633
* [https://bugzil.la/1144362 https://bugzil.la/1144362] - massive spike in hg load
** possibly due to new/rescued instances from the morning (re)cloning mh &amp; tools
*** negative feedback loop?


    should we ping markco/Q?


    ni: Q


    commented https://bugzilla.mozilla.org/show_bug.cgi?id=1135545#c89
'''2015-03-16'''<br />


    did another reconfig
* filed [https://bugzil.la/1143681 '''https://bugzil.la/1143681'''] - Some AWS test slaves not being recycled as expected
* how can we stop tests from being part of --all on try: {{bug|1143259#c4}}


    landed ec2 windows slave health fix


    needs second patch
'''2015-03-13'''<br />


    backed out https://bugzil.la/1146379 hgtool should avoid pulling if it already has a revision
* buildbot DB &quot;too many connections&quot; again. (perhaps DBA's are able to increase the conn pool limits?)
* need a button in slave health that automatically files a diagnostics bug for a given slave
** should disable the slave if not already disabled
** should do the bug linking automatically
** should have a small text entry box for the description of the diagnostics bug, i.e. why are we asking for diagnostics
** would hopefully prevent sheriffs from just taking slaves offline and waiting for us to perform the next step(s)
* file [https://bugzil.la/1143018 '''https://bugzil.la/1143018'''] - Update runslave.py with current machine types and basedirs
** we essentially guess at the builddir in most cases these days(!)
* [https://bugzil.la/1142825 https://bugzil.la/1142825] - high windows test pending
** rebooting idle win7 machines, re-imaged 2 others
** found some try commit from wednesday mar 11 with some of duplicate jobs:
*** [https://treeherder.mozilla.org/#/jobs?repo=try&revision=768730b3ae1c https://treeherder.mozilla.org/#/jobs?repo=try&amp;revision=768730b3ae1c]
*** [https://treeherder.mozilla.org/#/jobs?repo=try&revision=fb39553b0473 https://treeherder.mozilla.org/#/jobs?repo=try&amp;revision=fb39553b0473]
*** both from mchang@m.o
*** some jobs look like they're still running in treeherder, buildapi says no
*** checking Windows test masters for slaves that have been running jobs for a while (probably hung)


    context comment 3-5
* investigated tree closure bugs that resulted from reconfig and mh :(
** [https://bugzil.la/1143227 https://bugzil.la/1143227]
** {{bug|1142553#c11}}
** merge mh to prod and bump mh for separate failure: [http://hg.mozilla.org/build/mozharness/rev/18a18416de6a http://hg.mozilla.org/build/mozharness/rev/18a18416de6a]
* [https://bugzil.la/1143259 '''https://bugzil.la/1143259'''] - tests run by default that are failing more than 80 percent of the time


    landed https://bugzil.la/1146855 across all trees


'''2015-03-12'''<br />


2015-03-24
* filed [https://bugzil.la/1142493 https://bugzil.la/1142493] - panda-relay-037 is down
* Win7 test pending &gt;2000 (unclear on why)
* tree closure
** caused by RyanVM


    https://bugzil.la/1146855 - added dlls to tooltool to make minidump_stackwalk


    broken reconfig this morning?
'''2015-03-11'''<br />


    https://hg.mozilla.org/build/buildbot-configs/rev/22f9faca403c - missing comma
* file [https://bugzil.la/1142103 '''https://bugzil.la/1142103'''] - Scheduling issues with Win64 xulrunner nightlies on try
** getting daily dead command queue items from this
* blog post: [http://coop.deadsquid.com/2015/03/better-releng-patch-contribution-workflow/ http://coop.deadsquid.com/2015/03/better-releng-patch-contribution-workflow/]
* [https://bugzil.la/1088032 https://bugzil.la/1088032] - Test slaves sometimes fail to start buildbot after a reboot
** coop investigating


    aws instances stuck in long running or something? high pending


    "high" is relative: there were *maybe* 200 jobs pending on AWS pools
'''2015-03-10'''<br />


    [2:32pm] Callek: coop|buildduty: RyanVM|sheriffduty jlund|buildduty: fyi -- https://github.com/mozilla/build-cloud-tools/pull/54 mgerva just updated our limits for aws linux64 testers by about 50%, nick cautioned to keep an eye on master health with this increase.
* [https://bugzil.la/1141396 https://bugzil.la/1141396] - Mulet Nightlies all failing with FATAL
* [https://bugzil.la/1141416 https://bugzil.la/1141416] - Fix the slaves broken by talos's inability to deploy an update
** longstanding issue, we should really fix this
* a few BGP flaps reported in #buildduty this morning
* [https://bugzil.la/1139764 '''https://bugzil.la/1139764'''] - terminated tst-ubuntu14-ec2-shu
* sent mail to group re: AWS sanity checker long-running instances
* [https://bugzil.la/1141454 https://bugzil.la/1141454] - Buildbot DB max connections overnight
* priority backlog triage:
** [https://bugzil.la/1139763 https://bugzil.la/1139763] - add windows to jacuzzi
*** patch r-, need follow up
** [https://bugzil.la/1060214 <s>https://bugzil.la/1060214</s>]<s>- Intermittent command timed out: 10800</s>
** [https://bugzil.la/1123025 '''<s>https://bugzil.la/1123025</s>''']<s>- b2g emulator nightlies (sometimes?) use a test package from a previous nightly</s>
** <s></s>[https://bugzil.la/1055912 <s>https://bugzil.la/1055912</s>]<s>Clobberer on try is apparently not working.</s>
** [https://bugzil.la/1141416 '''<s>https://bugzil.la/1141416</s>''']<s>- Fix the slaves broken by talos's inability to deploy an update</s>


    https://bugzil.la/1145387 - t-yosemite-r5-0073 can't connect to a master (specifically bm107)


    notes in bug, gracefully restarting bm107 to see if that helps
'''2015-03-09'''<br />
 
    slave connected to bm108 and is fine now
 
 
2015-03-23
 
    coop working on https://bugzil.la/978928 -  Reconfigs should be automatic, and scheduled via a cron job
 
 
2015-03-20
 
    chemspill in progress, ***NO UNNECESSARY CHANGES***
 
    coop going through "All dependencies resolved" section of buildduty report
 
    doing all non-pandas first
 
    will do a second, panda-only pass after
 
 
2015-03-19
 
    https://etherpad.mozilla.org/bz1144762
 
    chemspill coming from pwn2own
 
 
2015-03-18
 
    https://bugzil.la/1144762 - more hg timeouts, new bug filed for today's fun
 
    happening again this morning, possibly different cause
 
    from fox2mike: https://fox2mike.pastebin.mozilla.org/8826256
 
    from cyliang: https://graphite-scl3.mozilla.org/dashboard/#http-zlbs
 
    notice table-top on outbound
 
    tackling from another end by re-visiting this bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1113460#c16
 
    https://etherpad.mozilla.org/bz1144762
 
    killed a ton of RETRY l10n and fuzzer jobs that were hitting hg.m.o and 503'ing
 
 
2015-03-17
 
    tst-linux64-spot pending >4500 as of 6:30am PT (!)
 
    inadvertent puppet change: http://hg.mozilla.org/build/puppet/rev/dfe40f33e6d0#l2.1
 
    mrrrgn deploying fix and rescuing instances
 
    https://bugzil.la/1143681 - Some AWS test slaves not being recycled as expected
 
    found a way to track these down
 
    in the AWS console, search for the instances in the Spot Requests tab. You can click on the instance ID to get more info.
 
    e.g. for tst-linux64-spot-233, the instance has no name associated and is marked as "shutting-down"
 
    https://bugzil.la/1144362 - massive spike in hg load
 
    possibly due to new/rescued instances from the morning (re)cloning mh & tools
 
    negative feedback loop?
 
 
2015-03-16
 
    filed https://bugzil.la/1143681 - Some AWS test slaves not being recycled as expected
 
    how can we stop tests from being part of --all on try: https://bugzilla.mozilla.org/show_bug.cgi?id=1143259#c4
 
 
2015-03-13
 
    buildbot DB "too many connections" again. (perhaps DBA's are able to increase the conn pool limits?)
 
    need a button in slave health that automatically files a diagnostics bug for a given slave
 
    should disable the slave if not already disabled
 
    should do the bug linking automatically
 
    should have a small text entry box for the description of the diagnostics bug, i.e. why are we asking for diagnostics
 
    would hopefully prevent sheriffs from just taking slaves offline and waiting for us to perform the next step(s)
 
    file https://bugzil.la/1143018 - Update runslave.py with current machine types and basedirs
 
    we essentially guess at the builddir in most cases these days(!)
 
    https://bugzil.la/1142825 - high windows test pending
 
    rebooting idle win7 machines, re-imaged 2 others
 
    found some try commit from wednesday mar 11 with some of duplicate jobs:
 
    https://treeherder.mozilla.org/#/jobs?repo=try&revision=768730b3ae1c
 
    https://treeherder.mozilla.org/#/jobs?repo=try&revision=fb39553b0473
 
    both from mchang@m.o
 
    some jobs look like they're still running in treeherder, buildapi says no
 
    checking Windows test masters for slaves that have been running jobs for a while (probably hung)
 
    investigated tree closure bugs that resulted from reconfig and mh :(
 
    https://bugzil.la/1143227
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1142553#c11
 
    merge mh to prod and bump mh for separate failure: http://hg.mozilla.org/build/mozharness/rev/18a18416de6a
 
    https://bugzil.la/1143259 - tests run by default that are failing more than 80 percent of the time
 
 
2015-03-12
 
    filed https://bugzil.la/1142493 - panda-relay-037 is down
 
    Win7 test pending >2000 (unclear on why)
 
    tree closure
 
    caused by RyanVM
 
 
2015-03-11
 
    file https://bugzil.la/1142103 - Scheduling issues with Win64 xulrunner nightlies on try
 
    getting daily dead command queue items from this
 
    blog post: http://coop.deadsquid.com/2015/03/better-releng-patch-contribution-workflow/
 
    https://bugzil.la/1088032 - Test slaves sometimes fail to start buildbot after a reboot
 
    coop investigating
 
 
2015-03-10
 
    https://bugzil.la/1141396 - Mulet Nightlies all failing with FATAL
 
    https://bugzil.la/1141416 - Fix the slaves broken by talos's inability to deploy an update
 
    longstanding issue, we should really fix this
 
    a few BGP flaps reported in #buildduty this morning
 
    https://bugzil.la/1139764 - terminated tst-ubuntu14-ec2-shu
 
    sent mail to group re: AWS sanity checker long-running instances
 
    https://bugzil.la/1141454 - Buildbot DB max connections overnight
 
    priority backlog triage:
 
    https://bugzil.la/1139763 - add windows to jacuzzi
 
    patch r-, need follow up
 
    https://bugzil.la/1060214 - Intermittent command timed out: 10800
 
    https://bugzil.la/1123025 - b2g emulator nightlies (sometimes?) use a test package from a previous nightly
 
    https://bugzil.la/1055912 Clobberer on try is apparently not working.
 
    https://bugzil.la/1141416 - Fix the slaves broken by talos's inability to deploy an update
 
 
2015-03-09
 
    https://bugzil.la/1140989 - zlb8.ops.phx1.mozilla.com:Load is CRITICAL
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1126825#c11
 
    reconfig
 
    https://bugzil.la/1140539 - slow query log report for buildbot2
 
    have some info now to start reducing load on db
 
    filed https://bugzil.la/1141217 - nagios alerts for unassigned blocker bugs in all releng bugzilla components
 
    hope to avoid situation from last Friday where RyanVM's blocker bug sat for hours
 
    https://bugzil.la/1139764 - Slave loan request for a  tst-linux32-spot instance
 
    created an Ubuntu 14.04 instance for Shu to test a kernel theory on
 
 
2015-03-06
 
    https://bugzil.la/1140304 - All Gij jobs are permanently red for v2.2 branch
 
    uploaded http://pypi.pub.build.mozilla.org/pub/mozprofile-0.21.tar.gz
 
    filed https://bugzil.la/1140398 - Send nagios load alerts for upload1.dmz.scl3.mozilla.com to #buildduty IRC channel
 
    closed out tree closure bugs from this week after making sure bugs for follow-up issues were on file
 
    filed https://bugzil.la/1140419 - [tracking] Switch all releng RoR to github
 
    will send mail/write blogpost today
 
    working on https://bugzil.la/1140479 - Improvements to end_to_end_reconfig.sh script
 
 
2015-03-05
 
    never got to adding win64 m-a nightlies to jacuzzi https://bugzil.la/1139763
 
    need to enable slaves from  https://bugzil.la/1138672
 
    end_to_end script tells bugs things for mozharness are live in production. this no longer is the case for all our build + test jobs (most things aside from vcs-sync, bumper, etc).
 
    should we be still automatically updating bugs for mh after reconfig?
 
    we need a way to roll out changes to mh on a regular cadence. right now it's up to the individual to update mozharness.json with a REV they want applied and consequently, whatever mh patches are in between are also applied...
 
    coop to drop mozharness from end-to-end-reconfig script and email public list
 
    added http://pypi.pvt.build.mozilla.org/pub/mozrunner-6.6.tar.gz
 
    talked to catlee re: releng-try pipeline
 
    fully supportive
 
    one wrinkle: how to tackle release tagging
 
    coop will get bugs filed today
 
    add 4-repo view to slave health?
 
 
2015-03-04
 
    https://bugzil.la/1138937 - Slave loan request for a t-w864-ix machine
 
    reconfig in progress
 
    buildduty report:
 
    re-imaging a bunch of slaves to help with capacity
 
    https://bugzil.la/1138672 - vlan request - move bld-lion-r5-[006-015] machines from prod build pool to try build pool (needs to be enabled)
 
    test master upgrades (done)
 
    https://bugzil.la/1136527 - upgrade ec2 linux64 test masters from m3.medium to m3.large (again)
 
    https://bugzil.la/1135664 - Some masters don't have swap enabled
 
    (hwine) meeting with Linda (head of #moc)
 
    make more specific requests from #moc
 
    share top issues with #moc
 
    when: next meeting is 13th
 
    come up with prioritized list of releng needs by early next week
 
    coop to file bugs re: releng-try improvements
 
    add builderlists/dumpmasters diff to travis
 
    switch RoR for key repos to github
 
    reverse VCS sync flow
 
    enable travis testing for forks - this is done on a per-fork basis by the owners of the forks. PR's will get travis jobs regardless.
 
    no-op reconfig on schedulers
 
    https://bugzil.la/1123911 - fw1.releng.scl3.mozilla.net routing failures - BGP use1


* [https://bugzil.la/1140989 '''https://bugzil.la/1140989'''] - zlb8.ops.phx1.mozilla.com:Load is CRITICAL
** {{bug|1126825#c11}}
* reconfig
* [https://bugzil.la/1140539 '''https://bugzil.la/1140539'''] - slow query log report for buildbot2
** have some info now to start reducing load on db
* filed [https://bugzil.la/1141217 '''https://bugzil.la/1141217'''] - nagios alerts for unassigned blocker bugs in all releng bugzilla components
** hope to avoid situation from last Friday where RyanVM's blocker bug sat for hours
* [https://bugzil.la/1139764 '''https://bugzil.la/1139764'''] - Slave loan request for a tst-linux32-spot instance
** created an Ubuntu 14.04 instance for Shu to test a kernel theory on


upgrade test linux masters (https://bugzil.la/1136527):


    bm51 (complete)
'''2015-03-06'''<br />


    bm53 (complete)
* [https://bugzil.la/1140304 https://bugzil.la/1140304] - All Gij jobs are permanently red for v2.2 branch
** uploaded [http://pypi.pub.build.mozilla.org/pub/mozprofile-0.21.tar.gz http://pypi.pub.build.mozilla.org/pub/mozprofile-0.21.tar.gz]
* filed''''''[https://bugzil.la/1140398 '''https://bugzil.la/1140398'''] - Send nagios load alerts for upload1.dmz.scl3.mozilla.com to #buildduty IRC channel
* closed out tree closure bugs from this week after making sure bugs for follow-up issues were on file
* filed [https://bugzil.la/1140419 '''https://bugzil.la/1140419'''] - [tracking] Switch all releng RoR to github
** will send mail/write blogpost today
* working on [https://bugzil.la/1140479 '''https://bugzil.la/1140479'''] - Improvements to end_to_end_reconfig.sh script


    bm117-tests1-linux64 (complete)


    bm52-tests1-linux64 (complete)
'''2015-03-05'''<br />


    bm54-tests1-linux64 (complete)
* never got to adding win64 m-a nightlies to jacuzzi [https://bugzil.la/1139763 https://bugzil.la/1139763]
* need to enable slaves from [https://bugzil.la/1138672 '''https://bugzil.la/1138672''']
* end_to_end script tells bugs things for mozharness are live in production. this no longer is the case for all our build + test jobs (most things aside from vcs-sync, bumper, etc).
** should we be still automatically updating bugs for mh after reconfig?
** we need a way to roll out changes to mh on a regular cadence. right now it's up to the individual to update mozharness.json with a REV they want applied and consequently, whatever mh patches are in between are also applied...
** coop to drop mozharness from end-to-end-reconfig script and email public list
* added [http://pypi.pvt.build.mozilla.org/pub/mozrunner-6.6.tar.gz http://pypi.pvt.build.mozilla.org/pub/mozrunner-6.6.tar.gz]
* talked to catlee re: releng-try pipeline
** fully supportive
** one wrinkle: how to tackle release tagging
** coop will get bugs filed today
* add 4-repo view to slave health?


    use1


    bm67-tests1-linux64 (complete)
'''2015-03-04'''<br />


    bm113-tests1-linux64 (complete)
* [https://bugzil.la/1138937 '''https://bugzil.la/1138937'''] - Slave loan request for a t-w864-ix machine
* reconfig in progress
* buildduty report:
** re-imaging a bunch of slaves to help with capacity
* [https://bugzil.la/1138672 '''https://bugzil.la/1138672'''] - vlan request - move bld-lion-r5-[006-015] machines from prod build pool to try build pool (needs to be enabled)
* test master upgrades (done)
** [https://bugzil.la/1136527 '''https://bugzil.la/1136527'''] - upgrade ec2 linux64 test masters from m3.medium to m3.large (again)
** [https://bugzil.la/1135664 '''https://bugzil.la/1135664'''] - Some masters don't have swap enabled
* (hwine) meeting with Linda (head of #moc)
** make more specific requests from #moc
** share top issues with #moc
** when: next meeting is 13th
*** come up with prioritized list of releng needs by early next week


    bm114-tests1-linux64 (complete)
* coop to file bugs re: releng-try improvements
** add builderlists/dumpmasters diff to travis
** switch RoR for key repos to github
*** reverse VCS sync flow
*** enable travis testing for forks - this is done on a per-fork basis by the owners of the forks. PR's will get travis jobs regardless.


    bm120-tests1-linux64 (complete)
* no-op reconfig on schedulers
* [https://bugzil.la/1123911 '''https://bugzil.la/1123911'''] - fw1.releng.scl3.mozilla.net routing failures - BGP use1


    bm121-tests1-linux64 (complete)


    usw2
upgrade test linux masters ([https://bugzil.la/1136527) https://bugzil.la/1136527)]:<br />


    bm68-tests1-linux64 (complete)
* bm51 (complete)
* bm53 (complete)
* bm117-tests1-linux64 (complete)
* bm52-tests1-linux64 (complete)
* bm54-tests1-linux64 (complete)
* use1
** bm67-tests1-linux64 (complete)
** bm113-tests1-linux64 (complete)
** bm114-tests1-linux64 (complete)
** bm120-tests1-linux64 (complete)
** bm121-tests1-linux64 (complete)
* usw2
** bm68-tests1-linux64 (complete)
** bm115-tests1-linux64 (complete)
** bm116-tests1-linux64 (complete)
** bm118-tests1-linux64 (complete)
** bm122-tests1-linux64 (complete)
** bm123-tests1-linux64 (started)


    bm115-tests1-linux64 (complete)


    bm116-tests1-linux64 (complete)
add swap ([https://bugzil.la/1135664) https://bugzil.la/1135664)]:<br />


    bm118-tests1-linux64 (complete)
* bm53 (complete)
* buildbot-master54 (complete)
* use1
** buildbot-master117 BAD
** buildbot-master120 BAD (complete)
** buildbot-master121 BAD (complete)
* usw2
** buildbot-master68 (complete)
** buildbot-master115 (complete)
** buildbot-master116 BAD (complete)
** buildbot-master118 BAD (complete)
** buildbot-master122 BAD (complete)
** buildbot-master123 BAD


    bm122-tests1-linux64 (complete)


    bm123-tests1-linux64 (started)
buildbot-master04 BAD<br />
 
buildbot-master05 BAD<br />
 
buildbot-master06 BAD<br />
add swap (https://bugzil.la/1135664):
buildbot-master66 BAD<br />
 
buildbot-master72 BAD<br />
    bm53 (complete)
buildbot-master73 BAD<br />
 
buildbot-master74 BAD<br />
    buildbot-master54 (complete)
buildbot-master78 BAD<br />
 
buildbot-master79 BAD<br />
    use1
 
    buildbot-master117 BAD
 
    buildbot-master120 BAD (complete)
 
    buildbot-master121 BAD (complete)
 
    usw2
 
    buildbot-master68 (complete)
 
    buildbot-master115 (complete)
 
    buildbot-master116 BAD (complete)
 
    buildbot-master118 BAD (complete)
 
    buildbot-master122 BAD (complete)
 
    buildbot-master123 BAD
 
 
buildbot-master04 BAD
buildbot-master05 BAD
buildbot-master06 BAD
buildbot-master66 BAD
buildbot-master72 BAD
buildbot-master73 BAD
buildbot-master74 BAD
buildbot-master78 BAD
buildbot-master79 BAD
buildbot-master91 BAD
buildbot-master91 BAD






<br />
'''2015-03-03'''<br />


2015-03-03
* [https://bugzil.la/1138955 '''https://bugzil.la/1138955'''] - Slow Builds and lagginess
 
** tree closure due to backlog (10:00am ET)
    https://bugzil.la/1138955 - Slow Builds and lagginess
** *mostly* unexpected load (extra poorly-timed pushes to try), although a bunch of test instances not recycling properly
*** coop is investigating these


    tree closure due to backlog (10:00am ET)
* [https://bugzil.la/1041763 https://bugzil.la/1041763] - upgrade ec2 linux64 test masters from m3.medium to m3.large
** jlund starting to iterate through list today
* [https://bugzil.la/1139029 '''https://bugzil.la/1139029'''] - Turn off OSX Gip (Gaia UI tests) on all branches
* [https://bugzil.la/1139023 '''https://bugzil.la/1139023'''] - Turn off Fx desktop OSX 10.8 tests on the B2G release branches
** coop landing patches from RyanVM to reduce b2g test load on 10.8


    *mostly* unexpected load (extra poorly-timed pushes to try), although a bunch of test instances not recycling properly


    coop is investigating these
'''2015-03-02'''<br />


    https://bugzil.la/1041763 - upgrade ec2 linux64 test masters from m3.medium to m3.large
* [https://bugzil.la/1138155 https://bugzil.la/1138155] - set up replacement masters for Fallen
* [https://bugzil.la/1137047 https://bugzil.la/1137047] - Rebalance the Mac build slaves between buildpool and trybuildpool


    jlund starting to iterate through list today


    https://bugzil.la/1139029 - Turn off OSX Gip (Gaia UI tests) on all branches
'''2015-02-27'''<br />


    https://bugzil.la/1139023 - Turn off Fx desktop OSX 10.8 tests on the B2G release branches
* queue issues on build masters due to graphene jobs
** should be resolved by reconfig this morning
* re-imaging some 10.8 machines as 10.10
** 10.10 will be running opt jobs on inbound, 10.8 debug on inbound + opt on release branches
** sheriffs are understandably worried about capacity issues in both pools
* re-re-imaging talos-linux32-ix-0[01,26]
** may have an underlying issue with the re-imaging process for linux hw


    coop landing patches from RyanVM to reduce b2g test load on 10.8


'''2015-02-26'''<br />


2015-03-02
* things to discuss:
** [https://bugzil.la/1137047 https://bugzil.la/1137047] - Rebalance the Mac build slaves between buildpool and trybuildpool
* filed: [https://bugzil.la/1137322 '''https://bugzil.la/1137322'''] - osx test slaves are failing to download a test zip from similiar rev


    https://bugzil.la/1138155 - set up replacement masters for Fallen


    https://bugzil.la/1137047 - Rebalance the Mac build slaves between buildpool and trybuildpool
'''2015-02-25'''<br />


* things to circle back on today:
** [https://bugzil.la/1136195 https://bugzil.la/1136195] - Frequent download timeouts across all trees
*** [https://bugzil.la/1130242 https://bugzil.la/1130242] - request for throughput data on the SCL3 ZLBs for the past 12 hours
** [https://bugzil.la/1136465 https://bugzil.la/1136465] - New: Spot instances failing with remoteFailed: [Failure instance: Traceback (failure with no frames): &lt;class 'twisted.spread.pb.PBConnectionLost'&gt;:
** [Bug 1041763] upgrade ec2 linux64 test masters from m3.medium to m3.large
* [https://bugzil.la/1136531 <s>https://bugzil.la/1136531</s>]<s>- Slave loan request for a tst-linux64-spot vm</s>


2015-02-27


    queue issues on build masters due to graphene jobs
'''2015-02-24'''<br />


    should be resolved by reconfig this morning
* [https://bugzil.la/1136195 '''https://bugzil.la/1136195'''] - Frequent download timeouts across all trees
** related to release traffic?
* release reconfigs don't log themselves
** should probably reconfig everything not just build/scheduler masters
*** i think this takes care of itself once masters start updating themselves based on tag updates


    re-imaging some 10.8 machines as 10.10
* tree closure
** symptom: {{bug|1136465#c0}}
** diagnosis: {{bug|1136465#c1}}
** an aid to make recovery faster:
*** [https://bugzil.la/1136527 https://bugzil.la/1136527]
**** note this was accidentally happened from ghost work:
***** {{bug|1126428#c66}}


    10.10 will be running opt jobs on inbound, 10.8 debug on inbound + opt on release branches


    sheriffs are understandably worried about capacity issues in both pools


    re-re-imaging talos-linux32-ix-0[01,26]


    may have an underlying issue with the re-imaging process for linux hw
10:35:13 &lt;hwine&gt; ah, I see coop already asked Usul about 0900PT <br />
 
10:36:34 &lt;hwine&gt; ashlee: sounds like our theory of load isn't right - can someone check further, please? [https://bugzil.la/1136195#c1 https://bugzil.la/1136195#c1]<br />
 
10:38:02 &lt;•pir&gt; hwine: check... what?<br />
2015-02-26
10:38:56 &lt;hwine&gt; ftp.m.o is timing out and has closed trees. Our guess was release day load, but that appears not to be it<br />
 
10:39:33 &lt;•pir&gt; hwine: I can't see any timeouts in that link, I may be missing something.<br />
    things to discuss:
10:39:48 &lt;jlund&gt; ashlee: hwine catlee-lunch we have {{bug|1130242#c4}} to avail of now too it seems. might provide some insight to health or even a possible cause as to why we are hitting timeouts since the the change time lines up within the time of reported timeouts.<br />
 
10:40:35 &lt;jlund&gt; pir: that's the bug tracking timeouts. there are timeouts across many of our continuous integration jobs: [http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/mozilla-inbound_ubuntu32_vm_test-jittest-1-bm01-tests1-linux32-build102.txt.gz http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/mozilla-inbound_ubuntu32_vm_test-jittest-1-bm01-tests1-linux32-build102.txt.gz]<br />
    https://bugzil.la/1137047 - Rebalance the Mac build slaves between buildpool and trybuildpool
10:40:39 mbrandt → mbrandt|lunch<br />
 
10:40:49 &lt;•pir&gt; hwine: we don't have a lot of visibility into how ftp.m.o works or is not working. This isn't a good situation, but sadly how it is.<br />
    filed: https://bugzil.la/1137322 - osx test slaves are failing to download a test zip from similiar rev
10:41:28 &lt;hwine&gt; pir: right, my understanding is that you (moc) coordinates all the deeper dives for IT infrastructure (which ftp.m.o still is)<br />
 
10:42:15 &lt;•pir&gt; hwine: To clarify, I don't think anyone has a lot of visibility into how ftp.m.o is working :(<br />
 
10:42:19 &lt;•pir&gt; it's a mess<br />
2015-02-25
10:42:34 &lt;•pir&gt; ashlee: want to loop in C ?<br />
 
10:43:00 &lt;•pir&gt; (and I think mixing continuous build traffic and release traffic is insane, personally)<br />
    things to circle back on today:
10:43:26 &lt;•pir&gt; jlund: yes, that's what I was reading and not seeing anythig<br />
 
10:43:43 &lt;hwine&gt; pir: system should handle it fine (has in the past) release traffic s/b minimal since we use CDNs<br />
    https://bugzil.la/1136195 - Frequent download timeouts across all trees
10:44:12 &lt;•pir&gt; hwine: should be. isn't.<br />
 
10:44:18 &lt;•ashlee&gt; pir sure<br />
    https://bugzil.la/1130242 - request for throughput data on the SCL3 ZLBs for the past 12 hours
10:47:53 &lt;•pir&gt; the load on the ftp servers is... minimal<br />
 
10:48:27 &lt;•fox2mike&gt; jlund: may I ask where these timeouts are happening from? <br />
    https://bugzil.la/1136465 - New: Spot instances  failing with remoteFailed: [Failure instance: Traceback (failure with no  frames): <class 'twisted.spread.pb.PBConnectionLost'>:
10:49:15 &lt;jlund&gt; hm, so load may not be the issue. begs the question &quot;what's changed&quot;<br />
 
10:49:17 &lt;•pir&gt; and what the timeouts actually are. I can't see anything timing out in the listed logs<br />
    [Bug 1041763] upgrade ec2 linux64 test masters from m3.medium to m3.large
10:49:37 &lt;•pir&gt; jlund: for ftp.m.o? nothing that I'm aware of<br />
 
10:50:06 &lt;cyliang&gt; no bandwith alerts from zeus. looking at the load balancers to see if anything pops out.<br />
    https://bugzil.la/1136531 - Slave loan request for a tst-linux64-spot vm
10:50:09 &lt;•ashish&gt; from [http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/mozilla-inbound_ubuntu32_vm_test-jittest-1-bm01-tests1-linux32-build102.txt.gz http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/mozilla-inbound_ubuntu32_vm_test-jittest-1-bm01-tests1-linux32-build102.txt.gz]<br />
 
10:50:13 &lt;•ashish&gt; i see<br />
 
10:50:14 &lt;•ashish&gt; 08:00:28 WARNING - Timed out accessing [http://ftp.mozilla.org.proxxy1.srv.releng.use1.mozilla.com/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/firefox-39.0a1.en-US.linux-i686.tests.zip http://ftp.mozilla.org.proxxy1.srv.releng.use1.mozilla.com/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/firefox-39.0a1.en-US.linux-i686.tests.zip]: timed out<br />
2015-02-24
10:50:18 &lt;•ashish&gt; what is that server?<br />
 
10:50:31 &lt;•fox2mike&gt; USE1<br />
    https://bugzil.la/1136195 - Frequent download timeouts across all trees
10:50:34 &lt;•fox2mike&gt; FUCK YEAH! :p <br />
 
10:50:35 → agibson joined (agibson@moz-j04gi9.cable.virginm.net)<br />
    related to release traffic?
10:50:36 &lt;•fox2mike&gt; the cloud baby<br />
 
10:50:55 &lt;•fox2mike&gt; jlund: I bet if you were to try this from other amazon regions, you might not his this <br />
    release reconfigs don't log themselves
10:50:57 &lt;•ashish&gt; i don't see timeouts for [http://ftp.mozilla.org/ http://ftp.mozilla.org/]*<br />
 
10:50:59 &lt;cyliang&gt; fox2mike: Is this the same timeout stuff as last time?<br />
    should probably reconfig everything not just build/scheduler masters
10:51:03 &lt;•ashish&gt; (in that log)<br />
 
10:51:03 &lt;•fox2mike&gt; I'm guessing<br />
    i think this takes care of itself once masters start updating themselves based on tag updates
10:51:06 &lt;•fox2mike&gt; cyliang: ^ <br />
 
10:51:17 &lt;•fox2mike&gt; because the last time we saw random issues<br />
    tree closure
10:51:21 &lt;•fox2mike&gt; it was all us-east1 <br />
 
10:51:39 &lt;•fox2mike&gt; jlund: for reference - {{bug|1130386}} <br />
    symptom: https://bugzilla.mozilla.org/show_bug.cgi?id=1136465#c0
10:52:11 &lt;•fox2mike&gt; our infra is the same, we can all save time by trying to see if you guys hit this from any other amazon region (if that's possible) <br />
 
10:53:08 &lt;jlund&gt; proxxy is a host from aws but after failing to try that a few times, we poke ftp directly and timeout after 30 min:<br />
    diagnosis: https://bugzilla.mozilla.org/show_bug.cgi?id=1136465#c1
10:53:12 &lt;jlund&gt; [https://www.irccloud.com/pastebin/WmSehqzj https://www.irccloud.com/pastebin/WmSehqzj]<br />
 
    an aid to make recovery faster:
 
    https://bugzil.la/1136527
 
    note this was accidentally happened from ghost work:
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1126428#c66
 
 
10:35:13 <hwine> ah, I see coop already asked Usul about 0900PT
10:36:34 <hwine> ashlee: sounds like our theory of load isn't right - can someone check further, please? https://bugzil.la/1136195#c1
10:38:02 <•pir> hwine: check... what?
10:38:56 <hwine> ftp.m.o is timing out and has closed trees. Our guess was release day load, but that appears not to be it
10:39:33 <•pir> hwine: I can't see any timeouts in that link, I may be missing something.
10:39:48 <jlund> ashlee: hwine catlee-lunch we have https://bugzilla.mozilla.org/show_bug.cgi?id=1130242#c4  to avail of now too it seems. might provide some insight to health or  even a possible cause as to why we are hitting timeouts since the the  change time lines up within the time of reported timeouts.
10:40:35 <jlund> pir: that's the bug tracking timeouts. there are timeouts across many of our continuous integration jobs: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/mozilla-inbound_ubuntu32_vm_test-jittest-1-bm01-tests1-linux32-build102.txt.gz
10:40:39 mbrandt → mbrandt|lunch
10:40:49 <•pir> hwine:  we don't have a lot of visibility into how ftp.m.o works or is not  working. This isn't a good situation, but sadly how it is.
10:41:28 <hwine> pir: right, my understanding is that you (moc) coordinates all the deeper dives for IT infrastructure (which ftp.m.o still is)
10:42:15 <•pir> hwine: To clarify, I don't think anyone has a lot of visibility into how ftp.m.o is working :(
10:42:19 <•pir> it's a mess
10:42:34 <•pir> ashlee: want to loop in C ?
10:43:00 <•pir> (and I think mixing continuous build traffic and release traffic is insane, personally)
10:43:26 <•pir> jlund: yes, that's what I was reading and not seeing anythig
10:43:43 <hwine> pir: system should handle it fine (has in the past) release traffic s/b minimal since we use CDNs
10:44:12 <•pir> hwine: should be. isn't.
10:44:18 <•ashlee> pir sure
10:47:53 <•pir> the load on the ftp servers is... minimal
10:48:27 <•fox2mike> jlund: may I ask where these timeouts are happening from?
10:49:15 <jlund> hm, so load may not be the issue. begs the question "what's changed"
10:49:17 <•pir> and what the timeouts actually are. I can't see anything timing out in the listed logs
10:49:37 <•pir> jlund: for ftp.m.o? nothing that I'm aware of
10:50:06 <cyliang> no bandwith alerts from zeus. looking at the load balancers to see if anything pops out.
10:50:09 <•ashish> from http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/mozilla-inbound_ubuntu32_vm_test-jittest-1-bm01-tests1-linux32-build102.txt.gz
10:50:13 <•ashish> i see
10:50:14 <•ashish> 08:00:28  WARNING - Timed out accessing http://ftp.mozilla.org.proxxy1.srv.releng.use1.mozilla.com/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/firefox-39.0a1.en-US.linux-i686.tests.zip: timed out
10:50:18 <•ashish> what is that server?
10:50:31 <•fox2mike> USE1
10:50:34 <•fox2mike> FUCK YEAH! :p
10:50:35 → agibson joined (agibson@moz-j04gi9.cable.virginm.net)
10:50:36 <•fox2mike> the cloud baby
10:50:55 <•fox2mike> jlund: I bet if you were to try this from other amazon regions, you might not his this
10:50:57 <•ashish> i don't see timeouts for http://ftp.mozilla.org/*
10:50:59 <cyliang> fox2mike: Is this the same timeout stuff as last time?
10:51:03 <•ashish> (in that log)
10:51:03 <•fox2mike> I'm guessing
10:51:06 <•fox2mike> cyliang: ^
10:51:17 <•fox2mike> because the last time we saw random issues
10:51:21 <•fox2mike> it was all us-east1
10:51:39 <•fox2mike> jlund: for reference - https://bugzilla.mozilla.org/show_bug.cgi?id=1130386
10:52:11 <•fox2mike> our  infra is the same, we can all save time by trying to see if you guys  hit this from any other amazon region (if that's possible)
10:53:08 <jlund> proxxy is a host from aws but after failing to try that  a few times, we poke ftp directly and timeout after 30 min:
10:53:12 <jlund> https://www.irccloud.com/pastebin/WmSehqzj
  Plain Text • 8 lines raw | line numbers  
  Plain Text • 8 lines raw | line numbers  


 
<br />
10:53:13 <wesley> jlund's shortened url is http://tinyurl.com/q47zbvl
10:53:13 &lt;wesley&gt; jlund's shortened url is [http://tinyurl.com/q47zbvl http://tinyurl.com/q47zbvl]<br />
10:53:15 <•pir> yay cloud
10:53:15 &lt;•pir&gt; yay cloud<br />
10:54:36 <•pir> jlund: that download from ftp-ssl works fine from anywhere I have access to test it
10:54:36 &lt;•pir&gt; jlund: that download from ftp-ssl works fine from anywhere I have access to test it<br />
10:55:06 <•fox2mike> jlund: where did that fail from?  
10:55:06 &lt;•fox2mike&gt; jlund: where did that fail from? <br />
10:55:40 <jlund> sure, and it doesn't always timeout, but of our thousands of jobs, a bunch have failed and timed out.
10:55:40 &lt;jlund&gt; sure, and it doesn't always timeout, but of our thousands of jobs, a bunch have failed and timed out.<br />
10:55:54 <unixfairy> jlund can you be more specific  
10:55:54 &lt;unixfairy&gt; jlund can you be more specific <br />
10:56:10 <jlund> fox2mike: same log example: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/mozilla-inbound_ubuntu32_vm_test-jittest-1-bm01-tests1-linux32-build102.txt.gz
10:56:10 &lt;jlund&gt; fox2mike: same log example: [http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/mozilla-inbound_ubuntu32_vm_test-jittest-1-bm01-tests1-linux32-build102.txt.gz http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/mozilla-inbound_ubuntu32_vm_test-jittest-1-bm01-tests1-linux32-build102.txt.gz]<br />
10:56:56 <jlund> sorry, I don't know exact failure rate numbers. RyanVM|sheriffduty  may know more.
10:56:56 &lt;jlund&gt; sorry, I don't know exact failure rate numbers. RyanVM|sheriffduty may know more.<br />
10:57:03 <•fox2mike> jlund: so
10:57:03 &lt;•fox2mike&gt; jlund: so<br />
10:57:03 <•fox2mike> builder: mozilla-inbound_ubuntu32_vm_test-jittest-1
10:57:03 &lt;•fox2mike&gt; builder: mozilla-inbound_ubuntu32_vm_test-jittest-1<br />
10:57:04 <•fox2mike> slave: tst-linux32-spot-105
10:57:04 &lt;•fox2mike&gt; slave: tst-linux32-spot-105<br />
10:57:10 <•fox2mike> that's from amazon again
10:57:10 &lt;•fox2mike&gt; that's from amazon again<br />
10:57:18 <•fox2mike> tst-linux32-spot-105  
10:57:18 &lt;•fox2mike&gt; tst-linux32-spot-105 <br />
10:57:23 <•fox2mike> that's a spot instance
10:57:23 &lt;•fox2mike&gt; that's a spot instance <br />
10:57:34 <•pir> yep, master: http://buildbot-master01.bb.releng.use1.mozilla.com:8201/
10:57:34 &lt;•pir&gt; yep, master: [http://buildbot-master01.bb.releng.use1.mozilla.com:8201/ http://buildbot-master01.bb.releng.use1.mozilla.com:8201/]<br />
10:57:40 <•fox2mike> us-east1  
10:57:40 &lt;•fox2mike&gt; us-east1 <br />
10:57:45 <•pir> so far the connection I see is use1 as fox2mike says
10:57:45 &lt;•pir&gt; so far the connection I see is use1 as fox2mike says<br />
10:58:14 <•fox2mike> we've been through this before :)  
10:58:14 &lt;•fox2mike&gt; we've been through this before :) <br />
10:58:17 <•fox2mike> is all I'm saying
10:58:17 &lt;•fox2mike&gt; is all I'm saying<br />
10:59:02 <jlund> sure. let's make sure we can narrow it down to that. I'll see if I can track down more jobs that have hit the timeout where slaves are not in aws.
10:59:02 &lt;jlund&gt; sure. let's make sure we can narrow it down to that. I'll see if I can track down more jobs that have hit the timeout where slaves are not in aws.<br />
10:59:08 <jlund> thanks for your help so far.
10:59:08 &lt;jlund&gt; thanks for your help so far.<br />
10:59:49 <•fox2mike> jlund: aws is fine, anything that's a non use1 failure
10:59:49 &lt;•fox2mike&gt; jlund: aws is fine, anything that's a non use1 failure<br />
10:59:55 <•fox2mike> before we go to non aws failure  
10:59:55 &lt;•fox2mike&gt; before we go to non aws failure <br />
11:00:06 <•fox2mike> but your case will narrow it down further
11:00:06 &lt;•fox2mike&gt; but your case will narrow it down further<br />
11:00:07 <•fox2mike> thanks!  
11:00:07 &lt;•fox2mike&gt; thanks! <br />
11:00:11 <jlund> rgr
11:00:11 &lt;jlund&gt; rgr<br />
11:00:15 <RyanVM|sheriffduty> fox2mike: things have been quiet for a little while now
11:00:15 &lt;RyanVM|sheriffduty&gt; fox2mike: things have been quiet for a little while now<br />
11:00:26 <RyanVM|sheriffduty> but we had a lull awhile ago too before another spike
11:00:26 &lt;RyanVM|sheriffduty&gt; but we had a lull awhile ago too before another spike<br />
11:00:36 <RyanVM|sheriffduty> so I'm not feeling overly inclined to say that things are resolved
11:00:36 &lt;RyanVM|sheriffduty&gt; so I'm not feeling overly inclined to say that things are resolved<br />
11:00:55 jp-food → jp
11:00:55 jp-food → jp<br />
11:01:00 <jlund> RyanVM|sheriffduty: have any mac or windows jobs hit this timeout?
11:01:00 &lt;jlund&gt; RyanVM|sheriffduty: have any mac or windows jobs hit this timeout?<br />
11:01:08 <RyanVM|sheriffduty> yes
11:01:08 &lt;RyanVM|sheriffduty&gt; yes<br />
11:01:13 <RyanVM|sheriffduty> windows definitely
11:01:13 &lt;RyanVM|sheriffduty&gt; windows definitely<br />
11:01:26 <jlund> k, fox2mike ^ we don't have any windows machines in the cloud
11:01:26 &lt;jlund&gt; k, fox2mike ^ we don't have any windows machines in the cloud<br />
11:01:48 <RyanVM|sheriffduty> random example - https://treeherder.mozilla.org/logviewer.html#?job_id=6928327&repo=mozilla-inbound
11:01:48 &lt;RyanVM|sheriffduty&gt; random example - [https://treeherder.mozilla.org/logviewer.html#?job_id=6928327&repo=mozilla-inbound https://treeherder.mozilla.org/logviewer.html#?job_id=6928327&amp;repo=mozilla-inbound]<br />
11:01:54 <•ashish> are there logs from thoes machines?
11:01:54 &lt;•ashish&gt; are there logs from thoes machines?<br />
11:01:55 <•ashish> ty
11:01:55 &lt;•ashish&gt; ty<br />
11:02:00 → KaiRo joined (robert@moz-dqe9u3.highway.telekom.at)
11:02:00 → KaiRo joined (robert@moz-dqe9u3.highway.telekom.at)<br />
11:02:17 <RyanVM|sheriffduty> OSX - https://treeherder.mozilla.org/logviewer.html#?job_id=6924712&repo=mozilla-inbound
11:02:17 &lt;RyanVM|sheriffduty&gt; OSX - [https://treeherder.mozilla.org/logviewer.html#?job_id=6924712&repo=mozilla-inbound https://treeherder.mozilla.org/logviewer.html#?job_id=6924712&amp;repo=mozilla-inbound]<br />
11:02:51 jlund → jlund|mtg
11:02:51 jlund → jlund|mtg<br />
11:02:57 ⇐ agibson quit (agibson@moz-j04gi9.cable.virginm.net)
11:02:57 ⇐ agibson quit (agibson@moz-j04gi9.cable.virginm.net)<br />
11:04:28 jlund|mtg → jlund
11:04:28 jlund|mtg → jlund<br />
11:04:36 <KaiRo> who is the right contact for getting HTTP requests to a Mozilla-owned domain set up to redirect to a different website (another Mozilla-owned domain)?
11:04:36 &lt;KaiRo&gt; who is the right contact for getting HTTP requests to a Mozilla-owned domain set up to redirect to a different website (another Mozilla-owned domain)?<br />
11:04:50 <KaiRo> the case in question is bug 998793
11:04:50 &lt;KaiRo&gt; the case in question is bug 998793<br />
11:05:27 → agibson joined (agibson@moz-j04gi9.cable.virginm.net)
11:05:27 → agibson joined (agibson@moz-j04gi9.cable.virginm.net)<br />
11:06:48 <•ashish> KaiRo: looks like that IP is hosted/maintained by the community
11:06:48 &lt;•ashish&gt; KaiRo: looks like that IP is hosted/maintained by the community<br />
11:07:05 <•pir> KaiRo: 173.5.47.78.in-addr.arpa domain name pointer static.173.5.47.78.clients.your-server.de.
11:07:05 &lt;•pir&gt; KaiRo: 173.5.47.78.in-addr.arpa domain name pointer static.173.5.47.78.clients.your-server.de.<br />
11:07:09 <•pir> KaiRo: not ours
11:07:09 &lt;•pir&gt; KaiRo: not ours<br />
11:07:39 <jlund> so, it sounds like we have confirmed that this outside aws. for completeness, I'll see if I can find this happening on usw-2 instances too.
11:07:39 &lt;jlund&gt; so, it sounds like we have confirmed that this outside aws. for completeness, I'll see if I can find this happening on usw-2 instances too.<br />
11:08:01 agibson → agibson|brb
11:08:01 agibson → agibson|brb<br />
11:08:23 <KaiRo> ashish: yes, the IP is right now not Mozilla-hosted (atopal, who does host it and actually is an employee nowadays, will be working on getting it moved to Mozilla in the next months) but the domains are both Mozilla-owned
11:08:23 &lt;KaiRo&gt; ashish: yes, the IP is right now not Mozilla-hosted (atopal, who does host it and actually is an employee nowadays, will be working on getting it moved to Mozilla in the next months) but the domains are both Mozilla-owned<br />
11:09:02 <•pir> KaiRo: the server isn't, though, and you do redirects on server
11:09:02 &lt;•pir&gt; KaiRo: the server isn't, though, and you do redirects on server<br />
11:09:54 <KaiRo> pir: well, what we want in that bug is to have mozilla.at point to the same IP as mozilla.de (or CNAME to it or whatever)
11:09:54 &lt;KaiRo&gt; pir: well, what we want in that bug is to have mozilla.at point to the same IP as mozilla.de (or CNAME to it or whatever)<br />
11:10:33 <•pir> KaiRo: ah, that's not the same question
11:10:33 &lt;•pir&gt; KaiRo: ah, that's not the same question<br />
11:11:17 <KaiRo> and the stuff hosted by atopal that I was referring to is actually the .de one - I have no idea what the .at one even points to
11:11:17 &lt;KaiRo&gt; and the stuff hosted by atopal that I was referring to is actually the .de one - I have no idea what the .at one even points to<br />
11:11:45 <•ashish> KaiRo: ok, file a bug with webops. they'll have to change nameservers, setup dns and then put up redirects as needed
11:11:45 &lt;•ashish&gt; KaiRo: ok, file a bug with webops. they'll have to change nameservers, setup dns and then put up redirects as needed<br />
11:12:06 <•pir> that
11:12:06 &lt;•pir&gt; that<br />
11:13:27 <KaiRo> ashish: OK, thanks!
11:13:27 &lt;KaiRo&gt; ashish: OK, thanks!<br />
11:13:56 <•pir> KaiRo: www.mozilla.de or www.mozilla.com/de/ ?
11:13:56 &lt;•pir&gt; KaiRo: www.mozilla.de or www.mozilla.com/de/ ?<br />
11:14:13 <•pir> KaiRo: the former is community, the latter is mozilla corp
11:14:13 &lt;•pir&gt; KaiRo: the former is community, the latter is mozilla corp<br />
New messages
New messages<br />
11:17:04 agibson|brb → agibson
11:17:04 agibson|brb → agibson<br />
11:17:54 <KaiRo> pir: the former, we want both .at and .de point to the same community site
11:17:54 &lt;KaiRo&gt; pir: the former, we want both .at and .de point to the same community site<br />
11:18:57 <•pir> KaiRo: then you need someone in corp to do the dns change and someone who runs the de community site to make sure their end is set up
11:18:57 &lt;•pir&gt; KaiRo: then you need someone in corp to do the dns change and someone who runs the de community site to make sure their end is set up<br />
11:20:08 <KaiRo> pir: sure
11:20:08 &lt;KaiRo&gt; pir: sure<br />
11:21:00 <KaiRo> pir: I was mostly concerned about who to contact for the crop piece, I know the community people, we just met this last weekend
11:21:00 &lt;KaiRo&gt; pir: I was mostly concerned about who to contact for the crop piece, I know the community people, we just met this last weekend<br />
11:21:35 <•pir> KaiRo: file a child bug into infra & ops :: moc: service requests
11:21:35 &lt;•pir&gt; KaiRo: file a child bug into infra &amp; ops :: moc: service requests<br />
11:21:46 <•pir> KaiRo: if we can't do it directly then we can find someone who can
11:21:46 &lt;•pir&gt; KaiRo: if we can't do it directly then we can find someone who can<br />
11:22:10 <KaiRo> pir: thanks, good to know
11:22:10 &lt;KaiRo&gt; pir: thanks, good to know<br />
11:22:18 ⇐ agibson quit (agibson@moz-j04gi9.cable.virginm.net)
11:22:18 ⇐ agibson quit (agibson@moz-j04gi9.cable.virginm.net)<br />
11:22:31 <•pir> KaiRo: I'd suggest asking for a CNAME from mozilla.at to mozilla.de so if the de site's IP changes it doesn't break
11:22:31 &lt;•pir&gt; KaiRo: I'd suggest asking for a CNAME from mozilla.at to mozilla.de so if the de site's IP changes it doesn't break<br />
11:23:03 jlund → jlund|mtg
11:23:03 jlund → jlund|mtg<br />
11:23:51 <KaiRo> pir: yes, that's what I would prefer as well, esp. given the plans to move that communitxy website from atopal's server to Mozilla Community IT
11:23:51 &lt;KaiRo&gt; pir: yes, that's what I would prefer as well, esp. given the plans to move that communitxy website from atopal's server to Mozilla Community IT<br />
11:25:12 <•ashish> KaiRo: will mozilla.at always remain a direct? (in the near future, at least)
11:25:12 &lt;•ashish&gt; KaiRo: will mozilla.at always remain a direct? (in the near future, at least)<br />
11:25:39 <KaiRo> ashish: in the near future for sure, yes
11:25:39 &lt;KaiRo&gt; ashish: in the near future for sure, yes<br />
11:25:45 <•ashish> KaiRo: if so, we can have our static cluster handle the redirect
11:25:45 &lt;•ashish&gt; KaiRo: if so, we can have our static cluster handle the redirect<br />
11:25:59 <•ashish> that migh save some resources for the community
11:25:59 &lt;•ashish&gt; that migh save some resources for the community<br />
11:26:18 <•pir> if it's ending up on the same server, how does that save resources?
11:26:18 &lt;•pir&gt; if it's ending up on the same server, how does that save resources?<br />
11:27:03 <•ashish> if it's all the same server then yeah, not a huge benefit
11:27:03 &lt;•ashish&gt; if it's all the same server then yeah, not a huge benefit<br />
11:27:20 Fallen|away → Fallen, hwine → hwine|mtg, catlee-lunch → catlee
11:27:20 Fallen|away → Fallen, hwine → hwine|mtg, catlee-lunch → catlee <br />
11:37:07 <KaiRo> ashish, pir: thanks for your help, I filed bug 1136318 as a result, I hope that moves this forward :)
11:37:07 &lt;KaiRo&gt; ashish, pir: thanks for your help, I filed bug 1136318 as a result, I hope that moves this forward :)<br />
11:38:31 <•pir> np
11:38:31 &lt;•pir&gt; np<br />
11:38:46 <•ashish> KaiRo: yw
11:38:46 &lt;•ashish&gt; KaiRo: yw<br />
11:40:23 coop|lunch → coop|mtg
11:40:23 coop|lunch → coop|mtg<br />
Tuesday, February 24th, 2015
Tuesday, February 24th, 2015


2015-02-20
'''2015-02-20'''<br />


    reimaging a bunch of linux talos machines that have sat idle for 6 months
* reimaging a bunch of linux talos machines that have sat idle for 6 months
** <s>talos-linux32-ix-001</s>
** t<s>alos-linux64-ix-[003,004,008,092]</s>
* [https://bugzil.la/1095300 '''https://bugzil.la/1095300''']
** working on slaveapi code for &quot;is this slave currently running a job?&quot;
* pending is up over 5000 again
** mostly try
** Callek: What caused this, just large amounts of pushing? What OS's were pending? etc.


    talos-linux32-ix-001


    talos-linux64-ix-[003,004,008,092]
'''2015-02-19'''<br />


    https://bugzil.la/1095300
* another massive gps push to try, another poorly-terminated json prop
** [https://bugzil.la/1134767 https://bugzil.la/1134767]
** rows excised from db by jlund
** jobs canceled by jlun/nthomas/gps
** master exception logs cleaned up with:
*** python manage_masters.py -f production-masters.json -R scheduler -R try -j16 update_exception_timestamp


    working on slaveapi code for "is this slave currently running a job?"
* saw more exceptions related to get_unallocated_slaves today
** filed [https://bugzil.la/1134958 https://bugzil.la/1134958]
** symptom of pending jobs?


    pending is up over 5000 again


    mostly try
'''2015-02-18'''<br />


    Callek: What caused this, just large amounts of pushing? What OS's were pending? etc.
* filed [https://bugzil.la/1134316 https://bugzil.la/1134316] for tst-linux64-spot-341
* been thinking about builder mappings since last night
** simplest way may be to augment current allthethings.json output
*** need display names for slavepools
*** need list of regexps matched to language for each slavepool
*** this can be verified internally very easily: can run regexp against all builders in slavepool
*** external apps can pull down allthethings.json daily(?) and process file to strip out only what they need, e.g. slavepool -&gt; builder regexp mapping
*** would be good to publish hash of allthethings.json so consumers can easily tell when it has updated




2015-02-19


    another massive gps push to try, another poorly-terminated json prop
'''2015-02-17'''<br />


    https://bugzil.la/1134767
* b2g_bumper process hung for a few hours
** killed off python processes on bm66 per [[ReleaseEngineering/Applications/Bumper#Troubleshooting]]
* 3 masters (bm71, bm77, bm94) hitting exceptions related to jacuzzis:
** [https://hg.mozilla.org/build/buildbotcustom/annotate/a89f8a5ccd59/misc.py#l352 https://hg.mozilla.org/build/buildbotcustom/annotate/a89f8a5ccd59/misc.py#l352]
** unsure how serious this is
* taking the quiet moment to hammer out buildduty report and buildduty dashboard
* yesterday callek (while all the president's frowned at him) added more linux masters: 120-124. They seemed to be trucking along fine.
* reconfig happened (for releases?) at 10:31 PT and that caused a push to b2g-37 to get lost
** related: [https://bugzil.la/1086961 https://bugzil.la/1086961]
* very high pending job count (again)
** enabled 4 new masters yesterday: bm[120-123], added 400 new AWS test slaves later in the day, but pending still shot past 3000
** graph of AWS capacity: [http://cl.ly/image/2r1b0C1q0g3p http://cl.ly/image/2r1b0C1q0g3p]
** nthomas has ipython tools that indicated many AWS builders were being missed in watch_pending.cfg
*** Callek wrote a patch: [https://github.com/mozilla/build-cloud-tools/commit/e2aba3500482f7b293455cf64bedfb1225bb3d7e https://github.com/mozilla/build-cloud-tools/commit/e2aba3500482f7b293455cf64bedfb1225bb3d7e]
*** seems to have helped, now around 2000 pending (21:06 PT)


    rows excised from db by jlund
* philor found a spot instance that hadn't taken work since dec 23:
** [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=tst-linux64-spot&name=tst-linux64-spot-341 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&amp;type=tst-linux64-spot&amp;name=tst-linux64-spot-341]
** no status from aws-manager, can't be started
*** needs more investigation tomorrow, may indicate a bigger problem if we aren't recycling nodes as we expect


    jobs canceled by jlun/nthomas/gps


    master exception logs cleaned up with:


    python manage_masters.py -f production-masters.json -R scheduler -R try -j16 update_exception_timestamp
'''2015-02-13'''<br />


    saw more exceptions related to get_unallocated_slaves today
* [https://bugzil.la/1132792 https://bugzil.la/1132792] - new tree closure
** current state: {{bug|1132792#c11}}
*** reverted db change from yesterday, buildbot apparently needs a beefy physical machine


    filed https://bugzil.la/1134958
* going through buildduty report


    symptom of pending jobs?


'''2015-02-12'''<br />


2015-02-18
* buildbot db failover by sheeri (planned)
** [https://bugzil.la/1131637 https://bugzil.la/1131637]
* [https://bugzil.la/1132469 https://bugzil.la/1132469] - tree closure
** lots of idle slaves connected to masters despite high pending counts
** have rebooted some masters so far:
*** bm70, bm71, bm72, bm73, bm74, bm91, bm94
** coop looking into windows builders
*** found 2 builders that hadn't run *any* jobs ever (since late sept at least)


    filed https://bugzil.la/1134316 for tst-linux64-spot-341


    been thinking about builder mappings since last night


    simplest way may be to augment current allthethings.json output
'''2015-02-11'''<br />


    need display names for slavepools
* reconfig is needed. last one was on thurs. blocked on the 10th from planned reconfig
** will kick off a reconfig at 10am ET
** bm118 ended up with 2 reconfig procs running
*** disabled in slavealloc, initiated clean shutdown. Will restart when jobs drain.


    need list of regexps matched to language for each slavepool
* went through aws_sanity_checker backlog
** lots of unnamed hosts up for multiple days
*** I'm assuming this is mostly for Windows AWS work based on the platform of the image, but we should really push people to tag instances more rigorously, or expect them to get killed randomly


    this can be verified internally very easily: can run regexp against all builders in slavepool
* recovering &quot;broken&quot; slaves in slave health list
* Currently from jacuzzi report, 28 pending windows builds (for non-try) that are not in a jacuzzi
** 18 of them are disabled for varying reasons, should cull that list to see if any of them can/should be turned on.


    external apps can pull down allthethings.json daily(?) and process file to strip out only what they need, e.g. slavepool -> builder regexp mapping


    would be good to publish hash of allthethings.json so consumers can easily tell when it has updated
'''2015-02-10'''<br />


* ghost patching fallout
** signing servers rejecting new masters
** masters enabled (bm69) but not running buildbot
*** dns issues
** context:
*** {{bug|1126428#c55}}
*** [https://gist.github.com/djmitche/0c2c968fa1f6a5b5e0ca#file-masters-md https://gist.github.com/djmitche/0c2c968fa1f6a5b5e0ca#file-masters-md]


2015-02-17
* a patch ended up in manage_masters.py that blocked amy from continuing rollout and doing a reconfig
** [http://hg.mozilla.org/build/tools/rev/e7e3c7bf6efa http://hg.mozilla.org/build/tools/rev/e7e3c7bf6efa]
** so much time spent on debugging :( callek ended up finding the rogue accidental patch


    b2g_bumper process hung for a few hours


    killed off python processes on bm66 per https://wiki.mozilla.org/ReleaseEngineering/Applications/Bumper#Troubleshooting
'''2015-02-09'''<br />


    3 masters (bm71, bm77, bm94) hitting exceptions related to jacuzzis:
* STAT for jlund


    https://hg.mozilla.org/build/buildbotcustom/annotate/a89f8a5ccd59/misc.py#l352


    unsure how serious this is
'''2015-02-05'''<br />


    taking the quiet moment to hammer out buildduty report and buildduty dashboard
* tree closures
** [Bug 1130024] New: Extremely high Linux64 test backlog
*** chalking that one up to a 20% more push increase than what we had previously
** [Bug 1130207] Several tests failing with &quot;command timed out: 1800 seconds without output running&quot; while downloading from ftp-ssl.mozilla.org
*** again, likely load related but nothing too obvious. worked with netops, suspect we were hitting load balancer issues (ZLB) since hg and ftp share balancers and hg was under heavy load today
*** dcurado will follow up
**** and his follow up: bug 1130242


    yesterday callek (while all the president's frowned at him) added more linux masters: 120-124. They seemed to be trucking along fine.
* two reconfigs
* dev-stage01 was running low on disk space
* loan for sfink


    reconfig happened (for releases?) at 10:31 PT and that caused a push to b2g-37 to get lost


    related: https://bugzil.la/1086961


    very high pending job count (again)


    enabled 4 new masters yesterday: bm[120-123], added 400 new AWS test slaves later in the day, but pending still shot past 3000
'''2015-02-04'''<br />


    graph of AWS capacity: http://cl.ly/image/2r1b0C1q0g3p
* treeherder master db node is getting rebooted for ghost patching
** I asked mpressman to do it tomorrow and confirm with #treeherder folks first as there was not many on that were familiar with the system
* puppet win 2008 slaves are ready for the big leagues (prod)!
** I will be coordinating with markco the testing on that front
* did a reconfig.lot's landed
* investigated the 13 win builders that got upgraded RAM. 4 of them have been disabled for various issues
* dustin ghost patched bm103 and signing5/6


    nthomas has ipython tools that indicated many AWS builders were being missed in watch_pending.cfg


    Callek wrote a patch: https://github.com/mozilla/build-cloud-tools/commit/e2aba3500482f7b293455cf64bedfb1225bb3d7e
'''2015-02-03'''<br />


    seems to have helped, now around 2000 pending (21:06 PT)
* fallout from: '''Bug 1127482''' - Make Windows B2G Desktop builds periodic
** caused a ~dozen dead command items every 6 hours
** patched: {{bug|1127482#c15}}
** moved current dead items to my own special dir in case I need to poke them again
** more dead items will come every 6 hours till above patch lands
* arr/dustin ghost slave work
** pod4 and 5 of pandas was completed today
*** 1 foopy failed to clone tools (/build/sut_tools/) on re-image
**** it was a timeout and puppet wasn't smart enough to re clone it without a removal first


    philor found a spot instance that hadn't taken work since dec 23:
** try linux ec2 instances completed
* maybe after ami / cloud-tools fallout we should have nagios alerts for when aws spins up instances and kils them right away
* pro-tip when looking at ec2 graphs:
** zoom in or out to a time you care about and click on individual colour headings in legend below graph
*** last night I did not click on individual moz-types under running graph and since there is so few bld-linux builders that run normally anyway, it was hard to notice any change


    https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=tst-linux64-spot&name=tst-linux64-spot-341


    no status from aws-manager, can't be started


    needs more investigation tomorrow, may indicate a bigger problem if we aren't recycling nodes as we expect




2015-02-13
'''2015-02-02'''<br />


    https://bugzil.la/1132792 - new tree closure
* [https://bugzil.la/1088032 '''https://bugzil.la/1088032'''] - Test slaves sometimes fail to start buildbot after a reboot
** I thought the problem had solved itself, but philor has been rebooting windows slaves everyday which is why we haven't run out of windows slaves yet
** may require some attention next week
* panda pod round 2 and 3 started today
** <s>turns out disabling pandas in slavealloc can kill its current job</s>
** Calling fabric's stop command (disable.flg) on foopies kills its current job
** This was a misunderstanding in terms of plan last week, but is what we did pods 1-&gt;3 with, and will be the continued plan for next sets
** we'll inform sheriffs at start and end of each pod's work
* reconfig is failing as masters won't update local repos
** vcs error: 500 ISE
** fallout from vcs issues. gps/hwine kicked a webhead and all is merry again
* added a new report link to slave health for runner's dashboard
* late night Tree Closures
** Bug 1128780
*** test pending skyrocketed, builds not running, builder graphs broken
*** tests were just linux test capacity (with ~1600 pending in &lt;3 hours)
*** graphs relating to running.html were just a fallout from dead-code removal
*** builds not running brought together mrrrgn dustin and catlee and determined it was fallout from dustin's AMI work with cent 6.5 causing earlier AMI's to get shut off automatically on us
** generic.scl3 got rebooted, causing mozpool to die out and restart, leaving many panda jobs dead
* B2G nightlies busted, unknown cause
** Bug 1128826


    current state: https://bugzilla.mozilla.org/show_bug.cgi?id=1132792#c11


    reverted db change from yesterday, buildbot apparently needs a beefy physical machine


    going through buildduty report


'''2015-01-30'''<br />


2015-02-12
* loan for markco: [https://bugzil.la/1127411 https://bugzil.la/1127411]
* GHOST
** dustin is upgrading 7 foopies + 1 image host to make more of our infra haunted with ghosts
** [https://bugzil.la/1126428 https://bugzil.la/1126428]
* started reconfig 11:00 PT
* fewer pandas decomm-ed than anticipated, will have final numbers today
* [https://bugzil.la/1109862 https://bugzil.la/1109862] - re-assigned to relops for dll deployment
* buildapi + new buildbot passwd: do we know what went wrong here?
** catlee suspects he updated the wrong config
* positive feedback from philor on Callek's jacuzzi changes


    buildbot db failover by sheeri (planned)


    https://bugzil.la/1131637
'''2015-01-29'''<br />


    https://bugzil.la/1132469 - tree closure
* [https://bugzil.la/1126879 https://bugzil.la/1126879] Slaveapi not filing unreachable/problem-tracking bugs
** Theorize we might fix by [https://github.com/bhearsum/bzrest/pull/1 https://github.com/bhearsum/bzrest/pull/1] or at least get better error reporting.
* Did some intermittent bug triage by using jordan's tool for giggles
** [https://bugzilla.mozilla.org/page.cgi?id=user_activity.html&action=run&who=bugspam.Callek%40gmail.com&from=2015-01-28&to=2015-01-29&group=bug https://bugzilla.mozilla.org/page.cgi?id=user_activity.html&amp;action=run&amp;who=bugspam.Callek%40gmail.com&amp;from=2015-01-28&amp;to=2015-01-29&amp;group=bug]
* GHOST
** cyliang wants to patch and restart rabbitmq
*** [https://bugzil.la/1127433 https://bugzil.la/1127433]
** nameservers restarted this morning
*** no detectable fallout, modulo ntp syncing alerts for 30min
** :dustin will be upgrading the foopies to CentOS 6.5
*** this will be done per VLAN, and will mean a small, rolling decrease in capacity
** mothballing linux build hardware actually helped us here!
* package-tests target is becoming increasingly unreliable. may have a parallel bug in it
** [https://bugzil.la/1122746 '''https://bugzil.la/1122746'''] - package-tests step occasionally fails on at least win64 with 'find: Filesystem loop detected'
* coop is doing a panda audit
** cleaning up devices.json for recently decomm-ed pandas
** figuring out how much capacity we've lost since we disabled those racks back in the fall
*** will determine when we need to start backfill


    lots of idle slaves connected to masters despite high pending counts
* {{bug|1127699}} (Tree Closure at ~10:30pm PT)


    have rebooted some masters so far:


    bm70, bm71, bm72, bm73, bm74, bm91, bm94
'''2015-01-28'''<br />


    coop looking into windows builders
* [https://bugzil.la/1109862 https://bugzil.la/1109862] - ran some tests with new dll installed
* working on slave loan for [https://bugzil.la/1126547 https://bugzil.la/1126547]
* jacuzzi changes this morning: Android split apk and win64 debug
* planning to decomm a bunch more pandas over the next few days
** may need to strat a backfill process soon (we have lots waiting). may be able to hold out until Q2
* hacked a script to scrape tbpl bot comments on intermittent bugs and apply metrics
** [https://hg.mozilla.org/build/braindump/file/8d723bd901f2/buildduty/diagnose_intermittent_bug.py https://hg.mozilla.org/build/braindump/file/8d723bd901f2/buildduty/diagnose_intermittent_bug.py]
*** BeautfulSoup is not required, but BeautifulSoup4 is! (said here rather than editing the doc like I should ~ Callek)
** applied here:
*** {{bug|1060214#c51}}
*** {{bug|1114541#c345}}


    found 2 builders that hadn't run *any* jobs ever (since late sept at least)




2015-02-11
'''2015-01-27'''<br />


    reconfig is needed. last one was on thurs. blocked on the 10th from planned reconfig
* [https://bugzil.la/1126181 https://bugzil.la/1126181] - slave health jacuzzi patch review for Callek
* [https://bugzil.la/1109862 https://bugzil.la/1109862] - Distribute update dbghelp.dll to all Windows XP talos machines for more usable profiler pseudostacks
** pinged in bug by Rail
* some slave health display consistency fixes
* [https://bugzil.la/1126370 https://bugzil.la/1126370]


    will kick off a reconfig at 10am ET


    bm118 ended up with 2 reconfig procs running
'''2015-01-26'''<br />


    disabled in slavealloc, initiated clean shutdown. Will restart when jobs drain.
* audited windows pool for RAM: {{bug|1122975#c6}}
** tl;dr 13 slaves have 4gb RAM and they have been disabled and dep'd on 1125887
** dcops bug: [https://bugzil.la/1125887 https://bugzil.la/1125887]
* 'over the weekend': small hiccup with bgp router swap bug: killed all of scl3 for ~10min not on purpose.
** tl;dr - everything came back magically and I only had to clear up ~20 command queue jobs
* which component is nagios bugs these days? seems like mozilla.org::Infrastructure &amp; Operations bounced [https://bugzil.la/1125218 https://bugzil.la/1125218] back to releng::other. do we (releng) play with nagios now?
** &quot;MOC: Service Requests&quot; - refreshed assurance per chat with MOC manager.(linda)
* terminated loan with slaveapi: {{bug|1121319#c4}}
* attempted reconfig but hit conflict in merge: {{bug|1110286#c13}}
* catlee is changing buildapi r/o sql pw now (11:35 PT)
** I restarted buildapi
** updated wiki to show how we can restart buildapi without bugging webops
*** [[ReleaseEngineering/How_To/Restart_BuildAPI]]
** ACTION: should we delete [[ReleaseEngineering/How_To/Update_BuildAPI]] since it is basically a less verbose copy than: [[ReleaseEngineering/BuildAPI#Updating_code]] ?
* updated trychooser to fix bustage


    went through aws_sanity_checker backlog


    lots of unnamed hosts up for multiple days
'''2015-01-23'''<br />


    I'm assuming this is mostly for Windows AWS work based on the platform of the image, but we should really push people to tag instances more rigorously, or expect them to get killed randomly
* deployed new bbot r/o pw to aws-manager and 'million other non puppetized tools'
** do we have a list? We should puppetize them *or* replace them
* '''filed: Bug 1125218''' - disk space nagios alerts are too aggressive for signing4.srv.releng.scl3.mozilla.com
* investigated: Bug 1124200 - Android 4 L10n Nightly Broken
* report-4hr hung at 10:42 - coop killed the cron task
** sheeri reported that mysql slave is overworked right now and she will add another node
** should we try to get a more self-serve option here, or a quicker view into the db state?
*** for DB state we have [https://rpm.newrelic.com/accounts/263620/dashboard/3101982 https://rpm.newrelic.com/accounts/263620/dashboard/3101982] and similar


    recovering "broken" slaves in slave health list
* [https://bugzil.la/1125269 https://bugzil.la/1125269] - survey of r5s uncovered two machines running slower RAM
* [http://callek.pastebin.mozilla.org/8314860 http://callek.pastebin.mozilla.org/8314860] &lt;- jacuzzi patch (saved in pastebin for 1 day)


    Currently from jacuzzi report, 28 pending windows builds (for non-try) that are not in a jacuzzi


    18 of them are disabled for varying reasons, should cull that list to see if any of them can/should be turned on.




2015-02-10
2015-01-22<br />


    ghost patching fallout
* reconfig
 
** required backout of mozharness patch from [https://bugzil.la/1123443 https://bugzil.la/1123443] due to bustage
    signing servers rejecting new masters
** philor reported spidermonkey bustage: [https://treeherder.mozilla.org/logviewer.html#?job_id=5771113&repo=mozilla-inbound https://treeherder.mozilla.org/logviewer.html#?job_id=5771113&amp;repo=mozilla-inbound]
 
*** change by sfink - {{bug|1106707#c11}}
    masters enabled (bm69) but not running buildbot
 
    dns issues
 
    context:
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1126428#c55
 
    https://gist.github.com/djmitche/0c2c968fa1f6a5b5e0ca#file-masters-md
 
    a patch ended up in manage_masters.py that blocked amy from continuing rollout and doing a reconfig
 
    http://hg.mozilla.org/build/tools/rev/e7e3c7bf6efa
 
    so much time spent on debugging :( callek ended up finding the rogue accidental patch
 
 
2015-02-09
 
    STAT for jlund
 
 
2015-02-05
 
    tree closures
 
    [Bug 1130024] New: Extremely high Linux64 test backlog
 
    chalking that one up to a 20% more push increase than what we had previously
 
    [Bug 1130207] Several tests  failing with "command timed out: 1800 seconds without output running"  while downloading from ftp-ssl.mozilla.org
 
    again, likely load related but nothing too obvious. worked with netops, suspect we were hitting load balancer issues (ZLB) since hg and ftp share balancers and hg was under heavy load today
 
    dcurado will follow up
 
    and his follow up: bug 1130242
 
    two reconfigs
 
    dev-stage01 was running low on disk space
 
    loan for sfink
 
 
 
2015-02-04
 
        treeherder master db node is getting rebooted for ghost patching
 
    I asked mpressman to do it tomorrow and confirm with #treeherder folks first as there was not many on that were familiar with the system
 
    puppet win 2008 slaves are ready for the big leagues (prod)!
 
    I will be coordinating with markco the testing on that front
 
    did a reconfig.lot's landed
 
    investigated the 13 win builders that got upgraded RAM. 4 of them have been disabled for various issues
 
    dustin ghost patched bm103 and signing5/6
 
 
2015-02-03
 
    fallout from: Bug 1127482 -        Make Windows B2G Desktop builds periodic
 
    caused a ~dozen dead command items every 6 hours
 
    patched: https://bugzilla.mozilla.org/show_bug.cgi?id=1127482#c15
 
    moved current dead items to my own special dir in case I need to poke them again
 
    more dead items will come every 6 hours till above patch lands
 
    arr/dustin ghost slave work
 
    pod4 and 5 of pandas was completed today
 
    1 foopy failed to clone tools (/build/sut_tools/) on re-image
 
    it was a timeout and puppet wasn't smart enough to re clone it without a removal first
 
    try linux ec2 instances completed
 
    maybe after ami / cloud-tools fallout we should have nagios alerts for when aws spins up instances and kils them right away
 
    pro-tip when looking at ec2 graphs:
 
    zoom in or out to a time you care about and click on individual colour headings in legend below graph
 
    last night I did not click on individual moz-types under running graph and since there is so few bld-linux builders that run normally anyway, it was hard to notice any change
 
 
 
2015-02-02
 
    https://bugzil.la/1088032 - Test slaves sometimes fail to start buildbot after a reboot
 
    I thought the problem had solved itself, but philor has been rebooting windows slaves everyday which is why we haven't run out of windows slaves yet
 
    may require some attention next week
 
    panda pod round 2 and 3 started today
 
    turns out disabling pandas in slavealloc can kill its current job
 
    Calling fabric's stop command (disable.flg) on foopies kills its current job
 
    This was a misunderstanding in terms of plan last week, but is what we did pods 1->3 with, and will be the continued plan for next sets
 
    we'll inform sheriffs at start and end of each pod's work
 
    reconfig is failing as masters won't update local repos
 
    vcs error: 500 ISE
 
    fallout from vcs issues. gps/hwine kicked a webhead and all is merry again
 
    added a new report link to slave health for runner's dashboard
 
    late night Tree Closures
 
    Bug 1128780
 
    test pending skyrocketed, builds not running, builder graphs broken
 
    tests were just linux test capacity (with ~1600 pending in <3 hours)
 
    graphs relating to running.html were just a fallout from dead-code removal
 
    builds not running brought together mrrrgn dustin and catlee and determined it was fallout from dustin's AMI work with cent 6.5 causing earlier AMI's to get shut off automatically on us
 
    generic.scl3 got rebooted, causing mozpool to die out and restart, leaving many panda jobs dead
 
    B2G nightlies busted, unknown cause
 
    Bug 1128826
 
 
 
2015-01-30
 
    loan for markco: https://bugzil.la/1127411
 
    GHOST
 
    dustin is upgrading 7 foopies + 1 image host to make more of our infra haunted with ghosts
 
    https://bugzil.la/1126428
 
    started reconfig 11:00 PT
 
    fewer pandas decomm-ed than anticipated, will have final numbers today
 
    https://bugzil.la/1109862 - re-assigned to relops for dll deployment
 
    buildapi + new buildbot passwd: do we know what went wrong here?
 
    catlee suspects he updated the wrong config
 
    positive feedback from philor on Callek's jacuzzi changes
 
 
2015-01-29
 
    https://bugzil.la/1126879 Slaveapi not filing unreachable/problem-tracking bugs
 
    Theorize we might fix by https://github.com/bhearsum/bzrest/pull/1 or at least get better error reporting.
 
    Did some intermittent bug triage by using jordan's tool for giggles
 
    https://bugzilla.mozilla.org/page.cgi?id=user_activity.html&action=run&who=bugspam.Callek%40gmail.com&from=2015-01-28&to=2015-01-29&group=bug
 
    GHOST
 
    cyliang wants to patch and restart rabbitmq
 
    https://bugzil.la/1127433
 
    nameservers restarted this morning
 
    no detectable fallout, modulo ntp syncing alerts for 30min
 
    :dustin will be upgrading the foopies to CentOS 6.5
 
    this will be done per VLAN, and will mean a small, rolling decrease in capacity
 
    mothballing linux build hardware actually helped us here!
 
    package-tests target is becoming increasingly unreliable. may have a parallel bug in it
 
    https://bugzil.la/1122746 - package-tests step occasionally fails on at least win64 with 'find: Filesystem loop detected'
 
    coop is doing a panda audit
 
    cleaning up devices.json for recently decomm-ed pandas
 
    figuring out how much capacity we've lost since we disabled those racks back in the fall
 
    will determine when we need to start backfill
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1127699 (Tree Closure at ~10:30pm PT)
 
 
2015-01-28
 
    https://bugzil.la/1109862 - ran some tests with new dll installed
 
    working on slave loan for https://bugzil.la/1126547
 
    jacuzzi changes this morning: Android split apk and win64 debug
 
    planning to decomm a bunch more pandas over the next few days
 
    may need to strat a backfill process soon (we have lots waiting). may be able to hold out until Q2
 
    hacked a script to scrape tbpl bot comments on intermittent bugs and apply metrics
 
    https://hg.mozilla.org/build/braindump/file/8d723bd901f2/buildduty/diagnose_intermittent_bug.py
 
    BeautfulSoup is not required, but BeautifulSoup4 is! (said here rather than editing the doc like I should ~ Callek)
 
    applied here:
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1060214#c51
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1114541#c345
 
 
2015-01-27
 
    https://bugzil.la/1126181 - slave health jacuzzi patch review for Callek
 
    https://bugzil.la/1109862 - Distribute update dbghelp.dll to all Windows XP talos machines for more usable profiler pseudostacks
 
    pinged in bug by Rail
 
    some slave health display consistency fixes
 
    https://bugzil.la/1126370
 
 
2015-01-26
 
    audited windows pool for RAM: https://bugzilla.mozilla.org/show_bug.cgi?id=1122975#c6
 
    tl;dr 13 slaves have 4gb RAM and they have been disabled and dep'd on 1125887
 
    dcops bug: https://bugzil.la/1125887
 
    'over the weekend': small hiccup with bgp router swap bug: killed all of scl3 for ~10min not on purpose.
 
    tl;dr - everything came back magically and I only had to clear up ~20 command queue jobs
 
    which component is nagios bugs these days? seems like mozilla.org::Infrastructure & Operations bounced https://bugzil.la/1125218 back to releng::other. do we (releng) play with nagios now?
 
    "MOC: Service Requests" - refreshed assurance per chat with MOC manager.(linda)
 
    terminated loan with slaveapi: https://bugzilla.mozilla.org/show_bug.cgi?id=1121319#c4
 
    attempted reconfig but hit conflict in merge: https://bugzilla.mozilla.org/show_bug.cgi?id=1110286#c13
 
    catlee is changing buildapi r/o sql pw now (11:35 PT)
 
    I restarted buildapi
 
    updated wiki to show how we can restart buildapi without bugging webops
 
    https://wiki.mozilla.org/ReleaseEngineering/How_To/Restart_BuildAPI
 
    ACTION: should we delete https://wiki.mozilla.org/ReleaseEngineering/How_To/Update_BuildAPI since it is basically a less verbose copy than: https://wiki.mozilla.org/ReleaseEngineering/BuildAPI#Updating_code ?
 
    updated trychooser to fix bustage
 
 
2015-01-23
 
    deployed new bbot r/o pw to aws-manager and 'million other non puppetized tools'
 
    do we have a list? We should puppetize them *or* replace them
 
    filed: Bug 1125218 - disk space nagios alerts are too aggressive for signing4.srv.releng.scl3.mozilla.com
 
    investigated: Bug 1124200 - Android 4 L10n Nightly Broken
 
    report-4hr hung at 10:42 - coop killed the cron task
 
    sheeri reported that mysql slave is overworked right now and she will add another node
 
    should we try to get a more self-serve option here, or a quicker view into the db state?
 
    for DB state we have https://rpm.newrelic.com/accounts/263620/dashboard/3101982 and similar
 
    https://bugzil.la/1125269 - survey of r5s uncovered two machines running slower RAM
 
    http://callek.pastebin.mozilla.org/8314860 <- jacuzzi patch (saved in pastebin for 1 day)
 
 
 
2015-01-22
 
    reconfig
 
    required backout of mozharness patch from https://bugzil.la/1123443 due to bustage
 
    philor reported spidermonkey bustage: https://treeherder.mozilla.org/logviewer.html#?job_id=5771113&repo=mozilla-inbound
 
    change by sfink - https://bugzilla.mozilla.org/show_bug.cgi?id=1106707#c11
 
    https://bugzil.la/1124705 - tree closure due to builds-4hr not updating
 
    queries and replication blocked in db
 
    sheeri flushed some tables, builds-4hr recovered
 
    re-opened after 20min
 
    https://bugzil.la/1121516 - sheeri initiated buildbot db failover after reconfig (per email)
 
    philor complaining about panda state:
 
    "I lied about the panda state looking totally normal - 129 broken then, fine, exactly 129 broken for all time, not so normal"
 
    filed Bug 1124863 -        more than 100 pandas have not taken a job since 2015-01-20 around reconfig
 
    status: fixed
 
    filed Bug 1124850 -        slaveapi get_console error handling causes an exception when log formatting
 
    status: wontfix but pinged callek before closing
 
    filed Bug 1124843 -        slaveapi cltbld creds are out of date
 
    status: fixed, also improved root pw list order
 
    did a non merge reconfig for armen/bustage
 
    b2g37 fix for bustage I (jlund) caused. reconfiged https://bugzil.la/1055919
 
 
2015-01-21


    landed fix for https://bugzil.la/1123395 - Add ability to reboot slaves in batch on the slavetype pag
* [https://bugzil.la/1124705 https://bugzil.la/1124705] - tree closure due to builds-4hr not updating
** queries and replication blocked in db
** sheeri flushed some tables, builds-4hr recovered
** re-opened after 20min
* [https://bugzil.la/1121516 https://bugzil.la/1121516] - sheeri initiated buildbot db failover after reconfig (per email)
* philor complaining about panda state:
** &quot;I lied about the panda state looking totally normal - 129 broken then, fine, exactly 129 broken for all time, not so normal&quot;
** filed '''Bug 1124863''' - more than 100 pandas have not taken a job since 2015-01-20 around reconfig
*** status: fixed


    many of our windows timeouts (2015-01-16) may be the result of not having enough RAM. Need to look into options like doubling page size: https://bugzilla.mozilla.org/show_bug.cgi?id=1110236#c20
* filed '''Bug 1124850''' - slaveapi get_console error handling causes an exception when log formatting
** status: wontfix but pinged callek before closing
* filed '''Bug 1124843''' - slaveapi cltbld creds are out of date
** status: fixed, also improved root pw list order
* did a non merge reconfig for armen/bustage
* b2g37 fix for bustage I (jlund) caused. reconfiged [https://bugzil.la/1055919 https://bugzil.la/1055919]




2015-01-20
'''2015-01-21'''<br />


    reconfig, mostly to test IRC notifications
* landed fix for [https://bugzil.la/1123395 https://bugzil.la/1123395] - Add ability to reboot slaves in batch on the slavetype pag
* many of our windows timeouts ('''2015-01-16''') may be the result of not having enough RAM. Need to look into options like doubling page size: {{bug|1110236#c20}}


    master


    grabbed 2 bugs:
'''2015-01-20'''<br />


    https://bugzil.la/1122379 - Loan some slaves to :Fallen for his master
* reconfig, mostly to test IRC notifications
* master
* grabbed 2 bugs:
** [https://bugzil.la/1122379 <s>https://bugzil.la/1122379</s>]<s>- Loan some slaves to :Fallen for his master</s>
** [https://bugzil.la/1122859 https://bugzil.la/1122859] - Slave loan request for a bld-linux64-ec2 vm to try installing gstreamer 1.0
*** use releng loan OU?


    https://bugzil.la/1122859 - Slave loan request for a bld-linux64-ec2 vm to try installing gstreamer 1.0
* network flapping thoughout the day
** filed: [https://bugzil.la/1123911 https://bugzil.la/1123911]
** starting discussion in #netops
** aws ticket opened
* b2g bumper bustage
** fix in [https://bugzil.la/1122751 https://bugzil.la/1122751]
* rebooted ~100 pandas that stopped taking jobs after reconfig
* '''Bug 1124059''' - create a buildduty dashboard that highlights current infra health
* TODO: Bug for &quot;make it painfully obvious when slave_health testing mode is enabled, thus is displaying stale data&quot;
** hurt philor in #releng this evening when an old patch wih testing mode on deployed..
** i have a precommit hook for this now, shouldn't happen again


    use releng loan OU?


    network flapping thoughout the day
'''2015-01-19'''<br />


    filed: https://bugzil.la/1123911
* Filed bugs for issues discussed on Friday:
** [https://bugzil.la/1123395 https://bugzil.la/1123395] - Add ability to reboot slaves in batch on the slavetype page
** [https://bugzil.la/1123371 https://bugzil.la/1123371] - provide access to more timely data in slave health
** [https://bugzil.la/1123390 https://bugzil.la/1123390] - Synchronize the running/pending parsing algorithms between slave health and nthomas' reports
* fixed slavealloc datacenter issue for some build/try linux instances - {{bug|1122582#c7}}
* re-imaged b-2008-ix-0006, b-2008-ix-0020, b-2008-ix-0172
* deployed 'terminate' to slaveapi and then broke slaveapi for bonus points
* re-patched 'other' aws end points for slaveapi - deploying that today (20th)
* fixed nical's troublesome loan


    starting discussion in #netops


    aws ticket opened
'''2015-01-16 (rollup of below scratchpad)'''


    b2g bumper bustage
JLUND<br />
sheriffs requested I investigate:<br />


    fix in https://bugzil.la/1122751
* spike in win64 filesystem loops:
** sheriffs suggested they have pinged many times recently and they will start disabling slaves if objdir nuking is not preferable
** nuking b-2008-ix-0114 objdir of related builder
** filed '''bug 1122746'''
* '''Bug 916765''' - Intermittent &quot;command timed out: 600 seconds without output, attempting to kill&quot; running expandlibs_exec.py in libgtest
** diagnosis: {{bug|916765#c193}}
** follow up: I will post a patch but it is not buildduty actionable from here on out IMO
* '''Bug 1111137''' - Intermittent test_user_agent_overrides.html | Navigator UA not overridden at step 1 - got Mozilla/5.0 (Android; Mobile; rv:37.0) Gecko/37.0 Firefox/37.0, expected DummyUserAgent
** diagnosis: {{bug|1111137#c679}}
** follow up: nothing for buildduty
* '''Bug 1110236''' - Intermittent &quot;mozmake.exe[6]: *** [xul.dll] Error 1318&quot; after &quot;fatal error LNK1318: Unexpected PDB error&quot;
** diagnosis: {{bug|1110236#c17}}
* '''there was a common trend from the above 3 bugs with certain slaves'''
** filed tracker and buildduty follow up bug: [https://bugzil.la/1122975 '''https://bugzil.la/1122975''']


    rebooted ~100 pandas that stopped taking jobs after reconfig


    Bug 1124059 - create a buildduty dashboard that highlights current infra health
loans:<br />


    TODO: Bug for "make it painfully obvious when slave_health testing mode is enabled, thus is displaying stale data"
* fallen for setting up slaves on his master [https://bugzil.la/1122379 https://bugzil.la/1122379]
* nical tst ec2 [https://bugzil.la/1121992 https://bugzil.la/1121992]


    hurt philor in #releng this evening when an old patch wih testing mode on deployed..


    i have a precommit hook for this now, shouldn't happen again




2015-01-19
CALLEK<br />
Puppet Issues: <br />


    Filed bugs for issues discussed on Friday:
* Had a db_cleanup puppet failure on bm81, catlee fixed with [http://hg.mozilla.org/build/puppet/rev/d88423d7223f http://hg.mozilla.org/build/puppet/rev/d88423d7223f]
* There is a MIG puppet issue blocking our golden AMI's from completing. Ulfr pinged in #releng and I told he has time to investigate (rather than asking for an immediate backout)


    https://bugzil.la/1123395 - Add ability to reboot slaves in batch on the slavetype page


    https://bugzil.la/1123371 - provide access to more timely data in slave health
Tree Closure:<br />


    https://bugzil.la/1123390  - Synchronize the running/pending parsing algorithms between slave health and nthomas' reports
* {{bug|1122582}}
* Linux jobs, test and build were pending far too long
* I (Callek) got frustrated trying to get assistance to find out what the problem is and while trying to get other releng assistance to look at the problem
* Boils down to capacity issues, but was darn hard to pinpoint


    fixed slavealloc datacenter issue for some build/try linux instances - https://bugzilla.mozilla.org/show_bug.cgi?id=1122582#c7


    re-imaged b-2008-ix-0006, b-2008-ix-0020, b-2008-ix-0172
Action Items<br />


    deployed 'terminate' to slaveapi and then broke slaveapi for bonus points
* Find some way to identify we're at capacity in AWS easier (my jacuzzi slave health work should help with that, at least a bit)
* Get &lt;someone&gt; to increase our AWS capacity or find out if/why we're not using existing capacity. If increasing we'll need more masters.


    re-patched 'other' aws end points for slaveapi - deploying that today (20th)


    fixed nical's troublesome loan




2015-01-16 (rollup of below scratchpad)


JLUND
sheriffs requested I investigate:


    spike in win64 filesystem loops:
'''2015-01-16'''


    sheriffs suggested they have pinged many times recently and they will start disabling slaves if objdir nuking is not preferable
arr<br />
 
    nuking b-2008-ix-0114 objdir of related builder
 
    filed bug 1122746
 
    Bug 916765 -        Intermittent "command timed out: 600 seconds without output, attempting to kill" running expandlibs_exec.py in libgtest
 
    diagnosis: https://bugzilla.mozilla.org/show_bug.cgi?id=916765#c193
 
    follow up: I will post a patch but it is not buildduty actionable from here on out IMO
 
    Bug 1111137 -        Intermittent  test_user_agent_overrides.html | Navigator UA not overridden at step 1 -  got Mozilla/5.0 (Android; Mobile; rv:37.0) Gecko/37.0 Firefox/37.0,  expected DummyUserAgent
 
    diagnosis: https://bugzilla.mozilla.org/show_bug.cgi?id=1111137#c679
 
    follow up: nothing for buildduty
 
    Bug 1110236 -        Intermittent "mozmake.exe[6]: *** [xul.dll] Error 1318" after "fatal error LNK1318: Unexpected PDB error"
 
    diagnosis: https://bugzilla.mozilla.org/show_bug.cgi?id=1110236#c17
 
    there was a common trend from the above 3 bugs with certain slaves
 
    filed tracker and buildduty follow up bug: https://bugzil.la/1122975
 
 
loans:
 
    fallen for setting up slaves on his master https://bugzil.la/1122379
 
    nical tst ec2 https://bugzil.la/1121992
 
 
 
CALLEK
Puppet Issues:
 
    Had a db_cleanup puppet failure on bm81, catlee fixed with http://hg.mozilla.org/build/puppet/rev/d88423d7223f
 
    There is a MIG puppet issue blocking our golden AMI's from completing. Ulfr pinged in #releng and I told he has time to investigate (rather than asking for an immediate backout)
 
 
Tree Closure:
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1122582
 
    Linux jobs, test and build were pending far too long
 
    I (Callek) got frustrated trying to get assistance to find out what the problem is and while trying to get other releng assistance to look at the problem
 
    Boils down to capacity issues, but was darn hard to pinpoint
 
 
Action Items
 
    Find some way to identify we're at capacity in AWS easier (my jacuzzi slave health work should help with that, at least a bit)
 
    Get <someone> to increase our AWS capacity or find out if/why we're not using existing capacity. If increasing we'll need more masters.
 
 
 
 
2015-01-16
 
arr
12:08:38 any of the eu folks around? looks like someone broke puppet last night.
12:08:38 any of the eu folks around? looks like someone broke puppet last night.


mgerva
mgerva<br />
12:20:01 arr: i'm here
12:20:01 arr: i'm here


arr
arr<br />
12:21:10 mgerva: looks like a problem with someone who's trying to upgrade mig
12:21:10 mgerva: looks like a problem with someone who's trying to upgrade mig<br />
12:21:32 mgerva: it's been sending out mail about failing hosts
12:21:32 mgerva: it's been sending out mail about failing hosts<br />
12:21:39 wasn't sure if it was also taking them offline eventually
12:21:39 wasn't sure if it was also taking them offline eventually<br />
12:21:48 (so I think this is limited to linux)
12:21:48 (so I think this is limited to linux)<br />
12:32:43 mgerva is now known as mgerva|afk
12:32:43 mgerva is now known as mgerva|afk


pmoore
pmoore<br />
12:47:16 arr: mgerva|afk: since the sheriffs aren't complaining yet, we can probably leave this for build duty which should start in a couple of hours
12:47:16 arr: mgerva|afk: since the sheriffs aren't complaining yet, we can probably leave this for build duty which should start in a couple of hours


arr
arr<br />
12:47:46 pmoore: okay!
12:47:46 pmoore: okay!


pmoore
pmoore<br />
12:47:51 i don't think anyone is landing puppet changes at the moment, so hopefully it should affect anything… i hope!
12:47:51 i don't think anyone is landing puppet changes at the moment, so hopefully it should affect anything… i hope!<br />
12:48:02 *shouldn't*
12:48:02 *shouldn't*


I see two different errors impacting different types of machines:
I see two different errors impacting different types of machines:<br />


    Issues with mig: Puppet (err): Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install mig-agent=20150109+a160729.prod' returned 100
* Issues with mig: Puppet (err): Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install mig-agent=20150109+a160729.prod' returned 100
* Issues with a different config file: Puppet (err): Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter user on File[/builds/buildbot/db_maint/config.ini] at /etc/puppet/production/modules/buildmaster/manifests/db_maintenance.pp:48


    Issues with a different config file: Puppet (err): Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter user on File[/builds/buildbot/db_maint/config.ini] at /etc/puppet/production/modules/buildmaster/manifests/db_maintenance.pp:48


'''2015-01-15'''<br />


2015-01-15
* reviewed and deployed [https://hg.mozilla.org/build/mozharness/rev/3a6062cbd177 https://hg.mozilla.org/build/mozharness/rev/3a6062cbd177] (to fix vcs sync fix for gecko l10n)
* enabled mochitest-chrome on B2G emulators on cedar (bug 1116187) as part of the merge default -&gt; production for mozharness


    reviewed and deployed https://hg.mozilla.org/build/mozharness/rev/3a6062cbd177 (to fix vcs sync fix for gecko l10n)


    enabled mochitest-chrome on B2G emulators on cedar (bug 1116187) as part of the merge default -> production for mozharness
'''2015-01-14'''<br />


* slave loan for tchou
* started patch to reboot slaves that have not reported in X hours (slave health)
* reconfig for catlee/ehsan
* recovered 2 windows builders with circular directory structure


2015-01-14


    slave loan for tchou
'''2015-01-13'''<br />


    started patch to reboot slaves that have not reported in X hours (slave health)
* reconfig for ehsan
* [https://bugzil.la/1121015 https://bugzil.la/1121015] - dolphin non-eng nightlies busted after merge
** bhearsum took it (fallout from retiring update.boot2gecko.org)
* scheduler reconfig for fubar
* [https://bugzil.la/1117811 https://bugzil.la/1117811] - continued master setup for Fallen
* clearing buildduty report backlog


    reconfig for catlee/ehsan


    recovered 2 windows builders with circular directory structure
'''2015-01-12'''<br />


* recovering loaned slaves
* setting up Tb test master for Fallen
** already has one apparently, some commentary in bug 1117811
* reconfig took almost 4hr (!)
* some merge day fallout with split APK


2015-01-13


    reconfig for ehsan
'''2015-01-08'''<br />


    https://bugzil.la/1121015 - dolphin non-eng nightlies busted after merge
* {{bug|1119447}} - All buildbot-masters failing to connect to MySQL: Too many connections
** caused 3-hour tree closure


    bhearsum took it (fallout from retiring update.boot2gecko.org)


    scheduler reconfig for fubar
'''2015-01-07'''<br />


    https://bugzil.la/1117811 - continued master setup for Fallen
* wrote [[ReleaseEngineering/Buildduty/Other_Duties#Marking_jobs_as_complete_in_the_buildbot_database]]
* adjusted retention time on signing servers to 4 hours (from 8) to deal with nagios disk space alerts


    clearing buildduty report backlog


'''2015-01-06'''<br />


2015-01-12
* {{bug|1117395}} - Set RETRY on &quot;timed out waiting for emulator to start&quot; and &quot;We have not been able to establish a telnet connection with the emulator&quot;
** trees closed because emulator jobs won't start
** {{bug|1013634}} - libGL changes &lt;- not related according to rail
** backed out Armen's mozharness patch: [http://hg.mozilla.org/build/mozharness/rev/27e55b4b5c9a http://hg.mozilla.org/build/mozharness/rev/27e55b4b5c9a]
* reclaiming loaned machines based on responses to yesterday's notices


    recovering loaned slaves


    setting up Tb test master for Fallen
'''2015-01-05'''<br />


    already has one apparently, some commentary in bug 1117811
* sent reminder notices to people with loaned slaves


    reconfig took almost 4hr (!)


    some merge day fallout with split APK
'''2014-12-30'''<br />


*


2015-01-08
<br />


    https://bugzilla.mozilla.org/show_bug.cgi?id=1119447 - All buildbot-masters failing to connect to MySQL: Too many connections
'''2014-12-29'''<br />


    caused 3-hour tree closure
* returning spot nodes disabled by philor
** these terminate pretty quickly after being disabled (which is why he does it)
** to re-enable en masse, run 'update slaves set enabled=1 where name like '%spot%' and enabled=0' in the slavealloc db
** use the buildduty report, click on the 'View list in Bugzilla' button, and then close all the spot node bugs at once
* started going throung bugs in the dependencies resolved section based on age. Here is a rundown of state:
** b-2008-ix-0010: kicked off a re-image, but I did this before fire-and-forget in early Dec and it doesn't seem to have taken. will check back in later
*** :markco using to debug Puppet on Windows issues
** panda-0619: updated relay info, but unclear in bug whether there are further issues with panda or chassis
**


<br />


2015-01-07
'''2014-12-14'''<br />


    wrote https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Other_Duties#Marking_jobs_as_complete_in_the_buildbot_database
*


    adjusted retention time on signing servers to 4 hours (from 8) to deal with nagios disk space alerts
<br />


'''2014-12-19'''<br />


2015-01-06
* what did we accomplish?
** vcssync, b2 bumper ready to hand-off to dev services(?)
** increased windows test capacity
** moved staging slaves to production
** disabled 10.8 on try
*** PoC for further actions of this type
** investigations with jmaher re: turning off &quot;useless&quot; tests
** opening up releng for contributions:
*** new public distribution list
*** moved tests over to travis
*** mentored bugs
** improved reconfigs
** set up CI for b2g bumper
* what do we need to accomplish next quarter?
** self-serve slave loan
** turn off &quot;useless&quot; tests
*** have a way to do this easily and regularly
** better ability to correlate tree state changes with releng code changes
** better platform change pipeline
*** proper staging env
** task cluster tackles most of the above, therefore migration of jobs to task cluster should enable these as a consequence
* what tools do we need?
** self-serve slave loan
** terminate AWS instances from slave health (slaveapi)
** ability to correlate releng changes with tree state changes
*** e.g. linux tests started timing out at Thursday at 8:00am: what changed in releng repos around that time?
*** armen's work on pinning mozharness tackles the mozharness part - migrating to task cluster puts build configs in-tree, so is also solved mostly with task cluster move


    https://bugzilla.mozilla.org/show_bug.cgi?id=1117395 - Set  RETRY on "timed out waiting for emulator to start" and "We have not  been able to establish a telnet connection with the emulator"


    trees closed because emulator jobs won't start


    https://bugzilla.mozilla.org/show_bug.cgi?id=1013634 - libGL changes <- not related according to rail
'''2014-11-27'''<br />


    backed out Armen's mozharness patch: http://hg.mozilla.org/build/mozharness/rev/27e55b4b5c9a
* [https://bugzil.la/1105826 https://bugzil.la/1105826]
** trees closed most of the day due to Armen's try jobs run amok
** reporting couldn't handle the volume of retried jobs, affected buildapi and builds-4hr
*** disabled buildapi cronjobs until solution found
** db sync between master-&gt;slave lost for 5 hours
*** filed [https://bugzil.la/1105877 https://bugzil.la/1105877] to fix db sync; paged sheeri to fix
*** fixed ~8pm ET
** re-ran buildapi cronjobs incrementally by hand in order to warm the cache for build-4hr
** all buildapi cronjobs re-enabled
** catlee picked up [https://bugzil.la/733663 https://bugzil.la/733663] for the long-term fix
** didn't get to deploy [https://bugzil.la/961279 https://bugzil.la/961279] as planned :(


    reclaiming loaned machines based on responses to yesterday's notices


'''2014-11-26'''<br />


2015-01-05
* [https://bugzil.la/961279 https://bugzil.la/961279] - Mercurial upgrade - how to proceed?
** yes, we should have time to deploy it Thursday/Friday this week


    sent reminder notices to people with loaned slaves


'''2014-11-25'''<br />


2014-12-30
* [https://bugzil.la/1104741 '''https://bugzil.la/1104741'''] - Tree closed for Windows 8 Backlog
** caused by nvidia auto-updates (probably not the first time)
** Q found fix to disable
** rebooted all w864 machines
* [https://bugzil.la/1101285 https://bugzil.la/1101285] - slaveapi doesn't handle 400 status from bugzilla
** needed to deploy this today so we could reboot the 80+ w8 slaves that didn't have problem tracking bugs yet
** also deployed logging fix ([https://bugzil.la/1073630) https://bugzil.la/1073630)] and '''component change for filing new bugs ('''[https://bugzil.la/1104451) '''https://bugzil.la/1104451)''']
* [https://bugzil.la/1101133 '''https://bugzil.la/1101133'''] - Intermittent Jit tests fail with &quot;No tests run or test summary not found&quot;
** too many jit_tests!
*


<br />


'''2014-11-24'''<br />


2014-12-29
* kanban tool?
* [https://bugzil.la/1104113 https://bugzil.la/1104113] - Intermittent mozprocess timed out after 330 seconds


    returning spot nodes disabled by philor


    these terminate pretty quickly after being disabled (which is why he does it)
'''2014-11-21'''<br />


    to re-enable en masse, run 'update slaves set enabled=1 where name like '%spot%' and enabled=0' in the slavealloc db
* work on 10.10
** running in staging
* restarted bm84
* reconfig for bhearsum/rail for pre-release changes for Fx34
* setup foopy56 after returning from diagnostics


    use the buildduty report, click on the 'View list in Bugzilla' button, and then close all the spot node bugs at once


    started going throung bugs in the dependencies resolved section based on age. Here is a rundown of state:
'''2014-11-20''' a.k.a &quot;BLACK THURSDAY&quot;<br />


    b-2008-ix-0010: kicked off a re-image, but I did this before fire-and-forget in early Dec and it doesn't seem to have taken. will check back in later
* {{bug|1101133}}
* {{bug|1101285}}


    :markco using to debug Puppet on Windows issues


    panda-0619: updated relay info, but unclear in bug whether there are further issues with panda or chassis




2014-11-19<br />


2014-12-14
* [https://bugzil.la/1101786 '''https://bugzil.la/1101786'''] - Mac fuzzer jobs failing to unzip tests.zip
* bm85 - BAD REQUEST exceptions
** gracefully shutdown and restarted to clear
* [https://bugzil.la/1092606 https://bugzil.la/1092606]




'''2014-11-18'''<br />


2014-12-19
* bm82 - BAD REQUEST exceptions
** gracefully shutdown and restarted to clear
* updated tools on foopys to pick up Callek's patch to monitor for old pywebsocket processes
* sent foopy56 for diagnostics
* [https://bugzil.la/1082852 https://bugzil.la/1082852] - slaverebooter hangs
** had been hung since Nov 14
** threads aren't terminating, need to figure out why
** have I mentioned how much i hate multi-threading?
* [https://bugzil.la/1094293 https://bugzil.la/1094293] - 10.10 support
** patches waiting for review


    what did we accomplish?


    vcssync, b2 bumper ready to hand-off to dev services(?)
'''2014-11-17'''<br />


    increased windows test capacity
* meeting with A-team
** reducing test load
** [http://alertmanager.allizom.org/seta.html http://alertmanager.allizom.org/seta.html]


    moved staging slaves to production


    disabled 10.8 on try
'''2014-11-14'''<br />


    PoC for further actions of this type
* ???


    investigations with jmaher re: turning off "useless" tests


    opening up releng for contributions:
'''2014-11-13'''<br />


    new public distribution list
* ???


    moved tests over to travis


    mentored bugs
'''2014-11-12'''<br />


    improved reconfigs
* ???


    set up CI for b2g bumper


    what do we need to accomplish next quarter?
'''2014-11-11'''<br />


    self-serve slave loan
* ???


    turn off "useless" tests


    have a way to do this easily and regularly
'''2014-11-10'''<br />


    better ability to correlate tree state changes with releng code changes
* release day
** ftp is melting under load; trees closed
*** dev edition went unthrottled
**** catlee throttled background updates to 25%
*** dev edition not on CDN
**** {{bug|1096367}}


    better platform change pipeline


    proper staging env


    task cluster tackles most of the above, therefore migration of jobs to task cluster should enable these as a consequence
'''2014-11-07'''<br />


    what tools do we need?
* shared mozharness checkout
* jlund hg landings
* b2g_bumper travis tests working
* buildbot-master52
** hanging on every reconfig
** builder limits, hitting PB limits
** split masters: Try + Everything Else?
** graceful not working -&gt; nuke it from orbit
** structured logging in mozharness has landed
* coop to write docs:
** moving slaves from production to staging
** dealing with bad slaves


    self-serve slave loan


    terminate AWS instances from slave health (slaveapi)
'''2014-11-06'''<br />


    ability to correlate releng changes with tree state changes
* b2g_bumper issues
* [https://bugzil.la/1094922 '''https://bugzil.la/1094922'''] - Widespread hg.mozilla.org unresponsiveness
* buildduty report queue
* some jobs pending for more than 4 hours
** aws tools needed to have the new cltbld password added to their json file, idle instances not being reaped
** need some monitoring here


    e.g. linux tests started timing out at Thursday at 8:00am: what changed in releng repos around that time?


    armen's work on pinning mozharness tackles the mozharness part - migrating to task cluster puts build configs in-tree, so is also solved mostly with task cluster move
'''2014-11-05'''<br />


* sorry for the last few days, something important came up and i've barely been able to focus on buildduty
* [https://bugzil.la/foopy56 https://bugzil.la/foopy56]
** hitting load spikes
* [https://bugzil.la/990173 https://bugzil.la/990173] - Move b2g bumper to a dedicated host
** bm66 hitting load spikes
** what is best solution: beefier instance? multiple instances?
* PT - best practices for buildduty?
** keep &quot;Current&quot; column accurate


2014-11-27


    https://bugzil.la/1105826
'''2014-11-04'''<br />


    trees closed most of the day due to Armen's try jobs run amok
* t-snow-r4-0002 hit an hdiutil error and is now unreachable
* t-w864-ix-026 destoying jobs, disabled
* {{bug|1093600}}
** bugzilla api updates were failing, fixed now
** affected reconfigs (script could not update bugzilla)
* {{bug|947462}}
** tree outage when this landed
** backed it out
** probably it can be relanded, just needs a clobber


    reporting couldn't handle the volume of retried jobs, affected buildapi and builds-4hr


    disabled buildapi cronjobs until solution found
'''2014-11-03'''<br />


    db sync between master->slave lost for 5 hours
*


    filed https://bugzil.la/1105877 to fix db sync; paged sheeri to fix
<br />


    fixed ~8pm ET
'''2014-10-31'''<br />


    re-ran buildapi cronjobs incrementally by hand in order to warm the cache for build-4hr
* valgrind busted on Try
** only build masters reconfig-ed last night by nthomas
*** reconfig-ed try masters this morning


    all buildapi cronjobs re-enabled
* {{bug|1071281}}
** mshal has metrics patch for landing
* windows release repacks failing on b-2008-ix-0094
** [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-ix-0094 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-ix-0094]
* failing to download rpms for centos
** {{bug|1085348}}
** [http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3 http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3]
* mapper changes
** docs in progress
** WSME doesn't support non-JSON requests
* b2g bumper change
** in progress
* kanban tool?
** sent mail to group
** we'll use pivotal tracker
* reconfig for jlund
* re-image b-2008-ix-0094
* blog post:
** [http://coop.deadsquid.com/2014/10/10-8-testing-disabled-by-default-on-try/ http://coop.deadsquid.com/2014/10/10-8-testing-disabled-by-default-on-try/]


    catlee picked up https://bugzil.la/733663 for the long-term fix


    didn't get to deploy https://bugzil.la/961279 as planned :(
'''2014-10-30'''<br />


* how best to handle broken manifests?
** difference of opinion w/ catlee
** catlee does see the human cost of not fixing this properly
* mapper docs
* b2g bumper: log rotation
* [https://bugzil.la/1091707 '''https://bugzil.la/1091707''']
** Frequent FTP/proxxy timeouts across all trees
*** network blip?


2014-11-26
* [https://bugzil.la/1091696 '''https://bugzil.la/1091696''']
 
** swap on fwunit1.private.releng.scl3.mozilla.com is CRITICAL: SWAP CRITICAL - 100% free (0 MB out of 0 MB)
    https://bugzil.la/961279 - Mercurial upgrade - how to proceed?
** these are dustin's firewall unit tests: ping him when we get these alerts
 
* reconfig
    yes, we should have time to deploy it Thursday/Friday this week
 
 
2014-11-25
 
    https://bugzil.la/1104741 - Tree closed for Windows 8 Backlog
 
    caused by nvidia auto-updates (probably not the first time)


    Q found fix to disable


    rebooted all w864 machines
'''2014-10-29'''<br />


    https://bugzil.la/1101285 - slaveapi doesn't handle 400 status from bugzilla
* b2g bumper
** b2g manifests
*** no try for manifests


    needed to deploy this today so we could reboot the 80+ w8 slaves that didn't have problem tracking bugs yet
* All new w864 boxes have wrong resolution
** Q started to investigate, resurrected 3
** slave bugs linked against [https://bugzil.la/1088839 https://bugzil.la/1088839]
* started thread about disabling try testing on mtnlion by default
** {{bug|1091368}}


    also deployed logging fix (https://bugzil.la/1073630) and component change for filing new bugs (https://bugzil.la/1104451)


    https://bugzil.la/1101133 - Intermittent Jit tests fail with "No tests run or test summary not found"
'''2014-10-28'''<br />


    too many jit_tests!
* testing new hg 3.1.2 GPO
** {{bug|1056981}}
** failing to find pem files
* cleaned up loaner list from yesterday
** closed 2 bugs that we're unused
** added 2 missing slavealloc notes
** terminated 11 instances
** removed many, many out-of-date names &amp; hosts from ldapadmin
* lots of bigger scope bugs getting filed under the buildduty category
** most belong in general automation or tools IMO
** I don't think buildduty bugs should have a scope bigger than what can be accomplished in a single day. thoughts?
* reconfig to put new master (bm119) and new Windows test slaves into production
* massive spike in pending jobs around 6pm ET
** 2000-&gt;5000
** closed trees
* waded through the buildduty report a bit




'''2014-10-27'''<br />


2014-11-24
* 19 *running* loan instances
 
dev-linux64-ec2-jlund2<br />
    kanban tool?
dev-linux64-ec2-kmoir<br />
 
dev-linux64-ec2-pmoore<br />
    https://bugzil.la/1104113 - Intermittent mozprocess timed out after 330 seconds
<s>dev-linux64-ec2-rchien</s><br />
 
dev-linux64-ec2-sbruno<br />
 
<s>tst-linux32-ec2-evold</s><br />
2014-11-21
tst-linux64-ec2-evanxd<br />
 
<s>tst-linux64-ec2-gbrown</s><br />
    work on 10.10
<s>tst-linux64-ec2-gweng</s><br />
 
<s>tst-linux64-ec2-jesup</s><br />
    running in staging
<s>tst-linux64-ec2-jesup2</s><br />
 
tst-linux64-ec2-jgilbert<br />
    restarted bm84
<s>tst-linux64-ec2-kchen</s><br />
 
tst-linux64-ec2-kmoir<br />
    reconfig for bhearsum/rail for pre-release changes for Fx34
<s>tst-linux64-ec2-mdas</s><br />
 
<s>tst-linux64-ec2-nchen</s><br />
    setup foopy56 after returning from diagnostics
<s>tst-linux64-ec2-rchien</s><br />
 
<s>tst-linux64-ec2-sbruno</s><br />
 
2014-11-20 a.k.a "BLACK THURSDAY"
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1101133
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1101285
 
 
 
2014-11-19
 
    https://bugzil.la/1101786 - Mac fuzzer jobs failing to unzip tests.zip
 
    bm85  - BAD REQUEST exceptions
 
    gracefully shutdown and restarted to clear
 
    https://bugzil.la/1092606
 
 
2014-11-18
 
    bm82 - BAD REQUEST exceptions
 
    gracefully shutdown and restarted to clear
 
    updated tools on foopys to pick up Callek's patch to monitor for old pywebsocket processes
 
    sent foopy56 for diagnostics
 
    https://bugzil.la/1082852 - slaverebooter hangs
 
    had been hung since Nov 14
 
    threads aren't terminating, need to figure out why
 
    have I mentioned how much i hate multi-threading?
 
    https://bugzil.la/1094293 - 10.10 support
 
    patches waiting for review
 
 
2014-11-17
 
    meeting with A-team
 
    reducing test load
 
    http://alertmanager.allizom.org/seta.html
 
 
2014-11-14
 
    ???
 
 
2014-11-13
 
    ???
 
 
2014-11-12
 
    ???
 
 
2014-11-11
 
    ???
 
 
2014-11-10
 
    release day
 
    ftp is melting under load; trees closed
 
    dev edition went unthrottled
 
    catlee throttled background updates to 25%
 
    dev edition not on CDN
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1096367
 
 
2014-11-07
 
    shared mozharness checkout
 
    jlund hg landings
 
    b2g_bumper travis tests working
 
    buildbot-master52
 
    hanging on every reconfig
 
    builder limits, hitting PB limits
 
    split masters: Try + Everything Else?
 
    graceful not working -> nuke it from orbit
 
    structured logging in mozharness has landed
 
    coop to write docs:
 
    moving slaves from production to staging
 
    dealing with bad slaves
 
 
2014-11-06
 
    b2g_bumper issues
 
    https://bugzil.la/1094922 - Widespread hg.mozilla.org unresponsiveness
 
    buildduty report queue
 
    some jobs pending for more than 4 hours
 
    aws tools needed to have the new cltbld password added to their json file, idle instances not being reaped
 
    need some monitoring here
 
 
2014-11-05
 
    sorry for the last few days, something important came up and i've barely been able to focus on buildduty
 
    https://bugzil.la/foopy56
 
    hitting load spikes
 
    https://bugzil.la/990173 - Move b2g bumper to a dedicated host
 
    bm66 hitting load spikes
 
    what is best solution: beefier instance? multiple instances?
 
    PT - best practices for buildduty?
 
    keep "Current" column accurate
 
 
2014-11-04
 
    t-snow-r4-0002 hit an hdiutil error and is now unreachable
 
    t-w864-ix-026 destoying jobs, disabled
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1093600
 
    bugzilla api updates were failing, fixed now
 
    affected reconfigs (script could not update bugzilla)
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=947462
 
    tree outage when this landed
 
    backed it out
 
    probably it can be relanded, just needs a clobber
 
 
2014-11-03
 
 
 
2014-10-31
 
    valgrind busted on Try
 
    only build masters reconfig-ed last night by nthomas
 
    reconfig-ed try masters this morning
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1071281
 
    mshal has metrics patch for landing
 
    windows release repacks failing on b-2008-ix-0094
 
    https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-ix-0094
 
    failing to download rpms for centos
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1085348
 
    http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3
 
    mapper changes
 
    docs in progress
 
    WSME doesn't support non-JSON requests
 
    b2g bumper change
 
    in progress
 
    kanban tool?
 
    sent mail to group
 
    we'll use pivotal tracker
 
    reconfig for jlund
 
    re-image b-2008-ix-0094
 
    blog post:
 
    http://coop.deadsquid.com/2014/10/10-8-testing-disabled-by-default-on-try/
 
 
2014-10-30
 
    how best to handle broken manifests?
 
    difference of opinion w/ catlee
 
    catlee does see the human cost of not fixing this properly
 
    mapper docs
 
    b2g bumper: log rotation
 
    https://bugzil.la/1091707
 
    Frequent FTP/proxxy timeouts across all trees
 
    network blip?
 
    https://bugzil.la/1091696
 
    swap on fwunit1.private.releng.scl3.mozilla.com is CRITICAL: SWAP CRITICAL - 100% free (0 MB out of 0 MB)
 
    these are dustin's firewall unit tests: ping him when we get these alerts
 
    reconfig
 
 
2014-10-29
 
    b2g bumper
 
    b2g manifests
 
    no try for manifests
 
    All new w864 boxes have wrong resolution
 
    Q started to investigate, resurrected 3
 
    slave bugs linked against https://bugzil.la/1088839
 
    started thread about disabling try testing on mtnlion by default
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1091368
 
 
2014-10-28
 
    testing new hg 3.1.2 GPO
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1056981
 
    failing to find pem files
 
    cleaned up loaner list from yesterday
 
    closed 2 bugs that we're unused
 
    added 2 missing slavealloc notes
 
    terminated 11 instances
 
    removed many, many out-of-date names & hosts from ldapadmin
 
    lots of bigger scope bugs getting filed under the buildduty category
 
    most belong in general automation or tools IMO
 
    I don't think buildduty bugs should have a scope bigger than what can be accomplished in a single day. thoughts?
 
    reconfig to put new master (bm119) and new Windows test slaves into production
 
    massive spike in pending jobs around 6pm ET
 
    2000->5000
 
    closed trees
 
    waded through the buildduty report a bit
 
 
2014-10-27
 
    19 *running* loan instances
 
dev-linux64-ec2-jlund2
dev-linux64-ec2-kmoir
dev-linux64-ec2-pmoore
dev-linux64-ec2-rchien
dev-linux64-ec2-sbruno
tst-linux32-ec2-evold
tst-linux64-ec2-evanxd
tst-linux64-ec2-gbrown
tst-linux64-ec2-gweng
tst-linux64-ec2-jesup
tst-linux64-ec2-jesup2
tst-linux64-ec2-jgilbert
tst-linux64-ec2-kchen
tst-linux64-ec2-kmoir
tst-linux64-ec2-mdas
tst-linux64-ec2-nchen
tst-linux64-ec2-rchien
tst-linux64-ec2-sbruno
tst-linux64-ec2-simone
tst-linux64-ec2-simone


27 open loan bugs:
27 open loan bugs:<br />
http://mzl.la/1nJGtTw
[http://mzl.la/1nJGtTw http://mzl.la/1nJGtTw]


We should reconcile. Should also cleanup entries in ldapadmin.
We should reconcile. Should also cleanup entries in ldapadmin.<br />


*


<br />


2014-10-24
'''2014-10-24'''<br />


    https://bugzil.la/1087013 - Move slaves from staging to production
* [https://bugzil.la/1087013 '''https://bugzil.la/1087013'''] - Move slaves from staging to production
** posted patch to cleanup mozilla-tests
* filed [https://bugzil.la/1088839 https://bugzil.la/1088839]
** get new slaves added to configs, dbs


    posted patch to cleanup mozilla-tests


    filed https://bugzil.la/1088839
'''2014-10-23'''<br />


    get new slaves added to configs, dbs
* test slaves sometimes fail to start buildbot on reboot
** {{bug|1088032}}
* re-imaging a bunch of w864 machines that were listed as only needing a re-image to be recovered:
** t-w864-ix-0[04,33,51,76,77]
** re-image didn't help any of these slaves
** {{bug|1067062}}
* investigated # of windows test masters required for arr
** 500 windows test slaves, 4 existing windows test masters
*** filed {{bug|1088146}} to create a new master


* OMG load
** ~7000 pending builds at 4pm ET
** KWierso killed off lots of try load: stuff that had already landed, stuff with followup patches
*** developer hygiene is terrible here


2014-10-23
* created [https://releng.etherpad.mozilla.org/platform-management-known-issues https://releng.etherpad.mozilla.org/platform-management-known-issues] to track ongoing issues with various slave classes


    test slaves sometimes fail to start buildbot on reboot


    https://bugzilla.mozilla.org/show_bug.cgi?id=1088032
'''2014-10-22'''<br />


    re-imaging a bunch of w864 machines that were listed as only needing a re-image to be recovered:
* no loaners
* no bustages
* remove slave lists from configs entirely
** pete to add see also to [https://bugzil.la/1087013 https://bugzil.la/1087013]
* merge all (most?) releng repos into a single repo
** {{bug|1087335}}
* mac-v2-signing3 alerting in #buildduty &lt;- not dealt with
* git load spiked: {{bug|1087640}}
** caused by Rail's work with new AMIs
*** [https://bugzil.la/1085520 '''https://bugzil.la/1085520''']'''?''' &lt;- confirm with Rail
** [https://graphite-scl3.mozilla.org/render/?width=586&height=308&_salt=1414027003.262&yAxisSide=right&title=git%201%20mem%20used%20%26%20load&from=-16hours&xFormat=%25a%20%25H%3A%25M&tz=UTC&target=secondYAxis%28hosts.git1_dmz_scl3_mozilla_com.load.load.shortterm%29&target=hosts.git1_dmz_scl3_mozilla_com.memory.memory.used.value&target=hosts.git1_dmz_scl3_mozilla_com.swap.swap.used.value https://graphite-scl3.mozilla.org/render/?width=586&amp;height=308&amp;_salt=1414027003.262&amp;yAxisSide=right&amp;title=git%201%20mem%20used%20%26%20load&amp;from=-16hours&amp;xFormat=%25a%20%25H%3A%25M&amp;tz=UTC&amp;target=secondYAxis%28hosts.git1_dmz_scl3_mozilla_com.load.load.shortterm%29&amp;target=hosts.git1_dmz_scl3_mozilla_com.memory.memory.used.value&amp;target=hosts.git1_dmz_scl3_mozilla_com.swap.swap.used.value]
** similar problem a while ago (before
*** led to creation of golden master


    t-w864-ix-0[04,33,51,76,77]
* many win8 machines &quot;broken&quot; in slave health
** working theory is that 64-bit browser is causing them to hang somehow
** [https://bugzil.la/1080134 https://bugzil.la/1080134]
** same for mtnlion
** same for win7
** we really need to find out why these slaves will simply fail to start buildbot and then sit waiting to be rebooted


    re-image didn't help any of these slaves


    https://bugzilla.mozilla.org/show_bug.cgi?id=1067062
'''2014-10-21'''<br />


    investigated # of windows test masters required for arr
* {{bug|1086564}} Trees closed
** alerted Fubar - he is working on it
* {{bug|1084414}} Windows loaner for ehsan
* killing esr24 branch
* [https://bugzil.la/1066765 https://bugzil.la/1066765] - disabling foopy64 for disk replacement
* [https://bugzil.la/1086620 '''https://bugzil.la/1086620'''] - Migrate slave tools to bugzilla REST API
** wrote patches and deployed to slavealloc, slave health
* trimmed Maintenance page to Q4 only, moved older to 2014 page
* filed [https://bugzil.la/1087013 https://bugzil.la/1087013] - Move slaves from staging to production
** take some of slave logic out of configs, increase capacity in production
* helped mconley in #build with a build config issue
* [https://bugzil.la/973274 '''https://bugzil.la/973274'''] - Install GStreamer 1.x on linux build and test slaves
** this may have webrtc implications, will send mail to laura to check


    500 windows test slaves, 4 existing windows test masters


    filed https://bugzilla.mozilla.org/show_bug.cgi?id=1088146 to create a new master
'''2014-10-20'''<br />


    OMG load
* reconfig (jetpack fixes, alder l10n, holly e10s)
** several liunx64 test masters hit the PB limit
*** put out a general call to disable branches, jobs
*** meanwhile, set masters to gracefully shutdown, and then restarted them. Took about 3 hours.


    ~7000 pending builds at 4pm ET
* 64-bit Windows testing
** clarity achieved!
*** testing 64-bit browser on 64-bit Windows 8, no 32-bit testing on Window 8 at all
*** this means we can divvy the incoming 100 machines between all three Windows test pools to improve capacity, rather than just beefing up the WIn8 platform and splitting it in 2


    KWierso killed off lots of try load: stuff that had already landed, stuff with followup patches


    developer hygiene is terrible here


    created https://releng.etherpad.mozilla.org/platform-management-known-issues to track ongoing issues with various slave classes
'''2014-10-17'''<br />


* blocklist changes for graphics (Sylvestre)
* code for bug updating in reconfigs is done
** review request coming today
* new signing server is up
** pete is testing, configuring masters to use it
* some classes of slaves not reconnecting to masters after reboot
** e.g. mtnlion
** need to find a slave in this state and figure out why
*** puppet problem? runslave.py problem (connection to slavealloc)? runner issue (connection to hg)?


2014-10-22
* patch review for {{bug|1004617}}
* clobbering m-i for rillian
* helping Tomcat cherry-pick patches for m-r
* reconfig for Alder + mac-signing


    no loaners


    no bustages
'''2014-10-16'''<br />


    remove slave lists from configs entirely
* Updated all windows builders with new ffxbld_rsa key
* Patched reconfig code to publish to bugzilla - will test on next reconfig
* Working on set up of mac v2 signing server
* Fixed sftp.py script
* Set up meeting the J Lal, H Wine, et al for vcs sync handover
* lost my reconfig logs from yesterday in order to validate against [https://hg.mozilla.org/build/tools/file/a8eb2cdbe82e/buildfarm/maintenance/watch_twistd_log.py#l132 https://hg.mozilla.org/build/tools/file/a8eb2cdbe82e/buildfarm/maintenance/watch_twistd_log.py#l132] - will do so with next reconfig
* spoke to Amy about windows reimaging problem, and requested a single windows reimage to validate GPO setup
* reconfig for alder and esr24 changes
* rebooting mtnlion slaves that had been idle for 4 hours (9 of them)
** this seems to be a common occurrence. If I can find a slave in this state today, I'll file a bug and dive in. Not sure why mahcine is rebooting and not launching buildbot.


    pete to add see also to https://bugzil.la/1087013


    merge all (most?) releng repos into a single repo
'''2014-10-15'''<br />


    https://bugzilla.mozilla.org/show_bug.cgi?id=1087335
* re-started instance for evold - {{bug|1071125}}
* landed wiki formatting improvements - {{bug|1079893}}
* landed slaverebooter timeout fix + logging improvements - {{bug|1082852}}
* disabled instance for dburns
* reconfig for jlund - [https://bugzil.la/1055918 https://bugzil.la/1055918]


    mac-v2-signing3 alerting in #buildduty <- not dealt with


    git load spiked: https://bugzilla.mozilla.org/show_bug.cgi?id=1087640
'''2014-10-14'''<br />


    caused by Rail's work with new AMIs
* {{bug|1081825}} b2gbumper outage / mirroring problem - backed out - new mirroring request in bug {{bug|1082466}}
** symptoms: b2g_bumper lock file is stale
** should mirror new repos automatically rather than fail
*** bare minimum: report which repo is affected


    https://bugzil.la/1085520 ? <- confirm with Rail
* {{bug|962863}} rolling out l10n gecko and l10n gaia vcs sync - still to do: wait for first run to complete, update wiki, enable cron
* {{bug|1061188}} rolled out, and had to backout due to puppet changes not hitting spot instances yet, and gpo changes not hitting all windows slaves yet - for spot instances, just need to wait, for GPO i have a needinfo on :markco
** need method to generate new golden AMIs on demand, e.g. when puppet changes land
* mac signing servers unhappy - probably not unrelated to higher load due to tree closure - have downtimed in #buildduty for now due to extra load
** backlog of builds on Mac
*** related to slaverebooter hang?
*** many were hung for 5+ hours trying to run signtool.py on repacks
**** not sure whether this was related to (cause? symptom?) of signing server issues
**** could also be related to reconfigs + key changes (ffxbld_rsa)
*** rebooted idle&amp;hung mac builders by hand
*** {{bug|1082770}} - getting another mac v2 signing machine into service


    https://graphite-scl3.mozilla.org/render/?width=586&height=308&_salt=1414027003.262&yAxisSide=right&title=git%201%20mem%20used%20%26%20load&from=-16hours&xFormat=%25a%20%25H%3A%25M&tz=UTC&target=secondYAxis%28hosts.git1_dmz_scl3_mozilla_com.load.load.shortterm%29&target=hosts.git1_dmz_scl3_mozilla_com.memory.memory.used.value&target=hosts.git1_dmz_scl3_mozilla_com.swap.swap.used.value
* sprints for this week:
** [pete] bug updates from reconfigs
** [coop] password updates?
* slaverebooter was hung but not alerting, *however* I did catch the failure mode: indefinitely looping waiting for an available worker thread
** added a 30min timeout waiting for a worker, running locally on bm74
** filed {{bug|1082852}}
* put foppy64 back into service - {{bug|1066765}}
* [https://bugzil.la/1082818 https://bugzil.la/1082818] - t-w864-ix loaner for Armen
* [https://bugzil.la/1082784 https://bugzil.la/1082784] - tst-linux64-ec2 loaner for dburns
* emptied buildduty bug queues


    similar problem a while ago (before


    led to creation of golden master
'''2014-10-13'''<br />


    many win8 machines "broken" in slave health
* [https://bugzil.la/1061861 https://bugzil.la/1061861] Merge day ongoing - keep in close contact with Rail
* [https://bugzil.la/1061589 https://bugzil.la/1061589] Deployed ffxbld_rsa key to hiera and landed Ben's changes
* Awaiting merge day activities before reconfiging outstanding changes:
** [https://bugzil.la/1077154 https://bugzil.la/1077154]
** [https://bugzil.la/1080134 https://bugzil.la/1080134]
** [https://bugzil.la/885331 https://bugzil.la/885331]
* mac-v2-signing1 complained a couple of times in #buildduty, but self-resolved


    working theory is that 64-bit browser is causing them to hang somehow


    https://bugzil.la/1080134


    same for mtnlion


    same for win7
'''2014-10-10:'''<br />


    we really need to find out why these slaves will simply fail to start buildbot and then sit waiting to be rebooted
* kgrandon reported to getting updates for flame-kk
** some investigation
** cc-ed him on {{bug|1063237}}
* work




2014-10-21
'''2014-10-09'''<br />


    https://bugzilla.mozilla.org/show_bug.cgi?id=1086564 Trees closed
* db issue <s>this morning</s> all day
** sheeri ran an errant command on the slave db that inadvertently propagated to the master db
** trees closed for about 2 hours until jobs started
** however, after the outage while the trees were still closed, we did a live fail over between the master and the slave without incident
** later in the day, we tried to fail back over to the master db from the slave db, but we ended up with inconsistent data between the two databases. This resulted in a bunch of jobs not starting because they were in the wrong db.
** fixed with a hot copy
** filed [https://bugzil.la/1080855 https://bugzil.la/1080855] for RFO\
* [https://bugzil.la/1079396 https://bugzil.la/1079396] - loaner win8 machine for :jrmuziel
* [https://bugzil.la/1075287 https://bugzil.la/1075287] - loaner instance for :rchien
** after some debugging over the course of the day, determined he needed a build instance after all
* filed [https://bugzil.la/1080951 '''https://bugzil.la/1080951'''] - Add fabric action to reset the timestamp used by buildbot-master exception log reporting


    alerted Fubar - he is working on it


    https://bugzilla.mozilla.org/show_bug.cgi?id=1084414 Windows loaner for ehsan
'''2014-10-08'''<br />


    killing esr24 branch
* [https://bugzil.la/1079778 https://bugzil.la/1079778] - Disabled pandas taking jobs
** [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=panda-0425 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=panda-0425]


    https://bugzil.la/1066765 - disabling foopy64 for disk replacement


    https://bugzil.la/1086620 - Migrate slave tools to bugzilla REST API
'''2014-10-07'''<br />


    wrote patches and deployed to slavealloc, slave health
* {{bug|1079256}} - B2G device image nightlies (non-eng only) constantly failing/retrying due to failure to upload to update.boot2gecko.org
** cleaned up, now 5% free
** fix is to stop creating/publishing/uploading b2g mars for all branches *except* 1.3 &lt;- {{bug|1000217}}
* fallout from PHX outage?
** golden images (AMIs) keep getting re-puppetized: arr and rail discussing
*** cert issue should be fixed now
** slave loan command line tool: [https://github.com/petemoore/build-tools/tree/slave_loan_command_line https://github.com/petemoore/build-tools/tree/slave_loan_command_line] (scripts/slave_loan.sh)
* cleared buildduty report module open dependencies
* filed [https://bugzil.la/1079468 '''https://bugzil.la/1079468'''] - [tracking][10.10] Continuous integration testing on OS X 10.10 Yosemite


    trimmed Maintenance page to Q4 only, moved older to 2014 page


    filed https://bugzil.la/1087013 - Move slaves from staging to production
'''2014-10-06'''<br />


    take some of slave logic out of configs, increase capacity in production
* {{bug|1078300#c3}}
** hg push showing on tbpl and treeherder with no associated builders generated
* sprints for this week:
** slaverebooter
*** [coop] determine why it sometimes hangs on exit
** [pete] end_to_end_reconfig.sh
*** add bug updates


    helped mconley in #build with a build config issue


    https://bugzil.la/973274 - Install GStreamer 1.x on linux build and test slaves


    this may have webrtc implications, will send mail to laura to check
'''2014-10-03'''<br />


* slaverebooter hung
** added some extra instrumentation locally to try to find out why, when removed the lockfile and restarted
** hasn't failed again today, will see whether it fails around the same time tonight (~11pm PT)
* [https://bugzil.la/1077432 '''https://bugzil.la/1077432'''] - Skip more unittests on capacity-starved platforms
** now skipping opt tests for mtnlion/win8/win7/xp
* reconfig
* [https://bugzil/la/1065677 https://bugzil/la/1065677] - started rolling restarts of all master
** done


2014-10-20


    reconfig (jetpack fixes, alder l10n, holly e10s)
'''2014-10-02'''<br />


    several liunx64 test masters hit the PB limit
* mozmill CI not receiving pulse messages
** some coming through now
** no logging around pulse notifications
* mtnlion machines
** lost half of pool last night, not sure why
** dbs shutdown last night &lt;- related?
** reboot most, leave 2 for diagnosis
** snow affected? yes! rebooted
** also windows
*** rebooted XP and W8
*** can't ssh to w7
**** rebooted via IPMI (-mgmt via web)


    put out a general call to disable branches, jobs
* pulse service - nthomas restarted pulse on masters
** multiple instances running, not checking PID file
** {{bug|1038006}}
* bug 1074147 - increasing test load on Windows
** coop wants to start using our skip-test functionality on windows (every 2) and mtnlion (every 3)
* uploads timing out
** led to (cause of?) running out of space on upload1/stage
** cleared out older uploads (&gt;2 hrs)
*** find /tmp -type d -mmin +60 -exec rm -rf &quot;{}&quot; \;
** think it might be related to network blips (BGP flapping) from AWS: upload gets interrupted, doesn't get cleaned up for 2+ hours. With the load we've had today, that wouldn't be surprising
** smokeping is terrible: [http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-use1 http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-use1]
** filed {{bug|1077187}}
** restarted slaveapi to clear any bad state from today
*** re-ran slaverebooter by hand (bm74)


    meanwhile, set masters to gracefully shutdown, and then restarted them. Took about 3 hours.
* [https://bugzil.la/1069429 https://bugzil.la/1069429] - Upload mozversion to internal pypi
* [https://bugzil.la/1076934 '''https://bugzil.la/1076934'''] - Temporarily turn off OTA on FX OS Master branch
** don't have a proper buildid to go on here, may have time to look up later
* added report links to slave health: hgstats, smokepings


    64-bit Windows testing


    clarity achieved!
'''2014-10-01'''<br />


    testing 64-bit browser on 64-bit Windows 8, no 32-bit testing on Window 8 at all
* [https://bugzil.la/1074267 '''https://bugzil.la/1074267'''] - Slave loan request for a bld-lion-r5 machine
* [https://bugzil.la/1075287 '''https://bugzil.la/1075287'''] - Requesting a loaner machine b2g_ubuntu64_vm to diagnose Bug 942411


    this means we can divvy the incoming 100 machines between all three Windows test pools to improve capacity, rather than just beefing up the WIn8 platform and splitting it in 2


'''2014-09-30'''<br />


2014-10-17
* [https://bugzil.la/1074827 https://bugzil.la/1074827] - buildbot-configs_tests failing on Jenkins due to problem with pip install of master-pip.txt
 
** non-frozen version of OpenSSL being used - rail fixing and trying again
    blocklist changes for graphics (Sylvestre)
* [https://bugzil.la/943932 https://bugzil.la/943932] - T-W864-IX-025 having blue jobs
 
** root cause not known - maybe faulty disk - removed old mozharness checkout, now has a green job
    code for bug updating in reconfigs is done
* [https://bugzil.la/1072434 https://bugzil.la/1072434] - balrog submitter doesn't set previous build number properly
 
** this caused bustage with locale repacks - nick and massimo sorted it out
    review request coming today
* [https://bugzil.la/1050808 https://bugzil.la/1050808] - several desktop repack failures today - I proposed we apply patch in this bug
 
* [https://bugzil.la/1072872 https://bugzil.la/1072872] - last machines rebooted
    new signing server is up
* [https://bugzil.la/1074655 '''https://bugzil.la/1074655'''] - Requesting a loaner machine b2g_ubuntu64_vm to diagnose bug 1053703
 
* going through buildduty report
    pete is testing, configuring masters to use it
** filed new panda-recovery bug, added pandas to it
 
** t-snow-r4-0075: reimaged, returned to production
    some classes of slaves not reconnecting to masters after reboot
** talos-linux64-ix-027: reimaged, returned to production
 
** emptied 3 sections (stopped listing the individual bugs)
    e.g. mtnlion
* reconfig
* [https://bugzil.la/1062465 '''https://bugzil.la/1062465'''] - returned foopy64 and attached pandas to production
** disk is truly failing on foopy64, undid all that work


    need to find a slave in this state and figure out why


    puppet problem? runslave.py problem (connection to slavealloc)? runner issue (connection to hg)?
'''2014-09-29'''<br />


    patch review for https://bugzilla.mozilla.org/show_bug.cgi?id=1004617
* [https://bugzil.la/1073653 https://bugzil.la/1073653] - bash on OS X
** dustin landed fix, watched for fallout
** complications with signing code bhearsum landed
*** all Macs required nudging (manual puppet runs + reboot). Mercifully dustin and bhearsum took care of this.


    clobbering m-i for rillian
* [https://bugzil.la/1072405 https://bugzil.la/1072405] - Investigate why backfilled pandas haven't taken any jobs
** checking failure logs for patterns
** looks like mozpool is still trying to reboot using old relay info: needs re-sync from inventory?
** tools checkout on foopies hadn't been updated, despite a reconfig on Saturday
** enabled 610-612
** each passed 2 tests in a row, re-enabled the rest
* cleaned up resolved loans for bld-lion, snow, and mtnlion machines
* [https://bugzil.la/1074358 '''https://bugzil.la/1074358'''] - Please loan OS X 10.8 Builder to dminor
* [https://bugzil.la/1074267 '''https://bugzil.la/1074267'''] - Slave loan request for a talos-r4-snow machine
* [https://bugzil.la/1073417 '''https://bugzil.la/1073417'''] - Requesting a loaner machine b2g_ubuntu64_vm to diagnose


    helping Tomcat cherry-pick patches for m-r


    reconfig for Alder + mac-signing
'''2014-09-26'''<br />


 
* cleared dead pulse queue items after pulse publisher issues in the morning
2014-10-16
* [https://bugzil.la/1072405 https://bugzil.la/1072405] - Investigate why backfilled pandas haven't taken any jobs
 
** updated devices.json
    Updated all windows builders with new ffxbld_rsa key
** created panda dirs on floppies
 
** had to re-image panda-0619 by hand
    Patched reconfig code to publish to bugzilla - will test on next reconfig
** all still failing, need to investigate on Monday
 
* [https://bugzil.la/1073040 https://bugzil.la/1073040] - loaner for mdas
    Working on set up of mac v2 signing server
 
    Fixed sftp.py script
 
    Set up meeting the J Lal, H Wine, et al for vcs sync handover
 
    lost my reconfig logs from yesterday in order to validate against https://hg.mozilla.org/build/tools/file/a8eb2cdbe82e/buildfarm/maintenance/watch_twistd_log.py#l132 - will do so with next reconfig
 
    spoke to Amy about windows reimaging problem, and requested a single windows reimage to validate GPO setup
 
    reconfig for alder and esr24 changes
 
    rebooting mtnlion slaves that had been idle for 4 hours (9 of them)
 
    this seems to be a common occurrence. If I can find a slave in this state today, I'll file a bug and dive in. Not sure why mahcine is rebooting and not launching buildbot.
 
 
2014-10-15
 
    re-started instance for evold - https://bugzilla.mozilla.org/show_bug.cgi?id=1071125
 
    landed wiki formatting improvements - https://bugzilla.mozilla.org/show_bug.cgi?id=1079893
 
    landed slaverebooter timeout fix + logging improvements - https://bugzilla.mozilla.org/show_bug.cgi?id=1082852
 
    disabled instance for dburns
 
    reconfig for jlund - https://bugzil.la/1055918
 
 
2014-10-14
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1081825 b2gbumper outage / mirroring problem - backed out - new mirroring request in bug https://bugzilla.mozilla.org/show_bug.cgi?id=1082466
 
    symptoms: b2g_bumper lock file is stale
 
    should mirror new repos automatically rather than fail
 
    bare minimum: report which repo is affected
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=962863 rolling out l10n gecko and l10n gaia vcs sync - still to do: wait for first run to complete, update wiki, enable cron
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1061188 rolled out, and had to backout due to puppet changes not hitting spot instances yet, and gpo changes not hitting all windows slaves yet - for spot instances, just need to wait, for GPO i have a needinfo on :markco
 
    need method to generate new golden AMIs on demand, e.g. when puppet changes land
 
    mac signing servers unhappy - probably not unrelated to higher load due to tree closure - have downtimed in #buildduty for now due to extra load
 
    backlog of builds on Mac
 
    related to slaverebooter hang?
 
    many were hung for 5+ hours trying to run signtool.py on repacks
 
    not sure whether this was related to (cause? symptom?) of signing server issues
 
    could also be related to reconfigs + key changes (ffxbld_rsa)
 
    rebooted idle&hung mac builders by hand
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1082770 - getting another mac v2 signing machine into service
 
    sprints for this week:
 
    [pete] bug updates from reconfigs
 
    [coop] password updates?
 
    slaverebooter was hung but not alerting, *however* I did catch the failure mode: indefinitely looping waiting for an available worker thread
 
    added a 30min timeout waiting for a worker, running locally on bm74
 
    filed https://bugzilla.mozilla.org/show_bug.cgi?id=1082852
 
    put foppy64 back into service - https://bugzilla.mozilla.org/show_bug.cgi?id=1066765
 
    https://bugzil.la/1082818 - t-w864-ix loaner for Armen
 
    https://bugzil.la/1082784 - tst-linux64-ec2 loaner for dburns
 
    emptied buildduty bug queues
 
 
2014-10-13
 
    https://bugzil.la/1061861 Merge day ongoing - keep in close contact with Rail
 
    https://bugzil.la/1061589 Deployed ffxbld_rsa key to hiera and landed Ben's changes
 
    Awaiting merge day activities before reconfiging outstanding changes:
 
    https://bugzil.la/1077154
 
    https://bugzil.la/1080134
 
    https://bugzil.la/885331
 
    mac-v2-signing1 complained a couple of times in #buildduty, but self-resolved
 
 
 
2014-10-10:
 
    kgrandon reported to getting updates for flame-kk
 
    some investigation
 
    cc-ed him on https://bugzilla.mozilla.org/show_bug.cgi?id=1063237
 
    work
 
 
2014-10-09
 
    db issue this morning all day
 
    sheeri ran an errant command on the slave db that inadvertently propagated to the master db
 
    trees closed for about 2 hours until jobs started
 
    however, after the outage while the trees were still closed, we did a live fail over between the master and the slave without incident
 
    later in the day, we tried to fail back over to the master db from the slave db, but we ended up with inconsistent data between the two databases. This resulted in a bunch of jobs not starting because they were in the wrong db.
 
    fixed with a hot copy
 
    filed https://bugzil.la/1080855 for RFO\
 
    https://bugzil.la/1079396 - loaner win8 machine for :jrmuziel
 
    https://bugzil.la/1075287 - loaner instance for :rchien
 
    after some debugging over the course of the day, determined he needed a build instance after all
 
    filed https://bugzil.la/1080951 - Add fabric action to reset the timestamp used by buildbot-master exception log reporting
 
 
2014-10-08
 
    https://bugzil.la/1079778 - Disabled pandas taking jobs
 
    https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=panda-0425
 
 
2014-10-07
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1079256 - B2G device image nightlies (non-eng only) constantly failing/retrying due to failure to upload to update.boot2gecko.org
 
    cleaned up, now 5% free
 
    fix is to stop creating/publishing/uploading b2g mars for all branches *except* 1.3 <- https://bugzilla.mozilla.org/show_bug.cgi?id=1000217
 
    fallout from PHX outage?
 
    golden images (AMIs) keep getting re-puppetized: arr and rail discussing
 
    cert issue should be fixed now
 
    slave loan command line tool: https://github.com/petemoore/build-tools/tree/slave_loan_command_line (scripts/slave_loan.sh)
 
    cleared buildduty report module open dependencies
 
    filed https://bugzil.la/1079468 - [tracking][10.10] Continuous integration testing on OS X 10.10 Yosemite
 
 
2014-10-06
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1078300#c3
 
    hg push showing on tbpl and treeherder with no associated builders generated
 
    sprints for this week:
 
    slaverebooter
 
    [coop] determine why it sometimes hangs on exit
 
    [pete] end_to_end_reconfig.sh
 
    add bug updates
 
 
2014-10-03
 
    slaverebooter hung
 
    added some extra instrumentation locally to try to find out why, when removed the lockfile and restarted
 
    hasn't failed again today, will see whether it fails around the same time tonight (~11pm PT)
 
    https://bugzil.la/1077432 - Skip more unittests on capacity-starved platforms
 
    now skipping opt tests  for mtnlion/win8/win7/xp
 
    reconfig
 
    https://bugzil/la/1065677 - started rolling restarts of all master
 
    done
 
 
2014-10-02
 
    mozmill CI not receiving pulse messages
 
    some coming through now
 
    no logging around pulse notifications
 
    mtnlion machines
 
    lost half of pool last night, not sure why
 
    dbs shutdown last night <- related?
 
    reboot most, leave 2 for diagnosis
 
    snow affected? yes! rebooted
 
    also windows
 
    rebooted XP and W8
 
    can't ssh to w7
 
    rebooted via IPMI (-mgmt via web)
 
    pulse service - nthomas restarted pulse on masters
 
    multiple instances running, not checking PID file
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=1038006
 
    bug 1074147 - increasing test load on Windows
 
    coop wants to start using our skip-test functionality on windows (every 2) and mtnlion (every 3)
 
    uploads timing out
 
    led to (cause of?) running out of space on upload1/stage
 
    cleared out older uploads (>2 hrs)
 
    find /tmp -type d -mmin +60 -exec rm -rf "{}" \;
 
    think it might be related to network blips (BGP flapping) from AWS: upload gets interrupted, doesn't get cleaned up for 2+ hours. With the load we've had today, that wouldn't be surprising
 
    smokeping is terrible: http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-use1
 
    filed https://bugzilla.mozilla.org/show_bug.cgi?id=1077187
 
    restarted slaveapi to clear any bad state from today
 
    re-ran slaverebooter by hand (bm74)
 
    https://bugzil.la/1069429 - Upload mozversion to internal pypi
 
    https://bugzil.la/1076934 - Temporarily turn off OTA on FX OS Master branch
 
    don't have a proper buildid to go on here, may have time to look up later
 
    added report links to slave health: hgstats, smokepings
 
 
2014-10-01
 
    https://bugzil.la/1074267 - Slave loan request for a bld-lion-r5 machine
 
    https://bugzil.la/1075287 - Requesting a loaner machine b2g_ubuntu64_vm to diagnose Bug 942411
 
 
2014-09-30
 
    https://bugzil.la/1074827 - buildbot-configs_tests failing on Jenkins due to problem with pip install of master-pip.txt
 
    non-frozen version of OpenSSL being used - rail fixing and trying again
 
    https://bugzil.la/943932 - T-W864-IX-025 having blue jobs
 
    root cause not known - maybe faulty disk - removed old mozharness checkout, now has a green job
 
    https://bugzil.la/1072434 - balrog submitter doesn't set previous build number properly
 
    this caused bustage with locale repacks - nick and massimo sorted it out
 
    https://bugzil.la/1050808 - several desktop repack failures today - I proposed we apply patch in this bug
 
    https://bugzil.la/1072872 - last machines rebooted
 
    https://bugzil.la/1074655 - Requesting a loaner machine b2g_ubuntu64_vm to diagnose bug 1053703
 
    going through buildduty report
 
    filed new panda-recovery bug, added pandas to it
 
    t-snow-r4-0075: reimaged, returned to production
 
    talos-linux64-ix-027: reimaged, returned to production
 
    emptied 3 sections (stopped listing the individual bugs)
 
    reconfig
 
    https://bugzil.la/1062465 - returned foopy64 and attached pandas to production
 
    disk is truly failing on foopy64, undid all that work
 
 
2014-09-29
 
    https://bugzil.la/1073653 - bash on OS X
 
    dustin landed fix, watched for fallout
 
    complications with signing code bhearsum landed
 
    all Macs required nudging (manual puppet runs + reboot). Mercifully dustin and bhearsum took care of this.
 
    https://bugzil.la/1072405 - Investigate why backfilled pandas haven't taken any jobs
 
    checking failure logs for patterns
 
    looks like mozpool is still trying to reboot using old relay info: needs re-sync from inventory?
 
    tools checkout on foopies hadn't been updated, despite a reconfig on Saturday
 
    enabled 610-612
 
    each passed 2 tests in a row, re-enabled the rest
 
    cleaned up resolved loans for bld-lion, snow, and mtnlion machines
 
    https://bugzil.la/1074358 - Please loan OS X 10.8 Builder to dminor
 
    https://bugzil.la/1074267 - Slave loan request for a talos-r4-snow machine
 
    https://bugzil.la/1073417 - Requesting a loaner machine b2g_ubuntu64_vm to diagnose
 
 
2014-09-26
 
    cleared dead pulse queue items after pulse publisher issues in the morning
 
    https://bugzil.la/1072405 - Investigate why backfilled pandas haven't taken any jobs
 
    updated devices.json
 
    created panda dirs on floppies
 
    had to re-image panda-0619 by hand
 
    all still failing, need to investigate on Monday
 
    https://bugzil.la/1073040 - loaner for mdas




Besides the handover notes for last week, which I received from pete, there are the following issues:
Besides the handover notes for last week, which I received from pete, there are the following issues:


https://bugzilla.mozilla.org/show_bug.cgi?id=1038063       Running out of space on dev-stage01:/builds
{{bug|1038063}} Running out of space on dev-stage01:/builds<br />
The root cause of the alerts was the addition of folder /builds/data/ftp/pub/firefox/releases/31.0b9 by Massimo in order to run some tests.
The root cause of the alerts was the addition of folder /builds/data/ftp/pub/firefox/releases/31.0b9 by Massimo in order to run some tests.<br />
Nick did some further cleanup, the bug has been reopened this morning by pete, proposing to automate some of the steps nick did manually.
Nick did some further cleanup, the bug has been reopened this morning by pete, proposing to automate some of the steps nick did manually.


https://bugzilla.mozilla.org/show_bug.cgi?id=1036176 Some spot instances in us-east-1 are failing to connect to hg.mozilla.org
[https://bugzilla.mozilla.org/show_bug.cgi?id=1036176 '''https://bugzilla.mozilla.org/show_bug.cgi?id=1036176'''] Some spot instances in us-east-1 are failing to connect to hg.mozilla.org<br />
Some troubleshooting has been done by Nick, and case 222113071 has been opened with AWS
Some troubleshooting has been done by Nick, and case 222113071 has been opened with AWS


2014-07-07 to 2014-07-11
'''2014-07-07 to 2014-07-11'''


Hi Simone,
Hi Simone,
Line 2,467: Line 1,698:
Open issues at end of week:
Open issues at end of week:


foopy117 is playing up (https://bugzilla.mozilla.org/show_bug.cgi?id=1037441)
foopy117 is playing up ({{bug|1037441)}}<br />
this is also affecting end_to_end_reconfig.sh (solution: comment out manage_foopies.py lines from this file and run manually)
this is also affecting end_to_end_reconfig.sh (solution: comment out manage_foopies.py lines from this file and run manually)<br />
Foopy 117 seems to be back and working normally
Foopy 117 seems to be back and working normally


Major problems with pending queues (https://bugzilla.mozilla.org/show_bug.cgi?id=1034055) - this should hopefully be fixed relatively soon. most notably linux64 in-house ix machines. Not a lot you can do about this - just be aware of it if people ask.
Major problems with pending queues ({{bug|1034055)}} - this should hopefully be fixed relatively soon. most notably linux64 in-house ix machines. Not a lot you can do about this - just be aware of it if people ask.<br />
Hopefully this is solved after Kim's recent work
Hopefully this is solved after Kim's recent work


Two changes currently in queue for next reconfig: https://bugzilla.mozilla.org/show_bug.cgi?id=1019962 (armenzg) and https://bugzilla.mozilla.org/show_bug.cgi?id=1025322 (jford)
Two changes currently in queue for next reconfig: {{bug|1019962}} (armenzg) and {{bug|1025322}} (jford)


Some changes to update_maintenance_wiki.sh from aki will be landing when the review passes: (https://bugzilla.mozilla.org/show_bug.cgi?id=1036573) - potential is to impact the wiki update in end_to_end_reconfig.sh as it has been refactored - be aware of this.
Some changes to update_maintenance_wiki.sh from aki will be landing when the review passes: ({{bug|1036573)}} - potential is to impact the wiki update in end_to_end_reconfig.sh as it has been refactored - be aware of this.


Currently no outstanding loaner requests at time of handover, but there are some that need to be checked or returned to the pool.
Currently no outstanding loaner requests at time of handover, but there are some that need to be checked or returned to the pool.


See the 18 open loan requests: https://bugzilla.mozilla.org/buglist.cgi?bug_id=989521%2C1036768%2C1035313%2C1035193%2C1036254%2C1006178%2C1035270%2C876013%2C818198%2C981095%2C977190%2C880893%2C1017303%2C1023856%2C1017046%2C1019135%2C974634%2C1015418&list_id=10700493
See the 18 open loan requests: [https://bugzilla.mozilla.org/buglist.cgi?bug_id=989521%2C1036768%2C1035313%2C1035193%2C1036254%2C1006178%2C1035270%2C876013%2C818198%2C981095%2C977190%2C880893%2C1017303%2C1023856%2C1017046%2C1019135%2C974634%2C1015418&list_id=10700493 https://bugzilla.mozilla.org/buglist.cgi?bug_id=989521%2C1036768%2C1035313%2C1035193%2C1036254%2C1006178%2C1035270%2C876013%2C818198%2C981095%2C977190%2C880893%2C1017303%2C1023856%2C1017046%2C1019135%2C974634%2C1015418&amp;list_id=10700493]
 


<br />
I've pinged all the people in this list (except for requests less than 5 days old) to ask for status.
I've pinged all the people in this list (except for requests less than 5 days old) to ask for status.


Pete
Pete


2014-05-26 to 2014-05-30
'''2014-05-26 to 2014-05-30'''<br />
* Monday
* Monday<br />
* Tuesday
* Tuesday<br />
** reconfig
** reconfig<br />
*** catlee's patch had bustage, and armenzg's had unintended consequences
*** catlee's patch had bustage, and armenzg's had unintended consequences<br />
*** they each reconfiged again for their own problems
*** they each reconfiged again for their own problems<br />
** buildduty report:
** buildduty report:<br />
*** tackled bugs without dependencies
*** tackled bugs without dependencies<br />
*** tackled bugs with all dependencies resolved
*** tackled bugs with all dependencies resolved<br />
* Wednesday
* Wednesday<br />
** new nightly for Tarako
** new nightly for Tarako<br />
*** was actually a b2g code issue: Bug 1016157 - updated the version of vold
*** was actually a b2g code issue: Bug 1016157 - updated the version of vold<br />
** resurrecting tegras to deal with load
** resurrecting tegras to deal with load<br />
* Thursday
* Thursday<br />
** AWS slave loan for ianconnoly
** AWS slave loan for ianconnoly<br />
** puppet patch for talos-linux64-ix-001 reclaim
** puppet patch for talos-linux64-ix-001 reclaim<br />
** resurrecting tegras
** resurrecting tegras<br />
* Friday
* Friday<br />
** tegras continue to fall behind, ping Pete very late Thursday with symptoms. Filed https://bugzil.la/1018118
** tegras continue to fall behind, ping Pete very late Thursday with symptoms. Filed [https://bugzil.la/1018118 https://bugzil.la/1018118]<br />
** reconfig
** reconfig<br />
*** chiefly to deploy https://bugzil.la/1017599 <- reduce # of test on tegras
*** chiefly to deploy [https://bugzil.la/1017599 https://bugzil.la/1017599] &lt;- reduce # of test on tegras<br />
*** fallout:
*** fallout:<br />
**** non-unified mozharness builds are failing in post_upload.py <- causing queue issues on masters
**** non-unified mozharness builds are failing in post_upload.py &lt;- causing queue issues on masters<br />
**** panda tests are retrying more than before
**** panda tests are retrying more than before<br />
***** hitting "Caught Exception: Remote Device Error: unable to connect to panda-0402 after 5 attempts", but it *should be non-fatal, i.e. test runs fine afterwards but still gets flagged for retry
***** hitting &quot;Caught Exception: Remote Device Error: unable to connect to panda-0402 after 5 attempts&quot;, but it *should be non-fatal, i.e. test runs fine afterwards but still gets flagged for retry<br />
***** filed: https://bugzil.la/1018531
***** filed: [https://bugzil.la/1018531 https://bugzil.la/1018531]<br />
**** reported by sheriffs (RyanVM)
**** reported by sheriffs (RyanVM)<br />
***** timeouts on OS X 10.8 tests - "Timed out while waiting for server startup."
***** timeouts on OS X 10.8 tests - &quot;Timed out while waiting for server startup.&quot;<br />
****** https://tbpl.mozilla.org/php/getParsedLog.php?id=40752197&tree=Fx-Team
****** [https://tbpl.mozilla.org/php/getParsedLog.php?id=40752197&tree=Fx-Team https://tbpl.mozilla.org/php/getParsedLog.php?id=40752197&amp;tree=Fx-Team]<br />
***** similar timeouts on android 2.3
***** similar timeouts on android 2.3


2014-05-05 to 2014-05-09
'''2014-05-05 to 2014-05-09'''


* Monday
* Monday<br />
** new tarako nightly for nhirata
** new tarako nightly for nhirata<br />
** reconfig
** reconfig<br />
* Tuesday
* Tuesday<br />
** began rolling master restarts for https://bugzil.la/1005133
** began rolling master restarts for [https://bugzil.la/1005133 https://bugzil.la/1005133]<br />
* Wednesday
* Wednesday<br />
** finished rolling master restarts for https://bugzil.la/1005133
** finished rolling master restarts for [https://bugzil.la/1005133 https://bugzil.la/1005133]<br />
** reconfig for fubar
** reconfig for fubar<br />
** dealt with bugs with no dependencies from the buildduty report (bdr)
** dealt with bugs with no dependencies from the buildduty report (bdr)<br />
* Thursday
* Thursday<br />
** loan to :jib for https://bugzil.la/1007194: talos-linux64-ix-004
** loan to :jib for [https://bugzil.la/1007194 https://bugzil.la/1007194]: talos-linux64-ix-004<br />
** loan to jmaher, failed. Filed https://bugzil.la/1007967
** loan to jmaher, failed. Filed [https://bugzil.la/1007967 https://bugzil.la/1007967]<br />
*** reconfig for bhearsum/jhford
*** reconfig for bhearsum/jhford<br />
* Friday
* Friday<br />
** tree closure(s) due to buildbot db slowdown
** tree closure(s) due to buildbot db slowdown<br />
*** was catlee's fault
*** was catlee's fault<br />
** follow-up on https://bugzil.la/1007967: slave loaned to jmaher
** follow-up on [https://bugzil.la/1007967 https://bugzil.la/1007967]: slave loaned to jmaher<br />
** bugs with no dependencies (bdr)
** bugs with no dependencies (bdr)<br />
** deployed tmp file removal fix for bld-lion in https://bugzil.la/880003
** deployed tmp file removal fix for bld-lion in [https://bugzil.la/880003 https://bugzil.la/880003]


2014-04-21 to 2014-04-25
'''2014-04-21 to 2014-04-'''25


follow up:
follow up:


buildduty report:
buildduty report:<br />
Bug 999930 - put tegras that were on loan back onto a foopy and into production
'''Bug 999930''' - put tegras that were on loan back onto a foopy and into production


action items:            
action items: <br />
* Bug 1001518 - bld-centos6-hp-* slaves are running out of disk space
'''* Bug 1001518''' - bld-centos6-hp-* slaves are running out of disk space<br />
** this pool had 4 machines run out of disk space all within the last week
** this pool had 4 machines run out of disk space all within the last week<br />
** I scrubbed a ton of space (bandaid) but the core issue will need to be addressed
** I scrubbed a ton of space (bandaid) but the core issue will need to be addressed


load:
load:<br />
(high load keep an eye on) Bug 999558 - high pending for ubuntu64-vm try test jobs on Apr 22 morning PT
(high load keep an eye on) '''Bug 999558''' - high pending for ubuntu64-vm try test jobs on Apr 22 morning PT<br />
(keep an eye on) https://bugzilla.mozilla.org/show_bug.cgi?id=997702
(keep an eye on) {{bug|997702}}<br />
** git.m.o high load, out of RAM, see recent emails from hwine with subj 'git.m.o'
** git.m.o high load, out of RAM, see recent emails from hwine with subj 'git.m.o'<br />
** graph to watch: https://graphite-scl3.mozilla.org/render/?width=586&height=308&_salt=1397724372.602&yAxisSide=right&title=git%201%20mem%20used%20&%20load&from=-8hours&target=secondYAxis(hosts.git1_dmz_scl3_mozilla_com.load.load.shortterm)&target=hosts.git1_dmz_scl3_mozilla_com.memory.memory.used.value&target=hosts.git1_dmz_scl3_mozilla_com.swap.swap.used.value        
** graph to watch: [https://graphite-scl3.mozilla.org/render/?width=586&height=308&_salt=1397724372.602&yAxisSide=right&title=git%201%20mem%20used%20&%20load&from=-8hours&target=secondYAxis(hosts.git1_dmz_scl3_mozilla_com.load.load.shortterm)&target=hosts.git1_dmz_scl3_mozilla_com.memory.memory.used.value&target=hosts.git1_dmz_scl3_mozilla_com.swap.swap.used.value https://graphite-scl3.mozilla.org/render/?width=586&amp;height=308&amp;_salt=1397724372.602&amp;yAxisSide=right&amp;title=git%201%20mem%20used%20&amp;%20load&amp;from=-8hours&amp;target=secondYAxis(hosts.git1_dmz_scl3_mozilla_com.load.load.shortterm)&amp;target=hosts.git1_dmz_scl3_mozilla_com.memory.memory.used.value&amp;target=hosts.git1_dmz_scl3_mozilla_com.swap.swap.used.value]
 


(jlund) reboot all these xp stuck slaves- https://bugzil.la/977341 - XP machines out of action
<br />
** https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-xp32-ix THERE ARE ONLY 4 MACHINES TODAY THAT ARE "BROKEN" AND ONLY HUNG TODAY.
<s>(jlund) reboot all these xp stuck slaves-</s>[https://bugzil.la/977341 <s>https://bugzil.la/977341</s>]<s>- XP machines out of action</s><br />
pmoore: there is only 1 now
<s>**</s>[https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-xp32-ix <s>https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&amp;type=t-xp32-ix</s>]<s>THERE ARE ONLY 4 MACHINES TODAY THAT ARE &quot;BROKEN&quot; AND ONLY HUNG TODAY.</s><br />
* (jlund) iterate through old disabled slaves in these platform lists - https://bugzil.la/984915 - Improve slave health for disabled slaves <- THIS WAS NOT DONE. I ASKED PMOORE TO HELP
<s>pmoore: there is only 1 now</s><br />
** pmoore: i'm not entirely sure which platform lists this means, as the bug doesn't contain a list of platforms. So I am looking at all light blue numbers on https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html (i.e. the disabled totals per platform). When jlund is online later i'll clarify with him.
* (jlund) iterate through old disabled slaves in these platform lists - [https://bugzil.la/984915 https://bugzil.la/984915] - Improve slave health for disabled slaves &lt;- THIS WAS NOT DONE. I ASKED PMOORE TO HELP<br />
** pmoore: i'm not entirely sure which platform lists this means, as the bug doesn't contain a list of platforms. So I am looking at all light blue numbers on [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html] (i.e. the disabled totals per platform). When jlund is online later i'll clarify with him.<br />
** jlund: thanks pete. Sorry I meant all platform lists I suppose starting with whatever platform held our worst wait times. I have started going through the disabled hosts looking for 'forgotten ones'
** jlund: thanks pete. Sorry I meant all platform lists I suppose starting with whatever platform held our worst wait times. I have started going through the disabled hosts looking for 'forgotten ones'


2014-04-10 to 2014-04-11 (Thursday and Friday)
'''2014-04-10 to 2014-04-11''' (Thursday and Friday)


https://bugzilla.mozilla.org/show_bug.cgi?id=995060
{{bug|995060}}<br />
Nasty nasty tree closure lasting several hours
Nasty nasty tree closure lasting several hours<br />
b-c taking loooooong time and log files too large for buildbot to handle
b-c taking loooooong time and log files too large for buildbot to handle<br />
Timeout for MOCHITEST_BC_3 increased from 4200s to 12000s
Timeout for MOCHITEST_BC_3 increased from 4200s to 12000s<br />
When Joel's patch has landed: https://bugzilla.mozilla.org/show_bug.cgi?id=984930
When Joel's patch has landed: {{bug|984930}}<br />
then we should "undo" the changes from bug 995060 and put timeout back down to 4200s (was just a temporary workaround). Align with edmorley on this.
then we should &quot;undo&quot; the changes from bug 995060 and put timeout back down to 4200s (was just a temporary workaround). Align with edmorley on this.


https://bugzilla.mozilla.org/show_bug.cgi?id=975006
{{bug|975006}}<br />
https://bugzilla.mozilla.org/show_bug.cgi?id=938872
{{bug|938872}}<br />
after much investigation, it turns out a monitor is attached to this slave - can you raise a bug to dc ops to get it removed?
after much investigation, it turns out a monitor is attached to this slave - can you raise a bug to dc ops to get it removed?


<br />
Loaners returned:<br />
[https://bugzil.la/978054 https://bugzil.la/978054]<br />
[https://bugzil.la/990722 https://bugzil.la/990722]<br />
[https://bugzil.la/977711 https://bugzil.la/977711] (discovered this older one)


Loaners returned:
Loaners created:<br />
https://bugzil.la/978054
{{bug|994283}}
https://bugzil.la/990722
https://bugzil.la/977711 (discovered this older one)


Loaners created:
<br />
https://bugzilla.mozilla.org/show_bug.cgi?id=994283
{{bug|994321#c7}}<br />
 
Still problems with 7 slaves that can't be rebooted:<br />
 
[https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-005 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&amp;type=talos-mtnlion-r5&amp;name=talos-mtnlion-r5-005]<br />
https://bugzilla.mozilla.org/show_bug.cgi?id=994321#c7
[https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-006 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&amp;type=talos-mtnlion-r5&amp;name=talos-mtnlion-r5-006]<br />
Still problems with 7 slaves that can't be rebooted:
[https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-061 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&amp;type=talos-mtnlion-r5&amp;name=talos-mtnlion-r5-061]<br />
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-005
[https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-065 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&amp;type=talos-mtnlion-r5&amp;name=talos-mtnlion-r5-065]<br />
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-006
[https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-074 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&amp;type=talos-mtnlion-r5&amp;name=talos-mtnlion-r5-074]<br />
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-061
[https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-086 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&amp;type=talos-mtnlion-r5&amp;name=talos-mtnlion-r5-086]<br />
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-065
[https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-089 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&amp;type=talos-mtnlion-r5&amp;name=talos-mtnlion-r5-089]<br />
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-074
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-086
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-089
Slave API has open bugs on all of these.
Slave API has open bugs on all of these.


 
<br />
https://bugzil.la/977341
[https://bugzil.la/977341 https://bugzil.la/977341]<br />
Stuck Win XP slaves - only one was stuck (t-xp32-ix-073). Rebooted.
Stuck Win XP slaves - only one was stuck (t-xp32-ix-073). Rebooted.


 
<br />
Thanks Callek!
Thanks Callek!


Week of 2014-04-05 to 2014-04-09 (thurs-wed)
'''Week of 2014-04-05 to 2014-04-09 (thurs-wed)'''


Hiya pete
Hiya pete


* van has been working hard at troubleshooting winxp 085: https://bugzilla.mozilla.org/show_bug.cgi?id=975006#c21
* van has been working hard at troubleshooting winxp 085: {{bug|975006#c21}}<br />
** this needs to be put back into production along with 002 and reported back to van on findings
** this needs to be put back into production along with 002 and reported back to van on findings<br />
** note this is a known failing machine. please try to catch it fail before sheriffs.
** note this is a known failing machine. please try to catch it fail before sheriffs.


* loans that can be returned that I have not got to:
* loans that can be returned that I have not got to:<br />
** https://bugzil.la/978054
** [https://bugzil.la/978054 https://bugzil.la/978054]<br />
**  https://bugzilla.mozilla.org/show_bug.cgi?id=990722
** {{bug|990722}}


* we should reconfig either thurs or by fri at latest
* we should reconfig either thurs or by fri at latest
Line 2,623: Line 1,854:
* latest aws sanity check runthrough yielded better results than before. Very view long running lazy instances. Very few unattended loans. This should be checked again on Friday
* latest aws sanity check runthrough yielded better results than before. Very view long running lazy instances. Very few unattended loans. This should be checked again on Friday


* there was a try push that broke a series of mtnlion machines this afternoon. Callek, nthomas, and Van worked hard at helping me diagnose and solve issue.
* there was a try push that broke a series of mtnlion machines this afternoon. Callek, nthomas, and Van worked hard at helping me diagnose and solve issue.<br />
** there are some slaves that failed to reboot via slaveapi. This is worth following up on especially since we barely have any 10.8 machines to begin with:
** there are some slaves that failed to reboot via slaveapi. This is worth following up on especially since we barely have any 10.8 machines to begin with:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=994321#c7
** {{bug|994321#c7}}


* on tues we started having github/vsync issues where sheriffs noticed that bumper bot wasn't keeping up with csets on github.
* on tues we started having github/vsync issues where sheriffs noticed that bumper bot wasn't keeping up with csets on github.<br />
** looks like things have been worked on and possibly fixed but just a heads up
** looks like things have been worked on and possibly fixed but just a heads up<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=993632
** {{bug|993632}}


* I never got around to doing this:
* I never got around to doing this:<br />
** iterate through old disabled slaves in these platform lists - https://bugzil.la/984915 - Improve slave health for disabled slaves. This was discussed in our buildduty mtg. could you please look at it: https://etherpad.mozilla.org/releng-buildduty-meeting
** iterate through old disabled slaves in these platform lists - [https://bugzil.la/984915 https://bugzil.la/984915] - Improve slave health for disabled slaves. This was discussed in our buildduty mtg. could you please look at it: [https://etherpad.mozilla.org/releng-buildduty-meeting https://etherpad.mozilla.org/releng-buildduty-meeting]


* (jlund) reboot all these xp stuck slaves- https://bugzil.la/977341 - XP machines out of action
* (jlund) reboot all these xp stuck slaves- [https://bugzil.la/977341 https://bugzil.la/977341] - XP machines out of action<br />
** https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-xp32-ix  
** [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-xp32-ix https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&amp;type=t-xp32-ix] <br />
** broken machines rebooted. Looking at list now there are 4 machines that are 'broken' and stopped doing a job today.
** broken machines rebooted. Looking at list now there are 4 machines that are 'broken' and stopped doing a job today.


* as per: https://bugzilla.mozilla.org/show_bug.cgi?id=991259#c1 I checked on these and the non green ones should be followed up on
* as per: {{bug|991259#c1}} I checked on these and the non green ones should be followed up on<br />
tegra-063.tegra.releng.scl3.mozilla.com is alive <- up and green
tegra-063.tegra.releng.scl3.mozilla.com is alive &lt;- up and green<br />
tegra-050.tegra.releng.scl3.mozilla.com is alive <- up and green
tegra-050.tegra.releng.scl3.mozilla.com is alive &lt;- up and green<br />
tegra-028.tegra.releng.scl3.mozilla.com is alive <- up but not running jobs
tegra-028.tegra.releng.scl3.mozilla.com is alive &lt;- up but not running jobs<br />
tegra-141.tegra.releng.scl3.mozilla.com is alive <- up and green
tegra-141.tegra.releng.scl3.mozilla.com is alive &lt;- up and green<br />
tegra-117.tegra.releng.scl3.mozilla.com is alive <- up but not running jobs
tegra-117.tegra.releng.scl3.mozilla.com is alive &lt;- up but not running jobs<br />
tegra-187.tegra.releng.scl3.mozilla.com is alive <- up and green
tegra-187.tegra.releng.scl3.mozilla.com is alive &lt;- up and green<br />
tegra-087.tegra.releng.scl3.mozilla.com is alive <- up and green
tegra-087.tegra.releng.scl3.mozilla.com is alive &lt;- up and green<br />
tegra-299.tegra.releng.scl3.mozilla.com is alive <- up and green
tegra-299.tegra.releng.scl3.mozilla.com is alive &lt;- up and green<br />
tegra-309.tegra.releng.scl3.mozilla.com is alive <- up and green
tegra-309.tegra.releng.scl3.mozilla.com is alive &lt;- up and green<br />
tegra-335.tegra.releng.scl3.mozilla.com is alive <- up and green
tegra-335.tegra.releng.scl3.mozilla.com is alive &lt;- up and green


as per jhopkins last week and reformatting: again the non green ones should be followed up on.
as per jhopkins last week and reformatting: again the non green ones should be followed up on.<br />
tegra-108 - bug 838425 - SD card reformat was successful <- cant write to sd  
tegra-108 - bug 838425 - SD card reformat was successful &lt;- cant write to sd <br />
tegra-091 - bug 778886 - SD card reformat was successful <- sdcard issues again
tegra-091 - bug 778886 - SD card reformat was successful &lt;- sdcard issues again<br />
tegra-073 - bug 771560 - SD card reformat was successful <- lockfile issues
tegra-073 - bug 771560 - SD card reformat was successful &lt;- lockfile issues<br />
tegra-210 - bug 890337 - SD card reformat was successful <- green in prod
tegra-210 - bug 890337 - SD card reformat was successful &lt;- green in prod<br />
tegra-129 - bug 838438 - SD card reformat was successful <- fail to connect to telnet
tegra-129 - bug 838438 - SD card reformat was successful &lt;- fail to connect to telnet<br />
tegra-041 - bug 778813 - SD card reformat was successful <- sdcard issues again
tegra-041 - bug 778813 - SD card reformat was successful &lt;- sdcard issues again<br />
tegra-035 - bug 772189 - SD card reformat was successful <- sdcard issues again
tegra-035 - bug 772189 - SD card reformat was successful &lt;- sdcard issues again<br />
tegra-228 - bug 740440 - SD card reformat was successful <- fail to connect to telnet
tegra-228 - bug 740440 - SD card reformat was successful &lt;- fail to connect to telnet<br />
tegra-133 - bug 778923 - SD card reformat was successful <- green in prod
tegra-133 - bug 778923 - SD card reformat was successful &lt;- green in prod<br />
tegra-223 - bug 740438 - SD card reformat was successful <- Unable to properly cleanup foopy processes
tegra-223 - bug 740438 - SD card reformat was successful &lt;- Unable to properly cleanup foopy processes<br />
tegra-080 - bug 740426 - SD card reformat was successful <- green in prod
tegra-080 - bug 740426 - SD card reformat was successful &lt;- green in prod<br />
tegra-032 - bug 778899 - SD card reformat was successful <- sdcard issues again
tegra-032 - bug 778899 - SD card reformat was successful &lt;- sdcard issues again<br />
tegra-047 - bug 778909 - SD card reformat was successful have not got past here
tegra-047 - bug 778909 - SD card reformat was successful have not got past here<br />
tegra-038 - bug 873677 - SD card reformat was successful
tegra-038 - bug 873677 - SD card reformat was successful<br />
tegra-264 - bug 778841 - SD card reformat was successful
tegra-264 - bug 778841 - SD card reformat was successful<br />
tegra-092 - bug 750835 - SD card reformat was successful
tegra-092 - bug 750835 - SD card reformat was successful<br />
tegra-293 - bug 819669 - SD card reformat was successful
tegra-293 - bug 819669 - SD card reformat was successful


Week of 2014-03-31 to 2014-04-04
'''Week of 2014-03-31 to 2014-04-04'''


Wednesday:
Wednesday:
Line 2,675: Line 1,906:
Someone will need to follow up on how these tegras did since I reformatted their SD cards:
Someone will need to follow up on how these tegras did since I reformatted their SD cards:


tegra-108 - bug 838425 - SD card reformat was successful
tegra-108 - bug 838425 - SD card reformat was successful<br />
tegra-091 - bug 778886 - SD card reformat was successful
tegra-091 - bug 778886 - SD card reformat was successful<br />
tegra-073 - bug 771560 - SD card reformat was successful
tegra-073 - bug 771560 - SD card reformat was successful<br />
tegra-210 - bug 890337 - SD card reformat was successful
tegra-210 - bug 890337 - SD card reformat was successful<br />
tegra-129 - bug 838438 - SD card reformat was successful
tegra-129 - bug 838438 - SD card reformat was successful<br />
tegra-041 - bug 778813 - SD card reformat was successful
tegra-041 - bug 778813 - SD card reformat was successful<br />
tegra-035 - bug 772189 - SD card reformat was successful
tegra-035 - bug 772189 - SD card reformat was successful<br />
tegra-228 - bug 740440 - SD card reformat was successful
tegra-228 - bug 740440 - SD card reformat was successful<br />
tegra-133 - bug 778923 - SD card reformat was successful
tegra-133 - bug 778923 - SD card reformat was successful<br />
tegra-223 - bug 740438 - SD card reformat was successful
tegra-223 - bug 740438 - SD card reformat was successful<br />
tegra-080 - bug 740426 - SD card reformat was successful
tegra-080 - bug 740426 - SD card reformat was successful<br />
tegra-032 - bug 778899 - SD card reformat was successful
tegra-032 - bug 778899 - SD card reformat was successful<br />
tegra-047 - bug 778909 - SD card reformat was successful
tegra-047 - bug 778909 - SD card reformat was successful<br />
tegra-038 - bug 873677 - SD card reformat was successful
tegra-038 - bug 873677 - SD card reformat was successful<br />
tegra-264 - bug 778841 - SD card reformat was successful
tegra-264 - bug 778841 - SD card reformat was successful<br />
tegra-092 - bug 750835 - SD card reformat was successful
tegra-092 - bug 750835 - SD card reformat was successful<br />
tegra-293 - bug 819669 - SD card reformat was successful
tegra-293 - bug 819669 - SD card reformat was successful


 
<br />
Week of 2014-03-17 to 2014-03-21
'''Week of 2014-03-17 to 2014-03-21'''<br />
buildduty: armenzg
buildduty: armenzg


Monday
'''Monday'''<br />
* bugmail and deal with broken slaves
* bugmail and deal with broken slaves<br />
* mergeday
* mergeday


Tuesday
'''Tuesday'''<br />
* reviewed aws sanity check
* reviewed aws sanity check<br />
* cleaned up and assigned some buildduty bugs
* cleaned up and assigned some buildduty bugs<br />
* reconfig
* reconfig


TODO:
TODO:<br />
* bug 984944
* bug 984944<br />
* swipe through problem tracking bugs
* swipe through problem tracking bugs


Wednesday
'''Wednesday'''


Thursday
'''Thursday'''


Friday
'''Friday'''


Week of 2014-01-20 to 2014-01-24
'''Week of 2014-01-20 to 2014-01-24'''<br />
buildduty: armenzg
buildduty: armenzg


Monday
'''Monday'''<br />
* deal with space warnings
* deal with space warnings<br />
* loan to dminor
* loan to dminor<br />
* terminated returned loan machines
* terminated returned loan machines


Tuesday
'''Tuesday'''<br />
* loan win64 builder
* loan win64 builder<br />
* Callek helped with the tegras
* Callek helped with the tegras


TODO
TODO<br />
* add more EC2 machines
* add more EC2 machines






Week of 2014-01-20 to 2014-01-24
'''Week of 2014-01-20 to 2014-01-24'''<br />
buildduty: jhopkins
buildduty: jhopkins


Bugs filed:
'''Bugs filed:'''<br />
 
    Bug 962269 (dupe) - DownloadFile step does not retry status 503 (server too busy)
 
    Bug 962698 - Expose aws sanity report data via web interface in json format
 
    Bug 963267 - aws_watch_pending.py should avoid region/instance combinations that lack capacity
 
 
Monday
 
    Nightly updates are disabled (bug 908134 comment 51)


    loan bug 961765
* Bug 962269 (dupe) - DownloadFile step does not retry status 503 (server too busy)
* Bug 962698 - Expose aws sanity report data via web interface in json format
* Bug 963267 - aws_watch_pending.py should avoid region/instance combinations that lack capacity




Tuesday
'''Monday'''<br />


    added new buildduty task: https://wiki.mozilla.org/ReleaseEngineering:Buildduty#Semi-Daily
* Nightly updates are disabled (bug 908134 comment 51)
* loan bug 961765


    Bug 934938 - Intermittent ftp.m.o "ERROR 503: Server Too Busy"


'''Tuesday'''<br />


Wednesday
* added new buildduty task: [[ReleaseEngineering:Buildduty#Semi-Daily]]
* Bug 934938 - Intermittent ftp.m.o &quot;ERROR 503: Server Too Busy&quot;


    added AWS instance Tag "moz-used-by" to the nat-gateway instance to help with processing the long-running instances report


    would be nice if we could get the aws sanity report data to be produced by slave api so it could be pulled by a web page and correlated with recent job history, for example
'''Wednesday'''<br />


    Bug 934938 - jakem switched to round-robin DNS (see https://bugzilla.mozilla.org/show_bug.cgi?id=934938#c1519 for technical details) to avoid "thundering herd" problem.
* added AWS instance Tag &quot;moz-used-by&quot; to the nat-gateway instance to help with processing the long-running instances report
* would be nice if we could get the aws sanity report data to be produced by slave api so it could be pulled by a web page and correlated with recent job history, for example
* Bug 934938 - jakem switched to round-robin DNS (see {{bug|934938#c1519}} for technical details) to avoid &quot;thundering herd&quot; problem.




Thursday
'''Thursday'''<br />


    AWS lacking capacity and slowing down instance startup. Filed 963267.
* AWS lacking capacity and slowing down instance startup. Filed 963267.




Friday
'''Friday'''<br />


    missed some loan requests b/c I thought they were being included in the buildduty report (2 previous ones seemed to be). Can we add loans to the buildduty report?
* missed some loan requests b/c I thought they were being included in the buildduty report (2 previous ones seemed to be). Can we add loans to the buildduty report?
* some automated slave recovery not happening due to Bug 963171''''''- please allow buildbot-master65 to talk to production slaveapi


    some automated slave recovery not happening due to Bug 963171 - please allow buildbot-master65 to talk to production slaveapi






Week of 2014-01-13 to 2014-01-17
'''Week of 2014-01-13 to 2014-01-17'''<br />
buildduty: bhearsum
buildduty: bhearsum


Bugs filed (not a complete list):
'''Bugs filed (not a complete list):'''<br />
* Bug 960535 - Increase bouncerlatestchecks Nagios script timeout
* '''Bug 960535''' - Increase bouncerlatestchecks Nagios script timeout






Week of 2014-01-16 to 2014-01-10
'''Week of 2014-01-16 to 2014-01-10'''<br />
buildduty: armenzg
buildduty: armenzg


Bugs filed:
'''Bugs filed:'''<br />
* https://bugzil.la/956788 - Allow slaveapi to clobber the basedir to fix machines
* [https://bugzil.la/956788 https://bugzil.la/956788] - Allow slaveapi to clobber the basedir to fix machines<br />
* https://bugzil.la/957630 - Invalid tokens
* [https://bugzil.la/957630 https://bugzil.la/957630] - Invalid tokens<br />
* https://bugzil.la/930897 - mochitest-browser-chrome timeouts
* [https://bugzil.la/930897 https://bugzil.la/930897] - mochitest-browser-chrome timeouts


Monday
'''Monday'''<br />
* loan machines
* loan machines<br />
* deal with some broken slaves
* deal with some broken slaves


Tuesday
'''Tuesday'''<br />
* loan machines
* loan machines<br />
* deal with some broken slaves
* deal with some broken slaves<br />
* reconfig
* reconfig<br />
* second reconfig for backout
* second reconfig for backout


Wednesday
'''Wednesday'''<br />
* enable VNC for a Mac loaner
* enable VNC for a Mac loaner<br />
* check signing issues filed by Tomcat
* check signing issues filed by Tomcat<br />
* mozharness merge
* mozharness merge<br />
* help RyanVM with some timeout
* help RyanVM with some timeout


Thursday
'''Thursday'''<br />
* do reconfig with jlund
* do reconfig with jlund


Friday
'''Friday'''<br />
* restart redis
* restart redis<br />
* loan 2 machines
* loan 2 machines<br />
* process problem tracking bugs
* process problem tracking bugs


Week of 2013-12-16 to 2013-12-20
'''Week of 2013-12-16 to 2013-12-20'''<br />
buildduty: jhopkins
buildduty: jhopkins


Bugs filed:
Bugs filed:<br />
* Bug 950746 - Log aws_watch_pending.py operations to a machine-parseable log or database
* '''Bug 950746''' - Log aws_watch_pending.py operations to a machine-parseable log or database<br />
* Bug 950780 - Start AWS instances in parallel
* '''Bug 950780''' - Start AWS instances in parallel<br />
* Bug 950789 - MozpoolException should be retried
* '''Bug 950789''' - MozpoolException should be retried<br />
* Bug 952129 - download_props step can hang indefinitely
* '''Bug 952129''' - download_props step can hang indefinitely<br />
* Bug 952517 - Run l10n repacks on a smaller EC2 instance type
'''* Bug 952517''' - Run l10n repacks on a smaller EC2 instance type


Monday
'''Monday'''<br />
* several talos-r3-fed* machines have a date of 2001
* several talos-r3-fed* machines have a date of 2001<br />
* adding hover-events to our slave health pages would be helpful to get quick access to recent job history
* adding hover-events to our slave health pages would be helpful to get quick access to recent job history<br />
* other interesting possibilities:
* other interesting possibilities:<br />
** a page showing last 50-100 jobs for all slaves in a class
** a page showing last 50-100 jobs for all slaves in a class<br />
** ability to filter on a certain builder to spot patterns/anomalies. eg. "robocop tests always fail on this slave but not the other slaves"
** ability to filter on a certain builder to spot patterns/anomalies. eg. &quot;robocop tests always fail on this slave but not the other slaves&quot;


Wednesday
'''Wednesday'''<br />
* taking over 15 minutes for some changes to show as 'pending' in tbpl. Delay from scheduler master seeing the change in twistd.log
* taking over 15 minutes for some changes to show as 'pending' in tbpl. Delay from scheduler master seeing the change in twistd.log<br />
** could be Bug 948426 - Random failed transactions for http://hg.mozilla.org/
** could be Bug 948426 - Random failed transactions for [http://hg.mozilla.org/ http://hg.mozilla.org/]<br />
* Bug 951558 - buildapi-web2 RabbitMQ queue is high
* Bug 951558 - buildapi-web2 RabbitMQ queue is high


Friday
'''Friday'''


* Bug 952448 - Integration Trees closed, high number of pending linux compile jobs
* '''Bug 952448''' - Integration Trees closed, high number of pending linux compile jobs<br />
** AWS instance 'start' requests returning "Error starting instances - insufficient capacity"
** AWS instance 'start' requests returning &quot;Error starting instances - insufficient capacity&quot;<br />
* dustin has fixed Bug 951558 - buildapi-web2 RabbitMQ queue is high
* '''dustin has fixed Bug 951558''' - buildapi-web2 RabbitMQ queue is high


Week of 2013-11-04 to 2013-11-08
'''Week of 2013-11-04 to 2013-11-08'''<br />
buildduty: jhopkins
buildduty: jhopkins


Tuesday
'''Tuesday'''<br />
* rev2 migrations going fairly smoothly (biggest issue is some IPMI interfaces being down and requiring a power cycle by DCOPs)
* rev2 migrations going fairly smoothly (biggest issue is some IPMI interfaces being down and requiring a power cycle by DCOPs)


Wednesday
'''Wednesday'''<br />
* needs attention (per RyanVM): Bug 935246 - Graphserver doesn't know how to handle the talos results from non-PGO builds on the B2G release branches
* needs attention (per RyanVM): '''Bug 935246''' - Graphserver doesn't know how to handle the talos results from non-PGO builds on the B2G release branches<br />
* IT's monitoring rollout happening today
* IT's monitoring rollout happening today<br />
* request to build "B2G device image nightlies" non-obvious what the builders are or what masters they live on. No howto I could find. How do we automate this and keep it  
* request to build &quot;B2G device image nightlies&quot; non-obvious what the builders are or what masters they live on. No howto I could find. How do we automate this and keep it <br />
** stopgap added to wiki: https://wiki.mozilla.org/ReleaseEngineering:Buildduty:Other_Duties#Trigger_B2G_device_image_nightlies
** stopgap added to wiki: [[ReleaseEngineering:Buildduty:Other_Duties#Trigger_B2G_device_image_nightlies]]<br />
** see also: https://bugzilla.mozilla.org/show_bug.cgi?id=793989
** see also: {{bug|793989}}


Friday
'''Friday'''


* RyanVM reports that pushing mozilla-beta to Try is a fairly normal thing to do but fails on the win64-rev2 build slaves. He has been helping with backporting the fixes in https://wiki.mozilla.org/User:Jhopkins/win64rev2Uplift to get this addressed.
* RyanVM reports that pushing mozilla-beta to Try is a fairly normal thing to do but fails on the win64-rev2 build slaves. He has been helping with backporting the fixes in [[User:Jhopkins/win64rev2Uplift]] to get this addressed.<br />
* catlee's buildbot checkconfig improvement went into production but we need a restart on all the masters to get the full benefit. no urgency, however.
* catlee's buildbot checkconfig improvement went into production but we need a restart on all the masters to get the full benefit. no urgency, however.


Week of 2013-10-14 to 2013-10-18
'''Week of 2013-10-14 to 2013-10-18'''<br />
buildduty: armenzg
buildduty: armenzg


Monday
'''Monday'''<br />
* uploaded mozprocess
* uploaded mozprocess<br />
* landed puppet change that made all Linux64 hosts get libvirt-bin get installed and made them fall to sync with puppet
* landed puppet change that made all Linux64 hosts get libvirt-bin get installed and made them fall to sync with puppet<br />
** I had to back out and land a patch to uninstall the package
** I had to back out and land a patch to uninstall the package<br />
** we don't know why it got installed
** we don't know why it got installed<br />
* redis issues
* redis issues<br />
** build-4hr issues
** build-4hr issues<br />
** signing issues
** signing issues


Tuesday
'''Tuesday'''<br />
* returned some slaves to the pool
* returned some slaves to the pool<br />
* investigated some cronmail
* investigated some cronmail<br />
* uploaded talos.zip
* uploaded talos.zip<br />
* reclaimed machines and requested reimages
* reclaimed machines and requested reimages


Wednesday
'''Wednesday'''<br />
* put machines back into produciton
* put machines back into produciton<br />
* loan
* loan<br />
* process delays email
* process delays email


<br />
'''Week of 2013-10-14 to 2013-10-18'''


Week of 2013-10-14 to 2013-10-18
buildduty: coop (callek on monday)
 
buildduty: coop (callek on monday)


Monday
'''Monday'''<br />
* Went through https://secure.pub.build.mozilla.org/builddata/reports/slave_health/buildduty_report.html to reduce bug lists
* Went through [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/buildduty_report.html https://secure.pub.build.mozilla.org/builddata/reports/slave_health/buildduty_report.html] to reduce bug lists<br />
** Bunch of win build machines requested to be reimaged as rev2 machines
** Bunch of win build machines requested to be reimaged as rev2 machines<br />
*** Will need buildbot-configs changes for slave list changes before re-enabled.
*** Will need buildbot-configs changes for slave list changes before re-enabled.<br />
* Two loaners
* Two loaners<br />
** one w64 rev2 with open question on if we need to manually remove secrets ourselves
** one w64 rev2 with open question on if we need to manually remove secrets ourselves


Tuesday
'''Tuesday'''<br />
* meetings! At least one of them was about buildduty.
* meetings! At least one of them was about buildduty.


Wednesday
'''Wednesday'''<br />
* shutdown long-running AWS instance for hverschore: bug 910368
* shutdown long-running AWS instance for hverschore: bug 910368<br />
* investigating disabled mtnlion slaves
* investigating disabled mtnlion slaves<br />
** many needed the next step taken: filing the IT bug for recovery
** many needed the next step taken: filing the IT bug for recovery


Thursday  
'''Thursday'''<br />
* filed Bug 927951 - Request for smoketesting Windows builds from the cedar branch
'''*'''filed '''Bug 927951''' - Request for smoketesting Windows builds from the cedar branch<br />
* filed Bug 927941 - Disable IRC alerts for issues with individual slaves
* filed '''Bug 927941''' - Disable IRC alerts for issues with individual slaves<br />
* reconfig for new Windows in-house build master
* reconfig for new Windows in-house build master


Friday
'''Friday'''<br />
*  
*  


Week of 2013-09-23 to 2013-09-27
'''Week of 2013-09-23 to 2013-09-27'''


buildduty: armenzg
buildduty: armenzg


Monday
'''Monday'''<br />
* help marcia debug some b2g nightly questions
* help marcia debug some b2g nightly questions<br />
* meetings and meetings and distractions
* meetings and meetings and distractions<br />
* started patches to transfer 11 win64 hosts to become try ones
* started patches to transfer 11 win64 hosts to become try ones


Tuesday
'''Tuesday'''<br />
* run a reconfig
* run a reconfig<br />
* did a backout for buildbotcustom and run another reconfig
* did a backout for buildbotcustom and run another reconfig<br />
* started work on moving win64 hosts from build pool to the try pool
* started work on moving win64 hosts from build pool to the try pool<br />
* analyzed a bug filed by sheriff wrt to clobberer
* analyzed a bug filed by sheriff wrt to clobberer<br />
** no need for buildduty to fix (moved to Platform Support)
** no need for buildduty to fix (moved to Platform Support)<br />
** asked people's input for proper fix
** asked people's input for proper fix<br />
* reviewed some patches for jmaher and kmoir
* reviewed some patches for jmaher and kmoir<br />
* assist edmorley with clobberer issue
* assist edmorley with clobberer issue<br />
* assist edmorley with git.m.o issue
* assist edmorley with git.m.o issue<br />
* assist RyanVM with git.m.o issue
* assist RyanVM with git.m.o issue<br />
* put w64-ix-slave64 in the production pool
* put w64-ix-slave64 in the production pool<br />
* updated buildduty wiki page
* updated buildduty wiki page<br />
* updated wiki page to move machines from one pool to another
* updated wiki page to move machines from one pool to another


Wednesday
'''Wednesday'''<br />
* messy
* messy


Thursday
'''Thursday'''<br />
* messy
* messy


Monday
'''Monday'''<br />
* messy
* messy


Week of 2013-09-16 to 2013-09-20
'''Week of 2013-09-16 to 2013-09-20'''


buildduty: Callek
buildduty: Callek


Monday:
Monday:<br />
* (pmoore) batch 1 of watcher update [Bug 914302]
* (pmoore) batch 1 of watcher update [Bug 914302]<br />
* MERGE DAY
* MERGE DAY<br />
** We missed having a point person for merge day again, rectified (thanks armen/rail/aki)
** We missed having a point person for merge day again, rectified (thanks armen/rail/aki)<br />
* 3-reconfigs or so,
* 3-reconfigs or so,<br />
** Merge day
** Merge day<br />
** Attempted to fix talos-mozharness (broken by panda-mozharness landing)
** Attempted to fix talos-mozharness (broken by panda-mozharness landing)<br />
** Backout out talos-mozharness change for continued bustaged
** Backout out talos-mozharness change for continued bustaged<br />
** Also backed out emulator-ics for in-tree (crossing-tree) bustage relating to name change.
** Also backed out emulator-ics for in-tree (crossing-tree) bustage relating to name change.<br />
*** Thanks to aki for helping while I had to slip out for a few minutes
*** Thanks to aki for helping while I had to slip out for a few minutes<br />
* Loaner bug poking/assigning
* Loaner bug poking/assigning<br />
* Did one high priority loaner needed for tree-closure which blocked MERGE DAY
* Did one high priority loaner needed for tree-closure which blocked MERGE DAY


Tuesday:
Tuesday:<br />
* (pmoore) batch 2 of watcher update [Bug 914302]
* (pmoore) batch 2 of watcher update [Bug 914302]<br />
* Buildduty Meeting
* Buildduty Meeting<br />
* Bug queue churn-through
* Bug queue churn-through<br />
* Hgweb OOM: Bug 917668
* Hgweb OOM: Bug 917668


Wednesday:
Wednesday:<br />
* (pmoore) batch 3 of watcher update [Bug 914302]
* (pmoore) batch 3 of watcher update [Bug 914302]<br />
* Reconfig
* Reconfig<br />
* Hgweb OOM continues (IT downtimed it, bkero is on PTO today, no easy answer)
* Hgweb OOM continues (IT downtimed it, bkero is on PTO today, no easy answer)<br />
** Very Low visible tree impact at present
** Very Low visible tree impact at present<br />
* Bug queue churn-through
* Bug queue churn-through<br />
* Discovered last-job-per-slave view of slave_health is out of date.
* Discovered last-job-per-slave view of slave_health is out of date.<br />
* Discovered reboot-history is either out of date or reboots not running for tegras
* Discovered reboot-history is either out of date or reboots not running for tegras


Thursday
Thursday<br />
* (pmoore) batch 4 [final] of watcher update [Bug 914302]
* (pmoore) batch 4 [final] of watcher update [Bug 914302]<br />
* Hgweb OOM continues  
* Hgweb OOM continues <br />
* Bug queue churn... focus on tegras today
* Bug queue churn... focus on tegras today<br />
* Coop fixed last-job-per-slave generation, and slaves_needing_reboots
* Coop fixed last-job-per-slave generation, and slaves_needing_reboots<br />
* Downtime (and problems) for scl1 nameserver and scl3 zeus nodes
* Downtime (and problems) for scl1 nameserver and scl3 zeus nodes<br />
** Caused tree closure due to buildapi01 being in scl1 and long delay
** Caused tree closure due to buildapi01 being in scl1 and long delay






Week of 2013-09-09 to 2013-09-13
'''Week of 2013-09-09 to 2013-09-13'''


buildduty: coop
buildduty: coop


Monday:
Monday:<br />
* meetings
* meetings<br />
* kittenherder hanging on bld-centos - why?
* kittenherder hanging on bld-centos - why?<br />
** multiple processes running, killed off (not sure if that's root cause)
** multiple processes running, killed off (not sure if that's root cause)


Tuesday:
Tuesday:<br />
* buildduty meeting
* buildduty meeting<br />
** filed https://bugzilla.mozilla.org/show_bug.cgi?id=913606 to stop running cronjobs to populate mobile-dashboard
** filed {{bug|913606}} to stop running cronjobs to populate mobile-dashboard<br />
* wrote reboot_tegras.py quick-n-dirty script to kill buildbot processes and reboot tegras listed as hung
* wrote reboot_tegras.py quick-n-dirty script to kill buildbot processes and reboot tegras listed as hung


Wednesday:
Wednesday:<br />
* re-enabled kittenherder rebooting of tegras
* re-enabled kittenherder rebooting of tegras<br />
* wrote bugzilla shared queries for releng-buildduty, releng-buildduty-tagged, and releng-buildduty-triage  
* wrote bugzilla shared queries for releng-buildduty, releng-buildduty-tagged, and releng-buildduty-triage <br />
* playing with bztools/bzrest to try to get query that considers dependent bugs
* playing with bztools/bzrest to try to get query that considers dependent bugs<br />
* meetings
* meetings<br />
* deploying registry change for bug 897768 (fuzzing dumps)
* deploying registry change for bug 897768 (fuzzing dumps)


Thursday:
Thursday:<br />
* broke up kittenherder rebooting of tegras into 4 batches to improve turnaround time
* broke up kittenherder rebooting of tegras into 4 batches to improve turnaround time<br />
* got basic buildduty query working with bztools
* got basic buildduty query working with bztools<br />
* respun Android nightly
* respun Android nightly<br />
* resurrecting as many Mac testers as possible to deal with load
* resurrecting as many Mac testers as possible to deal with load<br />
* filed bug 915766 to audit pdu2.r102-1.build.scl1
* filed bug 915766 to audit pdu2.r102-1.build.scl1<br />
* resurrected a bunch of talos-r3-fed machines that were not running buildbot
* resurrected a bunch of talos-r3-fed machines that were not running buildbot


Friday:
Friday:<br />
* AWS US-East-1 outage
* AWS US-East-1 outage<br />
* reconfig for nexus4 changes
* reconfig for nexus4 changes<br />
** re-reconfig to backout kmoir's changes that closed the tree: https://bugzilla.mozilla.org/show_bug.cgi?id=829211
** re-reconfig to backout kmoir's changes that closed the tree: {{bug|829211}}<br />
* Mac tester capacity
* Mac tester capacity


Week of 2013-09-02 to 2013-09-06
'''Week of 2013-09-02 to 2013-09-06'''<br />
buildduty: bhearsum
buildduty: bhearsum


Monday
Monday<br />


    US/Canada holiday
* US/Canada holiday




Tuesday
Tuesday<br />


    Bug 912225 - Intermittent B2G emulator image "command timed out: 14400 seconds elapsed, attempting to kill" or "command timed out: 3600 seconds without output, attempting to killduring the upload step
* '''Bug 912225''' - Intermittent B2G emulator image &quot;command timed out: 14400 seconds elapsed, attempting to kill&quot; or &quot;command timed out: 3600 seconds without output, attempting to kill&quot; during the upload step
** Worked around by lowering priority of use1. Acute issue is fixed, we suspect there's still symptoms from time to time.
** Windows disconnects may be the early warning sign.


    Worked around by lowering priority of use1. Acute issue is fixed, we suspect there's still symptoms from time to time.


    Windows disconnects may be the early warning sign.
'''Week of 2013-08-26 to 2013-08-30'''<br />
 
 
Week of 2013-08-26 to 2013-08-30
buildduty: jhopkins
buildduty: jhopkins


Monday
Monday<br />


    many talos-r3-w7 slaves have a broken session which prevents a new SSH login session (you can authenticate but it kicks you out right away). Needed to RDP in, open a terminal as Administrator, and delete the files in c:\program files\kts\log\ip-ban\* and active-sessions\*.
* many talos-r3-w7 slaves have a broken session which prevents a new SSH login session (you can authenticate but it kicks you out right away). Needed to RDP in, open a terminal as Administrator, and delete the files in c:\program files\kts\log\ip-ban\* and active-sessions\*.
* many other talos-r3-w7 slaves had just ip-ban\* files (no active-sessions\* files) which prevented kittenherder from managing the slave, since there are no IPMI or PDU mechanisms to manage these build slaves.
* trying slaveapi
** IPMI reboot failing (no netflow)


    many other talos-r3-w7 slaves had just ip-ban\* files (no active-sessions\* files) which prevented kittenherder from managing the slave, since there are no IPMI or PDU mechanisms to manage these build slaves.


    trying slaveapi
$curl -dwaittime=60 [http://cruncher.srv.releng.scl3.mozilla.com:8000/slave/talos-r3-xp-076/action/reboot http://cruncher.srv.releng.scl3.mozilla.com:8000/slave/talos-r3-xp-076/action/reboot]<br />
{<br />
&quot;requestid&quot;: 46889168, <br />
&quot;state&quot;: 0, <br />
&quot;text&quot;: &quot;&quot;<br />
}<br />


    IPMI reboot failing (no netflow)
* slavealloc-managed devices are live (per Callek)
* ec2 slave loan to eflores (909186)




$curl -dwaittime=60 http://cruncher.srv.releng.scl3.mozilla.com:8000/slave/talos-r3-xp-076/action/reboot
Tuesday<br />
{
  "requestid": 46889168,
  "state": 0,
  "text": ""
}


    slavealloc-managed devices are live (per Callek)
* osx 10.7 loan to jmaher (909510)
* [[ReleaseEngineering/Managing_Buildbot_with_Fabric]] takes awhile to set up. Also, &quot;suggestions&quot; seems like the defacto way to do a reconfig.


    ec2 slave loan to eflores (909186)


Wednesday<br />


Tuesday
* many tegras down for weeks. documentation needs improvement
** when to do a step
** what to do if the step fails
* linux64 slave loan to h4writer (bug 909986)
* created [https://intranet.mozilla.org/RelEngWiki/index.php/How_To/Disable_Updates https://intranet.mozilla.org/RelEngWiki/index.php/How_To/Disable_Updates] (bug 910378)


    osx 10.7 loan to jmaher (909510)


    https://wiki.mozilla.org/ReleaseEngineering/Managing_Buildbot_with_Fabric takes awhile to set up.  Also, "suggestions" seems like the defacto way to do a reconfig.
Thursday<br />


* filed '''Bug 910818''' - Please investigate cause of network disconnects 2013-08-29 10:22-10:24 Pacific
** need to '''automate''' gathering of details for filing this type of bug
* '''Bug 910662''' - B2G Leo device image builds broken with &quot;error: patch failed: include/hardware/hwcomposer.h:169&quot; during application of B2G patches to android source


Wednesday


    many tegras down for weeks. documentation needs improvement
Friday<br />


    when to do a step
* Fennec nightly updates disabled again due to startup crasher. Bug 911206
* These tegras need attention (rebooting hasn't helped):
** [https://releng.etherpad.mozilla.org/191 https://releng.etherpad.mozilla.org/191]


    what to do if the step fails


    linux64 slave loan to h4writer (bug 909986)


    created https://intranet.mozilla.org/RelEngWiki/index.php/How_To/Disable_Updates (bug 910378)


 
'''Week of 2013-08-19 to 2013-08-23'''<br />
Thursday
 
    filed Bug 910818 - Please investigate cause of network disconnects 2013-08-29 10:22-10:24 Pacific
 
    need to automate gathering of details for filing this type of bug
 
    Bug 910662 - B2G Leo device image builds  broken with "error: patch failed: include/hardware/hwcomposer.h:169"  during application of B2G patches to android source
 
 
Friday
 
    Fennec nightly updates disabled again due to startup crasher. Bug 911206
 
    These tegras need attention (rebooting hasn't helped):
 
    https://releng.etherpad.mozilla.org/191
 
 
 
Week of 2013-08-19 to 2013-08-23
buildduty: armenzg
buildduty: armenzg


Monday
Monday<br />


    40+ Windows builders had not been rebooted for several days
* 40+ Windows builders had not been rebooted for several days
** {{bug|906660}}
** rebooted a bunch with csshX
** I gave a couple to jhopkins to look into
*** cruncher was banned from being able to ssh
*** ipmi said that it was successful
*** more investigation happening


    https://bugzilla.mozilla.org/show_bug.cgi?id=906660
* edmorley requested that I look into fixing the TMPDIR removal issue
** {{bug|880003}}
** ted to land a fix to disable the test that causes
** filed a bug for IT to clean up the TMPDIR through puppet
** cleaned up tmpdir manually on 2 hosts and put them back in production to check
* do a reconfig for mihneadb
* rebooted talos-r4-lion-041 upon philor's request due to hdutil
* promote unagi
* upload talos.zip for mobile
** updated docs for talos-bundles [[ReleaseEngineering:Buildduty:Other_Duties#How_to_update_the_talos_zips]]


    rebooted a bunch with csshX


    I gave a couple to jhopkins to look into
Tuesday<br />


    cruncher was banned from being able to ssh
* I see a bunch of these:
** nagios-releng: Tue 06:04:58 PDT [4193] buildbot-master92.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. ([http://m.allizom.org/ntp+time) http://m.allizom.org/ntp+time)]
** {{bug|907158}}
** We disabled all use1 masters
** We reduced use1's priority for host generation
** Fixed Queue dirs for one of the masters by restarting the init service
*** had to kill the previous pid


    ipmi said that it was successful
* rebooted remaining win64 machines
* rebooted remaining bld-lion-r5 machines
* buildapi issues
** Callek took care of it and move it to Tools
* Queue dir issues
** it seems that the Amazon issues caused this
** I have been stopping and starting the pulse_publisher
** /etc/initd.d/pulse_publisher {stop|start}
** the /dev/shm/queue/pulse/new will start decreasing
* Re-enabled all aws-us-es-1 masters


    more investigation happening


    edmorley requested that I look into fixing the TMPDIR removal issue
Wednesday<br />


    https://bugzilla.mozilla.org/show_bug.cgi?id=880003
* one of the b2g repos has a 404 bundle and intermittent ISE 500
** {{bug|907693}}
** bhearsum has moved it to IT
** fubar is looking into it
* done a reconfig
** graceful restart for buildbot-master69
*** I had to do a less graceful restart
** graceful restart for buildbot-master70
** graceful restart for buildbot-master71
* loan t-xp32-ix006 in [https://bugzil.la/904219 https://bugzil.la/904219]
* promoted b2g build
* deploy graphs change for jmaher
* vcs major alert
** hwine to look into it


    ted to land a fix to disable the test that causes


    filed a bug for IT to clean up the TMPDIR through puppet
Thursday<br />


    cleaned up tmpdir manually on 2 hosts and put them back in production to check
* disable production-opsi
 
* reconfig
    do a reconfig for mihneadb
* trying to cleanup nagios
 
    rebooted talos-r4-lion-041 upon philor's request due to hdutil
 
    promote unagi
 
    upload talos.zip for mobile
 
    updated docs for talos-bundles https://wiki.mozilla.org/ReleaseEngineering:Buildduty:Other_Duties#How_to_update_the_talos_zips
 
 
Tuesday
 
    I see a bunch of these:
 
    nagios-releng:  Tue 06:04:58 PDT [4193]  buildbot-master92.srv.releng.use1.mozilla.com:ntp time is CRITICAL:  CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=907158
 
    We disabled all use1 masters
 
    We reduced use1's priority for host generation
 
    Fixed Queue dirs for one of the masters by restarting the init service
 
    had to kill the previous pid
 
    rebooted remaining win64 machines
 
    rebooted remaining bld-lion-r5 machines
 
    buildapi issues
 
    Callek took care of it and move it to Tools
 
    Queue dir issues
 
    it seems that the Amazon issues caused this
 
    I have been stopping and starting the pulse_publisher
 
    /etc/initd.d/pulse_publisher {stop|start}
 
    the /dev/shm/queue/pulse/new will start decreasing
 
    Re-enabled all aws-us-es-1 masters
 
 
Wednesday
 
    one of the b2g repos has a 404 bundle and intermittent ISE 500
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=907693
 
    bhearsum has moved it to IT
 
    fubar is looking into it
 
    done a reconfig
 
    graceful restart for buildbot-master69
 
    I had to do a less graceful restart
 
    graceful restart for buildbot-master70
 
    graceful restart for buildbot-master71
 
    loan t-xp32-ix006 in https://bugzil.la/904219
 
    promoted b2g build


    deploy graphs change for jmaher


    vcs major alert
Friday<br />


    hwine to look into it
* a bunch of win64 machines are not taking jobs
** deploy to all machines the fix
** filed bug for IT to add to task sequence
* some win64 imaging bugs filed
* reconfig for hwine
* Callek deployed his slavealloc change
* merged mozharness
* investigated a bunch of hosts that were down
* we might be having some DNS tree-closing issues
** {{bug|907981#c9}}
** It got cleared within an hour or so




Thursday


    disable production-opsi


    reconfig
'''Week of 2013-08-12 to 2013-08-16'''
 
    trying to cleanup nagios
 
 
Friday
 
    a bunch of win64 machines are not taking jobs
 
    deploy to all machines the fix
 
    filed bug for IT to add to task sequence
 
    some win64 imaging bugs filed
 
    reconfig for hwine
 
    Callek deployed his slavealloc change
 
    merged mozharness
 
    investigated a bunch of hosts that were down
 
    we might be having some DNS tree-closing issues
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=907981#c9
 
    It got cleared within an hour or so
 
 
 
Week of 2013-08-12 to 2013-08-16


buildduty: coop
buildduty: coop


Monday:
Monday:<br />
 
    ran kittenherder against w64-ix pool
 
    56 of 87 build slaves were hung
 
    suspect shutdown event tracker dialog
 
    deployed shutdown event tracker fix to all w64-ix slaves
 
    https://bugzilla.mozilla.org/show_bug.cgi?id=893888#c4
 
    cleaned up https://wiki.mozilla.org/ReferencePlatforms
 
    https://wiki.mozilla.org/ReferencePlatforms/Test/Lion
 
    added https://wiki.mozilla.org/ReferencePlatforms/Win64#Disable_shutdown_event_tracker
 
    tried to promote unagi build, hit problem caused by extra symlinks (aki) last week
 
    hwine debugged, updated docs: https://intranet.mozilla.org/RelEngWiki/index.php?title=How_To/Perform_b2g_dogfood_tasks#Trouble_shooting
 
    re-imaged (netboot) talos-r4-lion-0[30-60]: https://bugzilla.mozilla.org/show_bug.cgi?id=891880
 
 
Tuesday:
 
    more cleanup in https://wiki.mozilla.org/ReferencePlatforms
 
    https://wiki.mozilla.org/ReferencePlatforms/Test/MountainLion
 
    moved lots of platforms to "Historical/Other"
 
    re-imaged (netboot) talos-r4-lion-0[61-90]: https://bugzilla.mozilla.org/show_bug.cgi?id=891880
 
    investigated lion slaves linked from https://bugzilla.mozilla.org/show_bug.cgi?id=903462
 
    all back in service
 
    buildduty meeting @ 1:30pm EDT
 
    reconfig for Standard8: https://bugzilla.mozilla.org/show_bug.cgi?id=900549
 
 
Wednesday:
 
    fixed buildfaster query to cover helix builds
 
    set aside talos-r4-lion-001 for dustin: https://bugzilla.mozilla.org/show_bug.cgi?id=902903#c16
 
    promoted unagi build for dogfood: 20130812041203
 
    closed https://bugzilla.mozilla.org/show_bug.cgi?id=891880
 
    https://wiki.mozilla.org/ReferencePlatforms updated to remove need for extra reboot for Mac platforms
 
    investigated talos failures affecting (primarily) lion slaves: https://bugzilla.mozilla.org/show_bug.cgi?id=739089
 
 
Thursday:
 
    fixed up wait times report


    merged two Snow Leopard categories
* ran kittenherder against w64-ix pool
** 56 of 87 build slaves were hung
** suspect shutdown event tracker dialog
* deployed shutdown event tracker fix to all w64-ix slaves
** {{bug|893888#c4}}
* cleaned up [[ReferencePlatforms]]
** [[ReferencePlatforms/Test/Lion]]
** added [[ReferencePlatforms/Win64#Disable_shutdown_event_tracker]]
* tried to promote unagi build, hit problem caused by extra symlinks (aki) last week
** hwine debugged, updated docs: [https://intranet.mozilla.org/RelEngWiki/index.php?title=How_To/Perform_b2g_dogfood_tasks#Trouble_shooting https://intranet.mozilla.org/RelEngWiki/index.php?title=How_To/Perform_b2g_dogfood_tasks#Trouble_shooting]
* re-imaged (netboot) talos-r4-lion-0[30-60]: {{bug|891880}}


    added jetpack to Win8 match


    added helix nightlies to buildfaster report
Tuesday:<br />


    https://wiki.mozilla.org/ReleaseEngineering/Buildduty#Meeting_Notes
* more cleanup in [[ReferencePlatforms]]
** [[ReferencePlatforms/Test/MountainLion]]
** moved lots of platforms to &quot;Historical/Other&quot;
* re-imaged (netboot) talos-r4-lion-0[61-90]: {{bug|891880}}
* investigated lion slaves linked from {{bug|903462}}
** all back in service
* buildduty meeting @ 1:30pm EDT
* reconfig for Standard8: {{bug|900549}}


    help with WinXP DST issues: https://bugzilla.mozilla.org/show_bug.cgi?id=878391#c32


    compiling https://github.com/vvuk/winrm to test deploy on w64-ix-slave03
Wednesday:<br />


    https://bugzilla.mozilla.org/show_bug.cgi?id=727551
* fixed buildfaster query to cover helix builds
* set aside talos-r4-lion-001 for dustin: {{bug|902903#c16}}
* promoted unagi build for dogfood: 20130812041203
* closed {{bug|891880}}
** [[ReferencePlatforms]] updated to remove need for extra reboot for Mac platforms
* investigated talos failures affecting (primarily) lion slaves: {{bug|739089}}




Friday:
Thursday:<br />


    investigation into https://bugzilla.mozilla.org/show_bug.cgi?id=905350
* fixed up wait times report
** merged two Snow Leopard categories
** added jetpack to Win8 match
* added helix nightlies to buildfaster report
* [[ReleaseEngineering/Buildduty#Meeting_Notes]]
* help with WinXP DST issues: {{bug|878391#c32}}
* compiling [https://github.com/vvuk/winrm https://github.com/vvuk/winrm] to test deploy on w64-ix-slave03
** {{bug|727551}}


    basedir wrong in slavealloc


    reconfigs for catlee, aki
Friday:<br />


   
* investigation into {{bug|905350}}
** basedir wrong in slavealloc
* reconfigs for catlee, aki




<br />
Issues/Questions:
Issues/Questions:


* many/most of the tree closure reasons on treestatus don't have bug#'s.   should we encourage sheriffs to enter bug#'s so others can follow along more easily?
* many/most of the tree closure reasons on treestatus don't have bug#'s. should we encourage sheriffs to enter bug#'s so others can follow along more easily?<br />
* uncertainty around the difference between mozpool-managed and non-mozpool-managed pandas. How do I take one offline - do they both use disabled.flg? A: yes, all pandas use disabled.flg on the foopy
* uncertainty around the difference between mozpool-managed and non-mozpool-managed pandas. How do I take one offline - do they both use disabled.flg? '''A: yes, all pandas use disabled.flg on the foopy'''<br />
** when is it ok to use Lifeguard to for the state to "disabled"? per dustin: [it's] for testing, and working around issues like pandas that are still managed by old releng stuff. it's to save us loading up mysql and writing UPDATE queries
** when is it ok to use Lifeguard to for the state to &quot;disabled&quot;? per dustin: [it's] for testing, and working around issues like pandas that are still managed by old releng stuff. it's to save us loading up mysql and writing UPDATE queries<br />
*** what's "old releng stuff"?
*** what's &quot;old releng stuff&quot;?<br />
* Windows test slaves  
* Windows test slaves  


* another example of PDU reboot not working correctly: https://bugzilla.mozilla.org/show_bug.cgi?id=737408#c5 and https://bugzilla.mozilla.org/show_bug.cgi?id=885969#c5. We need to automate power off,pause,power on to increase reliability.
* another example of PDU reboot not working correctly: {{bug|737408#c5}} and {{bug|885969#c5}}. We need to automate power off,pause,power on to increase reliability.


Bustages:
Bustages:


* bug 900273 landed and backed out
* bug 900273 landed and backed out

Latest revision as of 04:47, 3 April 2015

2015-03-31


2015-03-30

  • buildduty report cleanup <- lots


2015-03-27


2015-03-26

  • https://bugzil.la/1147853 - Widespread "InternalError: Starting video failed" failures across all trees on AWS-based test instances
  • Q1 is almost done
    • what do we need to document/update prior to buildduty hand-off next week?
  • testing out latest mh prod rev on cedar in a canary fashion :)
    • be better if releng stays under the radar for at least the rest of the day
  • jlund|buildduty> nagios-releng: downtime vcssync2.srv.releng.usw2.mozilla.com 1h "bug 1135266"
  • disabling two foopies for recovery


2015-03-25


2015-03-24


2015-03-23


2015-03-20

  • chemspill in progress, ***NO UNNECESSARY CHANGES***
  • coop going through "All dependencies resolved" section of buildduty report
    • doing all non-pandas first
    • will do a second, panda-only pass after


2015-03-19


2015-03-18


2015-03-17

  • https://bugzil.la/1143681 - Some AWS test slaves not being recycled as expected
    • found a way to track these down
      • in the AWS console, search for the instances in the Spot Requests tab. You can click on the instance ID to get more info.
      • e.g. for tst-linux64-spot-233, the instance has no name associated and is marked as "shutting-down"
  • https://bugzil.la/1144362 - massive spike in hg load
    • possibly due to new/rescued instances from the morning (re)cloning mh & tools
      • negative feedback loop?


2015-03-16


2015-03-13

  • buildbot DB "too many connections" again. (perhaps DBA's are able to increase the conn pool limits?)
  • need a button in slave health that automatically files a diagnostics bug for a given slave
    • should disable the slave if not already disabled
    • should do the bug linking automatically
    • should have a small text entry box for the description of the diagnostics bug, i.e. why are we asking for diagnostics
    • would hopefully prevent sheriffs from just taking slaves offline and waiting for us to perform the next step(s)
  • file https://bugzil.la/1143018 - Update runslave.py with current machine types and basedirs
    • we essentially guess at the builddir in most cases these days(!)
  • https://bugzil.la/1142825 - high windows test pending


2015-03-12

  • filed https://bugzil.la/1142493 - panda-relay-037 is down
  • Win7 test pending >2000 (unclear on why)
  • tree closure
    • caused by RyanVM


2015-03-11


2015-03-10


2015-03-09


2015-03-06


2015-03-05

  • never got to adding win64 m-a nightlies to jacuzzi https://bugzil.la/1139763
  • need to enable slaves from https://bugzil.la/1138672
  • end_to_end script tells bugs things for mozharness are live in production. this no longer is the case for all our build + test jobs (most things aside from vcs-sync, bumper, etc).
    • should we be still automatically updating bugs for mh after reconfig?
    • we need a way to roll out changes to mh on a regular cadence. right now it's up to the individual to update mozharness.json with a REV they want applied and consequently, whatever mh patches are in between are also applied...
    • coop to drop mozharness from end-to-end-reconfig script and email public list
  • added http://pypi.pvt.build.mozilla.org/pub/mozrunner-6.6.tar.gz
  • talked to catlee re: releng-try pipeline
    • fully supportive
    • one wrinkle: how to tackle release tagging
    • coop will get bugs filed today
  • add 4-repo view to slave health?


2015-03-04

  • https://bugzil.la/1138937 - Slave loan request for a t-w864-ix machine
  • reconfig in progress
  • buildduty report:
    • re-imaging a bunch of slaves to help with capacity
  • https://bugzil.la/1138672 - vlan request - move bld-lion-r5-[006-015] machines from prod build pool to try build pool (needs to be enabled)
  • test master upgrades (done)
  • (hwine) meeting with Linda (head of #moc)
    • make more specific requests from #moc
    • share top issues with #moc
    • when: next meeting is 13th
      • come up with prioritized list of releng needs by early next week
  • coop to file bugs re: releng-try improvements
    • add builderlists/dumpmasters diff to travis
    • switch RoR for key repos to github
      • reverse VCS sync flow
      • enable travis testing for forks - this is done on a per-fork basis by the owners of the forks. PR's will get travis jobs regardless.


upgrade test linux masters (https://bugzil.la/1136527):

  • bm51 (complete)
  • bm53 (complete)
  • bm117-tests1-linux64 (complete)
  • bm52-tests1-linux64 (complete)
  • bm54-tests1-linux64 (complete)
  • use1
    • bm67-tests1-linux64 (complete)
    • bm113-tests1-linux64 (complete)
    • bm114-tests1-linux64 (complete)
    • bm120-tests1-linux64 (complete)
    • bm121-tests1-linux64 (complete)
  • usw2
    • bm68-tests1-linux64 (complete)
    • bm115-tests1-linux64 (complete)
    • bm116-tests1-linux64 (complete)
    • bm118-tests1-linux64 (complete)
    • bm122-tests1-linux64 (complete)
    • bm123-tests1-linux64 (started)


add swap (https://bugzil.la/1135664):

  • bm53 (complete)
  • buildbot-master54 (complete)
  • use1
    • buildbot-master117 BAD
    • buildbot-master120 BAD (complete)
    • buildbot-master121 BAD (complete)
  • usw2
    • buildbot-master68 (complete)
    • buildbot-master115 (complete)
    • buildbot-master116 BAD (complete)
    • buildbot-master118 BAD (complete)
    • buildbot-master122 BAD (complete)
    • buildbot-master123 BAD


buildbot-master04 BAD
buildbot-master05 BAD
buildbot-master06 BAD
buildbot-master66 BAD
buildbot-master72 BAD
buildbot-master73 BAD
buildbot-master74 BAD
buildbot-master78 BAD
buildbot-master79 BAD
buildbot-master91 BAD



2015-03-03

  • https://bugzil.la/1138955 - Slow Builds and lagginess
    • tree closure due to backlog (10:00am ET)
    • *mostly* unexpected load (extra poorly-timed pushes to try), although a bunch of test instances not recycling properly
      • coop is investigating these


2015-03-02


2015-02-27

  • queue issues on build masters due to graphene jobs
    • should be resolved by reconfig this morning
  • re-imaging some 10.8 machines as 10.10
    • 10.10 will be running opt jobs on inbound, 10.8 debug on inbound + opt on release branches
    • sheriffs are understandably worried about capacity issues in both pools
  • re-re-imaging talos-linux32-ix-0[01,26]
    • may have an underlying issue with the re-imaging process for linux hw


2015-02-26


2015-02-25

  • things to circle back on today:
    • https://bugzil.la/1136195 - Frequent download timeouts across all trees
    • https://bugzil.la/1136465 - New: Spot instances failing with remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.spread.pb.PBConnectionLost'>:
    • [Bug 1041763] upgrade ec2 linux64 test masters from m3.medium to m3.large
  • https://bugzil.la/1136531- Slave loan request for a tst-linux64-spot vm


2015-02-24

  • https://bugzil.la/1136195 - Frequent download timeouts across all trees
    • related to release traffic?
  • release reconfigs don't log themselves
    • should probably reconfig everything not just build/scheduler masters
      • i think this takes care of itself once masters start updating themselves based on tag updates



10:35:13 <hwine> ah, I see coop already asked Usul about 0900PT
10:36:34 <hwine> ashlee: sounds like our theory of load isn't right - can someone check further, please? https://bugzil.la/1136195#c1
10:38:02 <•pir> hwine: check... what?
10:38:56 <hwine> ftp.m.o is timing out and has closed trees. Our guess was release day load, but that appears not to be it
10:39:33 <•pir> hwine: I can't see any timeouts in that link, I may be missing something.
10:39:48 <jlund> ashlee: hwine catlee-lunch we have bug 1130242#c4 to avail of now too it seems. might provide some insight to health or even a possible cause as to why we are hitting timeouts since the the change time lines up within the time of reported timeouts.
10:40:35 <jlund> pir: that's the bug tracking timeouts. there are timeouts across many of our continuous integration jobs: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/mozilla-inbound_ubuntu32_vm_test-jittest-1-bm01-tests1-linux32-build102.txt.gz
10:40:39 mbrandt → mbrandt|lunch
10:40:49 <•pir> hwine: we don't have a lot of visibility into how ftp.m.o works or is not working. This isn't a good situation, but sadly how it is.
10:41:28 <hwine> pir: right, my understanding is that you (moc) coordinates all the deeper dives for IT infrastructure (which ftp.m.o still is)
10:42:15 <•pir> hwine: To clarify, I don't think anyone has a lot of visibility into how ftp.m.o is working :(
10:42:19 <•pir> it's a mess
10:42:34 <•pir> ashlee: want to loop in C ?
10:43:00 <•pir> (and I think mixing continuous build traffic and release traffic is insane, personally)
10:43:26 <•pir> jlund: yes, that's what I was reading and not seeing anythig
10:43:43 <hwine> pir: system should handle it fine (has in the past) release traffic s/b minimal since we use CDNs
10:44:12 <•pir> hwine: should be. isn't.
10:44:18 <•ashlee> pir sure
10:47:53 <•pir> the load on the ftp servers is... minimal
10:48:27 <•fox2mike> jlund: may I ask where these timeouts are happening from?
10:49:15 <jlund> hm, so load may not be the issue. begs the question "what's changed"
10:49:17 <•pir> and what the timeouts actually are. I can't see anything timing out in the listed logs
10:49:37 <•pir> jlund: for ftp.m.o? nothing that I'm aware of
10:50:06 <cyliang> no bandwith alerts from zeus. looking at the load balancers to see if anything pops out.
10:50:09 <•ashish> from http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/mozilla-inbound_ubuntu32_vm_test-jittest-1-bm01-tests1-linux32-build102.txt.gz
10:50:13 <•ashish> i see
10:50:14 <•ashish> 08:00:28 WARNING - Timed out accessing http://ftp.mozilla.org.proxxy1.srv.releng.use1.mozilla.com/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/firefox-39.0a1.en-US.linux-i686.tests.zip: timed out
10:50:18 <•ashish> what is that server?
10:50:31 <•fox2mike> USE1
10:50:34 <•fox2mike> FUCK YEAH! :p
10:50:35 → agibson joined (agibson@moz-j04gi9.cable.virginm.net)
10:50:36 <•fox2mike> the cloud baby
10:50:55 <•fox2mike> jlund: I bet if you were to try this from other amazon regions, you might not his this
10:50:57 <•ashish> i don't see timeouts for http://ftp.mozilla.org/*
10:50:59 <cyliang> fox2mike: Is this the same timeout stuff as last time?
10:51:03 <•ashish> (in that log)
10:51:03 <•fox2mike> I'm guessing
10:51:06 <•fox2mike> cyliang: ^
10:51:17 <•fox2mike> because the last time we saw random issues
10:51:21 <•fox2mike> it was all us-east1
10:51:39 <•fox2mike> jlund: for reference - bug 1130386
10:52:11 <•fox2mike> our infra is the same, we can all save time by trying to see if you guys hit this from any other amazon region (if that's possible)
10:53:08 <jlund> proxxy is a host from aws but after failing to try that a few times, we poke ftp directly and timeout after 30 min:
10:53:12 <jlund> https://www.irccloud.com/pastebin/WmSehqzj

Plain Text • 8 lines raw | line numbers 


10:53:13 <wesley> jlund's shortened url is http://tinyurl.com/q47zbvl
10:53:15 <•pir> yay cloud
10:54:36 <•pir> jlund: that download from ftp-ssl works fine from anywhere I have access to test it
10:55:06 <•fox2mike> jlund: where did that fail from?
10:55:40 <jlund> sure, and it doesn't always timeout, but of our thousands of jobs, a bunch have failed and timed out.
10:55:54 <unixfairy> jlund can you be more specific
10:56:10 <jlund> fox2mike: same log example: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/mozilla-inbound_ubuntu32_vm_test-jittest-1-bm01-tests1-linux32-build102.txt.gz
10:56:56 <jlund> sorry, I don't know exact failure rate numbers. RyanVM|sheriffduty may know more.
10:57:03 <•fox2mike> jlund: so
10:57:03 <•fox2mike> builder: mozilla-inbound_ubuntu32_vm_test-jittest-1
10:57:04 <•fox2mike> slave: tst-linux32-spot-105
10:57:10 <•fox2mike> that's from amazon again
10:57:18 <•fox2mike> tst-linux32-spot-105
10:57:23 <•fox2mike> that's a spot instance
10:57:34 <•pir> yep, master: http://buildbot-master01.bb.releng.use1.mozilla.com:8201/
10:57:40 <•fox2mike> us-east1
10:57:45 <•pir> so far the connection I see is use1 as fox2mike says
10:58:14 <•fox2mike> we've been through this before :)
10:58:17 <•fox2mike> is all I'm saying
10:59:02 <jlund> sure. let's make sure we can narrow it down to that. I'll see if I can track down more jobs that have hit the timeout where slaves are not in aws.
10:59:08 <jlund> thanks for your help so far.
10:59:49 <•fox2mike> jlund: aws is fine, anything that's a non use1 failure
10:59:55 <•fox2mike> before we go to non aws failure
11:00:06 <•fox2mike> but your case will narrow it down further
11:00:07 <•fox2mike> thanks!
11:00:11 <jlund> rgr
11:00:15 <RyanVM|sheriffduty> fox2mike: things have been quiet for a little while now
11:00:26 <RyanVM|sheriffduty> but we had a lull awhile ago too before another spike
11:00:36 <RyanVM|sheriffduty> so I'm not feeling overly inclined to say that things are resolved
11:00:55 jp-food → jp
11:01:00 <jlund> RyanVM|sheriffduty: have any mac or windows jobs hit this timeout?
11:01:08 <RyanVM|sheriffduty> yes
11:01:13 <RyanVM|sheriffduty> windows definitely
11:01:26 <jlund> k, fox2mike ^ we don't have any windows machines in the cloud
11:01:48 <RyanVM|sheriffduty> random example - https://treeherder.mozilla.org/logviewer.html#?job_id=6928327&repo=mozilla-inbound
11:01:54 <•ashish> are there logs from thoes machines?
11:01:55 <•ashish> ty
11:02:00 → KaiRo joined (robert@moz-dqe9u3.highway.telekom.at)
11:02:17 <RyanVM|sheriffduty> OSX - https://treeherder.mozilla.org/logviewer.html#?job_id=6924712&repo=mozilla-inbound
11:02:51 jlund → jlund|mtg
11:02:57 ⇐ agibson quit (agibson@moz-j04gi9.cable.virginm.net)
11:04:28 jlund|mtg → jlund
11:04:36 <KaiRo> who is the right contact for getting HTTP requests to a Mozilla-owned domain set up to redirect to a different website (another Mozilla-owned domain)?
11:04:50 <KaiRo> the case in question is bug 998793
11:05:27 → agibson joined (agibson@moz-j04gi9.cable.virginm.net)
11:06:48 <•ashish> KaiRo: looks like that IP is hosted/maintained by the community
11:07:05 <•pir> KaiRo: 173.5.47.78.in-addr.arpa domain name pointer static.173.5.47.78.clients.your-server.de.
11:07:09 <•pir> KaiRo: not ours
11:07:39 <jlund> so, it sounds like we have confirmed that this outside aws. for completeness, I'll see if I can find this happening on usw-2 instances too.
11:08:01 agibson → agibson|brb
11:08:23 <KaiRo> ashish: yes, the IP is right now not Mozilla-hosted (atopal, who does host it and actually is an employee nowadays, will be working on getting it moved to Mozilla in the next months) but the domains are both Mozilla-owned
11:09:02 <•pir> KaiRo: the server isn't, though, and you do redirects on server
11:09:54 <KaiRo> pir: well, what we want in that bug is to have mozilla.at point to the same IP as mozilla.de (or CNAME to it or whatever)
11:10:33 <•pir> KaiRo: ah, that's not the same question
11:11:17 <KaiRo> and the stuff hosted by atopal that I was referring to is actually the .de one - I have no idea what the .at one even points to
11:11:45 <•ashish> KaiRo: ok, file a bug with webops. they'll have to change nameservers, setup dns and then put up redirects as needed
11:12:06 <•pir> that
11:13:27 <KaiRo> ashish: OK, thanks!
11:13:56 <•pir> KaiRo: www.mozilla.de or www.mozilla.com/de/ ?
11:14:13 <•pir> KaiRo: the former is community, the latter is mozilla corp
New messages
11:17:04 agibson|brb → agibson
11:17:54 <KaiRo> pir: the former, we want both .at and .de point to the same community site
11:18:57 <•pir> KaiRo: then you need someone in corp to do the dns change and someone who runs the de community site to make sure their end is set up
11:20:08 <KaiRo> pir: sure
11:21:00 <KaiRo> pir: I was mostly concerned about who to contact for the crop piece, I know the community people, we just met this last weekend
11:21:35 <•pir> KaiRo: file a child bug into infra & ops :: moc: service requests
11:21:46 <•pir> KaiRo: if we can't do it directly then we can find someone who can
11:22:10 <KaiRo> pir: thanks, good to know
11:22:18 ⇐ agibson quit (agibson@moz-j04gi9.cable.virginm.net)
11:22:31 <•pir> KaiRo: I'd suggest asking for a CNAME from mozilla.at to mozilla.de so if the de site's IP changes it doesn't break
11:23:03 jlund → jlund|mtg
11:23:51 <KaiRo> pir: yes, that's what I would prefer as well, esp. given the plans to move that communitxy website from atopal's server to Mozilla Community IT
11:25:12 <•ashish> KaiRo: will mozilla.at always remain a direct? (in the near future, at least)
11:25:39 <KaiRo> ashish: in the near future for sure, yes
11:25:45 <•ashish> KaiRo: if so, we can have our static cluster handle the redirect
11:25:59 <•ashish> that migh save some resources for the community
11:26:18 <•pir> if it's ending up on the same server, how does that save resources?
11:27:03 <•ashish> if it's all the same server then yeah, not a huge benefit
11:27:20 Fallen|away → Fallen, hwine → hwine|mtg, catlee-lunch → catlee
11:37:07 <KaiRo> ashish, pir: thanks for your help, I filed bug 1136318 as a result, I hope that moves this forward :)
11:38:31 <•pir> np
11:38:46 <•ashish> KaiRo: yw
11:40:23 coop|lunch → coop|mtg
Tuesday, February 24th, 2015

2015-02-20

  • reimaging a bunch of linux talos machines that have sat idle for 6 months
    • talos-linux32-ix-001
    • talos-linux64-ix-[003,004,008,092]
  • https://bugzil.la/1095300
    • working on slaveapi code for "is this slave currently running a job?"
  • pending is up over 5000 again
    • mostly try
    • Callek: What caused this, just large amounts of pushing? What OS's were pending? etc.


2015-02-19

  • another massive gps push to try, another poorly-terminated json prop
    • https://bugzil.la/1134767
    • rows excised from db by jlund
    • jobs canceled by jlun/nthomas/gps
    • master exception logs cleaned up with:
      • python manage_masters.py -f production-masters.json -R scheduler -R try -j16 update_exception_timestamp


2015-02-18

  • filed https://bugzil.la/1134316 for tst-linux64-spot-341
  • been thinking about builder mappings since last night
    • simplest way may be to augment current allthethings.json output
      • need display names for slavepools
      • need list of regexps matched to language for each slavepool
      • this can be verified internally very easily: can run regexp against all builders in slavepool
      • external apps can pull down allthethings.json daily(?) and process file to strip out only what they need, e.g. slavepool -> builder regexp mapping
      • would be good to publish hash of allthethings.json so consumers can easily tell when it has updated


2015-02-17


2015-02-13

  • going through buildduty report


2015-02-12

  • buildbot db failover by sheeri (planned)
  • https://bugzil.la/1132469 - tree closure
    • lots of idle slaves connected to masters despite high pending counts
    • have rebooted some masters so far:
      • bm70, bm71, bm72, bm73, bm74, bm91, bm94
    • coop looking into windows builders
      • found 2 builders that hadn't run *any* jobs ever (since late sept at least)


2015-02-11

  • reconfig is needed. last one was on thurs. blocked on the 10th from planned reconfig
    • will kick off a reconfig at 10am ET
    • bm118 ended up with 2 reconfig procs running
      • disabled in slavealloc, initiated clean shutdown. Will restart when jobs drain.
  • went through aws_sanity_checker backlog
    • lots of unnamed hosts up for multiple days
      • I'm assuming this is mostly for Windows AWS work based on the platform of the image, but we should really push people to tag instances more rigorously, or expect them to get killed randomly
  • recovering "broken" slaves in slave health list
  • Currently from jacuzzi report, 28 pending windows builds (for non-try) that are not in a jacuzzi
    • 18 of them are disabled for varying reasons, should cull that list to see if any of them can/should be turned on.


2015-02-10


2015-02-09

  • STAT for jlund


2015-02-05

  • tree closures
    • [Bug 1130024] New: Extremely high Linux64 test backlog
      • chalking that one up to a 20% more push increase than what we had previously
    • [Bug 1130207] Several tests failing with "command timed out: 1800 seconds without output running" while downloading from ftp-ssl.mozilla.org
      • again, likely load related but nothing too obvious. worked with netops, suspect we were hitting load balancer issues (ZLB) since hg and ftp share balancers and hg was under heavy load today
      • dcurado will follow up
        • and his follow up: bug 1130242
  • two reconfigs
  • dev-stage01 was running low on disk space
  • loan for sfink



2015-02-04

  • treeherder master db node is getting rebooted for ghost patching
    • I asked mpressman to do it tomorrow and confirm with #treeherder folks first as there was not many on that were familiar with the system
  • puppet win 2008 slaves are ready for the big leagues (prod)!
    • I will be coordinating with markco the testing on that front
  • did a reconfig.lot's landed
  • investigated the 13 win builders that got upgraded RAM. 4 of them have been disabled for various issues
  • dustin ghost patched bm103 and signing5/6


2015-02-03

  • fallout from: Bug 1127482 - Make Windows B2G Desktop builds periodic
    • caused a ~dozen dead command items every 6 hours
    • patched: bug 1127482#c15
    • moved current dead items to my own special dir in case I need to poke them again
    • more dead items will come every 6 hours till above patch lands
  • arr/dustin ghost slave work
    • pod4 and 5 of pandas was completed today
      • 1 foopy failed to clone tools (/build/sut_tools/) on re-image
        • it was a timeout and puppet wasn't smart enough to re clone it without a removal first
    • try linux ec2 instances completed
  • maybe after ami / cloud-tools fallout we should have nagios alerts for when aws spins up instances and kils them right away
  • pro-tip when looking at ec2 graphs:
    • zoom in or out to a time you care about and click on individual colour headings in legend below graph
      • last night I did not click on individual moz-types under running graph and since there is so few bld-linux builders that run normally anyway, it was hard to notice any change



2015-02-02

  • https://bugzil.la/1088032 - Test slaves sometimes fail to start buildbot after a reboot
    • I thought the problem had solved itself, but philor has been rebooting windows slaves everyday which is why we haven't run out of windows slaves yet
    • may require some attention next week
  • panda pod round 2 and 3 started today
    • turns out disabling pandas in slavealloc can kill its current job
    • Calling fabric's stop command (disable.flg) on foopies kills its current job
    • This was a misunderstanding in terms of plan last week, but is what we did pods 1->3 with, and will be the continued plan for next sets
    • we'll inform sheriffs at start and end of each pod's work
  • reconfig is failing as masters won't update local repos
    • vcs error: 500 ISE
    • fallout from vcs issues. gps/hwine kicked a webhead and all is merry again
  • added a new report link to slave health for runner's dashboard
  • late night Tree Closures
    • Bug 1128780
      • test pending skyrocketed, builds not running, builder graphs broken
      • tests were just linux test capacity (with ~1600 pending in <3 hours)
      • graphs relating to running.html were just a fallout from dead-code removal
      • builds not running brought together mrrrgn dustin and catlee and determined it was fallout from dustin's AMI work with cent 6.5 causing earlier AMI's to get shut off automatically on us
    • generic.scl3 got rebooted, causing mozpool to die out and restart, leaving many panda jobs dead
  • B2G nightlies busted, unknown cause
    • Bug 1128826



2015-01-30

  • loan for markco: https://bugzil.la/1127411
  • GHOST
  • started reconfig 11:00 PT
  • fewer pandas decomm-ed than anticipated, will have final numbers today
  • https://bugzil.la/1109862 - re-assigned to relops for dll deployment
  • buildapi + new buildbot passwd: do we know what went wrong here?
    • catlee suspects he updated the wrong config
  • positive feedback from philor on Callek's jacuzzi changes


2015-01-29


2015-01-28


2015-01-27


2015-01-26

  • audited windows pool for RAM: bug 1122975#c6
  • 'over the weekend': small hiccup with bgp router swap bug: killed all of scl3 for ~10min not on purpose.
    • tl;dr - everything came back magically and I only had to clear up ~20 command queue jobs
  • which component is nagios bugs these days? seems like mozilla.org::Infrastructure & Operations bounced https://bugzil.la/1125218 back to releng::other. do we (releng) play with nagios now?
    • "MOC: Service Requests" - refreshed assurance per chat with MOC manager.(linda)
  • terminated loan with slaveapi: bug 1121319#c4
  • attempted reconfig but hit conflict in merge: bug 1110286#c13
  • catlee is changing buildapi r/o sql pw now (11:35 PT)
  • updated trychooser to fix bustage


2015-01-23

  • deployed new bbot r/o pw to aws-manager and 'million other non puppetized tools'
    • do we have a list? We should puppetize them *or* replace them
  • filed: Bug 1125218 - disk space nagios alerts are too aggressive for signing4.srv.releng.scl3.mozilla.com
  • investigated: Bug 1124200 - Android 4 L10n Nightly Broken
  • report-4hr hung at 10:42 - coop killed the cron task



2015-01-22

  • https://bugzil.la/1124705 - tree closure due to builds-4hr not updating
    • queries and replication blocked in db
    • sheeri flushed some tables, builds-4hr recovered
    • re-opened after 20min
  • https://bugzil.la/1121516 - sheeri initiated buildbot db failover after reconfig (per email)
  • philor complaining about panda state:
    • "I lied about the panda state looking totally normal - 129 broken then, fine, exactly 129 broken for all time, not so normal"
    • filed Bug 1124863 - more than 100 pandas have not taken a job since 2015-01-20 around reconfig
      • status: fixed
  • filed Bug 1124850 - slaveapi get_console error handling causes an exception when log formatting
    • status: wontfix but pinged callek before closing
  • filed Bug 1124843 - slaveapi cltbld creds are out of date
    • status: fixed, also improved root pw list order
  • did a non merge reconfig for armen/bustage
  • b2g37 fix for bustage I (jlund) caused. reconfiged https://bugzil.la/1055919


2015-01-21

  • landed fix for https://bugzil.la/1123395 - Add ability to reboot slaves in batch on the slavetype pag
  • many of our windows timeouts (2015-01-16) may be the result of not having enough RAM. Need to look into options like doubling page size: bug 1110236#c20


2015-01-20

  • reconfig, mostly to test IRC notifications
  • master
  • grabbed 2 bugs:
  • network flapping thoughout the day
  • b2g bumper bustage
  • rebooted ~100 pandas that stopped taking jobs after reconfig
  • Bug 1124059 - create a buildduty dashboard that highlights current infra health
  • TODO: Bug for "make it painfully obvious when slave_health testing mode is enabled, thus is displaying stale data"
    • hurt philor in #releng this evening when an old patch wih testing mode on deployed..
    • i have a precommit hook for this now, shouldn't happen again


2015-01-19

  • Filed bugs for issues discussed on Friday:
  • fixed slavealloc datacenter issue for some build/try linux instances - bug 1122582#c7
  • re-imaged b-2008-ix-0006, b-2008-ix-0020, b-2008-ix-0172
  • deployed 'terminate' to slaveapi and then broke slaveapi for bonus points
  • re-patched 'other' aws end points for slaveapi - deploying that today (20th)
  • fixed nical's troublesome loan


2015-01-16 (rollup of below scratchpad)

JLUND
sheriffs requested I investigate:

  • spike in win64 filesystem loops:
    • sheriffs suggested they have pinged many times recently and they will start disabling slaves if objdir nuking is not preferable
    • nuking b-2008-ix-0114 objdir of related builder
    • filed bug 1122746
  • Bug 916765 - Intermittent "command timed out: 600 seconds without output, attempting to kill" running expandlibs_exec.py in libgtest
    • diagnosis: bug 916765#c193
    • follow up: I will post a patch but it is not buildduty actionable from here on out IMO
  • Bug 1111137 - Intermittent test_user_agent_overrides.html | Navigator UA not overridden at step 1 - got Mozilla/5.0 (Android; Mobile; rv:37.0) Gecko/37.0 Firefox/37.0, expected DummyUserAgent
  • Bug 1110236 - Intermittent "mozmake.exe[6]: *** [xul.dll] Error 1318" after "fatal error LNK1318: Unexpected PDB error"
  • there was a common trend from the above 3 bugs with certain slaves


loans:



CALLEK
Puppet Issues:

  • Had a db_cleanup puppet failure on bm81, catlee fixed with http://hg.mozilla.org/build/puppet/rev/d88423d7223f
  • There is a MIG puppet issue blocking our golden AMI's from completing. Ulfr pinged in #releng and I told he has time to investigate (rather than asking for an immediate backout)


Tree Closure:

  • bug 1122582
  • Linux jobs, test and build were pending far too long
  • I (Callek) got frustrated trying to get assistance to find out what the problem is and while trying to get other releng assistance to look at the problem
  • Boils down to capacity issues, but was darn hard to pinpoint


Action Items

  • Find some way to identify we're at capacity in AWS easier (my jacuzzi slave health work should help with that, at least a bit)
  • Get <someone> to increase our AWS capacity or find out if/why we're not using existing capacity. If increasing we'll need more masters.




2015-01-16

arr
12:08:38 any of the eu folks around? looks like someone broke puppet last night.

mgerva
12:20:01 arr: i'm here

arr
12:21:10 mgerva: looks like a problem with someone who's trying to upgrade mig
12:21:32 mgerva: it's been sending out mail about failing hosts
12:21:39 wasn't sure if it was also taking them offline eventually
12:21:48 (so I think this is limited to linux)
12:32:43 mgerva is now known as mgerva|afk

pmoore
12:47:16 arr: mgerva|afk: since the sheriffs aren't complaining yet, we can probably leave this for build duty which should start in a couple of hours

arr
12:47:46 pmoore: okay!

pmoore
12:47:51 i don't think anyone is landing puppet changes at the moment, so hopefully it should affect anything… i hope!
12:48:02 *shouldn't*

I see two different errors impacting different types of machines:

  • Issues with mig: Puppet (err): Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install mig-agent=20150109+a160729.prod' returned 100
  • Issues with a different config file: Puppet (err): Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter user on File[/builds/buildbot/db_maint/config.ini] at /etc/puppet/production/modules/buildmaster/manifests/db_maintenance.pp:48


2015-01-15


2015-01-14

  • slave loan for tchou
  • started patch to reboot slaves that have not reported in X hours (slave health)
  • reconfig for catlee/ehsan
  • recovered 2 windows builders with circular directory structure


2015-01-13

  • reconfig for ehsan
  • https://bugzil.la/1121015 - dolphin non-eng nightlies busted after merge
    • bhearsum took it (fallout from retiring update.boot2gecko.org)
  • scheduler reconfig for fubar
  • https://bugzil.la/1117811 - continued master setup for Fallen
  • clearing buildduty report backlog


2015-01-12

  • recovering loaned slaves
  • setting up Tb test master for Fallen
    • already has one apparently, some commentary in bug 1117811
  • reconfig took almost 4hr (!)
  • some merge day fallout with split APK


2015-01-08

  • bug 1119447 - All buildbot-masters failing to connect to MySQL: Too many connections
    • caused 3-hour tree closure


2015-01-07


2015-01-06

  • bug 1117395 - Set RETRY on "timed out waiting for emulator to start" and "We have not been able to establish a telnet connection with the emulator"
  • reclaiming loaned machines based on responses to yesterday's notices


2015-01-05

  • sent reminder notices to people with loaned slaves


2014-12-30


2014-12-29

  • returning spot nodes disabled by philor
    • these terminate pretty quickly after being disabled (which is why he does it)
    • to re-enable en masse, run 'update slaves set enabled=1 where name like '%spot%' and enabled=0' in the slavealloc db
    • use the buildduty report, click on the 'View list in Bugzilla' button, and then close all the spot node bugs at once
  • started going throung bugs in the dependencies resolved section based on age. Here is a rundown of state:
    • b-2008-ix-0010: kicked off a re-image, but I did this before fire-and-forget in early Dec and it doesn't seem to have taken. will check back in later
      • :markco using to debug Puppet on Windows issues
    • panda-0619: updated relay info, but unclear in bug whether there are further issues with panda or chassis


2014-12-14


2014-12-19

  • what did we accomplish?
    • vcssync, b2 bumper ready to hand-off to dev services(?)
    • increased windows test capacity
    • moved staging slaves to production
    • disabled 10.8 on try
      • PoC for further actions of this type
    • investigations with jmaher re: turning off "useless" tests
    • opening up releng for contributions:
      • new public distribution list
      • moved tests over to travis
      • mentored bugs
    • improved reconfigs
    • set up CI for b2g bumper
  • what do we need to accomplish next quarter?
    • self-serve slave loan
    • turn off "useless" tests
      • have a way to do this easily and regularly
    • better ability to correlate tree state changes with releng code changes
    • better platform change pipeline
      • proper staging env
    • task cluster tackles most of the above, therefore migration of jobs to task cluster should enable these as a consequence
  • what tools do we need?
    • self-serve slave loan
    • terminate AWS instances from slave health (slaveapi)
    • ability to correlate releng changes with tree state changes
      • e.g. linux tests started timing out at Thursday at 8:00am: what changed in releng repos around that time?
      • armen's work on pinning mozharness tackles the mozharness part - migrating to task cluster puts build configs in-tree, so is also solved mostly with task cluster move


2014-11-27

  • https://bugzil.la/1105826
    • trees closed most of the day due to Armen's try jobs run amok
    • reporting couldn't handle the volume of retried jobs, affected buildapi and builds-4hr
      • disabled buildapi cronjobs until solution found
    • db sync between master->slave lost for 5 hours
    • re-ran buildapi cronjobs incrementally by hand in order to warm the cache for build-4hr
    • all buildapi cronjobs re-enabled
    • catlee picked up https://bugzil.la/733663 for the long-term fix
    • didn't get to deploy https://bugzil.la/961279 as planned :(


2014-11-26

  • https://bugzil.la/961279 - Mercurial upgrade - how to proceed?
    • yes, we should have time to deploy it Thursday/Friday this week


2014-11-25


2014-11-24


2014-11-21

  • work on 10.10
    • running in staging
  • restarted bm84
  • reconfig for bhearsum/rail for pre-release changes for Fx34
  • setup foopy56 after returning from diagnostics


2014-11-20 a.k.a "BLACK THURSDAY"



2014-11-19


2014-11-18

  • bm82 - BAD REQUEST exceptions
    • gracefully shutdown and restarted to clear
  • updated tools on foopys to pick up Callek's patch to monitor for old pywebsocket processes
  • sent foopy56 for diagnostics
  • https://bugzil.la/1082852 - slaverebooter hangs
    • had been hung since Nov 14
    • threads aren't terminating, need to figure out why
    • have I mentioned how much i hate multi-threading?
  • https://bugzil.la/1094293 - 10.10 support
    • patches waiting for review


2014-11-17


2014-11-14

  • ???


2014-11-13

  • ???


2014-11-12

  • ???


2014-11-11

  • ???


2014-11-10

  • release day
    • ftp is melting under load; trees closed
      • dev edition went unthrottled
        • catlee throttled background updates to 25%
      • dev edition not on CDN


2014-11-07

  • shared mozharness checkout
  • jlund hg landings
  • b2g_bumper travis tests working
  • buildbot-master52
    • hanging on every reconfig
    • builder limits, hitting PB limits
    • split masters: Try + Everything Else?
    • graceful not working -> nuke it from orbit
    • structured logging in mozharness has landed
  • coop to write docs:
    • moving slaves from production to staging
    • dealing with bad slaves


2014-11-06

  • b2g_bumper issues
  • https://bugzil.la/1094922 - Widespread hg.mozilla.org unresponsiveness
  • buildduty report queue
  • some jobs pending for more than 4 hours
    • aws tools needed to have the new cltbld password added to their json file, idle instances not being reaped
    • need some monitoring here


2014-11-05

  • sorry for the last few days, something important came up and i've barely been able to focus on buildduty
  • https://bugzil.la/foopy56
    • hitting load spikes
  • https://bugzil.la/990173 - Move b2g bumper to a dedicated host
    • bm66 hitting load spikes
    • what is best solution: beefier instance? multiple instances?
  • PT - best practices for buildduty?
    • keep "Current" column accurate


2014-11-04

  • t-snow-r4-0002 hit an hdiutil error and is now unreachable
  • t-w864-ix-026 destoying jobs, disabled
  • bug 1093600
    • bugzilla api updates were failing, fixed now
    • affected reconfigs (script could not update bugzilla)
  • bug 947462
    • tree outage when this landed
    • backed it out
    • probably it can be relanded, just needs a clobber


2014-11-03


2014-10-31

  • valgrind busted on Try
    • only build masters reconfig-ed last night by nthomas
      • reconfig-ed try masters this morning


2014-10-30

  • how best to handle broken manifests?
    • difference of opinion w/ catlee
    • catlee does see the human cost of not fixing this properly
  • mapper docs
  • b2g bumper: log rotation
  • https://bugzil.la/1091707
    • Frequent FTP/proxxy timeouts across all trees
      • network blip?
  • https://bugzil.la/1091696
    • swap on fwunit1.private.releng.scl3.mozilla.com is CRITICAL: SWAP CRITICAL - 100% free (0 MB out of 0 MB)
    • these are dustin's firewall unit tests: ping him when we get these alerts
  • reconfig


2014-10-29

  • b2g bumper
    • b2g manifests
      • no try for manifests
  • All new w864 boxes have wrong resolution
  • started thread about disabling try testing on mtnlion by default


2014-10-28

  • testing new hg 3.1.2 GPO
  • cleaned up loaner list from yesterday
    • closed 2 bugs that we're unused
    • added 2 missing slavealloc notes
    • terminated 11 instances
    • removed many, many out-of-date names & hosts from ldapadmin
  • lots of bigger scope bugs getting filed under the buildduty category
    • most belong in general automation or tools IMO
    • I don't think buildduty bugs should have a scope bigger than what can be accomplished in a single day. thoughts?
  • reconfig to put new master (bm119) and new Windows test slaves into production
  • massive spike in pending jobs around 6pm ET
    • 2000->5000
    • closed trees
  • waded through the buildduty report a bit


2014-10-27

  • 19 *running* loan instances

dev-linux64-ec2-jlund2
dev-linux64-ec2-kmoir
dev-linux64-ec2-pmoore
dev-linux64-ec2-rchien
dev-linux64-ec2-sbruno
tst-linux32-ec2-evold
tst-linux64-ec2-evanxd
tst-linux64-ec2-gbrown
tst-linux64-ec2-gweng
tst-linux64-ec2-jesup
tst-linux64-ec2-jesup2
tst-linux64-ec2-jgilbert
tst-linux64-ec2-kchen
tst-linux64-ec2-kmoir
tst-linux64-ec2-mdas
tst-linux64-ec2-nchen
tst-linux64-ec2-rchien
tst-linux64-ec2-sbruno
tst-linux64-ec2-simone

27 open loan bugs:
http://mzl.la/1nJGtTw

We should reconcile. Should also cleanup entries in ldapadmin.


2014-10-24


2014-10-23

  • test slaves sometimes fail to start buildbot on reboot
  • re-imaging a bunch of w864 machines that were listed as only needing a re-image to be recovered:
    • t-w864-ix-0[04,33,51,76,77]
    • re-image didn't help any of these slaves
    • bug 1067062
  • investigated # of windows test masters required for arr
    • 500 windows test slaves, 4 existing windows test masters
  • OMG load
    • ~7000 pending builds at 4pm ET
    • KWierso killed off lots of try load: stuff that had already landed, stuff with followup patches
      • developer hygiene is terrible here


2014-10-22

  • many win8 machines "broken" in slave health
    • working theory is that 64-bit browser is causing them to hang somehow
    • https://bugzil.la/1080134
    • same for mtnlion
    • same for win7
    • we really need to find out why these slaves will simply fail to start buildbot and then sit waiting to be rebooted


2014-10-21

  • bug 1086564 Trees closed
    • alerted Fubar - he is working on it
  • bug 1084414 Windows loaner for ehsan
  • killing esr24 branch
  • https://bugzil.la/1066765 - disabling foopy64 for disk replacement
  • https://bugzil.la/1086620 - Migrate slave tools to bugzilla REST API
    • wrote patches and deployed to slavealloc, slave health
  • trimmed Maintenance page to Q4 only, moved older to 2014 page
  • filed https://bugzil.la/1087013 - Move slaves from staging to production
    • take some of slave logic out of configs, increase capacity in production
  • helped mconley in #build with a build config issue
  • https://bugzil.la/973274 - Install GStreamer 1.x on linux build and test slaves
    • this may have webrtc implications, will send mail to laura to check


2014-10-20

  • reconfig (jetpack fixes, alder l10n, holly e10s)
    • several liunx64 test masters hit the PB limit
      • put out a general call to disable branches, jobs
      • meanwhile, set masters to gracefully shutdown, and then restarted them. Took about 3 hours.
  • 64-bit Windows testing
    • clarity achieved!
      • testing 64-bit browser on 64-bit Windows 8, no 32-bit testing on Window 8 at all
      • this means we can divvy the incoming 100 machines between all three Windows test pools to improve capacity, rather than just beefing up the WIn8 platform and splitting it in 2


2014-10-17

  • blocklist changes for graphics (Sylvestre)
  • code for bug updating in reconfigs is done
    • review request coming today
  • new signing server is up
    • pete is testing, configuring masters to use it
  • some classes of slaves not reconnecting to masters after reboot
    • e.g. mtnlion
    • need to find a slave in this state and figure out why
      • puppet problem? runslave.py problem (connection to slavealloc)? runner issue (connection to hg)?
  • patch review for bug 1004617
  • clobbering m-i for rillian
  • helping Tomcat cherry-pick patches for m-r
  • reconfig for Alder + mac-signing


2014-10-16

  • Updated all windows builders with new ffxbld_rsa key
  • Patched reconfig code to publish to bugzilla - will test on next reconfig
  • Working on set up of mac v2 signing server
  • Fixed sftp.py script
  • Set up meeting the J Lal, H Wine, et al for vcs sync handover
  • lost my reconfig logs from yesterday in order to validate against https://hg.mozilla.org/build/tools/file/a8eb2cdbe82e/buildfarm/maintenance/watch_twistd_log.py#l132 - will do so with next reconfig
  • spoke to Amy about windows reimaging problem, and requested a single windows reimage to validate GPO setup
  • reconfig for alder and esr24 changes
  • rebooting mtnlion slaves that had been idle for 4 hours (9 of them)
    • this seems to be a common occurrence. If I can find a slave in this state today, I'll file a bug and dive in. Not sure why mahcine is rebooting and not launching buildbot.


2014-10-15


2014-10-14

  • bug 1081825 b2gbumper outage / mirroring problem - backed out - new mirroring request in bug bug 1082466
    • symptoms: b2g_bumper lock file is stale
    • should mirror new repos automatically rather than fail
      • bare minimum: report which repo is affected
  • bug 962863 rolling out l10n gecko and l10n gaia vcs sync - still to do: wait for first run to complete, update wiki, enable cron
  • bug 1061188 rolled out, and had to backout due to puppet changes not hitting spot instances yet, and gpo changes not hitting all windows slaves yet - for spot instances, just need to wait, for GPO i have a needinfo on :markco
    • need method to generate new golden AMIs on demand, e.g. when puppet changes land
  • mac signing servers unhappy - probably not unrelated to higher load due to tree closure - have downtimed in #buildduty for now due to extra load
    • backlog of builds on Mac
      • related to slaverebooter hang?
      • many were hung for 5+ hours trying to run signtool.py on repacks
        • not sure whether this was related to (cause? symptom?) of signing server issues
        • could also be related to reconfigs + key changes (ffxbld_rsa)
      • rebooted idle&hung mac builders by hand
      • bug 1082770 - getting another mac v2 signing machine into service
  • sprints for this week:
    • [pete] bug updates from reconfigs
    • [coop] password updates?
  • slaverebooter was hung but not alerting, *however* I did catch the failure mode: indefinitely looping waiting for an available worker thread
    • added a 30min timeout waiting for a worker, running locally on bm74
    • filed bug 1082852
  • put foppy64 back into service - bug 1066765
  • https://bugzil.la/1082818 - t-w864-ix loaner for Armen
  • https://bugzil.la/1082784 - tst-linux64-ec2 loaner for dburns
  • emptied buildduty bug queues


2014-10-13



2014-10-10:

  • kgrandon reported to getting updates for flame-kk
  • work


2014-10-09

  • db issue this morning all day
    • sheeri ran an errant command on the slave db that inadvertently propagated to the master db
    • trees closed for about 2 hours until jobs started
    • however, after the outage while the trees were still closed, we did a live fail over between the master and the slave without incident
    • later in the day, we tried to fail back over to the master db from the slave db, but we ended up with inconsistent data between the two databases. This resulted in a bunch of jobs not starting because they were in the wrong db.
    • fixed with a hot copy
    • filed https://bugzil.la/1080855 for RFO\
  • https://bugzil.la/1079396 - loaner win8 machine for :jrmuziel
  • https://bugzil.la/1075287 - loaner instance for :rchien
    • after some debugging over the course of the day, determined he needed a build instance after all
  • filed https://bugzil.la/1080951 - Add fabric action to reset the timestamp used by buildbot-master exception log reporting


2014-10-08


2014-10-07

  • bug 1079256 - B2G device image nightlies (non-eng only) constantly failing/retrying due to failure to upload to update.boot2gecko.org
    • cleaned up, now 5% free
    • fix is to stop creating/publishing/uploading b2g mars for all branches *except* 1.3 <- bug 1000217
  • fallout from PHX outage?
  • cleared buildduty report module open dependencies
  • filed https://bugzil.la/1079468 - [tracking][10.10] Continuous integration testing on OS X 10.10 Yosemite


2014-10-06

  • bug 1078300#c3
    • hg push showing on tbpl and treeherder with no associated builders generated
  • sprints for this week:
    • slaverebooter
      • [coop] determine why it sometimes hangs on exit
    • [pete] end_to_end_reconfig.sh
      • add bug updates


2014-10-03

  • slaverebooter hung
    • added some extra instrumentation locally to try to find out why, when removed the lockfile and restarted
    • hasn't failed again today, will see whether it fails around the same time tonight (~11pm PT)
  • https://bugzil.la/1077432 - Skip more unittests on capacity-starved platforms
    • now skipping opt tests for mtnlion/win8/win7/xp
  • reconfig
  • https://bugzil/la/1065677 - started rolling restarts of all master
    • done


2014-10-02

  • mozmill CI not receiving pulse messages
    • some coming through now
    • no logging around pulse notifications
  • mtnlion machines
    • lost half of pool last night, not sure why
    • dbs shutdown last night <- related?
    • reboot most, leave 2 for diagnosis
    • snow affected? yes! rebooted
    • also windows
      • rebooted XP and W8
      • can't ssh to w7
        • rebooted via IPMI (-mgmt via web)
  • pulse service - nthomas restarted pulse on masters
    • multiple instances running, not checking PID file
    • bug 1038006
  • bug 1074147 - increasing test load on Windows
    • coop wants to start using our skip-test functionality on windows (every 2) and mtnlion (every 3)
  • uploads timing out
  • https://bugzil.la/1069429 - Upload mozversion to internal pypi
  • https://bugzil.la/1076934 - Temporarily turn off OTA on FX OS Master branch
    • don't have a proper buildid to go on here, may have time to look up later
  • added report links to slave health: hgstats, smokepings


2014-10-01


2014-09-30

  • https://bugzil.la/1074827 - buildbot-configs_tests failing on Jenkins due to problem with pip install of master-pip.txt
    • non-frozen version of OpenSSL being used - rail fixing and trying again
  • https://bugzil.la/943932 - T-W864-IX-025 having blue jobs
    • root cause not known - maybe faulty disk - removed old mozharness checkout, now has a green job
  • https://bugzil.la/1072434 - balrog submitter doesn't set previous build number properly
    • this caused bustage with locale repacks - nick and massimo sorted it out
  • https://bugzil.la/1050808 - several desktop repack failures today - I proposed we apply patch in this bug
  • https://bugzil.la/1072872 - last machines rebooted
  • https://bugzil.la/1074655 - Requesting a loaner machine b2g_ubuntu64_vm to diagnose bug 1053703
  • going through buildduty report
    • filed new panda-recovery bug, added pandas to it
    • t-snow-r4-0075: reimaged, returned to production
    • talos-linux64-ix-027: reimaged, returned to production
    • emptied 3 sections (stopped listing the individual bugs)
  • reconfig
  • https://bugzil.la/1062465 - returned foopy64 and attached pandas to production
    • disk is truly failing on foopy64, undid all that work


2014-09-29

  • https://bugzil.la/1073653 - bash on OS X
    • dustin landed fix, watched for fallout
    • complications with signing code bhearsum landed
      • all Macs required nudging (manual puppet runs + reboot). Mercifully dustin and bhearsum took care of this.
  • https://bugzil.la/1072405 - Investigate why backfilled pandas haven't taken any jobs
    • checking failure logs for patterns
    • looks like mozpool is still trying to reboot using old relay info: needs re-sync from inventory?
    • tools checkout on foopies hadn't been updated, despite a reconfig on Saturday
    • enabled 610-612
    • each passed 2 tests in a row, re-enabled the rest
  • cleaned up resolved loans for bld-lion, snow, and mtnlion machines
  • https://bugzil.la/1074358 - Please loan OS X 10.8 Builder to dminor
  • https://bugzil.la/1074267 - Slave loan request for a talos-r4-snow machine
  • https://bugzil.la/1073417 - Requesting a loaner machine b2g_ubuntu64_vm to diagnose


2014-09-26

  • cleared dead pulse queue items after pulse publisher issues in the morning
  • https://bugzil.la/1072405 - Investigate why backfilled pandas haven't taken any jobs
    • updated devices.json
    • created panda dirs on floppies
    • had to re-image panda-0619 by hand
    • all still failing, need to investigate on Monday
  • https://bugzil.la/1073040 - loaner for mdas


Besides the handover notes for last week, which I received from pete, there are the following issues:

bug 1038063 Running out of space on dev-stage01:/builds
The root cause of the alerts was the addition of folder /builds/data/ftp/pub/firefox/releases/31.0b9 by Massimo in order to run some tests.
Nick did some further cleanup, the bug has been reopened this morning by pete, proposing to automate some of the steps nick did manually.

https://bugzilla.mozilla.org/show_bug.cgi?id=1036176 Some spot instances in us-east-1 are failing to connect to hg.mozilla.org
Some troubleshooting has been done by Nick, and case 222113071 has been opened with AWS

2014-07-07 to 2014-07-11

Hi Simone,

Open issues at end of week:

foopy117 is playing up (bug 1037441)
this is also affecting end_to_end_reconfig.sh (solution: comment out manage_foopies.py lines from this file and run manually)
Foopy 117 seems to be back and working normally

Major problems with pending queues (bug 1034055) - this should hopefully be fixed relatively soon. most notably linux64 in-house ix machines. Not a lot you can do about this - just be aware of it if people ask.
Hopefully this is solved after Kim's recent work

Two changes currently in queue for next reconfig: bug 1019962 (armenzg) and bug 1025322 (jford)

Some changes to update_maintenance_wiki.sh from aki will be landing when the review passes: (bug 1036573) - potential is to impact the wiki update in end_to_end_reconfig.sh as it has been refactored - be aware of this.

Currently no outstanding loaner requests at time of handover, but there are some that need to be checked or returned to the pool.

See the 18 open loan requests: https://bugzilla.mozilla.org/buglist.cgi?bug_id=989521%2C1036768%2C1035313%2C1035193%2C1036254%2C1006178%2C1035270%2C876013%2C818198%2C981095%2C977190%2C880893%2C1017303%2C1023856%2C1017046%2C1019135%2C974634%2C1015418&list_id=10700493


I've pinged all the people in this list (except for requests less than 5 days old) to ask for status.

Pete

2014-05-26 to 2014-05-30

  • Monday
  • Tuesday
    • reconfig
      • catlee's patch had bustage, and armenzg's had unintended consequences
      • they each reconfiged again for their own problems
    • buildduty report:
      • tackled bugs without dependencies
      • tackled bugs with all dependencies resolved
  • Wednesday
    • new nightly for Tarako
      • was actually a b2g code issue: Bug 1016157 - updated the version of vold
    • resurrecting tegras to deal with load
  • Thursday
    • AWS slave loan for ianconnoly
    • puppet patch for talos-linux64-ix-001 reclaim
    • resurrecting tegras
  • Friday
    • tegras continue to fall behind, ping Pete very late Thursday with symptoms. Filed https://bugzil.la/1018118
    • reconfig
      • chiefly to deploy https://bugzil.la/1017599 <- reduce # of test on tegras
      • fallout:
        • non-unified mozharness builds are failing in post_upload.py <- causing queue issues on masters
        • panda tests are retrying more than before
          • hitting "Caught Exception: Remote Device Error: unable to connect to panda-0402 after 5 attempts", but it *should be non-fatal, i.e. test runs fine afterwards but still gets flagged for retry
          • filed: https://bugzil.la/1018531
        • reported by sheriffs (RyanVM)

2014-05-05 to 2014-05-09

2014-04-21 to 2014-04-25

follow up:

buildduty report:
Bug 999930 - put tegras that were on loan back onto a foopy and into production

action items:
* Bug 1001518 - bld-centos6-hp-* slaves are running out of disk space

    • this pool had 4 machines run out of disk space all within the last week
    • I scrubbed a ton of space (bandaid) but the core issue will need to be addressed

load:
(high load keep an eye on) Bug 999558 - high pending for ubuntu64-vm try test jobs on Apr 22 morning PT
(keep an eye on) bug 997702


(jlund) reboot all these xp stuck slaves-https://bugzil.la/977341- XP machines out of action
**https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-xp32-ixTHERE ARE ONLY 4 MACHINES TODAY THAT ARE "BROKEN" AND ONLY HUNG TODAY.
pmoore: there is only 1 now

  • (jlund) iterate through old disabled slaves in these platform lists - https://bugzil.la/984915 - Improve slave health for disabled slaves <- THIS WAS NOT DONE. I ASKED PMOORE TO HELP
    • pmoore: i'm not entirely sure which platform lists this means, as the bug doesn't contain a list of platforms. So I am looking at all light blue numbers on https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html (i.e. the disabled totals per platform). When jlund is online later i'll clarify with him.
    • jlund: thanks pete. Sorry I meant all platform lists I suppose starting with whatever platform held our worst wait times. I have started going through the disabled hosts looking for 'forgotten ones'

2014-04-10 to 2014-04-11 (Thursday and Friday)

bug 995060
Nasty nasty tree closure lasting several hours
b-c taking loooooong time and log files too large for buildbot to handle
Timeout for MOCHITEST_BC_3 increased from 4200s to 12000s
When Joel's patch has landed: bug 984930
then we should "undo" the changes from bug 995060 and put timeout back down to 4200s (was just a temporary workaround). Align with edmorley on this.

bug 975006
bug 938872
after much investigation, it turns out a monitor is attached to this slave - can you raise a bug to dc ops to get it removed?


Loaners returned:
https://bugzil.la/978054
https://bugzil.la/990722
https://bugzil.la/977711 (discovered this older one)

Loaners created:
bug 994283


bug 994321#c7
Still problems with 7 slaves that can't be rebooted:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-005
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-006
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-061
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-065
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-074
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-086
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-089
Slave API has open bugs on all of these.


https://bugzil.la/977341
Stuck Win XP slaves - only one was stuck (t-xp32-ix-073). Rebooted.


Thanks Callek!

Week of 2014-04-05 to 2014-04-09 (thurs-wed)

Hiya pete

  • van has been working hard at troubleshooting winxp 085: bug 975006#c21
    • this needs to be put back into production along with 002 and reported back to van on findings
    • note this is a known failing machine. please try to catch it fail before sheriffs.
  • we should reconfig either thurs or by fri at latest
  • latest aws sanity check runthrough yielded better results than before. Very view long running lazy instances. Very few unattended loans. This should be checked again on Friday
  • there was a try push that broke a series of mtnlion machines this afternoon. Callek, nthomas, and Van worked hard at helping me diagnose and solve issue.
    • there are some slaves that failed to reboot via slaveapi. This is worth following up on especially since we barely have any 10.8 machines to begin with:
    • bug 994321#c7
  • on tues we started having github/vsync issues where sheriffs noticed that bumper bot wasn't keeping up with csets on github.
    • looks like things have been worked on and possibly fixed but just a heads up
    • bug 993632
  • as per: bug 991259#c1 I checked on these and the non green ones should be followed up on

tegra-063.tegra.releng.scl3.mozilla.com is alive <- up and green
tegra-050.tegra.releng.scl3.mozilla.com is alive <- up and green
tegra-028.tegra.releng.scl3.mozilla.com is alive <- up but not running jobs
tegra-141.tegra.releng.scl3.mozilla.com is alive <- up and green
tegra-117.tegra.releng.scl3.mozilla.com is alive <- up but not running jobs
tegra-187.tegra.releng.scl3.mozilla.com is alive <- up and green
tegra-087.tegra.releng.scl3.mozilla.com is alive <- up and green
tegra-299.tegra.releng.scl3.mozilla.com is alive <- up and green
tegra-309.tegra.releng.scl3.mozilla.com is alive <- up and green
tegra-335.tegra.releng.scl3.mozilla.com is alive <- up and green

as per jhopkins last week and reformatting: again the non green ones should be followed up on.
tegra-108 - bug 838425 - SD card reformat was successful <- cant write to sd
tegra-091 - bug 778886 - SD card reformat was successful <- sdcard issues again
tegra-073 - bug 771560 - SD card reformat was successful <- lockfile issues
tegra-210 - bug 890337 - SD card reformat was successful <- green in prod
tegra-129 - bug 838438 - SD card reformat was successful <- fail to connect to telnet
tegra-041 - bug 778813 - SD card reformat was successful <- sdcard issues again
tegra-035 - bug 772189 - SD card reformat was successful <- sdcard issues again
tegra-228 - bug 740440 - SD card reformat was successful <- fail to connect to telnet
tegra-133 - bug 778923 - SD card reformat was successful <- green in prod
tegra-223 - bug 740438 - SD card reformat was successful <- Unable to properly cleanup foopy processes
tegra-080 - bug 740426 - SD card reformat was successful <- green in prod
tegra-032 - bug 778899 - SD card reformat was successful <- sdcard issues again
tegra-047 - bug 778909 - SD card reformat was successful have not got past here
tegra-038 - bug 873677 - SD card reformat was successful
tegra-264 - bug 778841 - SD card reformat was successful
tegra-092 - bug 750835 - SD card reformat was successful
tegra-293 - bug 819669 - SD card reformat was successful

Week of 2014-03-31 to 2014-04-04

Wednesday:

Someone will need to follow up on how these tegras did since I reformatted their SD cards:

tegra-108 - bug 838425 - SD card reformat was successful
tegra-091 - bug 778886 - SD card reformat was successful
tegra-073 - bug 771560 - SD card reformat was successful
tegra-210 - bug 890337 - SD card reformat was successful
tegra-129 - bug 838438 - SD card reformat was successful
tegra-041 - bug 778813 - SD card reformat was successful
tegra-035 - bug 772189 - SD card reformat was successful
tegra-228 - bug 740440 - SD card reformat was successful
tegra-133 - bug 778923 - SD card reformat was successful
tegra-223 - bug 740438 - SD card reformat was successful
tegra-080 - bug 740426 - SD card reformat was successful
tegra-032 - bug 778899 - SD card reformat was successful
tegra-047 - bug 778909 - SD card reformat was successful
tegra-038 - bug 873677 - SD card reformat was successful
tegra-264 - bug 778841 - SD card reformat was successful
tegra-092 - bug 750835 - SD card reformat was successful
tegra-293 - bug 819669 - SD card reformat was successful


Week of 2014-03-17 to 2014-03-21
buildduty: armenzg

Monday

  • bugmail and deal with broken slaves
  • mergeday

Tuesday

  • reviewed aws sanity check
  • cleaned up and assigned some buildduty bugs
  • reconfig

TODO:

  • bug 984944
  • swipe through problem tracking bugs

Wednesday

Thursday

Friday

Week of 2014-01-20 to 2014-01-24
buildduty: armenzg

Monday

  • deal with space warnings
  • loan to dminor
  • terminated returned loan machines

Tuesday

  • loan win64 builder
  • Callek helped with the tegras

TODO

  • add more EC2 machines


Week of 2014-01-20 to 2014-01-24
buildduty: jhopkins

Bugs filed:

  • Bug 962269 (dupe) - DownloadFile step does not retry status 503 (server too busy)
  • Bug 962698 - Expose aws sanity report data via web interface in json format
  • Bug 963267 - aws_watch_pending.py should avoid region/instance combinations that lack capacity


Monday

  • Nightly updates are disabled (bug 908134 comment 51)
  • loan bug 961765


Tuesday


Wednesday

  • added AWS instance Tag "moz-used-by" to the nat-gateway instance to help with processing the long-running instances report
  • would be nice if we could get the aws sanity report data to be produced by slave api so it could be pulled by a web page and correlated with recent job history, for example
  • Bug 934938 - jakem switched to round-robin DNS (see bug 934938#c1519 for technical details) to avoid "thundering herd" problem.


Thursday

  • AWS lacking capacity and slowing down instance startup. Filed 963267.


Friday

  • missed some loan requests b/c I thought they were being included in the buildduty report (2 previous ones seemed to be). Can we add loans to the buildduty report?
  • some automated slave recovery not happening due to Bug 963171'- please allow buildbot-master65 to talk to production slaveapi



Week of 2014-01-13 to 2014-01-17
buildduty: bhearsum

Bugs filed (not a complete list):

  • Bug 960535 - Increase bouncerlatestchecks Nagios script timeout


Week of 2014-01-16 to 2014-01-10
buildduty: armenzg

Bugs filed:

Monday

  • loan machines
  • deal with some broken slaves

Tuesday

  • loan machines
  • deal with some broken slaves
  • reconfig
  • second reconfig for backout

Wednesday

  • enable VNC for a Mac loaner
  • check signing issues filed by Tomcat
  • mozharness merge
  • help RyanVM with some timeout

Thursday

  • do reconfig with jlund

Friday

  • restart redis
  • loan 2 machines
  • process problem tracking bugs

Week of 2013-12-16 to 2013-12-20
buildduty: jhopkins

Bugs filed:

  • Bug 950746 - Log aws_watch_pending.py operations to a machine-parseable log or database
  • Bug 950780 - Start AWS instances in parallel
  • Bug 950789 - MozpoolException should be retried
  • Bug 952129 - download_props step can hang indefinitely

* Bug 952517 - Run l10n repacks on a smaller EC2 instance type

Monday

  • several talos-r3-fed* machines have a date of 2001
  • adding hover-events to our slave health pages would be helpful to get quick access to recent job history
  • other interesting possibilities:
    • a page showing last 50-100 jobs for all slaves in a class
    • ability to filter on a certain builder to spot patterns/anomalies. eg. "robocop tests always fail on this slave but not the other slaves"

Wednesday

  • taking over 15 minutes for some changes to show as 'pending' in tbpl. Delay from scheduler master seeing the change in twistd.log
  • Bug 951558 - buildapi-web2 RabbitMQ queue is high

Friday

  • Bug 952448 - Integration Trees closed, high number of pending linux compile jobs
    • AWS instance 'start' requests returning "Error starting instances - insufficient capacity"
  • dustin has fixed Bug 951558 - buildapi-web2 RabbitMQ queue is high

Week of 2013-11-04 to 2013-11-08
buildduty: jhopkins

Tuesday

  • rev2 migrations going fairly smoothly (biggest issue is some IPMI interfaces being down and requiring a power cycle by DCOPs)

Wednesday

  • needs attention (per RyanVM): Bug 935246 - Graphserver doesn't know how to handle the talos results from non-PGO builds on the B2G release branches
  • IT's monitoring rollout happening today
  • request to build "B2G device image nightlies" non-obvious what the builders are or what masters they live on. No howto I could find. How do we automate this and keep it

Friday

  • RyanVM reports that pushing mozilla-beta to Try is a fairly normal thing to do but fails on the win64-rev2 build slaves. He has been helping with backporting the fixes in User:Jhopkins/win64rev2Uplift to get this addressed.
  • catlee's buildbot checkconfig improvement went into production but we need a restart on all the masters to get the full benefit. no urgency, however.

Week of 2013-10-14 to 2013-10-18
buildduty: armenzg

Monday

  • uploaded mozprocess
  • landed puppet change that made all Linux64 hosts get libvirt-bin get installed and made them fall to sync with puppet
    • I had to back out and land a patch to uninstall the package
    • we don't know why it got installed
  • redis issues
    • build-4hr issues
    • signing issues

Tuesday

  • returned some slaves to the pool
  • investigated some cronmail
  • uploaded talos.zip
  • reclaimed machines and requested reimages

Wednesday

  • put machines back into produciton
  • loan
  • process delays email


Week of 2013-10-14 to 2013-10-18

buildduty: coop (callek on monday)

Monday

Tuesday

  • meetings! At least one of them was about buildduty.

Wednesday

  • shutdown long-running AWS instance for hverschore: bug 910368
  • investigating disabled mtnlion slaves
    • many needed the next step taken: filing the IT bug for recovery

Thursday
*filed Bug 927951 - Request for smoketesting Windows builds from the cedar branch

  • filed Bug 927941 - Disable IRC alerts for issues with individual slaves
  • reconfig for new Windows in-house build master

Friday

Week of 2013-09-23 to 2013-09-27

buildduty: armenzg

Monday

  • help marcia debug some b2g nightly questions
  • meetings and meetings and distractions
  • started patches to transfer 11 win64 hosts to become try ones

Tuesday

  • run a reconfig
  • did a backout for buildbotcustom and run another reconfig
  • started work on moving win64 hosts from build pool to the try pool
  • analyzed a bug filed by sheriff wrt to clobberer
    • no need for buildduty to fix (moved to Platform Support)
    • asked people's input for proper fix
  • reviewed some patches for jmaher and kmoir
  • assist edmorley with clobberer issue
  • assist edmorley with git.m.o issue
  • assist RyanVM with git.m.o issue
  • put w64-ix-slave64 in the production pool
  • updated buildduty wiki page
  • updated wiki page to move machines from one pool to another

Wednesday

  • messy

Thursday

  • messy

Monday

  • messy

Week of 2013-09-16 to 2013-09-20

buildduty: Callek

Monday:

  • (pmoore) batch 1 of watcher update [Bug 914302]
  • MERGE DAY
    • We missed having a point person for merge day again, rectified (thanks armen/rail/aki)
  • 3-reconfigs or so,
    • Merge day
    • Attempted to fix talos-mozharness (broken by panda-mozharness landing)
    • Backout out talos-mozharness change for continued bustaged
    • Also backed out emulator-ics for in-tree (crossing-tree) bustage relating to name change.
      • Thanks to aki for helping while I had to slip out for a few minutes
  • Loaner bug poking/assigning
  • Did one high priority loaner needed for tree-closure which blocked MERGE DAY

Tuesday:

  • (pmoore) batch 2 of watcher update [Bug 914302]
  • Buildduty Meeting
  • Bug queue churn-through
  • Hgweb OOM: Bug 917668

Wednesday:

  • (pmoore) batch 3 of watcher update [Bug 914302]
  • Reconfig
  • Hgweb OOM continues (IT downtimed it, bkero is on PTO today, no easy answer)
    • Very Low visible tree impact at present
  • Bug queue churn-through
  • Discovered last-job-per-slave view of slave_health is out of date.
  • Discovered reboot-history is either out of date or reboots not running for tegras

Thursday

  • (pmoore) batch 4 [final] of watcher update [Bug 914302]
  • Hgweb OOM continues
  • Bug queue churn... focus on tegras today
  • Coop fixed last-job-per-slave generation, and slaves_needing_reboots
  • Downtime (and problems) for scl1 nameserver and scl3 zeus nodes
    • Caused tree closure due to buildapi01 being in scl1 and long delay


Week of 2013-09-09 to 2013-09-13

buildduty: coop

Monday:

  • meetings
  • kittenherder hanging on bld-centos - why?
    • multiple processes running, killed off (not sure if that's root cause)

Tuesday:

  • buildduty meeting
    • filed bug 913606 to stop running cronjobs to populate mobile-dashboard
  • wrote reboot_tegras.py quick-n-dirty script to kill buildbot processes and reboot tegras listed as hung

Wednesday:

  • re-enabled kittenherder rebooting of tegras
  • wrote bugzilla shared queries for releng-buildduty, releng-buildduty-tagged, and releng-buildduty-triage
  • playing with bztools/bzrest to try to get query that considers dependent bugs
  • meetings
  • deploying registry change for bug 897768 (fuzzing dumps)

Thursday:

  • broke up kittenherder rebooting of tegras into 4 batches to improve turnaround time
  • got basic buildduty query working with bztools
  • respun Android nightly
  • resurrecting as many Mac testers as possible to deal with load
  • filed bug 915766 to audit pdu2.r102-1.build.scl1
  • resurrected a bunch of talos-r3-fed machines that were not running buildbot

Friday:

  • AWS US-East-1 outage
  • reconfig for nexus4 changes
    • re-reconfig to backout kmoir's changes that closed the tree: bug 829211
  • Mac tester capacity

Week of 2013-09-02 to 2013-09-06
buildduty: bhearsum

Monday

  • US/Canada holiday


Tuesday

  • Bug 912225 - Intermittent B2G emulator image "command timed out: 14400 seconds elapsed, attempting to kill" or "command timed out: 3600 seconds without output, attempting to kill" during the upload step
    • Worked around by lowering priority of use1. Acute issue is fixed, we suspect there's still symptoms from time to time.
    • Windows disconnects may be the early warning sign.


Week of 2013-08-26 to 2013-08-30
buildduty: jhopkins

Monday

  • many talos-r3-w7 slaves have a broken session which prevents a new SSH login session (you can authenticate but it kicks you out right away). Needed to RDP in, open a terminal as Administrator, and delete the files in c:\program files\kts\log\ip-ban\* and active-sessions\*.
  • many other talos-r3-w7 slaves had just ip-ban\* files (no active-sessions\* files) which prevented kittenherder from managing the slave, since there are no IPMI or PDU mechanisms to manage these build slaves.
  • trying slaveapi
    • IPMI reboot failing (no netflow)


$curl -dwaittime=60 http://cruncher.srv.releng.scl3.mozilla.com:8000/slave/talos-r3-xp-076/action/reboot
{

"requestid": 46889168, 
"state": 0,
"text": ""

}

  • slavealloc-managed devices are live (per Callek)
  • ec2 slave loan to eflores (909186)


Tuesday


Wednesday


Thursday

  • filed Bug 910818 - Please investigate cause of network disconnects 2013-08-29 10:22-10:24 Pacific
    • need to automate gathering of details for filing this type of bug
  • Bug 910662 - B2G Leo device image builds broken with "error: patch failed: include/hardware/hwcomposer.h:169" during application of B2G patches to android source


Friday



Week of 2013-08-19 to 2013-08-23
buildduty: armenzg

Monday

  • 40+ Windows builders had not been rebooted for several days
    • bug 906660
    • rebooted a bunch with csshX
    • I gave a couple to jhopkins to look into
      • cruncher was banned from being able to ssh
      • ipmi said that it was successful
      • more investigation happening
  • edmorley requested that I look into fixing the TMPDIR removal issue
    • bug 880003
    • ted to land a fix to disable the test that causes
    • filed a bug for IT to clean up the TMPDIR through puppet
    • cleaned up tmpdir manually on 2 hosts and put them back in production to check
  • do a reconfig for mihneadb
  • rebooted talos-r4-lion-041 upon philor's request due to hdutil
  • promote unagi
  • upload talos.zip for mobile


Tuesday

  • I see a bunch of these:
    • nagios-releng: Tue 06:04:58 PDT [4193] buildbot-master92.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
    • bug 907158
    • We disabled all use1 masters
    • We reduced use1's priority for host generation
    • Fixed Queue dirs for one of the masters by restarting the init service
      • had to kill the previous pid
  • rebooted remaining win64 machines
  • rebooted remaining bld-lion-r5 machines
  • buildapi issues
    • Callek took care of it and move it to Tools
  • Queue dir issues
    • it seems that the Amazon issues caused this
    • I have been stopping and starting the pulse_publisher
    • /etc/initd.d/pulse_publisher {stop|start}
    • the /dev/shm/queue/pulse/new will start decreasing
  • Re-enabled all aws-us-es-1 masters


Wednesday

  • one of the b2g repos has a 404 bundle and intermittent ISE 500
    • bug 907693
    • bhearsum has moved it to IT
    • fubar is looking into it
  • done a reconfig
    • graceful restart for buildbot-master69
      • I had to do a less graceful restart
    • graceful restart for buildbot-master70
    • graceful restart for buildbot-master71
  • loan t-xp32-ix006 in https://bugzil.la/904219
  • promoted b2g build
  • deploy graphs change for jmaher
  • vcs major alert
    • hwine to look into it


Thursday

  • disable production-opsi
  • reconfig
  • trying to cleanup nagios


Friday

  • a bunch of win64 machines are not taking jobs
    • deploy to all machines the fix
    • filed bug for IT to add to task sequence
  • some win64 imaging bugs filed
  • reconfig for hwine
  • Callek deployed his slavealloc change
  • merged mozharness
  • investigated a bunch of hosts that were down
  • we might be having some DNS tree-closing issues



Week of 2013-08-12 to 2013-08-16

buildduty: coop

Monday:


Tuesday:


Wednesday:

  • fixed buildfaster query to cover helix builds
  • set aside talos-r4-lion-001 for dustin: bug 902903#c16
  • promoted unagi build for dogfood: 20130812041203
  • closed bug 891880
  • investigated talos failures affecting (primarily) lion slaves: bug 739089


Thursday:


Friday:

  • investigation into bug 905350
    • basedir wrong in slavealloc
  • reconfigs for catlee, aki



Issues/Questions:

  • many/most of the tree closure reasons on treestatus don't have bug#'s. should we encourage sheriffs to enter bug#'s so others can follow along more easily?
  • uncertainty around the difference between mozpool-managed and non-mozpool-managed pandas. How do I take one offline - do they both use disabled.flg? A: yes, all pandas use disabled.flg on the foopy
    • when is it ok to use Lifeguard to for the state to "disabled"? per dustin: [it's] for testing, and working around issues like pandas that are still managed by old releng stuff. it's to save us loading up mysql and writing UPDATE queries
      • what's "old releng stuff"?
  • Windows test slaves
  • another example of PDU reboot not working correctly: bug 737408#c5 and bug 885969#c5. We need to automate power off,pause,power on to increase reliability.

Bustages:

  • bug 900273 landed and backed out