ReleaseEngineering/Buildduty/StandupMeetingNotesQ12015

2015-03-31

   https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Reconfigs

   http://coop.deadsquid.com/2015/03/the-changing-face-of-buildduty/

2015-03-30

   buildduty report cleanup <- lots

2015-03-27

   https://bugzil.la/1138234 - b2g_bumper stalling frequently

   is this still happening or can we resolve this? RESOLVED

2015-03-26

   https://bugzil.la/1147853 - Widespread "InternalError: Starting video failed" failures across all trees on AWS-based test instances

   possibly related to runslave changes ( https://bugzil.la/1143018 ) and interaction with runner

   Q1 is almost done

   what do we need to document/update prior to buildduty hand-off next week?

   testing out latest mh prod rev on cedar in a canary fashion :)

   be better if releng stays under the radar for at least the rest of the day

   jlund|buildduty> nagios-releng: downtime vcssync2.srv.releng.usw2.mozilla.com 1h "bug 1135266"

   disabling two foopies for recovery

   https://bugzil.la/1146130

2015-03-25

   no-op reconfig for fubar - https://bugzil.la/1147314

   full reconfig for bhearsum

   https://bugzil.la/1143018 - Update runslave.py with current machine types and basedirs

   updated basedirs in slavealloc db

   deployed puppet change to runslave.py

   ryanvm wants a number of bugs looked at that result in windows ending up with start screen

   namely https://bugzil.la/1135545

   see also

   https://bugzil.la/924728

   https://bugzil.la/1090633

   should we ping markco/Q?

   ni: Q

   commented https://bugzilla.mozilla.org/show_bug.cgi?id=1135545#c89

   did another reconfig

   landed ec2 windows slave health fix

   needs second patch

   backed out https://bugzil.la/1146379 hgtool should avoid pulling if it already has a revision

   context comment 3-5

   landed https://bugzil.la/1146855 across all trees

2015-03-24

   https://bugzil.la/1146855 - added dlls to tooltool to make minidump_stackwalk

   broken reconfig this morning?

   https://hg.mozilla.org/build/buildbot-configs/rev/22f9faca403c - missing comma

   aws instances stuck in long running or something? high pending

   "high" is relative: there were *maybe* 200 jobs pending on AWS pools

   [2:32pm] Callek: coop|buildduty: RyanVM|sheriffduty jlund|buildduty: fyi -- https://github.com/mozilla/build-cloud-tools/pull/54 mgerva just updated our limits for aws linux64 testers by about 50%, nick cautioned to keep an eye on master health with this increase.

   https://bugzil.la/1145387 - t-yosemite-r5-0073 can't connect to a master (specifically bm107)

   notes in bug, gracefully restarting bm107 to see if that helps

   slave connected to bm108 and is fine now

2015-03-23

   coop working on https://bugzil.la/978928 -  Reconfigs should be automatic, and scheduled via a cron job

2015-03-20

   chemspill in progress, ***NO UNNECESSARY CHANGES***

   coop going through "All dependencies resolved" section of buildduty report

   doing all non-pandas first

   will do a second, panda-only pass after

2015-03-19

   https://etherpad.mozilla.org/bz1144762

   chemspill coming from pwn2own

2015-03-18

   https://bugzil.la/1144762 - more hg timeouts, new bug filed for today's fun

   happening again this morning, possibly different cause

   from fox2mike: https://fox2mike.pastebin.mozilla.org/8826256

   from cyliang: https://graphite-scl3.mozilla.org/dashboard/#http-zlbs

   notice table-top on outbound

   tackling from another end by re-visiting this bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1113460#c16

   https://etherpad.mozilla.org/bz1144762

   killed a ton of RETRY l10n and fuzzer jobs that were hitting hg.m.o and 503'ing

2015-03-17

   tst-linux64-spot pending >4500 as of 6:30am PT (!)

   inadvertent puppet change: http://hg.mozilla.org/build/puppet/rev/dfe40f33e6d0#l2.1

   mrrrgn deploying fix and rescuing instances

   https://bugzil.la/1143681 - Some AWS test slaves not being recycled as expected

   found a way to track these down

   in the AWS console, search for the instances in the Spot Requests tab. You can click on the instance ID to get more info.

   e.g. for tst-linux64-spot-233, the instance has no name associated and is marked as "shutting-down"

   https://bugzil.la/1144362 - massive spike in hg load

   possibly due to new/rescued instances from the morning (re)cloning mh & tools

   negative feedback loop?

2015-03-16

   filed https://bugzil.la/1143681 - Some AWS test slaves not being recycled as expected

   how can we stop tests from being part of --all on try: https://bugzilla.mozilla.org/show_bug.cgi?id=1143259#c4

2015-03-13

   buildbot DB "too many connections" again. (perhaps DBA's are able to increase the conn pool limits?)

   need a button in slave health that automatically files a diagnostics bug for a given slave

   should disable the slave if not already disabled

   should do the bug linking automatically

   should have a small text entry box for the description of the diagnostics bug, i.e. why are we asking for diagnostics

   would hopefully prevent sheriffs from just taking slaves offline and waiting for us to perform the next step(s)

   file https://bugzil.la/1143018 - Update runslave.py with current machine types and basedirs

   we essentially guess at the builddir in most cases these days(!)

   https://bugzil.la/1142825 - high windows test pending

   rebooting idle win7 machines, re-imaged 2 others

   found some try commit from wednesday mar 11 with some of duplicate jobs:

   https://treeherder.mozilla.org/#/jobs?repo=try&revision=768730b3ae1c

   https://treeherder.mozilla.org/#/jobs?repo=try&revision=fb39553b0473

   both from mchang@m.o

   some jobs look like they're still running in treeherder, buildapi says no

   checking Windows test masters for slaves that have been running jobs for a while (probably hung)

   investigated tree closure bugs that resulted from reconfig and mh :(

   https://bugzil.la/1143227

   https://bugzilla.mozilla.org/show_bug.cgi?id=1142553#c11

   merge mh to prod and bump mh for separate failure: http://hg.mozilla.org/build/mozharness/rev/18a18416de6a

   https://bugzil.la/1143259 - tests run by default that are failing more than 80 percent of the time

2015-03-12

   filed https://bugzil.la/1142493 - panda-relay-037 is down

   Win7 test pending >2000 (unclear on why)

   tree closure

   caused by RyanVM

2015-03-11

   file https://bugzil.la/1142103 - Scheduling issues with Win64 xulrunner nightlies on try

   getting daily dead command queue items from this

   blog post: http://coop.deadsquid.com/2015/03/better-releng-patch-contribution-workflow/

   https://bugzil.la/1088032 - Test slaves sometimes fail to start buildbot after a reboot

   coop investigating

2015-03-10

   https://bugzil.la/1141396 - Mulet Nightlies all failing with FATAL

   https://bugzil.la/1141416 - Fix the slaves broken by talos's inability to deploy an update

   longstanding issue, we should really fix this

   a few BGP flaps reported in #buildduty this morning

   https://bugzil.la/1139764 - terminated tst-ubuntu14-ec2-shu

   sent mail to group re: AWS sanity checker long-running instances

   https://bugzil.la/1141454 - Buildbot DB max connections overnight

   priority backlog triage:

   https://bugzil.la/1139763 - add windows to jacuzzi

   patch r-, need follow up

   https://bugzil.la/1060214 - Intermittent command timed out: 10800

   https://bugzil.la/1123025 - b2g emulator nightlies (sometimes?) use a test package from a previous nightly

    https://bugzil.la/1055912 Clobberer on try is apparently not working.

   https://bugzil.la/1141416 - Fix the slaves broken by talos's inability to deploy an update

2015-03-09

   https://bugzil.la/1140989 - zlb8.ops.phx1.mozilla.com:Load is CRITICAL

   https://bugzilla.mozilla.org/show_bug.cgi?id=1126825#c11

   reconfig

   https://bugzil.la/1140539 - slow query log report for buildbot2

   have some info now to start reducing load on db

   filed https://bugzil.la/1141217 - nagios alerts for unassigned blocker bugs in all releng bugzilla components

   hope to avoid situation from last Friday where RyanVM's blocker bug sat for hours

   https://bugzil.la/1139764 - Slave loan request for a  tst-linux32-spot instance

   created an Ubuntu 14.04 instance for Shu to test a kernel theory on

2015-03-06

   https://bugzil.la/1140304 - All Gij jobs are permanently red for v2.2 branch

   uploaded http://pypi.pub.build.mozilla.org/pub/mozprofile-0.21.tar.gz

   filed https://bugzil.la/1140398 - Send nagios load alerts for upload1.dmz.scl3.mozilla.com to #buildduty IRC channel

   closed out tree closure bugs from this week after making sure bugs for follow-up issues were on file

   filed https://bugzil.la/1140419 - [tracking] Switch all releng RoR to github

   will send mail/write blogpost today

   working on https://bugzil.la/1140479 - Improvements to end_to_end_reconfig.sh script

2015-03-05

   never got to adding win64 m-a nightlies to jacuzzi https://bugzil.la/1139763

   need to enable slaves from  https://bugzil.la/1138672

   end_to_end script tells bugs things for mozharness are live in production. this no longer is the case for all our build + test jobs (most things aside from vcs-sync, bumper, etc).

   should we be still automatically updating bugs for mh after reconfig?

   we need a way to roll out changes to mh on a regular cadence. right now it's up to the individual to update mozharness.json with a REV they want applied and consequently, whatever mh patches are in between are also applied...

   coop to drop mozharness from end-to-end-reconfig script and email public list

   added http://pypi.pvt.build.mozilla.org/pub/mozrunner-6.6.tar.gz

   talked to catlee re: releng-try pipeline

   fully supportive

   one wrinkle: how to tackle release tagging

   coop will get bugs filed today

   add 4-repo view to slave health?

2015-03-04

   https://bugzil.la/1138937 - Slave loan request for a t-w864-ix machine

   reconfig in progress

   buildduty report:

   re-imaging a bunch of slaves to help with capacity

   https://bugzil.la/1138672 - vlan request - move bld-lion-r5-[006-015] machines from prod build pool to try build pool (needs to be enabled)

   test master upgrades (done)

   https://bugzil.la/1136527 - upgrade ec2 linux64 test masters from m3.medium to m3.large (again)

   https://bugzil.la/1135664 - Some masters don't have swap enabled

   (hwine) meeting with Linda (head of #moc)

   make more specific requests from #moc

   share top issues with #moc

   when: next meeting is 13th

   come up with prioritized list of releng needs by early next week

   coop to file bugs re: releng-try improvements

   add builderlists/dumpmasters diff to travis

   switch RoR for key repos to github

   reverse VCS sync flow

   enable travis testing for forks - this is done on a per-fork basis by the owners of the forks. PR's will get travis jobs regardless.

   no-op reconfig on schedulers

   https://bugzil.la/1123911 - fw1.releng.scl3.mozilla.net routing failures - BGP use1

upgrade test linux masters (https://bugzil.la/1136527):

   bm51 (complete)

   bm53 (complete)

   bm117-tests1-linux64 (complete)

   bm52-tests1-linux64 (complete)

   bm54-tests1-linux64 (complete)

   use1

   bm67-tests1-linux64 (complete)

   bm113-tests1-linux64 (complete)

   bm114-tests1-linux64 (complete)

   bm120-tests1-linux64 (complete)

   bm121-tests1-linux64 (complete)

   usw2

   bm68-tests1-linux64 (complete)

   bm115-tests1-linux64 (complete)

   bm116-tests1-linux64 (complete)

   bm118-tests1-linux64 (complete)

   bm122-tests1-linux64 (complete)

   bm123-tests1-linux64 (started)

add swap (https://bugzil.la/1135664):

   bm53 (complete)

   buildbot-master54 (complete)

   use1

   buildbot-master117 BAD

   buildbot-master120 BAD (complete)

   buildbot-master121 BAD (complete)

   usw2

   buildbot-master68 (complete)

   buildbot-master115 (complete)

   buildbot-master116 BAD (complete)

   buildbot-master118 BAD (complete)

   buildbot-master122 BAD (complete)

   buildbot-master123 BAD

buildbot-master04 BAD buildbot-master05 BAD buildbot-master06 BAD buildbot-master66 BAD buildbot-master72 BAD buildbot-master73 BAD buildbot-master74 BAD buildbot-master78 BAD buildbot-master79 BAD buildbot-master91 BAD

2015-03-03

   https://bugzil.la/1138955 - Slow Builds  and lagginess

   tree closure due to backlog (10:00am ET)

   *mostly* unexpected load (extra poorly-timed pushes to try), although a bunch of test instances not recycling properly

   coop is investigating these

   https://bugzil.la/1041763 - upgrade ec2 linux64 test masters from m3.medium to m3.large

   jlund starting to iterate through list today

   https://bugzil.la/1139029 - Turn off OSX Gip (Gaia UI tests) on all branches

   https://bugzil.la/1139023 - Turn off Fx desktop OSX 10.8 tests on the B2G release branches

   coop landing patches from RyanVM to reduce b2g test load on 10.8

2015-03-02

   https://bugzil.la/1138155 - set up replacement masters for Fallen

   https://bugzil.la/1137047 - Rebalance the Mac build slaves between buildpool and trybuildpool

2015-02-27

   queue issues on build masters due to graphene jobs

   should be resolved by reconfig this morning

   re-imaging some 10.8 machines as 10.10

   10.10 will be running opt jobs on inbound, 10.8 debug on inbound + opt on release branches

   sheriffs are understandably worried about capacity issues in both pools

   re-re-imaging talos-linux32-ix-0[01,26]

   may have an underlying issue with the re-imaging process for linux hw

2015-02-26

   things to discuss:

   https://bugzil.la/1137047 - Rebalance the Mac build slaves between buildpool and trybuildpool

   filed: https://bugzil.la/1137322 - osx test slaves are failing to download a test zip from similiar rev

2015-02-25

   things to circle back on today:

   https://bugzil.la/1136195 - Frequent download timeouts across all trees

   https://bugzil.la/1130242 - request for throughput data on the SCL3 ZLBs for the past 12 hours

   https://bugzil.la/1136465 - New: Spot instances  failing with remoteFailed: [Failure instance: Traceback (failure with no  frames): <class 'twisted.spread.pb.PBConnectionLost'>:

   [Bug 1041763] upgrade ec2 linux64 test masters from m3.medium to m3.large

   https://bugzil.la/1136531 - Slave loan request for a tst-linux64-spot vm

2015-02-24

   https://bugzil.la/1136195 - Frequent download timeouts across all trees

   related to release traffic?

   release reconfigs don't log themselves

   should probably reconfig everything not just build/scheduler masters

   i think this takes care of itself once masters start updating themselves based on tag updates

   tree closure

   symptom: https://bugzilla.mozilla.org/show_bug.cgi?id=1136465#c0

   diagnosis: https://bugzilla.mozilla.org/show_bug.cgi?id=1136465#c1

   an aid to make recovery faster:

   https://bugzil.la/1136527

   note this was accidentally happened from ghost work:

   https://bugzilla.mozilla.org/show_bug.cgi?id=1126428#c66

10:35:13 <hwine> ah, I see coop already asked Usul about 0900PT 10:36:34 <hwine> ashlee: sounds like our theory of load isn't right - can someone check further, please? https://bugzil.la/1136195#c1 10:38:02 <•pir> hwine: check... what? 10:38:56 <hwine> ftp.m.o is timing out and has closed trees. Our guess was release day load, but that appears not to be it 10:39:33 <•pir> hwine: I can't see any timeouts in that link, I may be missing something. 10:39:48 <jlund> ashlee: hwine catlee-lunch we have https://bugzilla.mozilla.org/show_bug.cgi?id=1130242#c4 to avail of now too it seems. might provide some insight to health or even a possible cause as to why we are hitting timeouts since the the change time lines up within the time of reported timeouts. 10:40:35 <jlund> pir: that's the bug tracking timeouts. there are timeouts across many of our continuous integration jobs: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/mozilla-inbound_ubuntu32_vm_test-jittest-1-bm01-tests1-linux32-build102.txt.gz 10:40:39 mbrandt → mbrandt|lunch 10:40:49 <•pir> hwine: we don't have a lot of visibility into how ftp.m.o works or is not working. This isn't a good situation, but sadly how it is. 10:41:28 <hwine> pir: right, my understanding is that you (moc) coordinates all the deeper dives for IT infrastructure (which ftp.m.o still is) 10:42:15 <•pir> hwine: To clarify, I don't think anyone has a lot of visibility into how ftp.m.o is working :( 10:42:19 <•pir> it's a mess 10:42:34 <•pir> ashlee: want to loop in C ? 10:43:00 <•pir> (and I think mixing continuous build traffic and release traffic is insane, personally) 10:43:26 <•pir> jlund: yes, that's what I was reading and not seeing anythig 10:43:43 <hwine> pir: system should handle it fine (has in the past) release traffic s/b minimal since we use CDNs 10:44:12 <•pir> hwine: should be. isn't. 10:44:18 <•ashlee> pir sure 10:47:53 <•pir> the load on the ftp servers is... minimal 10:48:27 <•fox2mike> jlund: may I ask where these timeouts are happening from? 10:49:15 <jlund> hm, so load may not be the issue. begs the question "what's changed" 10:49:17 <•pir> and what the timeouts actually are. I can't see anything timing out in the listed logs 10:49:37 <•pir> jlund: for ftp.m.o? nothing that I'm aware of 10:50:06 <cyliang> no bandwith alerts from zeus. looking at the load balancers to see if anything pops out. 10:50:09 <•ashish> from http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/mozilla-inbound_ubuntu32_vm_test-jittest-1-bm01-tests1-linux32-build102.txt.gz 10:50:13 <•ashish> i see 10:50:14 <•ashish> 08:00:28 WARNING - Timed out accessing http://ftp.mozilla.org.proxxy1.srv.releng.use1.mozilla.com/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/firefox-39.0a1.en-US.linux-i686.tests.zip: timed out 10:50:18 <•ashish> what is that server? 10:50:31 <•fox2mike> USE1 10:50:34 <•fox2mike> FUCK YEAH! :p 10:50:35 → agibson joined (agibson@moz-j04gi9.cable.virginm.net) 10:50:36 <•fox2mike> the cloud baby 10:50:55 <•fox2mike> jlund: I bet if you were to try this from other amazon regions, you might not his this 10:50:57 <•ashish> i don't see timeouts for http://ftp.mozilla.org/* 10:50:59 <cyliang> fox2mike: Is this the same timeout stuff as last time? 10:51:03 <•ashish> (in that log) 10:51:03 <•fox2mike> I'm guessing 10:51:06 <•fox2mike> cyliang: ^ 10:51:17 <•fox2mike> because the last time we saw random issues 10:51:21 <•fox2mike> it was all us-east1 10:51:39 <•fox2mike> jlund: for reference - https://bugzilla.mozilla.org/show_bug.cgi?id=1130386 10:52:11 <•fox2mike> our infra is the same, we can all save time by trying to see if you guys hit this from any other amazon region (if that's possible) 10:53:08 <jlund> proxxy is a host from aws but after failing to try that a few times, we poke ftp directly and timeout after 30 min: 10:53:12 <jlund> https://www.irccloud.com/pastebin/WmSehqzj

Plain Text • 8 lines raw | line numbers

10:53:13 <wesley> jlund's shortened url is http://tinyurl.com/q47zbvl 10:53:15 <•pir> yay cloud 10:54:36 <•pir> jlund: that download from ftp-ssl works fine from anywhere I have access to test it 10:55:06 <•fox2mike> jlund: where did that fail from? 10:55:40 <jlund> sure, and it doesn't always timeout, but of our thousands of jobs, a bunch have failed and timed out. 10:55:54 <unixfairy> jlund can you be more specific 10:56:10 <jlund> fox2mike: same log example: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/1424790646/mozilla-inbound_ubuntu32_vm_test-jittest-1-bm01-tests1-linux32-build102.txt.gz 10:56:56 <jlund> sorry, I don't know exact failure rate numbers. RyanVM|sheriffduty may know more. 10:57:03 <•fox2mike> jlund: so 10:57:03 <•fox2mike> builder: mozilla-inbound_ubuntu32_vm_test-jittest-1 10:57:04 <•fox2mike> slave: tst-linux32-spot-105 10:57:10 <•fox2mike> that's from amazon again 10:57:18 <•fox2mike> tst-linux32-spot-105 10:57:23 <•fox2mike> that's a spot instance 10:57:34 <•pir> yep, master: http://buildbot-master01.bb.releng.use1.mozilla.com:8201/ 10:57:40 <•fox2mike> us-east1 10:57:45 <•pir> so far the connection I see is use1 as fox2mike says 10:58:14 <•fox2mike> we've been through this before :) 10:58:17 <•fox2mike> is all I'm saying 10:59:02 <jlund> sure. let's make sure we can narrow it down to that. I'll see if I can track down more jobs that have hit the timeout where slaves are not in aws. 10:59:08 <jlund> thanks for your help so far. 10:59:49 <•fox2mike> jlund: aws is fine, anything that's a non use1 failure 10:59:55 <•fox2mike> before we go to non aws failure 11:00:06 <•fox2mike> but your case will narrow it down further 11:00:07 <•fox2mike> thanks! 11:00:11 <jlund> rgr 11:00:15 <RyanVM|sheriffduty> fox2mike: things have been quiet for a little while now 11:00:26 <RyanVM|sheriffduty> but we had a lull awhile ago too before another spike 11:00:36 <RyanVM|sheriffduty> so I'm not feeling overly inclined to say that things are resolved 11:00:55 jp-food → jp 11:01:00 <jlund> RyanVM|sheriffduty: have any mac or windows jobs hit this timeout? 11:01:08 <RyanVM|sheriffduty> yes 11:01:13 <RyanVM|sheriffduty> windows definitely 11:01:26 <jlund> k, fox2mike ^ we don't have any windows machines in the cloud 11:01:48 <RyanVM|sheriffduty> random example - https://treeherder.mozilla.org/logviewer.html#?job_id=6928327&repo=mozilla-inbound 11:01:54 <•ashish> are there logs from thoes machines? 11:01:55 <•ashish> ty 11:02:00 → KaiRo joined (robert@moz-dqe9u3.highway.telekom.at) 11:02:17 <RyanVM|sheriffduty> OSX - https://treeherder.mozilla.org/logviewer.html#?job_id=6924712&repo=mozilla-inbound 11:02:51 jlund → jlund|mtg 11:02:57 ⇐ agibson quit (agibson@moz-j04gi9.cable.virginm.net) 11:04:28 jlund|mtg → jlund 11:04:36 <KaiRo> who is the right contact for getting HTTP requests to a Mozilla-owned domain set up to redirect to a different website (another Mozilla-owned domain)? 11:04:50 <KaiRo> the case in question is bug 998793 11:05:27 → agibson joined (agibson@moz-j04gi9.cable.virginm.net) 11:06:48 <•ashish> KaiRo: looks like that IP is hosted/maintained by the community 11:07:05 <•pir> KaiRo: 173.5.47.78.in-addr.arpa domain name pointer static.173.5.47.78.clients.your-server.de. 11:07:09 <•pir> KaiRo: not ours 11:07:39 <jlund> so, it sounds like we have confirmed that this outside aws. for completeness, I'll see if I can find this happening on usw-2 instances too. 11:08:01 agibson → agibson|brb 11:08:23 <KaiRo> ashish: yes, the IP is right now not Mozilla-hosted (atopal, who does host it and actually is an employee nowadays, will be working on getting it moved to Mozilla in the next months) but the domains are both Mozilla-owned 11:09:02 <•pir> KaiRo: the server isn't, though, and you do redirects on server 11:09:54 <KaiRo> pir: well, what we want in that bug is to have mozilla.at point to the same IP as mozilla.de (or CNAME to it or whatever) 11:10:33 <•pir> KaiRo: ah, that's not the same question 11:11:17 <KaiRo> and the stuff hosted by atopal that I was referring to is actually the .de one - I have no idea what the .at one even points to 11:11:45 <•ashish> KaiRo: ok, file a bug with webops. they'll have to change nameservers, setup dns and then put up redirects as needed 11:12:06 <•pir> that 11:13:27 <KaiRo> ashish: OK, thanks! 11:13:56 <•pir> KaiRo: www.mozilla.de or www.mozilla.com/de/ ? 11:14:13 <•pir> KaiRo: the former is community, the latter is mozilla corp New messages 11:17:04 agibson|brb → agibson 11:17:54 <KaiRo> pir: the former, we want both .at and .de point to the same community site 11:18:57 <•pir> KaiRo: then you need someone in corp to do the dns change and someone who runs the de community site to make sure their end is set up 11:20:08 <KaiRo> pir: sure 11:21:00 <KaiRo> pir: I was mostly concerned about who to contact for the crop piece, I know the community people, we just met this last weekend 11:21:35 <•pir> KaiRo: file a child bug into infra & ops :: moc: service requests 11:21:46 <•pir> KaiRo: if we can't do it directly then we can find someone who can 11:22:10 <KaiRo> pir: thanks, good to know 11:22:18 ⇐ agibson quit (agibson@moz-j04gi9.cable.virginm.net) 11:22:31 <•pir> KaiRo: I'd suggest asking for a CNAME from mozilla.at to mozilla.de so if the de site's IP changes it doesn't break 11:23:03 jlund → jlund|mtg 11:23:51 <KaiRo> pir: yes, that's what I would prefer as well, esp. given the plans to move that communitxy website from atopal's server to Mozilla Community IT 11:25:12 <•ashish> KaiRo: will mozilla.at always remain a direct? (in the near future, at least) 11:25:39 <KaiRo> ashish: in the near future for sure, yes 11:25:45 <•ashish> KaiRo: if so, we can have our static cluster handle the redirect 11:25:59 <•ashish> that migh save some resources for the community 11:26:18 <•pir> if it's ending up on the same server, how does that save resources? 11:27:03 <•ashish> if it's all the same server then yeah, not a huge benefit 11:27:20 Fallen|away → Fallen, hwine → hwine|mtg, catlee-lunch → catlee 11:37:07 <KaiRo> ashish, pir: thanks for your help, I filed bug 1136318 as a result, I hope that moves this forward :) 11:38:31 <•pir> np 11:38:46 <•ashish> KaiRo: yw 11:40:23 coop|lunch → coop|mtg Tuesday, February 24th, 2015

2015-02-20

   reimaging a bunch of linux talos machines that have sat idle for 6 months

   talos-linux32-ix-001

   talos-linux64-ix-[003,004,008,092]

   https://bugzil.la/1095300

   working on slaveapi code for "is this slave currently running a job?"

   pending is up over 5000 again

   mostly try

   Callek: What caused this, just large amounts of pushing? What OS's were pending? etc.

2015-02-19

   another massive gps push to try, another poorly-terminated json prop

   https://bugzil.la/1134767

   rows excised from db by jlund

   jobs canceled by jlun/nthomas/gps

   master exception logs cleaned up with:

   python manage_masters.py -f production-masters.json -R scheduler -R try -j16 update_exception_timestamp

   saw more exceptions related to get_unallocated_slaves today

   filed https://bugzil.la/1134958

   symptom of pending jobs?

2015-02-18

   filed https://bugzil.la/1134316 for tst-linux64-spot-341

   been thinking about builder mappings since last night

   simplest way may be to augment current allthethings.json output

   need display names for slavepools

   need list of regexps matched to language for each slavepool

   this can be verified internally very easily: can run regexp against all builders in slavepool

   external apps can pull down allthethings.json daily(?) and process file to strip out only what they need, e.g. slavepool -> builder regexp mapping

   would be good to publish hash of allthethings.json so consumers can easily tell when it has updated

2015-02-17

   b2g_bumper process hung for a few hours

   killed off python processes on bm66 per https://wiki.mozilla.org/ReleaseEngineering/Applications/Bumper#Troubleshooting

   3 masters (bm71, bm77, bm94) hitting exceptions related to jacuzzis:

   https://hg.mozilla.org/build/buildbotcustom/annotate/a89f8a5ccd59/misc.py#l352

   unsure how serious this is

   taking the quiet moment to hammer out buildduty report and buildduty dashboard

   yesterday callek (while all the president's frowned at him) added more linux masters: 120-124. They seemed to be trucking along fine.

   reconfig happened (for releases?) at 10:31 PT and that caused a push to b2g-37 to get lost

   related: https://bugzil.la/1086961

   very high pending job count (again)

   enabled 4 new masters yesterday: bm[120-123], added 400 new AWS test slaves later in the day, but pending still shot past 3000

   graph of AWS capacity: http://cl.ly/image/2r1b0C1q0g3p

   nthomas has ipython tools that indicated many AWS builders were being missed in watch_pending.cfg

   Callek wrote a patch: https://github.com/mozilla/build-cloud-tools/commit/e2aba3500482f7b293455cf64bedfb1225bb3d7e

   seems to have helped, now around 2000 pending (21:06 PT)

   philor found a spot instance that hadn't taken work since dec 23:

   https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=tst-linux64-spot&name=tst-linux64-spot-341

   no status from aws-manager, can't be started

   needs more investigation tomorrow, may indicate a bigger problem if we aren't recycling nodes as we expect

2015-02-13

   https://bugzil.la/1132792 - new tree closure

   current state: https://bugzilla.mozilla.org/show_bug.cgi?id=1132792#c11

   reverted db change from yesterday, buildbot apparently needs a beefy physical machine

   going through buildduty report

2015-02-12

   buildbot db failover by sheeri (planned)

   https://bugzil.la/1131637

   https://bugzil.la/1132469 - tree closure

   lots of idle slaves connected to masters despite high pending counts

   have rebooted some masters so far:

   bm70, bm71, bm72, bm73, bm74, bm91, bm94

   coop looking into windows builders

   found 2 builders that hadn't run *any* jobs ever (since late sept at least)

2015-02-11

   reconfig is needed. last one was on thurs. blocked on the 10th from planned reconfig

   will kick off a reconfig at 10am ET

   bm118 ended up with 2 reconfig procs running

   disabled in slavealloc, initiated clean shutdown. Will restart when jobs drain.

   went through aws_sanity_checker backlog

   lots of unnamed hosts up for multiple days

   I'm assuming this is mostly for Windows AWS work based on the platform of the image, but we should really push people to tag instances more rigorously, or expect them to get killed randomly

   recovering "broken" slaves in slave health list

   Currently from jacuzzi report, 28 pending windows builds (for non-try) that are not in a jacuzzi

   18 of them are disabled for varying reasons, should cull that list to see if any of them can/should be turned on.

2015-02-10

   ghost patching fallout

   signing servers rejecting new masters

   masters enabled (bm69) but not running buildbot

   dns issues

   context:

   https://bugzilla.mozilla.org/show_bug.cgi?id=1126428#c55

   https://gist.github.com/djmitche/0c2c968fa1f6a5b5e0ca#file-masters-md

   a patch ended up in manage_masters.py that blocked amy from continuing rollout and doing a reconfig

   http://hg.mozilla.org/build/tools/rev/e7e3c7bf6efa

   so much time spent on debugging :( callek ended up finding the rogue accidental patch

2015-02-09

   STAT for jlund

2015-02-05

   tree closures

   [Bug 1130024] New: Extremely high Linux64 test backlog

   chalking that one up to a 20% more push increase than what we had previously

   [Bug 1130207] Several tests  failing with "command timed out: 1800 seconds without output running"  while downloading from ftp-ssl.mozilla.org

   again, likely load related but nothing too obvious. worked with netops, suspect we were hitting load balancer issues (ZLB) since hg and ftp share balancers and hg was under heavy load today

   dcurado will follow up

   and his follow up: bug 1130242

   two reconfigs

   dev-stage01 was running low on disk space

   loan for sfink

2015-02-04

       treeherder master db node is getting rebooted for ghost patching

   I asked mpressman to do it tomorrow and confirm with #treeherder folks first as there was not many on that were familiar with the system

   puppet win 2008 slaves are ready for the big leagues (prod)!

   I will be coordinating with markco the testing on that front

   did a reconfig.lot's landed

   investigated the 13 win builders that got upgraded RAM. 4 of them have been disabled for various issues

   dustin ghost patched bm103 and signing5/6

2015-02-03

   fallout from: Bug 1127482 -        Make Windows B2G Desktop builds periodic

   caused a ~dozen dead command items every 6 hours

   patched: https://bugzilla.mozilla.org/show_bug.cgi?id=1127482#c15

   moved current dead items to my own special dir in case I need to poke them again

   more dead items will come every 6 hours till above patch lands

   arr/dustin ghost slave work

   pod4 and 5 of pandas was completed today

   1 foopy failed to clone tools (/build/sut_tools/) on re-image

   it was a timeout and puppet wasn't smart enough to re clone it without a removal first

   try linux ec2 instances completed

   maybe after ami / cloud-tools fallout we should have nagios alerts for when aws spins up instances and kils them right away

   pro-tip when looking at ec2 graphs:

   zoom in or out to a time you care about and click on individual colour headings in legend below graph

   last night I did not click on individual moz-types under running graph and since there is so few bld-linux builders that run normally anyway, it was hard to notice any change

2015-02-02

   https://bugzil.la/1088032 - Test slaves sometimes fail to start buildbot after a reboot

   I thought the problem had solved itself, but philor has been rebooting windows slaves everyday which is why we haven't run out of windows slaves yet

   may require some attention next week

   panda pod round 2 and 3 started today

   turns out disabling pandas in slavealloc can kill its current job

   Calling fabric's stop command (disable.flg) on foopies kills its current job

   This was a misunderstanding in terms of plan last week, but is what we did pods 1->3 with, and will be the continued plan for next sets

   we'll inform sheriffs at start and end of each pod's work

   reconfig is failing as masters won't update local repos

   vcs error: 500 ISE

   fallout from vcs issues. gps/hwine kicked a webhead and all is merry again

   added a new report link to slave health for runner's dashboard

   late night Tree Closures

   Bug 1128780

   test pending skyrocketed, builds not running, builder graphs broken

   tests were just linux test capacity (with ~1600 pending in <3 hours)

   graphs relating to running.html were just a fallout from dead-code removal

   builds not running brought together mrrrgn dustin and catlee and determined it was fallout from dustin's AMI work with cent 6.5 causing earlier AMI's to get shut off automatically on us

   generic.scl3 got rebooted, causing mozpool to die out and restart, leaving many panda jobs dead

   B2G nightlies busted, unknown cause

   Bug 1128826

2015-01-30

   loan for markco: https://bugzil.la/1127411

   GHOST

   dustin is upgrading 7 foopies + 1 image host to make more of our infra haunted with ghosts

   https://bugzil.la/1126428

   started reconfig 11:00 PT

   fewer pandas decomm-ed than anticipated, will have final numbers today

   https://bugzil.la/1109862 - re-assigned to relops for dll deployment

   buildapi + new buildbot passwd: do we know what went wrong here?

   catlee suspects he updated the wrong config

   positive feedback from philor on Callek's jacuzzi changes

2015-01-29

   https://bugzil.la/1126879 Slaveapi not filing unreachable/problem-tracking bugs

   Theorize we might fix by https://github.com/bhearsum/bzrest/pull/1 or at least get better error reporting.

   Did some intermittent bug triage by using jordan's tool for giggles

   https://bugzilla.mozilla.org/page.cgi?id=user_activity.html&action=run&who=bugspam.Callek%40gmail.com&from=2015-01-28&to=2015-01-29&group=bug

   GHOST

   cyliang wants to patch and restart rabbitmq

   https://bugzil.la/1127433

   nameservers restarted this morning

   no detectable fallout, modulo ntp syncing alerts for 30min

   :dustin will be upgrading the foopies to CentOS 6.5

   this will be done per VLAN, and will mean a small, rolling decrease in capacity

   mothballing linux build hardware actually helped us here!

   package-tests target is becoming increasingly unreliable. may have a parallel bug in it

   https://bugzil.la/1122746 - package-tests step occasionally fails on at least win64 with 'find: Filesystem loop detected'

   coop is doing a panda audit

   cleaning up devices.json for recently decomm-ed pandas

   figuring out how much capacity we've lost since we disabled those racks back in the fall

   will determine when we need to start backfill

   https://bugzilla.mozilla.org/show_bug.cgi?id=1127699 (Tree Closure at ~10:30pm PT)

2015-01-28

   https://bugzil.la/1109862 - ran some tests with new dll installed

   working on slave loan for https://bugzil.la/1126547

   jacuzzi changes this morning: Android split apk and win64 debug

   planning to decomm a bunch more pandas over the next few days

   may need to strat a backfill process soon (we have lots waiting). may be able to hold out until Q2

   hacked a script to scrape tbpl bot comments on intermittent bugs and apply metrics

   https://hg.mozilla.org/build/braindump/file/8d723bd901f2/buildduty/diagnose_intermittent_bug.py

   BeautfulSoup is not required, but BeautifulSoup4 is! (said here rather than editing the doc like I should ~ Callek)

   applied here:

   https://bugzilla.mozilla.org/show_bug.cgi?id=1060214#c51

   https://bugzilla.mozilla.org/show_bug.cgi?id=1114541#c345

2015-01-27

   https://bugzil.la/1126181 - slave health jacuzzi patch review for Callek

   https://bugzil.la/1109862 - Distribute update dbghelp.dll to all Windows XP talos machines for more usable profiler pseudostacks

   pinged in bug by Rail

   some slave health display consistency fixes

   https://bugzil.la/1126370

2015-01-26

   audited windows pool for RAM: https://bugzilla.mozilla.org/show_bug.cgi?id=1122975#c6

   tl;dr 13 slaves have 4gb RAM and they have been disabled and dep'd on 1125887

   dcops bug: https://bugzil.la/1125887

   'over the weekend': small hiccup with bgp router swap bug: killed all of scl3 for ~10min not on purpose.

   tl;dr - everything came back magically and I only had to clear up ~20 command queue jobs

   which component is nagios bugs these days? seems like mozilla.org::Infrastructure & Operations bounced https://bugzil.la/1125218 back to releng::other. do we (releng) play with nagios now?

   "MOC: Service Requests" - refreshed assurance per chat with MOC manager.(linda)

   terminated loan with slaveapi: https://bugzilla.mozilla.org/show_bug.cgi?id=1121319#c4

   attempted reconfig but hit conflict in merge: https://bugzilla.mozilla.org/show_bug.cgi?id=1110286#c13

   catlee is changing buildapi r/o sql pw now (11:35 PT)

   I restarted buildapi

   updated wiki to show how we can restart buildapi without bugging webops

   https://wiki.mozilla.org/ReleaseEngineering/How_To/Restart_BuildAPI

   ACTION: should we delete https://wiki.mozilla.org/ReleaseEngineering/How_To/Update_BuildAPI since it is basically a less verbose copy than: https://wiki.mozilla.org/ReleaseEngineering/BuildAPI#Updating_code ?

   updated trychooser to fix bustage

2015-01-23

   deployed new bbot r/o pw to aws-manager and 'million other non puppetized tools'

   do we have a list? We should puppetize them *or* replace them

   filed: Bug 1125218 - disk space nagios alerts are too aggressive for signing4.srv.releng.scl3.mozilla.com

   investigated: Bug 1124200 - Android 4 L10n Nightly Broken

   report-4hr hung at 10:42 - coop killed the cron task

   sheeri reported that mysql slave is overworked right now and she will add another node

   should we try to get a more self-serve option here, or a quicker view into the db state?

   for DB state we have https://rpm.newrelic.com/accounts/263620/dashboard/3101982 and similar

   https://bugzil.la/1125269 - survey of r5s uncovered two machines running slower RAM

   http://callek.pastebin.mozilla.org/8314860 <- jacuzzi patch (saved in pastebin for 1 day)

2015-01-22

   reconfig

   required backout of mozharness patch from https://bugzil.la/1123443 due to bustage

   philor reported spidermonkey bustage: https://treeherder.mozilla.org/logviewer.html#?job_id=5771113&repo=mozilla-inbound

   change by sfink - https://bugzilla.mozilla.org/show_bug.cgi?id=1106707#c11

    https://bugzil.la/1124705 - tree closure due to builds-4hr not updating

   queries and replication blocked in db

   sheeri flushed some tables, builds-4hr recovered

   re-opened after 20min

   https://bugzil.la/1121516 - sheeri initiated buildbot db failover after reconfig (per email)

   philor complaining about panda state:

   "I lied about the panda state looking totally normal - 129 broken then, fine, exactly 129 broken for all time, not so normal"

   filed Bug 1124863 -        more than 100 pandas have not taken a job since 2015-01-20 around reconfig

   status: fixed

   filed Bug 1124850 -        slaveapi get_console error handling causes an exception when log formatting

   status: wontfix but pinged callek before closing

   filed Bug 1124843 -        slaveapi cltbld creds are out of date

   status: fixed, also improved root pw list order

   did a non merge reconfig for armen/bustage

   b2g37 fix for bustage I (jlund) caused. reconfiged https://bugzil.la/1055919

2015-01-21

   landed fix for https://bugzil.la/1123395 - Add ability to reboot slaves in batch on the slavetype pag

   many of our windows timeouts (2015-01-16) may be the result of not having enough RAM. Need to look into options like doubling page size: https://bugzilla.mozilla.org/show_bug.cgi?id=1110236#c20

2015-01-20

   reconfig, mostly to test IRC notifications

   master

   grabbed 2 bugs:

   https://bugzil.la/1122379 - Loan some slaves to :Fallen for his master

   https://bugzil.la/1122859 - Slave loan request for a bld-linux64-ec2 vm to try installing gstreamer 1.0

   use releng loan OU?

   network flapping thoughout the day

   filed: https://bugzil.la/1123911

   starting discussion in #netops

   aws ticket opened

   b2g bumper bustage

   fix in https://bugzil.la/1122751

   rebooted ~100 pandas that stopped taking jobs after reconfig

   Bug 1124059 - create a buildduty dashboard that highlights current infra health

   TODO: Bug for "make it painfully obvious when slave_health testing mode is enabled, thus is displaying stale data"

   hurt philor in #releng this evening when an old patch wih testing mode on deployed..

   i have a precommit hook for this now, shouldn't happen again

2015-01-19

   Filed bugs for issues discussed on Friday:

   https://bugzil.la/1123395 - Add ability to reboot slaves in batch on the slavetype page

   https://bugzil.la/1123371 - provide access to more timely data in slave health

   https://bugzil.la/1123390  - Synchronize the running/pending parsing algorithms between slave health and nthomas' reports

   fixed slavealloc datacenter issue for some build/try linux instances - https://bugzilla.mozilla.org/show_bug.cgi?id=1122582#c7

   re-imaged b-2008-ix-0006, b-2008-ix-0020, b-2008-ix-0172

   deployed 'terminate' to slaveapi and then broke slaveapi for bonus points

   re-patched 'other' aws end points for slaveapi - deploying that today (20th)

   fixed nical's troublesome loan

2015-01-16 (rollup of below scratchpad)

JLUND sheriffs requested I investigate:

   spike in win64 filesystem loops:

   sheriffs suggested they have pinged many times recently and they will start disabling slaves if objdir nuking is not preferable

   nuking b-2008-ix-0114 objdir of related builder

   filed bug 1122746

   Bug 916765 -        Intermittent "command timed out: 600 seconds without output, attempting to kill" running expandlibs_exec.py in libgtest

   diagnosis: https://bugzilla.mozilla.org/show_bug.cgi?id=916765#c193

   follow up: I will post a patch but it is not buildduty actionable from here on out IMO

   Bug 1111137 -        Intermittent  test_user_agent_overrides.html | Navigator UA not overridden at step 1 -  got Mozilla/5.0 (Android; Mobile; rv:37.0) Gecko/37.0 Firefox/37.0,  expected DummyUserAgent

   diagnosis: https://bugzilla.mozilla.org/show_bug.cgi?id=1111137#c679

   follow up: nothing for buildduty

   Bug 1110236 -        Intermittent "mozmake.exe[6]: *** [xul.dll] Error 1318" after "fatal error LNK1318: Unexpected PDB error"

   diagnosis: https://bugzilla.mozilla.org/show_bug.cgi?id=1110236#c17

   there was a common trend from the above 3 bugs with certain slaves

   filed tracker and buildduty follow up bug: https://bugzil.la/1122975

loans:

   fallen for setting up slaves on his master https://bugzil.la/1122379

   nical tst ec2 https://bugzil.la/1121992

CALLEK Puppet Issues:

   Had a db_cleanup puppet failure on bm81, catlee fixed with http://hg.mozilla.org/build/puppet/rev/d88423d7223f

   There is a MIG puppet issue blocking our golden AMI's from completing. Ulfr pinged in #releng and I told he has time to investigate (rather than asking for an immediate backout)

Tree Closure:

   https://bugzilla.mozilla.org/show_bug.cgi?id=1122582

   Linux jobs, test and build were pending far too long

   I (Callek) got frustrated trying to get assistance to find out what the problem is and while trying to get other releng assistance to look at the problem

   Boils down to capacity issues, but was darn hard to pinpoint

Action Items

   Find some way to identify we're at capacity in AWS easier (my jacuzzi slave health work should help with that, at least a bit)

   Get <someone> to increase our AWS capacity or find out if/why we're not using existing capacity. If increasing we'll need more masters.

2015-01-16

arr 12:08:38 any of the eu folks around? looks like someone broke puppet last night.

mgerva 12:20:01 arr: i'm here

arr 12:21:10 mgerva: looks like a problem with someone who's trying to upgrade mig 12:21:32 mgerva: it's been sending out mail about failing hosts 12:21:39 wasn't sure if it was also taking them offline eventually 12:21:48 (so I think this is limited to linux) 12:32:43 mgerva is now known as mgerva|afk

pmoore 12:47:16 arr: mgerva|afk: since the sheriffs aren't complaining yet, we can probably leave this for build duty which should start in a couple of hours

arr 12:47:46 pmoore: okay!

pmoore 12:47:51 i don't think anyone is landing puppet changes at the moment, so hopefully it should affect anything… i hope! 12:48:02 *shouldn't*

I see two different errors impacting different types of machines:

   Issues with mig: Puppet (err): Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install mig-agent=20150109+a160729.prod' returned 100

   Issues with a different config file: Puppet (err): Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter user on File[/builds/buildbot/db_maint/config.ini] at /etc/puppet/production/modules/buildmaster/manifests/db_maintenance.pp:48

2015-01-15

   reviewed and deployed https://hg.mozilla.org/build/mozharness/rev/3a6062cbd177 (to fix vcs sync fix for gecko l10n)

   enabled mochitest-chrome on B2G emulators on cedar (bug 1116187) as part of the merge default -> production for mozharness

2015-01-14

   slave loan for tchou

   started patch to reboot slaves that have not reported in X hours (slave health)

   reconfig for catlee/ehsan

   recovered 2 windows builders with circular directory structure

2015-01-13

   reconfig for ehsan

   https://bugzil.la/1121015 - dolphin non-eng nightlies busted after merge

   bhearsum took it (fallout from retiring update.boot2gecko.org)

   scheduler reconfig for fubar

   https://bugzil.la/1117811 - continued master setup for Fallen

   clearing buildduty report backlog

2015-01-12

   recovering loaned slaves

   setting up Tb test master for Fallen

   already has one apparently, some commentary in bug 1117811

   reconfig took almost 4hr (!)

   some merge day fallout with split APK

2015-01-08

   https://bugzilla.mozilla.org/show_bug.cgi?id=1119447 - All buildbot-masters failing to connect to MySQL: Too many connections

   caused 3-hour tree closure

2015-01-07

   wrote https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Other_Duties#Marking_jobs_as_complete_in_the_buildbot_database

   adjusted retention time on signing servers to 4 hours (from 8) to deal with nagios disk space alerts

2015-01-06

   https://bugzilla.mozilla.org/show_bug.cgi?id=1117395 - Set  RETRY on "timed out waiting for emulator to start" and "We have not  been able to establish a telnet connection with the emulator"

   trees closed because emulator jobs won't start

   https://bugzilla.mozilla.org/show_bug.cgi?id=1013634 - libGL changes <- not related according to rail

   backed out Armen's mozharness patch: http://hg.mozilla.org/build/mozharness/rev/27e55b4b5c9a

   reclaiming loaned machines based on responses to yesterday's notices

2015-01-05

   sent reminder notices to people with loaned slaves

2014-12-30

2014-12-29

   returning spot nodes disabled by philor

   these terminate pretty quickly after being disabled (which is why he does it)

   to re-enable en masse, run 'update slaves set enabled=1 where name like '%spot%' and enabled=0' in the slavealloc db

   use the buildduty report, click on the 'View list in Bugzilla' button, and then close all the spot node bugs at once

   started going throung bugs in the dependencies resolved section based on age. Here is a rundown of state:

   b-2008-ix-0010: kicked off a re-image, but I did this before fire-and-forget in early Dec and it doesn't seem to have taken. will check back in later

   :markco using to debug Puppet on Windows issues

   panda-0619: updated relay info, but unclear in bug whether there are further issues with panda or chassis

2014-12-14

2014-12-19

   what did we accomplish?

   vcssync, b2 bumper ready to hand-off to dev services(?)

   increased windows test capacity

   moved staging slaves to production

   disabled 10.8 on try

   PoC for further actions of this type

   investigations with jmaher re: turning off "useless" tests

   opening up releng for contributions:

   new public distribution list

   moved tests over to travis

   mentored bugs

   improved reconfigs

   set up CI for b2g bumper

   what do we need to accomplish next quarter?

   self-serve slave loan

   turn off "useless" tests

   have a way to do this easily and regularly

   better ability to correlate tree state changes with releng code changes

   better platform change pipeline

   proper staging env

   task cluster tackles most of the above, therefore migration of jobs to task cluster should enable these as a consequence

   what tools do we need?

   self-serve slave loan

   terminate AWS instances from slave health (slaveapi)

   ability to correlate releng changes with tree state changes

   e.g. linux tests started timing out at Thursday at 8:00am: what changed in releng repos around that time?

   armen's work on pinning mozharness tackles the mozharness part - migrating to task cluster puts build configs in-tree, so is also solved mostly with task cluster move

2014-11-27

   https://bugzil.la/1105826

   trees closed most of the day due to Armen's try jobs run amok

   reporting couldn't handle the volume of retried jobs, affected buildapi and builds-4hr

   disabled buildapi cronjobs until solution found

   db sync between master->slave lost for 5 hours

   filed https://bugzil.la/1105877 to fix db sync; paged sheeri to fix

   fixed ~8pm ET

   re-ran buildapi cronjobs incrementally by hand in order to warm the cache for build-4hr

   all buildapi cronjobs re-enabled

   catlee picked up https://bugzil.la/733663 for the long-term fix

   didn't get to deploy https://bugzil.la/961279 as planned :(

2014-11-26

   https://bugzil.la/961279 - Mercurial upgrade - how to proceed?

   yes, we should have time to deploy it Thursday/Friday this week

2014-11-25

   https://bugzil.la/1104741 - Tree closed for Windows 8 Backlog

   caused by nvidia auto-updates (probably not the first time)

   Q found fix to disable

   rebooted all w864 machines

   https://bugzil.la/1101285 - slaveapi doesn't handle 400 status from bugzilla

   needed to deploy this today so we could reboot the 80+ w8 slaves that didn't have problem tracking bugs yet

   also deployed logging fix (https://bugzil.la/1073630) and component change for filing new bugs (https://bugzil.la/1104451)

   https://bugzil.la/1101133 - Intermittent Jit tests fail with "No tests run or test summary not found"

   too many jit_tests!

2014-11-24

   kanban tool?

   https://bugzil.la/1104113 - Intermittent mozprocess timed out after 330 seconds

2014-11-21

   work on 10.10

   running in staging

   restarted bm84

   reconfig for bhearsum/rail for pre-release changes for Fx34

   setup foopy56 after returning from diagnostics

2014-11-20 a.k.a "BLACK THURSDAY"

   https://bugzilla.mozilla.org/show_bug.cgi?id=1101133

   https://bugzilla.mozilla.org/show_bug.cgi?id=1101285

2014-11-19

   https://bugzil.la/1101786 - Mac fuzzer jobs failing to unzip tests.zip

   bm85  - BAD REQUEST exceptions

   gracefully shutdown and restarted to clear

   https://bugzil.la/1092606

2014-11-18

   bm82 - BAD REQUEST exceptions

   gracefully shutdown and restarted to clear

   updated tools on foopys to pick up Callek's patch to monitor for old pywebsocket processes

   sent foopy56 for diagnostics

   https://bugzil.la/1082852 - slaverebooter hangs

   had been hung since Nov 14

   threads aren't terminating, need to figure out why

   have I mentioned how much i hate multi-threading?

   https://bugzil.la/1094293 - 10.10 support

   patches waiting for review

2014-11-17

   meeting with A-team

   reducing test load

   http://alertmanager.allizom.org/seta.html

2014-11-14

???

2014-11-13

???

2014-11-12

???

2014-11-11

???

2014-11-10

   release day

   ftp is melting under load; trees closed

   dev edition went unthrottled

   catlee throttled background updates to 25%

   dev edition not on CDN

   https://bugzilla.mozilla.org/show_bug.cgi?id=1096367

2014-11-07

   shared mozharness checkout

   jlund hg landings

   b2g_bumper travis tests working

   buildbot-master52

   hanging on every reconfig

   builder limits, hitting PB limits

   split masters: Try + Everything Else?

   graceful not working -> nuke it from orbit

   structured logging in mozharness has landed

   coop to write docs:

   moving slaves from production to staging

   dealing with bad slaves

2014-11-06

   b2g_bumper issues

   https://bugzil.la/1094922 - Widespread hg.mozilla.org unresponsiveness

   buildduty report queue

   some jobs pending for more than 4 hours

   aws tools needed to have the new cltbld password added to their json file, idle instances not being reaped

   need some monitoring here

2014-11-05

   sorry for the last few days, something important came up and i've barely been able to focus on buildduty

   https://bugzil.la/foopy56

   hitting load spikes

   https://bugzil.la/990173 - Move b2g bumper to a dedicated host

   bm66 hitting load spikes

   what is best solution: beefier instance? multiple instances?

   PT - best practices for buildduty?

   keep "Current" column accurate

2014-11-04

   t-snow-r4-0002 hit an hdiutil error and is now unreachable

   t-w864-ix-026 destoying jobs, disabled

   https://bugzilla.mozilla.org/show_bug.cgi?id=1093600

   bugzilla api updates were failing, fixed now

   affected reconfigs (script could not update bugzilla)

   https://bugzilla.mozilla.org/show_bug.cgi?id=947462

   tree outage when this landed

   backed it out

   probably it can be relanded, just needs a clobber

2014-11-03

2014-10-31

   valgrind busted on Try

   only build masters reconfig-ed last night by nthomas

   reconfig-ed try masters this morning

   https://bugzilla.mozilla.org/show_bug.cgi?id=1071281

   mshal has metrics patch for landing

   windows release repacks failing on b-2008-ix-0094

   https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-ix-0094

   failing to download rpms for centos

   https://bugzilla.mozilla.org/show_bug.cgi?id=1085348

   http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3

   mapper changes

   docs in progress

   WSME doesn't support non-JSON requests

   b2g bumper change

   in progress

   kanban tool?

   sent mail to group

   we'll use pivotal tracker

   reconfig for jlund

   re-image b-2008-ix-0094

   blog post:

   http://coop.deadsquid.com/2014/10/10-8-testing-disabled-by-default-on-try/

2014-10-30

   how best to handle broken manifests?

   difference of opinion w/ catlee

   catlee does see the human cost of not fixing this properly

   mapper docs

   b2g bumper: log rotation

   https://bugzil.la/1091707

   Frequent FTP/proxxy timeouts across all trees

   network blip?

   https://bugzil.la/1091696

   swap on fwunit1.private.releng.scl3.mozilla.com is CRITICAL: SWAP CRITICAL - 100% free (0 MB out of 0 MB)

   these are dustin's firewall unit tests: ping him when we get these alerts

   reconfig

2014-10-29

   b2g bumper

   b2g manifests

   no try for manifests

   All new w864 boxes have wrong resolution

   Q started to investigate, resurrected 3

   slave bugs linked against https://bugzil.la/1088839

   started thread about disabling try testing on mtnlion by default

   https://bugzilla.mozilla.org/show_bug.cgi?id=1091368

2014-10-28

   testing new hg 3.1.2 GPO

   https://bugzilla.mozilla.org/show_bug.cgi?id=1056981

   failing to find pem files

   cleaned up loaner list from yesterday

   closed 2 bugs that we're unused

   added 2 missing slavealloc notes

   terminated 11 instances

   removed many, many out-of-date names & hosts from ldapadmin

   lots of bigger scope bugs getting filed under the buildduty category

   most belong in general automation or tools IMO

   I don't think buildduty bugs should have a scope bigger than what can be accomplished in a single day. thoughts?

   reconfig to put new master (bm119) and new Windows test slaves into production

   massive spike in pending jobs around 6pm ET

   2000->5000

   closed trees

   waded through the buildduty report a bit

2014-10-27

   19 *running* loan instances

dev-linux64-ec2-jlund2 dev-linux64-ec2-kmoir dev-linux64-ec2-pmoore dev-linux64-ec2-rchien dev-linux64-ec2-sbruno tst-linux32-ec2-evold tst-linux64-ec2-evanxd tst-linux64-ec2-gbrown tst-linux64-ec2-gweng tst-linux64-ec2-jesup tst-linux64-ec2-jesup2 tst-linux64-ec2-jgilbert tst-linux64-ec2-kchen tst-linux64-ec2-kmoir tst-linux64-ec2-mdas tst-linux64-ec2-nchen tst-linux64-ec2-rchien tst-linux64-ec2-sbruno tst-linux64-ec2-simone

27 open loan bugs: http://mzl.la/1nJGtTw

We should reconcile. Should also cleanup entries in ldapadmin.

2014-10-24

   https://bugzil.la/1087013 - Move slaves from staging to production

   posted patch to cleanup mozilla-tests

   filed https://bugzil.la/1088839

   get new slaves added to configs, dbs

2014-10-23

   test slaves sometimes fail to start buildbot on reboot

   https://bugzilla.mozilla.org/show_bug.cgi?id=1088032

   re-imaging a bunch of w864 machines that were listed as only needing a re-image to be recovered:

   t-w864-ix-0[04,33,51,76,77]

   re-image didn't help any of these slaves

   https://bugzilla.mozilla.org/show_bug.cgi?id=1067062

   investigated # of windows test masters required for arr

   500 windows test slaves, 4 existing windows test masters

   filed https://bugzilla.mozilla.org/show_bug.cgi?id=1088146 to create a new master

   OMG load

   ~7000 pending builds at 4pm ET

   KWierso killed off lots of try load: stuff that had already landed, stuff with followup patches

   developer hygiene is terrible here

   created https://releng.etherpad.mozilla.org/platform-management-known-issues to track ongoing issues with various slave classes

2014-10-22

   no loaners

   no bustages

   remove slave lists from configs entirely

   pete to add see also to https://bugzil.la/1087013

   merge all (most?) releng repos into a single repo

   https://bugzilla.mozilla.org/show_bug.cgi?id=1087335

   mac-v2-signing3 alerting in #buildduty <- not dealt with

   git load spiked: https://bugzilla.mozilla.org/show_bug.cgi?id=1087640

   caused by Rail's work with new AMIs

   https://bugzil.la/1085520 ?  <- confirm with Rail

   https://graphite-scl3.mozilla.org/render/?width=586&height=308&_salt=1414027003.262&yAxisSide=right&title=git%201%20mem%20used%20%26%20load&from=-16hours&xFormat=%25a%20%25H%3A%25M&tz=UTC&target=secondYAxis%28hosts.git1_dmz_scl3_mozilla_com.load.load.shortterm%29&target=hosts.git1_dmz_scl3_mozilla_com.memory.memory.used.value&target=hosts.git1_dmz_scl3_mozilla_com.swap.swap.used.value

   similar problem a while ago (before

   led to creation of golden master

   many win8 machines "broken" in slave health

   working theory is that 64-bit browser is causing them to hang somehow

    https://bugzil.la/1080134

   same for mtnlion

   same for win7

   we really need to find out why these slaves will simply fail to start buildbot and then sit waiting to be rebooted

2014-10-21

   https://bugzilla.mozilla.org/show_bug.cgi?id=1086564 Trees closed

   alerted Fubar - he is working on it

   https://bugzilla.mozilla.org/show_bug.cgi?id=1084414 Windows loaner for ehsan

   killing esr24 branch

   https://bugzil.la/1066765 - disabling foopy64 for disk replacement

   https://bugzil.la/1086620 - Migrate slave tools to bugzilla REST API

   wrote patches and deployed to slavealloc, slave health

   trimmed Maintenance page to Q4 only, moved older to 2014 page

   filed https://bugzil.la/1087013 - Move slaves from staging to production

   take some of slave logic out of configs, increase capacity in production

   helped mconley in #build with a build config issue

   https://bugzil.la/973274 - Install GStreamer 1.x on linux build and test slaves

   this may have webrtc implications, will send mail to laura to check

2014-10-20

   reconfig (jetpack fixes, alder l10n, holly e10s)

   several liunx64 test masters hit the PB limit

   put out a general call to disable branches, jobs

   meanwhile, set masters to gracefully shutdown, and then restarted them. Took about 3 hours.

   64-bit Windows testing

   clarity achieved!

   testing 64-bit browser on 64-bit Windows 8, no 32-bit testing on Window 8 at all

   this means we can divvy the incoming 100 machines between all three Windows test pools to improve capacity, rather than just beefing up the WIn8 platform and splitting it in 2

2014-10-17

   blocklist changes for graphics (Sylvestre)

   code for bug updating in reconfigs is done

   review request coming today

   new signing server is up

   pete is testing, configuring masters to use it

   some classes of slaves not reconnecting to masters after reboot

   e.g. mtnlion

   need to find a slave in this state and figure out why

   puppet problem? runslave.py problem (connection to slavealloc)? runner issue (connection to hg)?

   patch review for https://bugzilla.mozilla.org/show_bug.cgi?id=1004617

   clobbering m-i for rillian

   helping Tomcat cherry-pick patches for m-r

   reconfig for Alder + mac-signing

2014-10-16

   Updated all windows builders with new ffxbld_rsa key

   Patched reconfig code to publish to bugzilla - will test on next reconfig

   Working on set up of mac v2 signing server

   Fixed sftp.py script

   Set up meeting the J Lal, H Wine, et al for vcs sync handover

   lost my reconfig logs from yesterday in order to validate against https://hg.mozilla.org/build/tools/file/a8eb2cdbe82e/buildfarm/maintenance/watch_twistd_log.py#l132 - will do so with next reconfig

   spoke to Amy about windows reimaging problem, and requested a single windows reimage to validate GPO setup

   reconfig for alder and esr24 changes

   rebooting mtnlion slaves that had been idle for 4 hours (9 of them)

   this seems to be a common occurrence. If I can find a slave in this state today, I'll file a bug and dive in. Not sure why mahcine is rebooting and not launching buildbot.

2014-10-15

   re-started instance for evold - https://bugzilla.mozilla.org/show_bug.cgi?id=1071125

   landed wiki formatting improvements - https://bugzilla.mozilla.org/show_bug.cgi?id=1079893

   landed slaverebooter timeout fix + logging improvements - https://bugzilla.mozilla.org/show_bug.cgi?id=1082852

   disabled instance for dburns

   reconfig for jlund - https://bugzil.la/1055918

2014-10-14

   https://bugzilla.mozilla.org/show_bug.cgi?id=1081825 b2gbumper outage / mirroring problem - backed out - new mirroring request in bug https://bugzilla.mozilla.org/show_bug.cgi?id=1082466

   symptoms: b2g_bumper lock file is stale

   should mirror new repos automatically rather than fail

   bare minimum: report which repo is affected

   https://bugzilla.mozilla.org/show_bug.cgi?id=962863 rolling out l10n gecko and l10n gaia vcs sync - still to do: wait for first run to complete, update wiki, enable cron

   https://bugzilla.mozilla.org/show_bug.cgi?id=1061188 rolled out, and had to backout due to puppet changes not hitting spot instances yet, and gpo changes not hitting all windows slaves yet - for spot instances, just need to wait, for GPO i have a needinfo on :markco

   need method to generate new golden AMIs on demand, e.g. when puppet changes land

   mac signing servers unhappy - probably not unrelated to higher load due to tree closure - have downtimed in #buildduty for now due to extra load

   backlog of builds on Mac

   related to slaverebooter hang?

   many were hung for 5+ hours trying to run signtool.py on repacks

   not sure whether this was related to (cause? symptom?) of signing server issues

   could also be related to reconfigs + key changes (ffxbld_rsa)

   rebooted idle&hung mac builders by hand

   https://bugzilla.mozilla.org/show_bug.cgi?id=1082770 - getting another mac v2 signing machine into service

   sprints for this week:

   [pete] bug updates from reconfigs

   [coop] password updates?

   slaverebooter was hung but not alerting, *however* I did catch the failure mode: indefinitely looping waiting for an available worker thread

   added a 30min timeout waiting for a worker, running locally on bm74

   filed https://bugzilla.mozilla.org/show_bug.cgi?id=1082852

   put foppy64 back into service - https://bugzilla.mozilla.org/show_bug.cgi?id=1066765

   https://bugzil.la/1082818 - t-w864-ix loaner for Armen

   https://bugzil.la/1082784 - tst-linux64-ec2 loaner for dburns

   emptied buildduty bug queues

2014-10-13

   https://bugzil.la/1061861 Merge day ongoing - keep in close contact with Rail

   https://bugzil.la/1061589 Deployed ffxbld_rsa key to hiera and landed Ben's changes

   Awaiting merge day activities before reconfiging outstanding changes:

   https://bugzil.la/1077154

   https://bugzil.la/1080134

   https://bugzil.la/885331

   mac-v2-signing1 complained a couple of times in #buildduty, but self-resolved

2014-10-10:

   kgrandon reported to getting updates for flame-kk

   some investigation

   cc-ed him on https://bugzilla.mozilla.org/show_bug.cgi?id=1063237

   work

2014-10-09

   db issue this morning all day

   sheeri ran an errant command on the slave db that inadvertently propagated to the master db

   trees closed for about 2 hours until jobs started

   however, after the outage while the trees were still closed, we did a live fail over between the master and the slave without incident

   later in the day, we tried to fail back over to the master db from the slave db, but we ended up with inconsistent data between the two databases. This resulted in a bunch of jobs not starting because they were in the wrong db.

   fixed with a hot copy

   filed https://bugzil.la/1080855 for RFO\

   https://bugzil.la/1079396 - loaner win8 machine for :jrmuziel

   https://bugzil.la/1075287 - loaner instance for :rchien

   after some debugging over the course of the day, determined he needed a build instance after all

   filed https://bugzil.la/1080951 - Add fabric action to reset the timestamp used by buildbot-master exception log reporting

2014-10-08

   https://bugzil.la/1079778 - Disabled pandas taking jobs

   https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=panda-0425

2014-10-07

   https://bugzilla.mozilla.org/show_bug.cgi?id=1079256 - B2G device image nightlies (non-eng only) constantly failing/retrying due to failure to upload to update.boot2gecko.org

   cleaned up, now 5% free

   fix is to stop creating/publishing/uploading b2g mars for all branches *except* 1.3 <- https://bugzilla.mozilla.org/show_bug.cgi?id=1000217

   fallout from PHX outage?

   golden images (AMIs) keep getting re-puppetized: arr and rail discussing

   cert issue should be fixed now

   slave loan command line tool: https://github.com/petemoore/build-tools/tree/slave_loan_command_line (scripts/slave_loan.sh)

   cleared buildduty report module open dependencies

   filed https://bugzil.la/1079468 - [tracking][10.10] Continuous integration testing on OS X 10.10 Yosemite

2014-10-06

   https://bugzilla.mozilla.org/show_bug.cgi?id=1078300#c3

   hg push showing on tbpl and treeherder with no associated builders generated

   sprints for this week:

   slaverebooter

   [coop] determine why it sometimes hangs on exit

   [pete] end_to_end_reconfig.sh

   add bug updates

2014-10-03

   slaverebooter hung

   added some extra instrumentation locally to try to find out why, when removed the lockfile and restarted

   hasn't failed again today, will see whether it fails around the same time tonight (~11pm PT)

   https://bugzil.la/1077432 - Skip more unittests on capacity-starved platforms

   now skipping opt tests  for mtnlion/win8/win7/xp

   reconfig

   https://bugzil/la/1065677 - started rolling restarts of all master

   done

2014-10-02

   mozmill CI not receiving pulse messages

   some coming through now

   no logging around pulse notifications

   mtnlion machines

   lost half of pool last night, not sure why

   dbs shutdown last night <- related?

   reboot most, leave 2 for diagnosis

   snow affected? yes! rebooted

   also windows

   rebooted XP and W8

   can't ssh to w7

   rebooted via IPMI (-mgmt via web)

   pulse service - nthomas restarted pulse on masters

   multiple instances running, not checking PID file

   https://bugzilla.mozilla.org/show_bug.cgi?id=1038006

   bug 1074147 - increasing test load on Windows

   coop wants to start using our skip-test functionality on windows (every 2) and mtnlion (every 3)

   uploads timing out

   led to (cause of?) running out of space on upload1/stage

   cleared out older uploads (>2 hrs)

   find /tmp -type d -mmin +60 -exec rm -rf "{}" \;

   think it might be related to network blips (BGP flapping) from AWS: upload gets interrupted, doesn't get cleaned up for 2+ hours. With the load we've had today, that wouldn't be surprising

   smokeping is terrible: http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-use1

   filed https://bugzilla.mozilla.org/show_bug.cgi?id=1077187

   restarted slaveapi to clear any bad state from today

   re-ran slaverebooter by hand (bm74)

   https://bugzil.la/1069429 - Upload mozversion to internal pypi

   https://bugzil.la/1076934 - Temporarily turn off OTA on FX OS Master branch

   don't have a proper buildid to go on here, may have time to look up later

   added report links to slave health: hgstats, smokepings

2014-10-01

   https://bugzil.la/1074267 - Slave loan request for a bld-lion-r5 machine

   https://bugzil.la/1075287 - Requesting a loaner machine b2g_ubuntu64_vm to diagnose Bug 942411

2014-09-30

   https://bugzil.la/1074827 - buildbot-configs_tests failing on Jenkins due to problem with pip install of master-pip.txt

   non-frozen version of OpenSSL being used - rail fixing and trying again

   https://bugzil.la/943932 - T-W864-IX-025 having blue jobs

   root cause not known - maybe faulty disk - removed old mozharness checkout, now has a green job

   https://bugzil.la/1072434 - balrog submitter doesn't set previous build number properly

   this caused bustage with locale repacks - nick and massimo sorted it out

   https://bugzil.la/1050808 - several desktop repack failures today - I proposed we apply patch in this bug

   https://bugzil.la/1072872 - last machines rebooted

   https://bugzil.la/1074655 - Requesting a loaner machine b2g_ubuntu64_vm to diagnose bug 1053703

   going through buildduty report

   filed new panda-recovery bug, added pandas to it

   t-snow-r4-0075: reimaged, returned to production

   talos-linux64-ix-027: reimaged, returned to production

   emptied 3 sections (stopped listing the individual bugs)

   reconfig

   https://bugzil.la/1062465 - returned foopy64 and attached pandas to production

   disk is truly failing on foopy64, undid all that work

2014-09-29

   https://bugzil.la/1073653 - bash on OS X

   dustin landed fix, watched for fallout

   complications with signing code bhearsum landed

   all Macs required nudging (manual puppet runs + reboot). Mercifully dustin and bhearsum took care of this.

   https://bugzil.la/1072405 - Investigate why backfilled pandas haven't taken any jobs

   checking failure logs for patterns

   looks like mozpool is still trying to reboot using old relay info: needs re-sync from inventory?

   tools checkout on foopies hadn't been updated, despite a reconfig on Saturday

   enabled 610-612

   each passed 2 tests in a row, re-enabled the rest

   cleaned up resolved loans for bld-lion, snow, and mtnlion machines

   https://bugzil.la/1074358 - Please loan OS X 10.8 Builder to dminor

   https://bugzil.la/1074267 - Slave loan request for a talos-r4-snow machine

   https://bugzil.la/1073417 - Requesting a loaner machine b2g_ubuntu64_vm to diagnose

2014-09-26

   cleared dead pulse queue items after pulse publisher issues in the morning

   https://bugzil.la/1072405 - Investigate why backfilled pandas haven't taken any jobs

   updated devices.json

   created panda dirs on floppies

   had to re-image panda-0619 by hand

   all still failing, need to investigate on Monday

   https://bugzil.la/1073040 - loaner for mdas

Besides the handover notes for last week, which I received from pete, there are the following issues:

https://bugzilla.mozilla.org/show_bug.cgi?id=1038063 Running out of space on dev-stage01:/builds The root cause of the alerts was the addition of folder /builds/data/ftp/pub/firefox/releases/31.0b9 by Massimo in order to run some tests. Nick did some further cleanup, the bug has been reopened this morning by pete, proposing to automate some of the steps nick did manually.

https://bugzilla.mozilla.org/show_bug.cgi?id=1036176 Some spot instances in us-east-1 are failing to connect to hg.mozilla.org Some troubleshooting has been done by Nick, and case 222113071 has been opened with AWS

2014-07-07 to 2014-07-11

Hi Simone,

Open issues at end of week:

foopy117 is playing up (https://bugzilla.mozilla.org/show_bug.cgi?id=1037441) this is also affecting end_to_end_reconfig.sh (solution: comment out manage_foopies.py lines from this file and run manually) Foopy 117 seems to be back and working normally

Major problems with pending queues (https://bugzilla.mozilla.org/show_bug.cgi?id=1034055) - this should hopefully be fixed relatively soon. most notably linux64 in-house ix machines. Not a lot you can do about this - just be aware of it if people ask. Hopefully this is solved after Kim's recent work

Two changes currently in queue for next reconfig: https://bugzilla.mozilla.org/show_bug.cgi?id=1019962 (armenzg) and https://bugzilla.mozilla.org/show_bug.cgi?id=1025322 (jford)

Some changes to update_maintenance_wiki.sh from aki will be landing when the review passes: (https://bugzilla.mozilla.org/show_bug.cgi?id=1036573) - potential is to impact the wiki update in end_to_end_reconfig.sh as it has been refactored - be aware of this.

Currently no outstanding loaner requests at time of handover, but there are some that need to be checked or returned to the pool.

See the 18 open loan requests: https://bugzilla.mozilla.org/buglist.cgi?bug_id=989521%2C1036768%2C1035313%2C1035193%2C1036254%2C1006178%2C1035270%2C876013%2C818198%2C981095%2C977190%2C880893%2C1017303%2C1023856%2C1017046%2C1019135%2C974634%2C1015418&list_id=10700493

I've pinged all the people in this list (except for requests less than 5 days old) to ask for status.

Pete

2014-05-26 to 2014-05-30

Monday
Tuesday
- reconfig
  - catlee's patch had bustage, and armenzg's had unintended consequences
  - they each reconfiged again for their own problems
- buildduty report:
  - tackled bugs without dependencies
  - tackled bugs with all dependencies resolved
Wednesday
- new nightly for Tarako
  - was actually a b2g code issue: Bug 1016157 - updated the version of vold
- resurrecting tegras to deal with load
Thursday
- AWS slave loan for ianconnoly
- puppet patch for talos-linux64-ix-001 reclaim
- resurrecting tegras
Friday
- tegras continue to fall behind, ping Pete very late Thursday with symptoms. Filed https://bugzil.la/1018118
- reconfig
  - chiefly to deploy https://bugzil.la/1017599 <- reduce # of test on tegras
  - fallout:
    - non-unified mozharness builds are failing in post_upload.py <- causing queue issues on masters
    - panda tests are retrying more than before
      - hitting "Caught Exception: Remote Device Error: unable to connect to panda-0402 after 5 attempts", but it *should be non-fatal, i.e. test runs fine afterwards but still gets flagged for retry
      - filed: https://bugzil.la/1018531
    - reported by sheriffs (RyanVM)
      - timeouts on OS X 10.8 tests - "Timed out while waiting for server startup."
        https://tbpl.mozilla.org/php/getParsedLog.php?id=40752197&tree=Fx-Team
      - similar timeouts on android 2.3

2014-05-05 to 2014-05-09

Monday
- new tarako nightly for nhirata
- reconfig
Tuesday
- began rolling master restarts for https://bugzil.la/1005133
Wednesday
- finished rolling master restarts for https://bugzil.la/1005133
- reconfig for fubar
- dealt with bugs with no dependencies from the buildduty report (bdr)
Thursday
- loan to :jib for https://bugzil.la/1007194: talos-linux64-ix-004
- loan to jmaher, failed. Filed https://bugzil.la/1007967
  - reconfig for bhearsum/jhford
Friday
- tree closure(s) due to buildbot db slowdown
  - was catlee's fault
- follow-up on https://bugzil.la/1007967: slave loaned to jmaher
- bugs with no dependencies (bdr)
- deployed tmp file removal fix for bld-lion in https://bugzil.la/880003

2014-04-21 to 2014-04-25

follow up:

buildduty report: Bug 999930 - put tegras that were on loan back onto a foopy and into production

action items:

Bug 1001518 - bld-centos6-hp-* slaves are running out of disk space
- this pool had 4 machines run out of disk space all within the last week
- I scrubbed a ton of space (bandaid) but the core issue will need to be addressed

load: (high load keep an eye on) Bug 999558 - high pending for ubuntu64-vm try test jobs on Apr 22 morning PT (keep an eye on) https://bugzilla.mozilla.org/show_bug.cgi?id=997702

- git.m.o high load, out of RAM, see recent emails from hwine with subj 'git.m.o'
- graph to watch: https://graphite-scl3.mozilla.org/render/?width=586&height=308&_salt=1397724372.602&yAxisSide=right&title=git%201%20mem%20used%20&%20load&from=-8hours&target=secondYAxis(hosts.git1_dmz_scl3_mozilla_com.load.load.shortterm)&target=hosts.git1_dmz_scl3_mozilla_com.memory.memory.used.value&target=hosts.git1_dmz_scl3_mozilla_com.swap.swap.used.value

(jlund) reboot all these xp stuck slaves- https://bugzil.la/977341 - XP machines out of action

- https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-xp32-ix THERE ARE ONLY 4 MACHINES TODAY THAT ARE "BROKEN" AND ONLY HUNG TODAY.

pmoore: there is only 1 now

(jlund) iterate through old disabled slaves in these platform lists - https://bugzil.la/984915 - Improve slave health for disabled slaves <- THIS WAS NOT DONE. I ASKED PMOORE TO HELP
- pmoore: i'm not entirely sure which platform lists this means, as the bug doesn't contain a list of platforms. So I am looking at all light blue numbers on https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html (i.e. the disabled totals per platform). When jlund is online later i'll clarify with him.
- jlund: thanks pete. Sorry I meant all platform lists I suppose starting with whatever platform held our worst wait times. I have started going through the disabled hosts looking for 'forgotten ones'

2014-04-10 to 2014-04-11 (Thursday and Friday)

https://bugzilla.mozilla.org/show_bug.cgi?id=995060 Nasty nasty tree closure lasting several hours b-c taking loooooong time and log files too large for buildbot to handle Timeout for MOCHITEST_BC_3 increased from 4200s to 12000s When Joel's patch has landed: https://bugzilla.mozilla.org/show_bug.cgi?id=984930 then we should "undo" the changes from bug 995060 and put timeout back down to 4200s (was just a temporary workaround). Align with edmorley on this.

https://bugzilla.mozilla.org/show_bug.cgi?id=975006 https://bugzilla.mozilla.org/show_bug.cgi?id=938872 after much investigation, it turns out a monitor is attached to this slave - can you raise a bug to dc ops to get it removed?

Loaners returned: https://bugzil.la/978054 https://bugzil.la/990722 https://bugzil.la/977711 (discovered this older one)

Loaners created: https://bugzilla.mozilla.org/show_bug.cgi?id=994283

https://bugzilla.mozilla.org/show_bug.cgi?id=994321#c7 Still problems with 7 slaves that can't be rebooted: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-005 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-006 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-061 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-065 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-074 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-086 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-mtnlion-r5&name=talos-mtnlion-r5-089 Slave API has open bugs on all of these.

https://bugzil.la/977341 Stuck Win XP slaves - only one was stuck (t-xp32-ix-073). Rebooted.

Thanks Callek!

Week of 2014-04-05 to 2014-04-09 (thurs-wed)

Hiya pete

van has been working hard at troubleshooting winxp 085: https://bugzilla.mozilla.org/show_bug.cgi?id=975006#c21
- this needs to be put back into production along with 002 and reported back to van on findings
- note this is a known failing machine. please try to catch it fail before sheriffs.

loans that can be returned that I have not got to:
- https://bugzil.la/978054
- https://bugzilla.mozilla.org/show_bug.cgi?id=990722

we should reconfig either thurs or by fri at latest

latest aws sanity check runthrough yielded better results than before. Very view long running lazy instances. Very few unattended loans. This should be checked again on Friday

there was a try push that broke a series of mtnlion machines this afternoon. Callek, nthomas, and Van worked hard at helping me diagnose and solve issue.
- there are some slaves that failed to reboot via slaveapi. This is worth following up on especially since we barely have any 10.8 machines to begin with:
- https://bugzilla.mozilla.org/show_bug.cgi?id=994321#c7

on tues we started having github/vsync issues where sheriffs noticed that bumper bot wasn't keeping up with csets on github.
- looks like things have been worked on and possibly fixed but just a heads up
- https://bugzilla.mozilla.org/show_bug.cgi?id=993632

I never got around to doing this:
- iterate through old disabled slaves in these platform lists - https://bugzil.la/984915 - Improve slave health for disabled slaves. This was discussed in our buildduty mtg. could you please look at it: https://etherpad.mozilla.org/releng-buildduty-meeting

(jlund) reboot all these xp stuck slaves- https://bugzil.la/977341 - XP machines out of action
- https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-xp32-ix
- broken machines rebooted. Looking at list now there are 4 machines that are 'broken' and stopped doing a job today.

as per: https://bugzilla.mozilla.org/show_bug.cgi?id=991259#c1 I checked on these and the non green ones should be followed up on

tegra-063.tegra.releng.scl3.mozilla.com is alive <- up and green tegra-050.tegra.releng.scl3.mozilla.com is alive <- up and green tegra-028.tegra.releng.scl3.mozilla.com is alive <- up but not running jobs tegra-141.tegra.releng.scl3.mozilla.com is alive <- up and green tegra-117.tegra.releng.scl3.mozilla.com is alive <- up but not running jobs tegra-187.tegra.releng.scl3.mozilla.com is alive <- up and green tegra-087.tegra.releng.scl3.mozilla.com is alive <- up and green tegra-299.tegra.releng.scl3.mozilla.com is alive <- up and green tegra-309.tegra.releng.scl3.mozilla.com is alive <- up and green tegra-335.tegra.releng.scl3.mozilla.com is alive <- up and green

as per jhopkins last week and reformatting: again the non green ones should be followed up on. tegra-108 - bug 838425 - SD card reformat was successful <- cant write to sd tegra-091 - bug 778886 - SD card reformat was successful <- sdcard issues again tegra-073 - bug 771560 - SD card reformat was successful <- lockfile issues tegra-210 - bug 890337 - SD card reformat was successful <- green in prod tegra-129 - bug 838438 - SD card reformat was successful <- fail to connect to telnet tegra-041 - bug 778813 - SD card reformat was successful <- sdcard issues again tegra-035 - bug 772189 - SD card reformat was successful <- sdcard issues again tegra-228 - bug 740440 - SD card reformat was successful <- fail to connect to telnet tegra-133 - bug 778923 - SD card reformat was successful <- green in prod tegra-223 - bug 740438 - SD card reformat was successful <- Unable to properly cleanup foopy processes tegra-080 - bug 740426 - SD card reformat was successful <- green in prod tegra-032 - bug 778899 - SD card reformat was successful <- sdcard issues again tegra-047 - bug 778909 - SD card reformat was successful have not got past here tegra-038 - bug 873677 - SD card reformat was successful tegra-264 - bug 778841 - SD card reformat was successful tegra-092 - bug 750835 - SD card reformat was successful tegra-293 - bug 819669 - SD card reformat was successful

Week of 2014-03-31 to 2014-04-04

Wednesday:

Someone will need to follow up on how these tegras did since I reformatted their SD cards:

tegra-108 - bug 838425 - SD card reformat was successful tegra-091 - bug 778886 - SD card reformat was successful tegra-073 - bug 771560 - SD card reformat was successful tegra-210 - bug 890337 - SD card reformat was successful tegra-129 - bug 838438 - SD card reformat was successful tegra-041 - bug 778813 - SD card reformat was successful tegra-035 - bug 772189 - SD card reformat was successful tegra-228 - bug 740440 - SD card reformat was successful tegra-133 - bug 778923 - SD card reformat was successful tegra-223 - bug 740438 - SD card reformat was successful tegra-080 - bug 740426 - SD card reformat was successful tegra-032 - bug 778899 - SD card reformat was successful tegra-047 - bug 778909 - SD card reformat was successful tegra-038 - bug 873677 - SD card reformat was successful tegra-264 - bug 778841 - SD card reformat was successful tegra-092 - bug 750835 - SD card reformat was successful tegra-293 - bug 819669 - SD card reformat was successful

Week of 2014-03-17 to 2014-03-21 buildduty: armenzg

Monday

bugmail and deal with broken slaves
mergeday

Tuesday

reviewed aws sanity check
cleaned up and assigned some buildduty bugs
reconfig

TODO:

bug 984944
swipe through problem tracking bugs

Wednesday

Thursday

Friday

Week of 2014-01-20 to 2014-01-24 buildduty: armenzg

Monday

deal with space warnings
loan to dminor
terminated returned loan machines

Tuesday

loan win64 builder
Callek helped with the tegras

TODO

add more EC2 machines

Week of 2014-01-20 to 2014-01-24 buildduty: jhopkins

Bugs filed:

   Bug 962269 (dupe) - DownloadFile step does not retry status 503 (server too busy)

   Bug 962698 - Expose aws sanity report data via web interface in json format

   Bug 963267 - aws_watch_pending.py should avoid region/instance combinations that lack capacity

Monday

   Nightly updates are disabled (bug 908134 comment 51)

   loan bug 961765

Tuesday

   added new buildduty task: https://wiki.mozilla.org/ReleaseEngineering:Buildduty#Semi-Daily

   Bug 934938 - Intermittent ftp.m.o "ERROR 503: Server Too Busy"

Wednesday

   added AWS instance Tag "moz-used-by" to the nat-gateway instance to help with processing the long-running instances report

   would be nice if we could get the aws sanity report data to be produced by slave api so it could be pulled by a web page and correlated with recent job history, for example

   Bug 934938 - jakem switched to round-robin DNS (see https://bugzilla.mozilla.org/show_bug.cgi?id=934938#c1519 for technical details) to avoid "thundering herd" problem.

Thursday

   AWS lacking capacity and slowing down instance startup.  Filed 963267.

Friday

   missed some loan requests b/c I thought they were being included in the buildduty report (2 previous ones seemed to be).  Can we add loans to the buildduty report?

   some automated slave recovery not happening due to Bug 963171 - please allow buildbot-master65 to talk to production slaveapi

Week of 2014-01-13 to 2014-01-17 buildduty: bhearsum

Bugs filed (not a complete list):

Bug 960535 - Increase bouncerlatestchecks Nagios script timeout

Week of 2014-01-16 to 2014-01-10 buildduty: armenzg

Bugs filed:

https://bugzil.la/956788 - Allow slaveapi to clobber the basedir to fix machines
https://bugzil.la/957630 - Invalid tokens
https://bugzil.la/930897 - mochitest-browser-chrome timeouts

Monday

loan machines
deal with some broken slaves

Tuesday

loan machines
deal with some broken slaves
reconfig
second reconfig for backout

Wednesday

enable VNC for a Mac loaner
check signing issues filed by Tomcat
mozharness merge
help RyanVM with some timeout

Thursday

do reconfig with jlund

Friday

restart redis
loan 2 machines
process problem tracking bugs

Week of 2013-12-16 to 2013-12-20 buildduty: jhopkins

Bugs filed:

Bug 950746 - Log aws_watch_pending.py operations to a machine-parseable log or database
Bug 950780 - Start AWS instances in parallel
Bug 950789 - MozpoolException should be retried
Bug 952129 - download_props step can hang indefinitely
Bug 952517 - Run l10n repacks on a smaller EC2 instance type

Monday

several talos-r3-fed* machines have a date of 2001
adding hover-events to our slave health pages would be helpful to get quick access to recent job history
other interesting possibilities:
- a page showing last 50-100 jobs for all slaves in a class
- ability to filter on a certain builder to spot patterns/anomalies. eg. "robocop tests always fail on this slave but not the other slaves"

Wednesday

taking over 15 minutes for some changes to show as 'pending' in tbpl. Delay from scheduler master seeing the change in twistd.log
- could be Bug 948426 - Random failed transactions for http://hg.mozilla.org/
Bug 951558 - buildapi-web2 RabbitMQ queue is high

Friday

Bug 952448 - Integration Trees closed, high number of pending linux compile jobs
- AWS instance 'start' requests returning "Error starting instances - insufficient capacity"
dustin has fixed Bug 951558 - buildapi-web2 RabbitMQ queue is high

Week of 2013-11-04 to 2013-11-08 buildduty: jhopkins

Tuesday

rev2 migrations going fairly smoothly (biggest issue is some IPMI interfaces being down and requiring a power cycle by DCOPs)

Wednesday

needs attention (per RyanVM): Bug 935246 - Graphserver doesn't know how to handle the talos results from non-PGO builds on the B2G release branches
IT's monitoring rollout happening today
request to build "B2G device image nightlies" non-obvious what the builders are or what masters they live on. No howto I could find. How do we automate this and keep it
- stopgap added to wiki: https://wiki.mozilla.org/ReleaseEngineering:Buildduty:Other_Duties#Trigger_B2G_device_image_nightlies
- see also: https://bugzilla.mozilla.org/show_bug.cgi?id=793989

Friday

RyanVM reports that pushing mozilla-beta to Try is a fairly normal thing to do but fails on the win64-rev2 build slaves. He has been helping with backporting the fixes in https://wiki.mozilla.org/User:Jhopkins/win64rev2Uplift to get this addressed.
catlee's buildbot checkconfig improvement went into production but we need a restart on all the masters to get the full benefit. no urgency, however.

Week of 2013-10-14 to 2013-10-18 buildduty: armenzg

Monday

uploaded mozprocess
landed puppet change that made all Linux64 hosts get libvirt-bin get installed and made them fall to sync with puppet
- I had to back out and land a patch to uninstall the package
- we don't know why it got installed
redis issues
- build-4hr issues
- signing issues

Tuesday

returned some slaves to the pool
investigated some cronmail
uploaded talos.zip
reclaimed machines and requested reimages

Wednesday

put machines back into produciton
loan
process delays email

Week of 2013-10-14 to 2013-10-18

buildduty: coop (callek on monday)

Monday

Went through https://secure.pub.build.mozilla.org/builddata/reports/slave_health/buildduty_report.html to reduce bug lists
- Bunch of win build machines requested to be reimaged as rev2 machines
  - Will need buildbot-configs changes for slave list changes before re-enabled.
Two loaners
- one w64 rev2 with open question on if we need to manually remove secrets ourselves

Tuesday

meetings! At least one of them was about buildduty.

Wednesday

shutdown long-running AWS instance for hverschore: bug 910368
investigating disabled mtnlion slaves
- many needed the next step taken: filing the IT bug for recovery

Thursday

filed Bug 927951 - Request for smoketesting Windows builds from the cedar branch
filed Bug 927941 - Disable IRC alerts for issues with individual slaves
reconfig for new Windows in-house build master

Friday

Week of 2013-09-23 to 2013-09-27

buildduty: armenzg

Monday

help marcia debug some b2g nightly questions
meetings and meetings and distractions
started patches to transfer 11 win64 hosts to become try ones

Tuesday

run a reconfig
did a backout for buildbotcustom and run another reconfig
started work on moving win64 hosts from build pool to the try pool
analyzed a bug filed by sheriff wrt to clobberer
- no need for buildduty to fix (moved to Platform Support)
- asked people's input for proper fix
reviewed some patches for jmaher and kmoir
assist edmorley with clobberer issue
assist edmorley with git.m.o issue
assist RyanVM with git.m.o issue
put w64-ix-slave64 in the production pool
updated buildduty wiki page
updated wiki page to move machines from one pool to another

Wednesday

messy

Thursday

messy

Monday

messy

Week of 2013-09-16 to 2013-09-20

buildduty: Callek

Monday:

(pmoore) batch 1 of watcher update [Bug 914302]
MERGE DAY
- We missed having a point person for merge day again, rectified (thanks armen/rail/aki)
3-reconfigs or so,
- Merge day
- Attempted to fix talos-mozharness (broken by panda-mozharness landing)
- Backout out talos-mozharness change for continued bustaged
- Also backed out emulator-ics for in-tree (crossing-tree) bustage relating to name change.
  - Thanks to aki for helping while I had to slip out for a few minutes
Loaner bug poking/assigning
Did one high priority loaner needed for tree-closure which blocked MERGE DAY

Tuesday:

(pmoore) batch 2 of watcher update [Bug 914302]
Buildduty Meeting
Bug queue churn-through
Hgweb OOM: Bug 917668

Wednesday:

(pmoore) batch 3 of watcher update [Bug 914302]
Reconfig
Hgweb OOM continues (IT downtimed it, bkero is on PTO today, no easy answer)
- Very Low visible tree impact at present
Bug queue churn-through
Discovered last-job-per-slave view of slave_health is out of date.
Discovered reboot-history is either out of date or reboots not running for tegras

Thursday

(pmoore) batch 4 [final] of watcher update [Bug 914302]
Hgweb OOM continues
Bug queue churn... focus on tegras today
Coop fixed last-job-per-slave generation, and slaves_needing_reboots
Downtime (and problems) for scl1 nameserver and scl3 zeus nodes
- Caused tree closure due to buildapi01 being in scl1 and long delay

Week of 2013-09-09 to 2013-09-13

buildduty: coop

Monday:

meetings
kittenherder hanging on bld-centos - why?
- multiple processes running, killed off (not sure if that's root cause)

Tuesday:

buildduty meeting
- filed https://bugzilla.mozilla.org/show_bug.cgi?id=913606 to stop running cronjobs to populate mobile-dashboard
wrote reboot_tegras.py quick-n-dirty script to kill buildbot processes and reboot tegras listed as hung

Wednesday:

re-enabled kittenherder rebooting of tegras
wrote bugzilla shared queries for releng-buildduty, releng-buildduty-tagged, and releng-buildduty-triage
playing with bztools/bzrest to try to get query that considers dependent bugs
meetings
deploying registry change for bug 897768 (fuzzing dumps)

Thursday:

broke up kittenherder rebooting of tegras into 4 batches to improve turnaround time
got basic buildduty query working with bztools
respun Android nightly
resurrecting as many Mac testers as possible to deal with load
filed bug 915766 to audit pdu2.r102-1.build.scl1
resurrected a bunch of talos-r3-fed machines that were not running buildbot

Friday:

AWS US-East-1 outage
reconfig for nexus4 changes
- re-reconfig to backout kmoir's changes that closed the tree: https://bugzilla.mozilla.org/show_bug.cgi?id=829211
Mac tester capacity

Week of 2013-09-02 to 2013-09-06 buildduty: bhearsum

Monday

   US/Canada holiday

Tuesday

   Bug 912225 - Intermittent B2G emulator  image "command timed out: 14400 seconds elapsed, attempting to kill" or  "command timed out: 3600 seconds without output, attempting to kill"  during the upload step

   Worked around by lowering priority of use1. Acute issue is fixed, we suspect there's still symptoms from time to time.

   Windows disconnects may be the early warning sign.

Week of 2013-08-26 to 2013-08-30 buildduty: jhopkins

Monday

   many talos-r3-w7 slaves have a broken session which prevents a new SSH login session (you can authenticate but it kicks you out right away). Needed to RDP in, open a terminal as Administrator, and delete the files in c:\program files\kts\log\ip-ban\* and active-sessions\*.

   many other talos-r3-w7 slaves had just ip-ban\* files (no active-sessions\* files) which prevented kittenherder from managing the slave, since there are no IPMI or PDU mechanisms to manage these build slaves.

   trying slaveapi

   IPMI reboot failing (no netflow)

$curl -dwaittime=60 http://cruncher.srv.releng.scl3.mozilla.com:8000/slave/talos-r3-xp-076/action/reboot {

 "requestid": 46889168, 
 "state": 0, 
 "text": ""

}

   slavealloc-managed devices are live (per Callek)

   ec2 slave loan to eflores (909186)

Tuesday

   osx 10.7 loan to jmaher (909510)

   https://wiki.mozilla.org/ReleaseEngineering/Managing_Buildbot_with_Fabric takes awhile to set up.  Also, "suggestions" seems like the defacto way to do a reconfig.

Wednesday

   many tegras down for weeks. documentation needs improvement

   when to do a step

   what to do if the step fails

   linux64 slave loan to h4writer (bug 909986)

   created https://intranet.mozilla.org/RelEngWiki/index.php/How_To/Disable_Updates (bug 910378)

Thursday

   filed Bug 910818 - Please investigate cause of network disconnects 2013-08-29 10:22-10:24 Pacific

   need to automate gathering of details for filing this type of bug

   Bug 910662 - B2G Leo device image builds  broken with "error: patch failed: include/hardware/hwcomposer.h:169"  during application of B2G patches to android source

Friday

   Fennec nightly updates disabled again due to startup crasher. Bug 911206

   These tegras need attention (rebooting hasn't helped):

   https://releng.etherpad.mozilla.org/191

Week of 2013-08-19 to 2013-08-23 buildduty: armenzg

Monday

   40+ Windows builders had not been rebooted for several days

   https://bugzilla.mozilla.org/show_bug.cgi?id=906660

   rebooted a bunch with csshX

   I gave a couple to jhopkins to look into

   cruncher was banned from being able to ssh

   ipmi said that it was successful

   more investigation happening

   edmorley requested that I look into fixing the TMPDIR removal issue

   https://bugzilla.mozilla.org/show_bug.cgi?id=880003

   ted to land a fix to disable the test that causes

   filed a bug for IT to clean up the TMPDIR through puppet

   cleaned up tmpdir manually on 2 hosts and put them back in production to check

   do a reconfig for mihneadb

   rebooted talos-r4-lion-041 upon philor's request due to hdutil

   promote unagi

   upload talos.zip for mobile

   updated docs for talos-bundles https://wiki.mozilla.org/ReleaseEngineering:Buildduty:Other_Duties#How_to_update_the_talos_zips

Tuesday

   I see a bunch of these:

    nagios-releng:  Tue 06:04:58 PDT [4193]  buildbot-master92.srv.releng.use1.mozilla.com:ntp time is CRITICAL:  CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)

   https://bugzilla.mozilla.org/show_bug.cgi?id=907158

   We disabled all use1 masters

   We reduced use1's priority for host generation

   Fixed Queue dirs for one of the masters by restarting the init service

   had to kill the previous pid

   rebooted remaining win64 machines

   rebooted remaining bld-lion-r5 machines

   buildapi issues

   Callek took care of it and move it to Tools

   Queue dir issues

   it seems that the Amazon issues caused this

   I have been stopping and starting the pulse_publisher

   /etc/initd.d/pulse_publisher {stop|start}

   the /dev/shm/queue/pulse/new will start decreasing

   Re-enabled all aws-us-es-1 masters

Wednesday

   one of the b2g repos has a 404 bundle and intermittent ISE 500

   https://bugzilla.mozilla.org/show_bug.cgi?id=907693

   bhearsum has moved it to IT

   fubar is looking into it

   done a reconfig

   graceful restart for buildbot-master69

   I had to do a less graceful restart

   graceful restart for buildbot-master70

   graceful restart for buildbot-master71

   loan t-xp32-ix006 in https://bugzil.la/904219

   promoted b2g build

   deploy graphs change for jmaher

   vcs major alert

   hwine to look into it

Thursday

   disable production-opsi

   reconfig

   trying to cleanup nagios

Friday

   a bunch of win64 machines are not taking jobs

   deploy to all machines the fix

   filed bug for IT to add to task sequence

   some win64 imaging bugs filed

   reconfig for hwine

   Callek deployed his slavealloc change

   merged mozharness

   investigated a bunch of hosts that were down

   we might be having some DNS tree-closing issues

   https://bugzilla.mozilla.org/show_bug.cgi?id=907981#c9

   It got cleared within an hour or so

Week of 2013-08-12 to 2013-08-16

buildduty: coop

Monday:

   ran kittenherder against w64-ix pool

   56 of 87 build slaves were hung

   suspect shutdown event tracker dialog

   deployed shutdown event tracker fix to all w64-ix slaves

   https://bugzilla.mozilla.org/show_bug.cgi?id=893888#c4

   cleaned up https://wiki.mozilla.org/ReferencePlatforms

   https://wiki.mozilla.org/ReferencePlatforms/Test/Lion

   added https://wiki.mozilla.org/ReferencePlatforms/Win64#Disable_shutdown_event_tracker

   tried to promote unagi build, hit problem caused by extra symlinks (aki) last week

   hwine debugged, updated docs: https://intranet.mozilla.org/RelEngWiki/index.php?title=How_To/Perform_b2g_dogfood_tasks#Trouble_shooting

   re-imaged (netboot) talos-r4-lion-0[30-60]: https://bugzilla.mozilla.org/show_bug.cgi?id=891880

Tuesday:

   more cleanup in https://wiki.mozilla.org/ReferencePlatforms

   https://wiki.mozilla.org/ReferencePlatforms/Test/MountainLion

   moved lots of platforms to "Historical/Other"

   re-imaged (netboot) talos-r4-lion-0[61-90]: https://bugzilla.mozilla.org/show_bug.cgi?id=891880

   investigated lion slaves linked from https://bugzilla.mozilla.org/show_bug.cgi?id=903462

   all back in service

   buildduty meeting @ 1:30pm EDT

   reconfig for Standard8: https://bugzilla.mozilla.org/show_bug.cgi?id=900549

Wednesday:

   fixed buildfaster query to cover helix builds

   set aside talos-r4-lion-001 for dustin: https://bugzilla.mozilla.org/show_bug.cgi?id=902903#c16

   promoted unagi build for dogfood: 20130812041203

   closed https://bugzilla.mozilla.org/show_bug.cgi?id=891880

   https://wiki.mozilla.org/ReferencePlatforms updated to remove need for extra reboot for Mac platforms

   investigated talos failures affecting (primarily) lion slaves: https://bugzilla.mozilla.org/show_bug.cgi?id=739089

Thursday:

   fixed up wait times report

   merged two Snow Leopard categories

   added jetpack to Win8 match

   added helix nightlies to buildfaster report

   https://wiki.mozilla.org/ReleaseEngineering/Buildduty#Meeting_Notes

   help with WinXP DST issues: https://bugzilla.mozilla.org/show_bug.cgi?id=878391#c32

   compiling https://github.com/vvuk/winrm to test deploy on w64-ix-slave03

   https://bugzilla.mozilla.org/show_bug.cgi?id=727551

Friday:

   investigation into https://bugzilla.mozilla.org/show_bug.cgi?id=905350

   basedir wrong in slavealloc

   reconfigs for catlee, aki

Issues/Questions:

many/most of the tree closure reasons on treestatus don't have bug#'s. should we encourage sheriffs to enter bug#'s so others can follow along more easily?
uncertainty around the difference between mozpool-managed and non-mozpool-managed pandas. How do I take one offline - do they both use disabled.flg? A: yes, all pandas use disabled.flg on the foopy
- when is it ok to use Lifeguard to for the state to "disabled"? per dustin: [it's] for testing, and working around issues like pandas that are still managed by old releng stuff. it's to save us loading up mysql and writing UPDATE queries
  - what's "old releng stuff"?
Windows test slaves

another example of PDU reboot not working correctly: https://bugzilla.mozilla.org/show_bug.cgi?id=737408#c5 and https://bugzilla.mozilla.org/show_bug.cgi?id=885969#c5. We need to automate power off,pause,power on to increase reliability.

Bustages:

bug 900273 landed and backed out

ReleaseEngineering/Buildduty/StandupMeetingNotesQ12015

Navigation menu

Search