CIDuty/SVMeetings/Aug10-Aug14

From MozillaWiki
Jump to: navigation, search

2015-08-10

    • [alin]

1. at the moment, we still do not have permissions to grant VPN access

https://bugzilla.mozilla.org/show_bug.cgi?id=1192253 Kim asked jabba for help with this bug as arr recommended.  Jabba says that this access is now in place - can you verify?

2. https://bugzilla.mozilla.org/show_bug.cgi?id=989237

    https://bugzilla.mozilla.org/show_bug.cgi?id=1188409 states that the machine was decommissioned
Q: can we mark the bug as RESOLVED?
We should remove the machine from slavealloc etc if this hasn't already been done
Still exists in slavealloc, will find you a doc on how to decom in our configs
https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Slave_Management#How_to_decommission_a_slave
https://bugzilla.mozilla.org/show_bug.cgi?id=1193304

3. https://bugzilla.mozilla.org/show_bug.cgi?id=1189003

   we finally managed to re-image the machine, disabled the runner and then connected to it via SSH and VNC.
   in the past, this machine had problems updating talos (https://bugzil.la/1141416).

Q: should we enable it in slavealloc? Did you disable runner after you reimaged it?

4. https://bugzilla.mozilla.org/show_bug.cgi?id=1192525

   slave loan request for a t-w732-ix machine
   Armen states in his comment that he needs to find the path to git.exe, however git is only installed on Windows build machines.

Q: should we ask him if he wants to loan a build machine?

5. https://bugzilla.mozilla.org/show_bug.cgi?id=1191967

   issue related to "panda-0345.p3.releng.scl3.mozilla.com" machine
   this machine is disabled
   Phil stated that it is "Close enough to nothing-but-RETRY"

Q: next steps in this case?

  • open DCops bug to decomm

6. https://bugzilla.mozilla.org/show_bug.cgi?id=947202 (info) re-imaged and enabled this machine in slavealloc https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=bld-lion-r5-086

2015-08-11

[kmoir] introduction to diff and patch Here is an introduction https://en.wikipedia.org/wiki/Patch_%28Unix%29 diff - tool to compare files line by line i.e. diff you can redirect the output of a diff command to a text file this text file can be applied to another person's copy of the file using patch

I think you run windows on your machines so I'm not sure what command line tools you have available. I recall from before that you cloned buildbot-configs. To update your local copy change directories to where you cloned hg.mozilla.org/build/buildbot-configs hg pull -u As an aside, here is an hg tutorial http://swcarpentry.github.io/hg-novice/

[alin] 1. Granting VPN access now works fine. Thanks for looking into this!

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1193193

   as discussed yesterday, we submitted a bug to DCOps for decommissioning the machine (panda-0345)
   also marked it as "decomm" in slavealloc


3. https://bugzilla.mozilla.org/show_bug.cgi?id=1193188

   loan request for a t-snow-r4 machine
   checked in Slave Health dashboard and noticed that all of them are working and taking jobs, excepting one which is disabled as it needs further diagnostics (t-snow-r4-0094)

Q: what is the approach in these cases?

  • Pick a slave in slavealloc and mark it disabled. Wait for the jobs on it to finish and then start the slave loan process.

batch downtime

[vlad] 1. we received a lot of nagios alerts on #buildduty channel with the following message: "[4839] buildbot-master111.bb.releng.scl3.mozilla.com:buildbot masters age is �7WARNING���: ELAPSED WARNING: 1 warn out of 1 process with command name buildbot (http://m.mozilla.org/buildbot+masters+age)" Q: Can you explain us what the alerts mean and what are the next steps to do. https://bugzilla.mozilla.org/show_bug.cgi?id=1056348 This is just an alert that the buildbot process has been up for over a month. Probably should just downtime it, although all of the masters will soon have this alert. Amy changed the alert to every 720 minutes so it should be less noisy in #buildduty now [vlad] I thought there is a problem with buildbot and we thought that somebody needs to look over it

2. The re-image problem for bld-lion-r5 has been fixed and the process of re-image has been completed . Updated the bug ticket and resolved : https://bugzilla.mozilla.org/show_bug.cgi?id=1189005


2015-08-12

[vlad] 1. Loaned the win7 slave to bhearsum . https://bugzilla.mozilla.org/show_bug.cgi?id=1193310

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1026516

   RyanVM re-opened the bug by specifying that a diagnostic need to be run 
   Q: Do we need to open a ticket to DCops to run the diagnostic again on the slave ? Can we clone this bug https://bugzilla.mozilla.org/show_bug.cgi?id=1162121 ?

Update: Created a bug 1193734 to DCops to run again the diagnostic

[alin] 1. https://bugzilla.mozilla.org/show_bug.cgi?id=1193412

   slave did not any job since August 10
   looked over the logs from buildbot master and noticed that the slave was detached and the connection was never re-established
   disabled the slave in slavealloc, restarted it and re-enable it
   after that we checked the logs from both master and slave and noticed that the connection was up again (more info on the bug)
   waiting to see if it starts taking jobs again... 

UPDATE: it started taking jobs --> marked the bug as RESOLVED

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1193413

   pretty much the same issue as above, so we followed the same steps
   after we re-enabled the slave in slavealloc, we noticed that it connected to another master
   at the moment we are waiting to see if it takes any jobs. 

UPDATE: t-w732-ix-055 appears as "broken" at the moment, we will need to see why UPDATE2: t-w732-ix-055 taking jobs at the moment :) Q: could it be that this "disable-reboot-enable" process does the trick? :) The machines were rebooted before several times, but still didn't take any jobs

3. https://bugzilla.mozilla.org/show_bug.cgi?id=1191967

   the panda machine was decommissioned by DCOps
   noticed that we don't have any entry for these types of machines in "production_config" script
   also looked through many examples, but didn't find any bug for decommissioning such a machine on releng side

Q: are there any additional steps that must be done here? Kim will update doc to decomm pandas

myqsl acess - kim sign slavealloc with vlad and alin's keys - still having problems with my signing access. Will have to get someone else to do this.

2015-08-13

[coop]

# replace placeholder with creds from oob-password.txt.gpg
export IPMI_USERNAME=XXXXXXXX
export IPMI_PASSWORD=XXXXXXXX

[vlad] I tried to decrypt the oob-password but failed

[alin] 1. https://bugzilla.mozilla.org/show_bug.cgi?id=890317

   used the script provided by Chris and re-imaged the slave
   re-enabled in slavealloc and checked the logs - the machine is now connected to a master
   waiting to see if it starts taking jobs

UPDATE: at the moment, it has already completed 4 jobs -> marking the bug as RESOLVED.

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1104571

   re-imaged the slave, enabled it in slavealloc and then restarted it
   noticed that it connected to a buildbot master machine
   waiting to see if it takes jobs

UPDATE: at the moment, it has already completed 4 jobs -> marking the bug as RESOLVED

3. we are not able to decrypt "slavealloc.txt.gpg", so we do not have mysql access. I guess the file has not been signed with our keys yet. I had problems with my gpg setup, am trying to fix it and if I can't will ask someone else to sign them

4. https://bugzilla.mozilla.org/show_bug.cgi?id=1191967 - decomission panda-0345

   cloned the /tools repo
   modified the json file, ran hg diff and obtained the patch
   logged in as root on foopy59 and removed /builds/panda-0345 folder
   as stated above, we don't have DB access


5. Should we start appending 'buildduty' to our IRC names? :)

[vlad] 1. https://bugzilla.mozilla.org/show_bug.cgi?id=1191901

   removed the fqdn from inventory
   terminated the aws instance
   revoked the  VPN access


2. Started the process of re-image for t-w732-ix-001 from my computer. Run the script with success Update: Reimage complated, the slave take jobs


2015-08-14

[alin] 1. https://bugzilla.mozilla.org/show_bug.cgi?id=795795 - bld-lion-r5-052

   tried to re-image the machine
   waited over one hour but ping did not work
   attempted to reboot the machine from the console -->failed 

Attempting SSH reboot...Failed. Attempting PDU reboot...Failed. Filed IT bug for reboot (bug 1194615)

https://bugzilla.mozilla.org/show_bug.cgi?id=867136 - talos-linux64-ix-017

   re-imaged, could not connect to it after that
   attempted to reboot the machine from the console -> failed

Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 1194673)

2. loaned machines:

   https://bugzilla.mozilla.org/show_bug.cgi?id=1194498 - EC2 machine to kfung
   https://bugzilla.mozilla.org/show_bug.cgi?id=1194623 - t-snow-r4 machine to cmanchester


3. https://bugzilla.mozilla.org/show_bug.cgi?id=1141416 (failed to update talos) - re-imaged the following slaves + restarted httpd:

Talos-linux32-ix:

   talos-linux32-ix-008
   talos-linux32-ix-001 <- coop to investigate
   talos-linux32-ix-026
   talos-linux32-ix-022


Yesterday before leaving, I enabled the four slaves above. Ryan VanderMeulen noticed that they failed each job so he disabled them. I also tried to re-image talos-linux32-ix-001 once again, but with an extra reboot step after the re-imaging process finished, but with no luck - it still fails every single job. Disabled. From the logs: "command timed out: 3600 seconds without output running..."

Talos-linux64-ix:

   talos-linux64-ix-001 - enabled, 
   talos-linux64-ix-002 - enabled, OK
   talos-linux64-ix-008 - enabled, 
   talos-linux64-ix-027 - enabled, OK
   talos-linux64-ix-004 - enabled, 
   talos-linux64-ix-099 - enabled, OK
   talos-linux64-ix-055 - enabled, OK
   talos-linux64-ix-092 - enabled, 
   talos-linux64-ix-017 - non-responsive


Not re-imaged yet:

   talos-linux32-ix-003 - 
   talos-linux64-ix-027 - re-imaged, started taking jobs


4. https://bugzilla.mozilla.org/show_bug.cgi?id=1194211 - panda-0345 slave

   will wait until a reconfig occurs 
   thanks Kim for the info :) 


5. https://bugzilla.mozilla.org/show_bug.cgi?id=1098452 - bld-lion-r5-079

   noticed that it no longer takes jobs
   disabled it in slavealloc, rebooted and then re-enabled it --> no effect
   checked the logs and noticed that it doesn't connect to a master
   in slavealloc it still appears to be connected to buildbot-master86.bb.releng.scl3.mozilla.com, even though the logs show that it lost connection two days ago (2015-08-12 11:49:13-0700)

Q: should I try a re-image?

   filed bug to DCOps to re-image and run diagnostics on this slave