ReleaseEngineering/Buildduty/SVMeetings/Sept14-Sept18

From MozillaWiki
Jump to: navigation, search

Upcoming vacation/PTO:

  • vlad - oct16-oct20
  • coop - sep 27 - oct 2 in Cluj-Napoca (approved!)

Meetings every Tuesday and Thursday


2015-09-11 - 2015-09-15

1. https://bugzilla.mozilla.org/show_bug.cgi?id=1203673

   loaned tst-linux32-ec2 to tbsaunde . He accidentally created 4 bugs . 3 bugs has been marked as duplicated. I spoke with him on irc and he confirmed that he needs only 1 slave

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1204153

   loaned tst-linux64-ec2-chmanchester

3. https://bugzilla.mozilla.org/show_bug.cgi?id=1204431

   created the patch 
   delete the slave from mysql DB

4. https://bugzilla.mozilla.org/show_bug.cgi?id=1204676

   uploaded the package to internal mirror

5. https://bugzilla.mozilla.org/show_bug.cgi?id=1200180

   provided a list of xp slaves that needs to be re-imaged
   the following slave t-w732-ix-117 has been re-imaged

6. https://bugzilla.mozilla.org/show_bug.cgi?id=1204756

   created the patch


1. re-imaged several machines, enabled them in slavealloc. Monitoring if they take jobs.

   b-2008-ix-0027 -> most of the jobs taken after re-image are green, closing
   b-2008-ix-0122 -> last 10 jobs completed without issues, closing bug
   t-w864-ix-158 -> most of the jobs are blue


2. https://bugzilla.mozilla.org/show_bug.cgi?id=853719 -> t-w864-ix-025

   removed C:\slave\test-pgo\scripts\configs\b2g file, rebooted and returned to the pool
   monitoring if it takes jobs
   still hitting errors like: "rm: cannot remove directory `scripts/configs/b2g': Directory not empty", disabled.
   re-imaged the slave

UPDATE: 1 green job followed by two blue ones (build successful interrupted) --> still monitoring..

3. https://bugzilla.mozilla.org/show_bug.cgi?id=977306 - talos-linux32-ix-001

   noticed that this slave is locked to Coop's master
   @Coop: still making use of it?


4. https://bugzilla.mozilla.org/show_bug.cgi?id=1103082 - talos-linux32-ix-022

   this slave has been re-imaged multiple times but continued to burn jobs
   Vlad opened a ticket to DCOps for diagnostics, but there were no errors found and the slave has been re-imaged one more time
   enabled the slave in production but it failed the first job, disabled
   as we noticed, Amy suggested not to decommission such slaves as they are still under warranty.
   tried to figure out a possible reason for this, but did not find something clear
   this is one of the slaves affected by bug 1141416 (related to talos's inability to deploy an update)

Q: what further steps should we take on this case?

  • Kim will look at this machine

maybe get someone in relops to look at it. It seems every job times out. It's not worth renabling it again, it will just burn more jobs.

5. today we received several alerts from Nagios:

<nagios-releng> Tue 02:38:35 PDT [4470] cruncher.srv.releng.scl3.mozilla.com:Pending builds is CRITICAL: CRITICAL Pending Builds: 8200 (http://m.mozilla.org/Pending+builds)

checked /var/log/nrpe.log and noticed that the number of pending jobs started increasing around Sep 14 10:28:27 PDT (3029 -> 4151) and reached a maximum of 10772 on Sep 14 19:08:27 PDT looked over https://wiki.mozilla.org/ReleaseEngineering/How_To/Dealing_with_high_pending_counts and tried to debug the issues most of the pending jobs are for Windows slaves and AWS spot instances (Android and Linux) talked to Carsten (Tomcat|sheriffduty) and he said that he closed the trees since OS X 10.7 build jobs were not starting --> that seemed to solve OS X problem, also the number of pending jobs decreased --> he reopened the trees after that Vlad looked to find eventual problems on the AWS side, but had little luck also checked twistd.log from buildbot-master81.bb.releng.scl3.mozilla.com and it seems that it stopped skipping jobs at 2015-09-15 01:59:53-0700, but the number of pending jobs was high even before that. Q1: how does the closure of OS X trees affect number of Windows and Linux number of pending jobs Q2: if you have time, we would like to know more about closing/opening trees and their behaviour.


Might be a good idea to ping people on the loaner bugs and ask if they still need them I also did a needinfo to Coop on the bug Do you need additional bugs to work on that are non-loaner related? Yes, please :)

I updated this document last week, you could review and see if there are any parts that are not clear https://wiki.mozilla.org/ReleaseEngineering/How_To/Dealing_with_high_pending_counts

Amy is looking for a list of win32 machines that need to be reimaged. Q is looking to test his fix on some machines. https://bugzilla.mozilla.org/show_bug.cgi?id=1200180

New bugs to work on 1)Add a runner task to check resolution on Windows testers before starting buildbot https://bugzilla.mozilla.org/show_bug.cgi?id=1190868 https://github.com/mozilla/build-runner

2) Increase size of tst-emulator64-spot instance pool https://bugzilla.mozilla.org/show_bug.cgi?id=1204756 [Vlad] Created the patch

3) 10.10 work (upgrade 10.10.2 test machines (yosemite) to 10.10.5 on new hardware) https://bugzilla.mozilla.org/show_bug.cgi?id=1203128 machines are up here https://bugzilla.mozilla.org/show_bug.cgi?id=1203157 Steps to do this work: Loan yourself one of the 64 new yosemite machines mentioned in bug 1203157 Don't delete puppet and runner when you loan yourself the machine Setup a personal development test master on dev-master2.bb.releng.use1.mozilla.com in /builds/buildbot/<yourid> https://wiki.mozilla.org/ReleaseEngineering/How_To/Setup_Personal_Development_Master You can limit the master to macosx tests Connect the mac loaner machine to your master Invoke sendchanges to run tests We can discuss tomorrow if you want, it's a lot of new material Here is also some info that may be useful https://wiki.mozilla.org/ReferencePlatforms/HowToSetupNewPlatform You won't need to do all of these steps because it's not really a new platform entirely, just a 10.10 version upgrade

4) Use ec2 instances with more memory when running linux gtests

https://bugzilla.mozilla.org/show_bug.cgi?id=1205785

5) modify check_pending_builds to report more granularly on pending builds/tests

https://bugzilla.mozilla.org/show_bug.cgi?id=1204970


2015-09-16 - 2015-09-17

1. took care of several common tasks: --> https://bugzilla.mozilla.org/show_bug.cgi?id=1103082 - talos-linux32-ix-022

   opened bug to RelOPs to take a look at this slave as it continues to time out every job
   this slave  has been re-imaged multiple times during the past few months
   also the memory and disk diagnostics did not seem to find any issue.

--> https://bugzilla.mozilla.org/show_bug.cgi?id=1052484

   enabled t-snow-r4-0034 as it had its memory cards replaced
   monitoring the jobs..

--> https://bugzilla.mozilla.org/show_bug.cgi?id=1205033

   Phil noted frequent job failures due to suspicious GC crashes.
   opened a bug to DCOps for memory diagnostics


2. started working on https://bugzilla.mozilla.org/show_bug.cgi?id=1190868 - add a runner task to check resolution on Windows

   documentation about runner, Python
   created a python script that checks the screen resolution on a Windows machine
   the wiki states that we could use runslave.py or start-buildbot.bat to prevent a slave from starting if the resolution is too low.
   at the moment, I am looking over runslave.py to better understand it
   I am not able to find the location of start-buildbot.bat on the Windows slaves

Q: which steps should the script take if the resolution on a certain machine is not the desired one?

  • will go look at it more detail

3. https://bugzilla.mozilla.org/show_bug.cgi?id=1203128 - add new 10.10 (Yosemite) machines to releng configs

   followed the instructions from the wiki page that Kim provided to us and set a test master
   limited the master to macosx64 slaves
   since the new slaves are not present in slavealloc:
   --> how can we loan such a slave?
   --> how can we connect it to our development test master?

Callek also mentioned bug 1191481 as a good reference for the next steps that need to be done in order to add the new machines to the following:

  --> buildbot-configs
  --> slavealloc
  --> graphserver
  --> slave health 
  --> treeherder    

Q: do the patches for all these need to be done before or after running tests on them?

4. https://bugzilla.mozilla.org/show_bug.cgi?id=1200180

   Q stated in comment 27 that XP installs are back online. 
   should we go on and re-image those XP slaves listed on the bugs from the Buildduty Report?



[vlad]

1. https://bugzilla.mozilla.org/show_bug.cgi?id=1204756

   created the patch to increase the number
   created the patch to add the slaves to slavealloc

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1158729

   work in progress
   created the patches and uploaded them on bug. Tested the changes on bm69

3. https://bugzilla.mozilla.org/show_bug.cgi?id=1200824

   I was not able to find the instance in AWS 
   Removed the fqdn from inventory
   Removed from ldap 

4. My windows has been re-imaged and I lost my gpg key. I generated a new one , the key string is 0x8f34cb4f . Can you please re-sign again the files in private repository 5. https://bugzilla.mozilla.org/show_bug.cgi?id=1203157 didn't get to this today, will try tomorrow morning

Capacity issues are believed to be due to the fact that more e10s tests were added on windows, increased machine compute time per push by 13% on Windows or 2.5 hours more of compute time on that platform. Not sure what to do in that case, will have to talk to sheriffs, ateam and releng about the next steps forward https://etherpad.mozilla.org/high-pending-count