- vlad - oct16-oct20
- coop - sep 27 - oct 2 in Cluj-Napoca (approved!)
Meetings every Tuesday and Thursday
- Main Meetings Page: https://wiki.mozilla.org/ReleaseEngineering/Buildduty/SVMeetings
2015-09-11 - 2015-09-15
loaned tst-linux32-ec2 to tbsaunde . He accidentally created 4 bugs . 3 bugs has been marked as duplicated. I spoke with him on irc and he confirmed that he needs only 1 slave
created the patch
delete the slave from mysql DB
uploaded the package to internal mirror
provided a list of xp slaves that needs to be re-imaged
the following slave t-w732-ix-117 has been re-imaged
created the patch
1. re-imaged several machines, enabled them in slavealloc. Monitoring if they take jobs.
b-2008-ix-0027 -> most of the jobs taken after re-image are green, closing
b-2008-ix-0122 -> last 10 jobs completed without issues, closing bug
t-w864-ix-158 -> most of the jobs are blue
2. https://bugzilla.mozilla.org/show_bug.cgi?id=853719 -> t-w864-ix-025
removed C:\slave\test-pgo\scripts\configs\b2g file, rebooted and returned to the pool
monitoring if it takes jobs
still hitting errors like: "rm: cannot remove directory `scripts/configs/b2g': Directory not empty", disabled.
re-imaged the slave
UPDATE: 1 green job followed by two blue ones (build successful interrupted) --> still monitoring..
3. https://bugzilla.mozilla.org/show_bug.cgi?id=977306 - talos-linux32-ix-001
noticed that this slave is locked to Coop's master
@Coop: still making use of it?
4. https://bugzilla.mozilla.org/show_bug.cgi?id=1103082 - talos-linux32-ix-022
this slave has been re-imaged multiple times but continued to burn jobs
Vlad opened a ticket to DCOps for diagnostics, but there were no errors found and the slave has been re-imaged one more time
enabled the slave in production but it failed the first job, disabled
as we noticed, Amy suggested not to decommission such slaves as they are still under warranty.
tried to figure out a possible reason for this, but did not find something clear
this is one of the slaves affected by bug 1141416 (related to talos's inability to deploy an update)
Q: what further steps should we take on this case?
- Kim will look at this machine
maybe get someone in relops to look at it. It seems every job times out. It's not worth renabling it again, it will just burn more jobs.
5. today we received several alerts from Nagios:
<nagios-releng> Tue 02:38:35 PDT  cruncher.srv.releng.scl3.mozilla.com:Pending builds is CRITICAL: CRITICAL Pending Builds: 8200 (http://m.mozilla.org/Pending+builds)
checked /var/log/nrpe.log and noticed that the number of pending jobs started increasing around Sep 14 10:28:27 PDT (3029 -> 4151) and reached a maximum of 10772 on Sep 14 19:08:27 PDT looked over https://wiki.mozilla.org/ReleaseEngineering/How_To/Dealing_with_high_pending_counts and tried to debug the issues most of the pending jobs are for Windows slaves and AWS spot instances (Android and Linux) talked to Carsten (Tomcat|sheriffduty) and he said that he closed the trees since OS X 10.7 build jobs were not starting --> that seemed to solve OS X problem, also the number of pending jobs decreased --> he reopened the trees after that Vlad looked to find eventual problems on the AWS side, but had little luck also checked twistd.log from buildbot-master81.bb.releng.scl3.mozilla.com and it seems that it stopped skipping jobs at 2015-09-15 01:59:53-0700, but the number of pending jobs was high even before that. Q1: how does the closure of OS X trees affect number of Windows and Linux number of pending jobs Q2: if you have time, we would like to know more about closing/opening trees and their behaviour.
Might be a good idea to ping people on the loaner bugs and ask if they still need them I also did a needinfo to Coop on the bug Do you need additional bugs to work on that are non-loaner related? Yes, please :)
I updated this document last week, you could review and see if there are any parts that are not clear https://wiki.mozilla.org/ReleaseEngineering/How_To/Dealing_with_high_pending_counts
Amy is looking for a list of win32 machines that need to be reimaged. Q is looking to test his fix on some machines. https://bugzilla.mozilla.org/show_bug.cgi?id=1200180
New bugs to work on 1)Add a runner task to check resolution on Windows testers before starting buildbot https://bugzilla.mozilla.org/show_bug.cgi?id=1190868 https://github.com/mozilla/build-runner
2) Increase size of tst-emulator64-spot instance pool https://bugzilla.mozilla.org/show_bug.cgi?id=1204756 [Vlad] Created the patch
3) 10.10 work (upgrade 10.10.2 test machines (yosemite) to 10.10.5 on new hardware) https://bugzilla.mozilla.org/show_bug.cgi?id=1203128 machines are up here https://bugzilla.mozilla.org/show_bug.cgi?id=1203157 Steps to do this work: Loan yourself one of the 64 new yosemite machines mentioned in bug 1203157 Don't delete puppet and runner when you loan yourself the machine Setup a personal development test master on dev-master2.bb.releng.use1.mozilla.com in /builds/buildbot/<yourid> https://wiki.mozilla.org/ReleaseEngineering/How_To/Setup_Personal_Development_Master You can limit the master to macosx tests Connect the mac loaner machine to your master Invoke sendchanges to run tests We can discuss tomorrow if you want, it's a lot of new material Here is also some info that may be useful https://wiki.mozilla.org/ReferencePlatforms/HowToSetupNewPlatform You won't need to do all of these steps because it's not really a new platform entirely, just a 10.10 version upgrade
4) Use ec2 instances with more memory when running linux gtests
5) modify check_pending_builds to report more granularly on pending builds/tests
2015-09-16 - 2015-09-17
1. took care of several common tasks: --> https://bugzilla.mozilla.org/show_bug.cgi?id=1103082 - talos-linux32-ix-022
opened bug to RelOPs to take a look at this slave as it continues to time out every job
this slave has been re-imaged multiple times during the past few months
also the memory and disk diagnostics did not seem to find any issue.
enabled t-snow-r4-0034 as it had its memory cards replaced
monitoring the jobs..
Phil noted frequent job failures due to suspicious GC crashes.
opened a bug to DCOps for memory diagnostics
2. started working on https://bugzilla.mozilla.org/show_bug.cgi?id=1190868 - add a runner task to check resolution on Windows
documentation about runner, Python
created a python script that checks the screen resolution on a Windows machine
the wiki states that we could use runslave.py or start-buildbot.bat to prevent a slave from starting if the resolution is too low.
at the moment, I am looking over runslave.py to better understand it
I am not able to find the location of start-buildbot.bat on the Windows slaves
Q: which steps should the script take if the resolution on a certain machine is not the desired one?
- will go look at it more detail
3. https://bugzilla.mozilla.org/show_bug.cgi?id=1203128 - add new 10.10 (Yosemite) machines to releng configs
followed the instructions from the wiki page that Kim provided to us and set a test master
limited the master to macosx64 slaves
since the new slaves are not present in slavealloc:
--> how can we loan such a slave? --> how can we connect it to our development test master?
Callek also mentioned bug 1191481 as a good reference for the next steps that need to be done in order to add the new machines to the following:
--> buildbot-configs --> slavealloc --> graphserver --> slave health --> treeherder
Q: do the patches for all these need to be done before or after running tests on them?
Q stated in comment 27 that XP installs are back online.
should we go on and re-image those XP slaves listed on the bugs from the Buildduty Report?
created the patch to increase the number
created the patch to add the slaves to slavealloc
work in progress
created the patches and uploaded them on bug. Tested the changes on bm69
I was not able to find the instance in AWS
Removed the fqdn from inventory
Removed from ldap
4. My windows has been re-imaged and I lost my gpg key. I generated a new one , the key string is 0x8f34cb4f . Can you please re-sign again the files in private repository 5. https://bugzilla.mozilla.org/show_bug.cgi?id=1203157 didn't get to this today, will try tomorrow morning
Capacity issues are believed to be due to the fact that more e10s tests were added on windows, increased machine compute time per push by 13% on Windows or 2.5 hours more of compute time on that platform. Not sure what to do in that case, will have to talk to sheriffs, ateam and releng about the next steps forward https://etherpad.mozilla.org/high-pending-count