CIDuty/SVMeetings/Aug24-Aug28
Upcoming vacation/PTO:
- alin - aug31-sep11
- coop - aug 3, aug 17-28
- kmoir - aug 3, aug 24
- otilia - aug10-21, half day on July31
- vlad - jul31 ; aug14-aug27
- Monday Aug 3 - Holiday in Canada
- https://wiki.mozilla.org/ReleaseEngineering/Buildduty/SVMeetings/July27-July31
- https://wiki.mozilla.org/ReleaseEngineering/Buildduty/SVMeetings/Aug3-Aug7
- https://wiki.mozilla.org/ReleaseEngineering/Buildduty/SVMeetings/Aug10-Aug14
- https://wiki.mozilla.org/ReleaseEngineering/Buildduty/SVMeetings/Aug17-Aug21
2015-08-24
https://bugzilla.mozilla.org/show_bug.cgi?id=1175291
@Callek: Kim reviewed the patch and said it was ok. If you have time, please land it and merge it to the production branch. - done --> thanks! -->backed out, see bug
2015-08-25
1. took care of several common tasks: --> https://bugzilla.mozilla.org/show_bug.cgi?id=1103082 - talos-linux32-ix-022
Van Le ran several memory and disk tests, all of them passed
the slave has been re-imaged, so I enabled it in slavealloc
waiting to see if it takes jobs/how they run
UPDATE: burned another job, disabled --> https://bugzilla.mozilla.org/show_bug.cgi?id=1103851 - t-xp32-ix-041
burned the latest several jobs, disabled
re-imaged, waiting to see if it works now
UPDATE: started taking jobs, looks fine
2. started looking over the additional bugs to work on
tried to run " manage_masters.py" script from my local machine
I am using Cygwin and Windows 8.1
copied the script in /home/alin.selagea, installed pip and then fabric
$ pip freeze | grep Fabric Fabric==1.10.2
spent a lot of time debugging, but I get:
File "manage_masters.py", line 13, in <module>
import util.fabric.actions
ImportError: No module named util.fabric.actions
downloaded this module from github and copied the folder "fabric" under /lib/python/util
Q: do you know of any additional settings that should be made? Did you copy the entire tools dir to /home/alin.selagea or just the one script? You need to copy the entire repo, not just the one script. \o/ NOW IT WORKS \o/ :)
3. https://bugzilla.mozilla.org/show_bug.cgi?id=1175291
patch to disable freshclam failed on t-snow and t-yosemite machines and has been rolled back
talked to Amy and it seems that the "unless" condition is not correct
currently working to change that line
onlyif => " /bin/test -f /etc/freshclam.conf && /bin/test "$(/usr/bin/grep 'Checks 0' /etc/freshclam.conf)" != "Checks 0"" ;
the second test condition needs to be adjusted, as it doesn't work fine yet
Amy also mentioned that "org.clamav.freshclam-init.plist has RunAtLoad true", so maybe we should add that to the disabled services in order to fix the bug
Kim will find pointers to change plists
http://hg.mozilla.org/build/puppet/file/39edb742127c/modules/disableservices/manifests/common.pp#l95
[root@bld-lion-r5-078.build.releng.scl3.mozilla.com ~]# launchctl list | grep freshclam
- 0 org.clamav.freshclam-init
4. in the morning I was asked by Nigel if there were some changes to the linux machines as noticed some failures: https://treeherder.mozilla.org/logviewer.html#?job_id=4364889&repo=fx-team
errors like: 'mach build' did not run successfully. Please check log for errors.
on the #developers channel, it was suggested that the issue was not related to infrastructure but to clobber.
I don't know exactly what "clobber" refers to in this case
--> in the case of clobberring a tree -> refers to the process of erasing the build directory from a machine Q1: where do we find the logs to see if a clobber has been run? Q2: (I know this was briefly explained during the first week) some tips&tricks for working with Treeherder https://api.pub.build.mozilla.org/docs/ https://api.pub.build.mozilla.org/docs/usage/clobberer/ https://wiki.mozilla.org/Auto-tools/Projects/Treeherder
You can see that each build clobbers the properties and scripts dirs - see clobber_scripts and clobber_properties here for a particular job and machine. http://buildbot-master86.bb.releng.scl3.mozilla.com:8001/builders/OS%20X%2010.7%2064-bit%20mozilla-inbound%20leak%20test%20build/builds/1967 During the run script stage, it checks to see if a clobber needs to be run, for example http://buildbot-master86.bb.releng.scl3.mozilla.com:8001/builders/OS%20X%2010.7%2064-bit%20mozilla-inbound%20leak%20test%20build/builds/1967/steps/run_script/logs/stdio You can see 01:46:04 INFO - ##### 01:46:04 INFO - ##### Running clobber step. 01:46:04 INFO - ##### 01:46:04 INFO - Running pre-action listener: influxdb_recording_pre_action 01:46:04 INFO - Running main action method: clobber 01:46:04 INFO - retry: Calling run_command with args: [['/builds/slave/m-in-m64-d-0000000000000000000/scripts/external_tools/clobberer.py', '-s', 'scripts', '-s', 'logs', '-s', 'buildprops.json', '-s', 'token', '-s', 'oauth.txt', '-t', '168', 'https://api.pub.build.mozilla.org/clobberer/lastclobber', 'mozilla-inbound', 'OS X 10.7 64-bit mozilla-inbound leak test build', 'm-in-m64-d-0000000000000000000', 'bld-lion-r5-083', 'http://buildbot-master86.bb.releng.scl3.mozilla.com:8001/']], kwargs: {'error_list': [{'substr': 'Error contacting server', 'explanation': 'Error contacting server for clobberer information.', 'level': 'error'}], 'cwd': '/builds/slave'}, attempt #1 01:46:04 INFO - Running command: ['/builds/slave/m-in-m64-d-0000000000000000000/scripts/external_tools/clobberer.py', '-s', 'scripts', '-s', 'logs', '-s', 'buildprops.json', '-s', 'token', '-s', 'oauth.txt', '-t', '168', 'https://api.pub.build.mozilla.org/clobberer/lastclobber', 'mozilla-inbound', 'OS X 10.7 64-bit mozilla-inbound leak test build', 'm-in-m64-d-0000000000000000000', 'bld-lion-r5-083', 'http://buildbot-master86.bb.releng.scl3.mozilla.com:8001/'] in /builds/slave 01:46:04 INFO - Copy/paste: /builds/slave/m-in-m64-d-0000000000000000000/scripts/external_tools/clobberer.py -s scripts -s logs -s buildprops.json -s token -s oauth.txt -t 168 https://api.pub.build.mozilla.org/clobberer/lastclobber mozilla-inbound "OS X 10.7 64-bit mozilla-inbound leak test build" m-in-m64-d-0000000000000000000 bld-lion-r5-083 http://buildbot-master86.bb.releng.scl3.mozilla.com:8001/ 01:46:05 INFO - Checking clobber URL: https://api.pub.build.mozilla.org/clobberer/lastclobber?master=http%3A%2F%2Fbuildbot-master86.bb.releng.scl3.mozilla.com%3A8001%2F&slave=bld-lion-r5-083&builddir=m-in-m64-d-0000000000000000000&branch=mozilla-inbound&buildername=OS+X+10.7+64-bit+mozilla-inbound+leak+test+build 01:46:05 INFO - m-in-m64-d-0000000000000000000:Our last clobber date: 2015-08-18 10:12:57 01:46:05 INFO - m-in-m64-d-0000000000000000000:Server clobber date: 2015-08-10 15:40:31 01:46:05 INFO - Return code: 0 01:46:05 INFO - Running command: ['/builds/slave/m-in-m64-d-0000000000000000000/scripts/external_tools/purge_builds.py', '-s', '10', '--max-age', '14', '--not', 'info', '--not', 'rel-*:10d', '--not', 'tb-rel-*:10d', '/builds/slave'] 01:46:05 INFO - Copy/paste: /builds/slave/m-in-m64-d-0000000000000000000/scripts/external_tools/purge_builds.py -s 10 --max-age 14 --not info --not rel-*:10d --not tb-rel-*:10d /builds/slave 01:46:05 INFO - Using env: {'HG_SHARE_BASE_DIR': '/builds/hg-shared', 01:46:05 INFO - 'PATH': '/tools/python/bin:/tools/buildbot/bin:/opt/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin'} 01:46:05 INFO - Ignored directory 'rel-m-beta-m64_beta_u_v_3-0000' exceeds cutoff time 01:46:05 INFO - Ignored directory 'rel-m-beta-m64_rpk_1-000000000' exceeds cutoff time 01:46:05 INFO - Ignored directory 'rel-m-beta-m64_rpk_3-000000000' exceeds cutoff time 01:46:05 INFO - Ignored directory 'rel-m-rel-m64_rpk_5-0000000000' exceeds cutoff time 01:46:05 INFO - Ignored directory 'tb-rel-c-esr38-m64_bld-0000000' exceeds cutoff time 01:46:05 INFO - Ignored directory 'tb-rel-c-esr38-m64_rpk_1-00000' exceeds cutoff time 01:46:05 INFO - 519.60 GB of space available 01:46:05 INFO - Return code: 0
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=36e7f7148cf8
clobberer https://api.pub.build.mozilla.org/login_request?next=%2Fclobberer%2F
Will look for logs for clobber -- ^^ see above
2015-08-26
1. common tasks:
re-imaged the machines that returned from loan and enabled them in slavealloc
loaned one win 8 and one win xp machines to Kim :)
2. https://bugzilla.mozilla.org/show_bug.cgi?id=1175291 - disable freshclam
tried both with the example suggested by Kim and Amy's patch
received errors on t-snow and t-yosemite slaves, debugging
2015-08-27
1. common tasks:
loaned two AWS slaves to Ted Mielczarek:
--> tst-linux32-ec2-ted --> tst-linux64-ec2-ted
https://bugzilla.mozilla.org/show_bug.cgi?id=1037839 - t-snow-r4-0089
--> this slave has been disabled due to GC crashes --> re-imaged and returned it to the pool, waiting to see how it goes
UPDATE: already completed 10+ jobs without issues --> marked the bug as Resolved
2. received 5 consecutive alerts like: [sns alert] Thu 03:08:45 PDT t-snow-r4-0032 puppet-agent: /usr/local/bin/screenresolution set 1600x1200x32 returned 1 instead of one of [0]
connected to the slave --> "/usr/local/bin/screenresolution get"
"2015-08-27 03:42:38.042 screenresolution[8382:903] Error: failed to get list of active displays" tried the same command on t-snow-r4-0089: "2015-08-27 03:42:07.742 screenresolution[1109:903] Display 0: 1600x1200x32" restarted the slave, ran the command again: "2015-08-27 03:46:53.403 screenresolution[655:903] Display 0: 1600x1200x32" -- seems ok now Yes, this alert usually means that we need to reboot the machine. We should add a check in runner to check for this alert and reboot, perhaps open a bug to investigate.
3. also received several alerts from relengbot:
<relengbot> [sns alert] Thu 05:08:02 PDT buildbot-master79.bb.releng.usw2.mozilla.com watch_twistd_log.py: Count: 633 | First instance: 2015-08-27 04:00:03-0700 | Most recent instance: 2015-08-27 04:29:00-0700 | Twistd exception: twisted.cred.error.UnauthorizedLogin - unknown 10.132.67.84
investigated, I don't know where those IPs are coming from
asked in #releng channel for suggestions
This is an machine trying to connect to the buildbot master that has either 1) wrong password 2) wrong username 3) is not listed in the range that can connect ot the master
Looks like this machine 10.132.67.84 is try-linux64-spot-407 (I looked in the AWS console)
aselagea: http://buildbot-master79.bb.releng.usw2.mozilla.com:8101/buildslaves
kmoiraselagea: http://buildbot-master79.bb.releng.usw2.mozilla.com:8101/buildslaves/try-linux64-spot-407
[Broker,24354,10.132.67.182] BuildSlave.detached(try-linux64-spot-349)
2015-08-27 04:26:08-0700 [Broker,25204,10.132.67.48] Peer will receive following PB traceback:
2015-08-27 04:26:08-0700 [Broker,25204,10.132.67.48] Unhandled Error
4. https://bugzilla.mozilla.org/show_bug.cgi?id=1158729 - manage_masters.py retry_dead_queue should run periodically
I am now able to run the script from my local machine using Cygwinhttps://etherpad.mozilla.org/buildduty-notes
tried to find a way to implement a cronjob to run the command on all the masters
looked at bug https://bugzilla.mozilla.org/show_bug.cgi?id=1057888 - Automate monthly graceful restart of buildbot masters for a clue
my guess is that we can use the options -D, -M, -H to select the masters
Q: which machine should be used to set the cronjob? Do a needinfo on the bug for rail to ask about that
5. https://bugzilla.mozilla.org/show_bug.cgi?id=1175291 - disable freshclam on OSX builders
loaned two slaves: t-snow-r4-0156 and t-yosemite-r5-0090
modified slave.pp file to test Amy's patch
added t-snow-r4-0156 and t-yosemite-r5-0090 in moco-nodes.pp configuration file
working fine on bld-lion-r5-078
on t-snow and t-yosemite I get some ugly errors:
Could not set 'present' on ensure: No such file or directory - https://releng-puppet2.srv.releng.scl3.mozilla.com/repos/DMGs/10.10/git-1.7.9.4-1.dmg at 38:/etc/puppet/environments/aselagea/modules/packages/manifests/pkgdmg.pp
debugging...
kim will look at yosemite - didn't get a chance to do that today because we had a lot of issues today, will look at it first thing tomorrow
https://secure.pub.build.mozilla.org/buildapi/self-serve
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-central
https://bugzilla.mozilla.org/show_bug.cgi?id=1198900
Build changeset is here
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-central
f8086bd3c84f
search for build changeset on this page https://hg.mozilla.org/mozilla-central/summary f8086bd3c84f
changeset referred to in bug is https://hg.mozilla.org/mozilla-central/summary 8205877b3b30
2015-08-27
Yesterday we had some buildduty issues, thought you might want to read them for interest
In the morning SETA stopped working for 6 hrs on Aug 27, then started working again, why? https://bugzilla.mozilla.org/show_bug.cgi?id=1199347 buildbot-master81 in the twistd.log
which caused Very high test backlog https://bugzilla.mozilla.org/show_bug.cgi?id=1199226
In the afternoon, stage.mozilla.org:http - 301 /(.*) to archive.mozilla.org/$1 https://bugzilla.mozilla.org/show_bug.cgi?id=1198296 Summary: IT made a change to change stage.mozilla.org to archive.mozilla.org. The ip address changed and this meant we didn't have the ssh key to scp content there
ftp 2-8.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 2400 out of 2400 Clients https://bugzilla.mozilla.org/show_bug.cgi?id=1199363 ftp server in load balancing cluster failed, one had to be put in passive mode. Unclear if this is the result of extra traffic from bug 1198296 above
We can discuss tomorrow. Also, thanks for checking the state of the AWS masters starting instances, that was very helpful.
2015-08-28
[alin] 1. common tasks:
bugs 1198820, 1198821 - terminated instances, removed records from inventory, revoked VPN access
re-imaged several slaves, monitoring:
--> b-2008-ix-0025: 0 jobs passed | 0 failed --> b-2008-ix-0016: 0 jobs passed | 0 failed --> t-w732-ix-001: 4 jobs passed | 0 failed
2. https://bugzilla.mozilla.org/show_bug.cgi?id=1197071
t-snow-r4-0093 has been decommissioned by the DCOps team
opened bug to implement the required changes on our side: 1199586
attached patch to remove slave from configs --> please review it when you have time :)
also removed the slave from slavealloc
3. https://bugzilla.mozilla.org/show_bug.cgi?id=1158729 - manage_masters.py retry_dead_queue should run periodically
did a needinfo on the bug for Rail, waiting for reply
4. https://bugzilla.mozilla.org/show_bug.cgi?id=1175291 - disable freshclam on OSX builders
the issue does not seem to be related to the patch
when you have time, please take a look
loaned slaves: t-snow-r4-0156 and t-yosemite-r5-0090
[vlad]
1. Requested access for wiki page
2. Helped Alin to figure it why the puppet is not working on t-yosemite-r5 slaves and t-snow-r4 slaves
Additional bugs to work on
1)manage_masters.py retry_dead_queue should run periodically https://bugzilla.mozilla.org/show_bug.cgi?id=1158729 Way to stop all these alerts :-)
2) Add a runner task to check resolution on Windows testers before starting buildbot https://bugzilla.mozilla.org/show_bug.cgi?id=1190868https://github.com/mozilla/build-runner
3) Add T testing to the trychooser UI https://bugzilla.mozilla.org/show_bug.cgi?id=1141280 Code is in hg.mozilla.org/buildtools and trychooser dir
FYI Friday afternoon we had this problem https://bugzilla.mozilla.org/show_bug.cgi?id=1199524