CIDuty/SVMeetings/Aug24-Aug28

Upcoming vacation/PTO:

alin - aug31-sep11
coop - aug 3, aug 17-28
kmoir - aug 3, aug 24
otilia - aug10-21, half day on July31
vlad - jul31 ; aug14-aug27
Monday Aug 3 - Holiday in Canada

2015-08-24

https://bugzilla.mozilla.org/show_bug.cgi?id=1175291

   @Callek: Kim reviewed the patch and said it was ok. If you have time, please land it and merge it to the production branch. - done --> thanks! -->backed out, see bug

2015-08-25

1. took care of several common tasks: --> https://bugzilla.mozilla.org/show_bug.cgi?id=1103082 - talos-linux32-ix-022

   Van Le ran several memory and disk tests, all of them passed

   the slave has been re-imaged, so I enabled it in slavealloc

   waiting to see if it takes jobs/how they run

UPDATE: burned another job, disabled --> https://bugzilla.mozilla.org/show_bug.cgi?id=1103851 - t-xp32-ix-041

   burned the latest several jobs, disabled

   re-imaged, waiting to see if it works now

UPDATE: started taking jobs, looks fine

2. started looking over the additional bugs to work on

   tried to run " manage_masters.py" script from my local machine

   I am using Cygwin and Windows 8.1

   copied the script in /home/alin.selagea, installed pip and then fabric

$ pip freeze | grep Fabric Fabric==1.10.2

   spent a lot of time debugging, but I get:

File "manage_masters.py", line 13, in <module>

   import util.fabric.actions

ImportError: No module named util.fabric.actions

   downloaded this module from github and copied the folder "fabric" under /lib/python/util

Q: do you know of any additional settings that should be made? Did you copy the entire tools dir to /home/alin.selagea or just the one script? You need to copy the entire repo, not just the one script. \o/ NOW IT WORKS \o/ :)

3. https://bugzilla.mozilla.org/show_bug.cgi?id=1175291

   patch to disable freshclam failed on t-snow and t-yosemite machines and has been rolled back

   talked to Amy and it seems that the "unless" condition is not correct

   currently working to change that line

onlyif => " /bin/test -f /etc/freshclam.conf && /bin/test "$(/usr/bin/grep 'Checks 0' /etc/freshclam.conf)" != "Checks 0"" ;

   the second test condition needs to be adjusted, as it doesn't work fine yet

   Amy also mentioned that "org.clamav.freshclam-init.plist has RunAtLoad true", so maybe we should add that to the disabled services in order to fix the bug

   Kim will find pointers to change plists

   http://hg.mozilla.org/build/puppet/file/39edb742127c/modules/disableservices/manifests/common.pp#l95

   [root@bld-lion-r5-078.build.releng.scl3.mozilla.com ~]# launchctl list | grep freshclam

   -       0       org.clamav.freshclam-init

4. in the morning I was asked by Nigel if there were some changes to the linux machines as noticed some failures: https://treeherder.mozilla.org/logviewer.html#?job_id=4364889&repo=fx-team

   errors like: 'mach build' did not run successfully. Please check log for errors.

   on the #developers channel, it was suggested that the issue was not related to infrastructure but to clobber.

   I don't know exactly what "clobber" refers to in this case

--> in the case of clobberring a tree -> refers to the process of erasing the build directory from a machine Q1: where do we find the logs to see if a clobber has been run? Q2: (I know this was briefly explained during the first week) some tips&tricks for working with Treeherder https://api.pub.build.mozilla.org/docs/ https://api.pub.build.mozilla.org/docs/usage/clobberer/ https://wiki.mozilla.org/Auto-tools/Projects/Treeherder

You can see that each build clobbers the properties and scripts dirs - see clobber_scripts and clobber_properties here for a particular job and machine. http://buildbot-master86.bb.releng.scl3.mozilla.com:8001/builders/OS%20X%2010.7%2064-bit%20mozilla-inbound%20leak%20test%20build/builds/1967 During the run script stage, it checks to see if a clobber needs to be run, for example http://buildbot-master86.bb.releng.scl3.mozilla.com:8001/builders/OS%20X%2010.7%2064-bit%20mozilla-inbound%20leak%20test%20build/builds/1967/steps/run_script/logs/stdio You can see 01:46:04 INFO - ##### 01:46:04 INFO - ##### Running clobber step. 01:46:04 INFO - ##### 01:46:04 INFO - Running pre-action listener: influxdb_recording_pre_action 01:46:04 INFO - Running main action method: clobber 01:46:04 INFO - retry: Calling run_command with args: [['/builds/slave/m-in-m64-d-0000000000000000000/scripts/external_tools/clobberer.py', '-s', 'scripts', '-s', 'logs', '-s', 'buildprops.json', '-s', 'token', '-s', 'oauth.txt', '-t', '168', 'https://api.pub.build.mozilla.org/clobberer/lastclobber', 'mozilla-inbound', 'OS X 10.7 64-bit mozilla-inbound leak test build', 'm-in-m64-d-0000000000000000000', 'bld-lion-r5-083', 'http://buildbot-master86.bb.releng.scl3.mozilla.com:8001/']], kwargs: {'error_list': [{'substr': 'Error contacting server', 'explanation': 'Error contacting server for clobberer information.', 'level': 'error'}], 'cwd': '/builds/slave'}, attempt #1 01:46:04 INFO - Running command: ['/builds/slave/m-in-m64-d-0000000000000000000/scripts/external_tools/clobberer.py', '-s', 'scripts', '-s', 'logs', '-s', 'buildprops.json', '-s', 'token', '-s', 'oauth.txt', '-t', '168', 'https://api.pub.build.mozilla.org/clobberer/lastclobber', 'mozilla-inbound', 'OS X 10.7 64-bit mozilla-inbound leak test build', 'm-in-m64-d-0000000000000000000', 'bld-lion-r5-083', 'http://buildbot-master86.bb.releng.scl3.mozilla.com:8001/'] in /builds/slave 01:46:04 INFO - Copy/paste: /builds/slave/m-in-m64-d-0000000000000000000/scripts/external_tools/clobberer.py -s scripts -s logs -s buildprops.json -s token -s oauth.txt -t 168 https://api.pub.build.mozilla.org/clobberer/lastclobber mozilla-inbound "OS X 10.7 64-bit mozilla-inbound leak test build" m-in-m64-d-0000000000000000000 bld-lion-r5-083 http://buildbot-master86.bb.releng.scl3.mozilla.com:8001/ 01:46:05 INFO - Checking clobber URL: https://api.pub.build.mozilla.org/clobberer/lastclobber?master=http%3A%2F%2Fbuildbot-master86.bb.releng.scl3.mozilla.com%3A8001%2F&slave=bld-lion-r5-083&builddir=m-in-m64-d-0000000000000000000&branch=mozilla-inbound&buildername=OS+X+10.7+64-bit+mozilla-inbound+leak+test+build 01:46:05 INFO - m-in-m64-d-0000000000000000000:Our last clobber date: 2015-08-18 10:12:57 01:46:05 INFO - m-in-m64-d-0000000000000000000:Server clobber date: 2015-08-10 15:40:31 01:46:05 INFO - Return code: 0 01:46:05 INFO - Running command: ['/builds/slave/m-in-m64-d-0000000000000000000/scripts/external_tools/purge_builds.py', '-s', '10', '--max-age', '14', '--not', 'info', '--not', 'rel-*:10d', '--not', 'tb-rel-*:10d', '/builds/slave'] 01:46:05 INFO - Copy/paste: /builds/slave/m-in-m64-d-0000000000000000000/scripts/external_tools/purge_builds.py -s 10 --max-age 14 --not info --not rel-*:10d --not tb-rel-*:10d /builds/slave 01:46:05 INFO - Using env: {'HG_SHARE_BASE_DIR': '/builds/hg-shared', 01:46:05 INFO - 'PATH': '/tools/python/bin:/tools/buildbot/bin:/opt/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin'} 01:46:05 INFO - Ignored directory 'rel-m-beta-m64_beta_u_v_3-0000' exceeds cutoff time 01:46:05 INFO - Ignored directory 'rel-m-beta-m64_rpk_1-000000000' exceeds cutoff time 01:46:05 INFO - Ignored directory 'rel-m-beta-m64_rpk_3-000000000' exceeds cutoff time 01:46:05 INFO - Ignored directory 'rel-m-rel-m64_rpk_5-0000000000' exceeds cutoff time 01:46:05 INFO - Ignored directory 'tb-rel-c-esr38-m64_bld-0000000' exceeds cutoff time 01:46:05 INFO - Ignored directory 'tb-rel-c-esr38-m64_rpk_1-00000' exceeds cutoff time 01:46:05 INFO - 519.60 GB of space available 01:46:05 INFO - Return code: 0

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=36e7f7148cf8

clobberer https://api.pub.build.mozilla.org/login_request?next=%2Fclobberer%2F

Will look for logs for clobber -- ^^ see above

2015-08-26

1. common tasks:

   re-imaged the machines that returned from loan and enabled them in slavealloc

   loaned one win 8 and one win xp machines to Kim :)

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1175291 - disable freshclam

   tried both with the example suggested by Kim and Amy's patch

   received errors on t-snow and t-yosemite slaves, debugging

2015-08-27

1. common tasks:

   loaned two AWS slaves to Ted Mielczarek:

   --> tst-linux32-ec2-ted  
   --> tst-linux64-ec2-ted

   https://bugzilla.mozilla.org/show_bug.cgi?id=1037839 - t-snow-r4-0089

 --> this slave has been disabled due to GC crashes
 --> re-imaged and returned it to the pool, waiting to see how it goes

UPDATE: already completed 10+ jobs without issues --> marked the bug as Resolved

2. received 5 consecutive alerts like: [sns alert] Thu 03:08:45 PDT t-snow-r4-0032 puppet-agent: /usr/local/bin/screenresolution set 1600x1200x32 returned 1 instead of one of [0]

   connected to the slave --> "/usr/local/bin/screenresolution get"

"2015-08-27 03:42:38.042 screenresolution[8382:903] Error: failed to get list of active displays" tried the same command on t-snow-r4-0089: "2015-08-27 03:42:07.742 screenresolution[1109:903] Display 0: 1600x1200x32" restarted the slave, ran the command again: "2015-08-27 03:46:53.403 screenresolution[655:903] Display 0: 1600x1200x32" -- seems ok now Yes, this alert usually means that we need to reboot the machine. We should add a check in runner to check for this alert and reboot, perhaps open a bug to investigate.

3. also received several alerts from relengbot:

   <relengbot> [sns alert] Thu 05:08:02 PDT buildbot-master79.bb.releng.usw2.mozilla.com watch_twistd_log.py: Count: 633 | First instance: 2015-08-27 04:00:03-0700 | Most recent instance: 2015-08-27 04:29:00-0700 | Twistd exception: twisted.cred.error.UnauthorizedLogin - unknown 10.132.67.84

   investigated, I don't know where those IPs are coming from

   asked in #releng channel for suggestions

   This is an machine trying to connect to the buildbot master that has either 1) wrong password 2) wrong username 3) is not listed in the range that can connect ot the master

   Looks like this machine 10.132.67.84 is try-linux64-spot-407 (I looked in the AWS console)

   aselagea: http://buildbot-master79.bb.releng.usw2.mozilla.com:8101/buildslaves

   kmoiraselagea: http://buildbot-master79.bb.releng.usw2.mozilla.com:8101/buildslaves/try-linux64-spot-407

[Broker,24354,10.132.67.182] BuildSlave.detached(try-linux64-spot-349) 2015-08-27 04:26:08-0700 [Broker,25204,10.132.67.48] Peer will receive following PB traceback: 2015-08-27 04:26:08-0700 [Broker,25204,10.132.67.48] Unhandled Error

4. https://bugzilla.mozilla.org/show_bug.cgi?id=1158729 - manage_masters.py retry_dead_queue should run periodically

   I am now able to run the script from my local machine using Cygwinhttps://etherpad.mozilla.org/buildduty-notes

   tried to find a way to implement a cronjob to run the command on all the masters

   looked at bug https://bugzilla.mozilla.org/show_bug.cgi?id=1057888 - Automate monthly graceful restart of buildbot masters for a clue

   my guess is that we can use the options -D, -M, -H to select the masters

Q: which machine should be used to set the cronjob? Do a needinfo on the bug for rail to ask about that

5. https://bugzilla.mozilla.org/show_bug.cgi?id=1175291 - disable freshclam on OSX builders

   loaned two slaves: t-snow-r4-0156 and t-yosemite-r5-0090

   modified slave.pp file to test Amy's patch

   added t-snow-r4-0156 and t-yosemite-r5-0090 in moco-nodes.pp configuration file

   working fine on bld-lion-r5-078

   on t-snow and t-yosemite I get some ugly errors:

Could not set 'present' on ensure: No such file or directory - https://releng-puppet2.srv.releng.scl3.mozilla.com/repos/DMGs/10.10/git-1.7.9.4-1.dmg at 38:/etc/puppet/environments/aselagea/modules/packages/manifests/pkgdmg.pp

   debugging...

   kim will look at yosemite - didn't get a chance to do that today because we had a lot of issues today, will look at it first thing tomorrow

https://secure.pub.build.mozilla.org/buildapi/self-serve https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-central https://bugzilla.mozilla.org/show_bug.cgi?id=1198900

Build changeset is here https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-central f8086bd3c84f

search for build changeset on this page https://hg.mozilla.org/mozilla-central/summary f8086bd3c84f

changeset referred to in bug is https://hg.mozilla.org/mozilla-central/summary 8205877b3b30

2015-08-27

Yesterday we had some buildduty issues, thought you might want to read them for interest

In the morning SETA stopped working for 6 hrs on Aug 27, then started working again, why? https://bugzilla.mozilla.org/show_bug.cgi?id=1199347 buildbot-master81 in the twistd.log

which caused Very high test backlog https://bugzilla.mozilla.org/show_bug.cgi?id=1199226

In the afternoon, stage.mozilla.org:http - 301 /(.*) to archive.mozilla.org/$1 https://bugzilla.mozilla.org/show_bug.cgi?id=1198296 Summary: IT made a change to change stage.mozilla.org to archive.mozilla.org. The ip address changed and this meant we didn't have the ssh key to scp content there

ftp 2-8.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 2400 out of 2400 Clients https://bugzilla.mozilla.org/show_bug.cgi?id=1199363 ftp server in load balancing cluster failed, one had to be put in passive mode. Unclear if this is the result of extra traffic from bug 1198296 above

We can discuss tomorrow. Also, thanks for checking the state of the AWS masters starting instances, that was very helpful.

2015-08-28

[alin] 1. common tasks:

   bugs 1198820, 1198821 - terminated instances, removed records from inventory, revoked VPN access

   re-imaged several slaves, monitoring:

       --> b-2008-ix-0025: 0 jobs passed | 0 failed
       --> b-2008-ix-0016: 0 jobs passed | 0 failed
       --> t-w732-ix-001:   4 jobs passed | 0 failed

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1197071

   t-snow-r4-0093 has been decommissioned by the DCOps team

   opened bug to implement the required changes on our side: 1199586

   attached patch to remove slave from configs --> please review it when you have time :)

   also removed the slave from slavealloc

3. https://bugzilla.mozilla.org/show_bug.cgi?id=1158729 - manage_masters.py retry_dead_queue should run periodically

   did a needinfo on the bug for Rail, waiting for reply

4. https://bugzilla.mozilla.org/show_bug.cgi?id=1175291 - disable freshclam on OSX builders

   the issue does not seem to be related to the patch

   when you have time, please take a look

   loaned slaves: t-snow-r4-0156 and t-yosemite-r5-0090

[vlad] 1. Requested access for wiki page 2. Helped Alin to figure it why the puppet is not working on t-yosemite-r5 slaves and t-snow-r4 slaves

Additional bugs to work on 1)manage_masters.py retry_dead_queue should run periodically https://bugzilla.mozilla.org/show_bug.cgi?id=1158729 Way to stop all these alerts :-) 2) Add a runner task to check resolution on Windows testers before starting buildbot https://bugzilla.mozilla.org/show_bug.cgi?id=1190868https://github.com/mozilla/build-runner 3) Add T testing to the trychooser UI https://bugzilla.mozilla.org/show_bug.cgi?id=1141280 Code is in hg.mozilla.org/buildtools and trychooser dir

FYI Friday afternoon we had this problem https://bugzilla.mozilla.org/show_bug.cgi?id=1199524

CIDuty/SVMeetings/Aug24-Aug28

Contents

2015-08-24

2015-08-25

2015-08-26

2015-08-27

2015-08-27

2015-08-28

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools