CIDuty/SVMeetings/Aug24-Aug28

From MozillaWiki
Jump to: navigation, search

Upcoming vacation/PTO:

  • alin - aug31-sep11
  • coop - aug 3, aug 17-28
  • kmoir - aug 3, aug 24
  • otilia - aug10-21, half day on July31
  • vlad - jul31 ; aug14-aug27
  • Monday Aug 3 - Holiday in Canada


2015-08-24

https://bugzilla.mozilla.org/show_bug.cgi?id=1175291

   @Callek: Kim reviewed the patch and said it was ok. If you have time, please land it and merge it to the production branch. - done --> thanks! -->backed out, see bug


2015-08-25

1. took care of several common tasks: --> https://bugzilla.mozilla.org/show_bug.cgi?id=1103082 - talos-linux32-ix-022

   Van Le ran several memory and disk tests, all of them passed
   the slave has been re-imaged, so I enabled it in slavealloc
   waiting to see if it takes jobs/how they run

UPDATE: burned another job, disabled --> https://bugzilla.mozilla.org/show_bug.cgi?id=1103851 - t-xp32-ix-041

   burned the latest several jobs, disabled
   re-imaged, waiting to see if it works now

UPDATE: started taking jobs, looks fine

2. started looking over the additional bugs to work on

   tried to run " manage_masters.py" script from my local machine
   I am using Cygwin and Windows 8.1
   copied the script in /home/alin.selagea, installed pip and then fabric

$ pip freeze | grep Fabric Fabric==1.10.2

   spent a lot of time debugging, but I get: 

File "manage_masters.py", line 13, in <module>

   import util.fabric.actions

ImportError: No module named util.fabric.actions

   downloaded this module from github and copied the folder "fabric" under /lib/python/util

Q: do you know of any additional settings that should be made? Did you copy the entire tools dir to /home/alin.selagea or just the one script? You need to copy the entire repo, not just the one script. \o/ NOW IT WORKS \o/ :)

3. https://bugzilla.mozilla.org/show_bug.cgi?id=1175291

   patch to disable freshclam failed on t-snow and t-yosemite machines and has been rolled back
   talked to Amy and it seems that the "unless" condition is not correct
   currently working to change that line

onlyif => " /bin/test -f /etc/freshclam.conf && /bin/test "$(/usr/bin/grep 'Checks 0' /etc/freshclam.conf)" != "Checks 0"" ;

   the second test condition needs to be adjusted, as it doesn't work fine yet
   Amy also mentioned that "org.clamav.freshclam-init.plist has RunAtLoad true", so maybe we should add that to the disabled services in order to fix the bug
   Kim will find pointers to change plists
   http://hg.mozilla.org/build/puppet/file/39edb742127c/modules/disableservices/manifests/common.pp#l95
   [root@bld-lion-r5-078.build.releng.scl3.mozilla.com ~]# launchctl list | grep freshclam
   -       0       org.clamav.freshclam-init


4. in the morning I was asked by Nigel if there were some changes to the linux machines as noticed some failures: https://treeherder.mozilla.org/logviewer.html#?job_id=4364889&repo=fx-team

   errors like: 'mach build' did not run successfully. Please check log for errors.
   on the #developers channel, it was suggested that the issue was not related to infrastructure but to clobber. 
   I don't know exactly what "clobber" refers to in this case

--> in the case of clobberring a tree -> refers to the process of erasing the build directory from a machine Q1: where do we find the logs to see if a clobber has been run? Q2: (I know this was briefly explained during the first week) some tips&tricks for working with Treeherder https://api.pub.build.mozilla.org/docs/ https://api.pub.build.mozilla.org/docs/usage/clobberer/ https://wiki.mozilla.org/Auto-tools/Projects/Treeherder

You can see that each build clobbers the properties and scripts dirs - see clobber_scripts and clobber_properties here for a particular job and machine. http://buildbot-master86.bb.releng.scl3.mozilla.com:8001/builders/OS%20X%2010.7%2064-bit%20mozilla-inbound%20leak%20test%20build/builds/1967 During the run script stage, it checks to see if a clobber needs to be run, for example http://buildbot-master86.bb.releng.scl3.mozilla.com:8001/builders/OS%20X%2010.7%2064-bit%20mozilla-inbound%20leak%20test%20build/builds/1967/steps/run_script/logs/stdio You can see 01:46:04 INFO - ##### 01:46:04 INFO - ##### Running clobber step. 01:46:04 INFO - ##### 01:46:04 INFO - Running pre-action listener: influxdb_recording_pre_action 01:46:04 INFO - Running main action method: clobber 01:46:04 INFO - retry: Calling run_command with args: [['/builds/slave/m-in-m64-d-0000000000000000000/scripts/external_tools/clobberer.py', '-s', 'scripts', '-s', 'logs', '-s', 'buildprops.json', '-s', 'token', '-s', 'oauth.txt', '-t', '168', 'https://api.pub.build.mozilla.org/clobberer/lastclobber', 'mozilla-inbound', 'OS X 10.7 64-bit mozilla-inbound leak test build', 'm-in-m64-d-0000000000000000000', 'bld-lion-r5-083', 'http://buildbot-master86.bb.releng.scl3.mozilla.com:8001/']], kwargs: {'error_list': [{'substr': 'Error contacting server', 'explanation': 'Error contacting server for clobberer information.', 'level': 'error'}], 'cwd': '/builds/slave'}, attempt #1 01:46:04 INFO - Running command: ['/builds/slave/m-in-m64-d-0000000000000000000/scripts/external_tools/clobberer.py', '-s', 'scripts', '-s', 'logs', '-s', 'buildprops.json', '-s', 'token', '-s', 'oauth.txt', '-t', '168', 'https://api.pub.build.mozilla.org/clobberer/lastclobber', 'mozilla-inbound', 'OS X 10.7 64-bit mozilla-inbound leak test build', 'm-in-m64-d-0000000000000000000', 'bld-lion-r5-083', 'http://buildbot-master86.bb.releng.scl3.mozilla.com:8001/'] in /builds/slave 01:46:04 INFO - Copy/paste: /builds/slave/m-in-m64-d-0000000000000000000/scripts/external_tools/clobberer.py -s scripts -s logs -s buildprops.json -s token -s oauth.txt -t 168 https://api.pub.build.mozilla.org/clobberer/lastclobber mozilla-inbound "OS X 10.7 64-bit mozilla-inbound leak test build" m-in-m64-d-0000000000000000000 bld-lion-r5-083 http://buildbot-master86.bb.releng.scl3.mozilla.com:8001/ 01:46:05 INFO - Checking clobber URL: https://api.pub.build.mozilla.org/clobberer/lastclobber?master=http%3A%2F%2Fbuildbot-master86.bb.releng.scl3.mozilla.com%3A8001%2F&slave=bld-lion-r5-083&builddir=m-in-m64-d-0000000000000000000&branch=mozilla-inbound&buildername=OS+X+10.7+64-bit+mozilla-inbound+leak+test+build 01:46:05 INFO - m-in-m64-d-0000000000000000000:Our last clobber date: 2015-08-18 10:12:57 01:46:05 INFO - m-in-m64-d-0000000000000000000:Server clobber date: 2015-08-10 15:40:31 01:46:05 INFO - Return code: 0 01:46:05 INFO - Running command: ['/builds/slave/m-in-m64-d-0000000000000000000/scripts/external_tools/purge_builds.py', '-s', '10', '--max-age', '14', '--not', 'info', '--not', 'rel-*:10d', '--not', 'tb-rel-*:10d', '/builds/slave'] 01:46:05 INFO - Copy/paste: /builds/slave/m-in-m64-d-0000000000000000000/scripts/external_tools/purge_builds.py -s 10 --max-age 14 --not info --not rel-*:10d --not tb-rel-*:10d /builds/slave 01:46:05 INFO - Using env: {'HG_SHARE_BASE_DIR': '/builds/hg-shared', 01:46:05 INFO - 'PATH': '/tools/python/bin:/tools/buildbot/bin:/opt/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin'} 01:46:05 INFO - Ignored directory 'rel-m-beta-m64_beta_u_v_3-0000' exceeds cutoff time 01:46:05 INFO - Ignored directory 'rel-m-beta-m64_rpk_1-000000000' exceeds cutoff time 01:46:05 INFO - Ignored directory 'rel-m-beta-m64_rpk_3-000000000' exceeds cutoff time 01:46:05 INFO - Ignored directory 'rel-m-rel-m64_rpk_5-0000000000' exceeds cutoff time 01:46:05 INFO - Ignored directory 'tb-rel-c-esr38-m64_bld-0000000' exceeds cutoff time 01:46:05 INFO - Ignored directory 'tb-rel-c-esr38-m64_rpk_1-00000' exceeds cutoff time 01:46:05 INFO - 519.60 GB of space available 01:46:05 INFO - Return code: 0


https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=36e7f7148cf8

clobberer https://api.pub.build.mozilla.org/login_request?next=%2Fclobberer%2F

Will look for logs for clobber -- ^^ see above

2015-08-26

1. common tasks:

   re-imaged the machines that returned from loan and enabled them in slavealloc
   loaned one win 8 and one win xp machines to Kim :) 


2. https://bugzilla.mozilla.org/show_bug.cgi?id=1175291 - disable freshclam

   tried both with the example suggested by Kim and Amy's patch
   received errors on t-snow and t-yosemite slaves, debugging 


2015-08-27

1. common tasks:

   loaned two AWS slaves to Ted Mielczarek:
   --> tst-linux32-ec2-ted  
   --> tst-linux64-ec2-ted
   https://bugzilla.mozilla.org/show_bug.cgi?id=1037839 - t-snow-r4-0089
 --> this slave has been disabled due to GC crashes
 --> re-imaged and returned it to the pool, waiting to see how it goes

UPDATE: already completed 10+ jobs without issues --> marked the bug as Resolved

2. received 5 consecutive alerts like: [sns alert] Thu 03:08:45 PDT t-snow-r4-0032 puppet-agent: /usr/local/bin/screenresolution set 1600x1200x32 returned 1 instead of one of [0]

   connected to the slave --> "/usr/local/bin/screenresolution get" 

"2015-08-27 03:42:38.042 screenresolution[8382:903] Error: failed to get list of active displays" tried the same command on t-snow-r4-0089: "2015-08-27 03:42:07.742 screenresolution[1109:903] Display 0: 1600x1200x32" restarted the slave, ran the command again: "2015-08-27 03:46:53.403 screenresolution[655:903] Display 0: 1600x1200x32" -- seems ok now Yes, this alert usually means that we need to reboot the machine. We should add a check in runner to check for this alert and reboot, perhaps open a bug to investigate.

3. also received several alerts from relengbot:

   <relengbot> [sns alert] Thu 05:08:02 PDT buildbot-master79.bb.releng.usw2.mozilla.com watch_twistd_log.py: Count: 633 | First instance: 2015-08-27 04:00:03-0700 | Most recent instance: 2015-08-27 04:29:00-0700 | Twistd exception: twisted.cred.error.UnauthorizedLogin - unknown 10.132.67.84
   investigated, I don't know where those IPs are coming from
   asked in #releng channel for suggestions
   This is an machine trying to connect to the buildbot master that has either 1) wrong password 2) wrong username 3) is not listed in the range that can connect ot the master
   Looks like this machine 10.132.67.84 is try-linux64-spot-407 (I looked in the AWS console)
   aselagea: http://buildbot-master79.bb.releng.usw2.mozilla.com:8101/buildslaves
   kmoiraselagea: http://buildbot-master79.bb.releng.usw2.mozilla.com:8101/buildslaves/try-linux64-spot-407


[Broker,24354,10.132.67.182] BuildSlave.detached(try-linux64-spot-349) 2015-08-27 04:26:08-0700 [Broker,25204,10.132.67.48] Peer will receive following PB traceback: 2015-08-27 04:26:08-0700 [Broker,25204,10.132.67.48] Unhandled Error

4. https://bugzilla.mozilla.org/show_bug.cgi?id=1158729 - manage_masters.py retry_dead_queue should run periodically

   I am now able to run the script from my local machine using Cygwinhttps://etherpad.mozilla.org/buildduty-notes
   tried to find a way to implement a cronjob to run the command on all the masters
   looked at bug https://bugzilla.mozilla.org/show_bug.cgi?id=1057888 - Automate monthly graceful restart of buildbot masters for a clue
   my guess is that we can use the options -D, -M, -H to select the masters

Q: which machine should be used to set the cronjob? Do a needinfo on the bug for rail to ask about that

5. https://bugzilla.mozilla.org/show_bug.cgi?id=1175291 - disable freshclam on OSX builders

   loaned two slaves: t-snow-r4-0156 and t-yosemite-r5-0090
   modified slave.pp file to test Amy's patch
   added t-snow-r4-0156 and t-yosemite-r5-0090 in moco-nodes.pp configuration file
   working fine on bld-lion-r5-078
   on t-snow and t-yosemite I get some ugly errors:

Could not set 'present' on ensure: No such file or directory - https://releng-puppet2.srv.releng.scl3.mozilla.com/repos/DMGs/10.10/git-1.7.9.4-1.dmg at 38:/etc/puppet/environments/aselagea/modules/packages/manifests/pkgdmg.pp

   debugging...
   kim will look at yosemite - didn't get a chance to do that today because we had a lot of issues today, will look at it first thing tomorrow


https://secure.pub.build.mozilla.org/buildapi/self-serve https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-central https://bugzilla.mozilla.org/show_bug.cgi?id=1198900


Build changeset is here https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-central f8086bd3c84f

search for build changeset on this page https://hg.mozilla.org/mozilla-central/summary f8086bd3c84f

changeset referred to in bug is https://hg.mozilla.org/mozilla-central/summary 8205877b3b30

2015-08-27

Yesterday we had some buildduty issues, thought you might want to read them for interest

In the morning SETA stopped working for 6 hrs on Aug 27, then started working again, why? https://bugzilla.mozilla.org/show_bug.cgi?id=1199347 buildbot-master81 in the twistd.log

which caused Very high test backlog https://bugzilla.mozilla.org/show_bug.cgi?id=1199226

In the afternoon, stage.mozilla.org:http - 301 /(.*) to archive.mozilla.org/$1 https://bugzilla.mozilla.org/show_bug.cgi?id=1198296 Summary: IT made a change to change stage.mozilla.org to archive.mozilla.org. The ip address changed and this meant we didn't have the ssh key to scp content there

ftp 2-8.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 2400 out of 2400 Clients https://bugzilla.mozilla.org/show_bug.cgi?id=1199363 ftp server in load balancing cluster failed, one had to be put in passive mode. Unclear if this is the result of extra traffic from bug 1198296 above


We can discuss tomorrow. Also, thanks for checking the state of the AWS masters starting instances, that was very helpful.

2015-08-28

[alin] 1. common tasks:

   bugs 1198820, 1198821 - terminated instances, removed records from inventory, revoked VPN access
   re-imaged several slaves, monitoring:
       --> b-2008-ix-0025: 0 jobs passed | 0 failed
       --> b-2008-ix-0016: 0 jobs passed | 0 failed
       --> t-w732-ix-001:   4 jobs passed | 0 failed

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1197071

   t-snow-r4-0093 has been decommissioned by the DCOps team
   opened bug to implement the required changes on our side: 1199586
   attached patch to remove slave from configs --> please review it when you have time :)
   also removed the slave from slavealloc


3. https://bugzilla.mozilla.org/show_bug.cgi?id=1158729 - manage_masters.py retry_dead_queue should run periodically

   did a needinfo on the bug for Rail, waiting for reply


4. https://bugzilla.mozilla.org/show_bug.cgi?id=1175291 - disable freshclam on OSX builders

   the issue does not seem to be related to the patch
   when you have time, please take a look
   loaned slaves: t-snow-r4-0156 and t-yosemite-r5-0090


[vlad] 1. Requested access for wiki page 2. Helped Alin to figure it why the puppet is not working on t-yosemite-r5 slaves and t-snow-r4 slaves


Additional bugs to work on 1)manage_masters.py retry_dead_queue should run periodically https://bugzilla.mozilla.org/show_bug.cgi?id=1158729 Way to stop all these alerts :-) 2) Add a runner task to check resolution on Windows testers before starting buildbot https://bugzilla.mozilla.org/show_bug.cgi?id=1190868https://github.com/mozilla/build-runner 3) Add T testing to the trychooser UI https://bugzilla.mozilla.org/show_bug.cgi?id=1141280 Code is in hg.mozilla.org/buildtools and trychooser dir

FYI Friday afternoon we had this problem https://bugzilla.mozilla.org/show_bug.cgi?id=1199524