CIDuty/SVMeetings/Aug17-Aug21

From MozillaWiki
Jump to: navigation, search

Upcoming vacation/PTO:

  • alin - aug31-sep11
  • coop - aug 3, aug 17-28
  • kmoir - aug 3, aug 24
  • otilia - aug10-21, half day on July31
  • vlad - jul31 ; aug14-aug27
  • Monday Aug 3 - Holiday in Canada


2015-08-17

1. Kim - please create another wiki page with the Etherpad notes from the last week :) Done Opened https://bugzilla.mozilla.org/show_bug.cgi?id=1195301 to ask for wiki access for you two.

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1194786 - investigate re-imaged and enabled in slavealloc slave is now connected to a master, waiting to see if it takes jobs UPDATE: taking jobs, marked as resolved

3. https://bugzilla.mozilla.org/show_bug.cgi?id=1193054 the loaned EC2 instance is no longer needed removed user records from inventory, terminated instance, revoked VPN access, marked the problem tracking bug as resolved.

4. https://bugzilla.mozilla.org/show_bug.cgi?id=1063024

   https://bugzilla.mozilla.org/show_bug.cgi?id=1189049
   https://bugzilla.mozilla.org/show_bug.cgi?id=823235

re-imaged machines, enabled in slavealloc waiting to see if it takes jobs UPDATES:

   --> b-2008-ix-0115 taking jobs
   

5. https://bugzilla.mozilla.org/show_bug.cgi?id=1194604 t-snow-r4-0133 is no longer needed revoked VPN access, re-imaged the slave after re-image, the slave is no longer accessible (only ping works, nothing else) Q: should I open a bug to DCOps to re-image the slave? --> submitted bug https://bugzilla.mozilla.org/show_bug.cgi?id=1195313

6. on 08/13/2015 (guess it's August 12 on your side) we received an -email from Q: with the subject: "[RelEng] Down time for WDS (windows imaging server) upgrade" Q: is there a specific time of the day when re-imaging windows machines is available? https://bugzilla.mozilla.org/show_bug.cgi?id=936042 no, this will be intermittent as we shuffle VMs around. I'll try to get some definitive info form Q about possible breakage scenarios and when it should/shouldn't be safe to try reimages


"if we have no code changes between nightly builds, we still build and release a new one the day after even without any changes?"

991707

Kim will go look for buildduty bugs other than reimaging disable freshclam on OSX builders https://bugzilla.mozilla.org/show_bug.cgi?id=1175291 Steps to do this 1) Can you ssh as yourself to releng-puppet2.srv.releng.scl3.mozilla.com? If not, I'll ask for correct rights 2) After connecting, cd /etc/puppet/environments mkdir aselagea cd aselagea clone hg.mozilla.org/build/puppet as described here https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/HowTo/Set_up_a_user_environment#Common 3) Look at the code here puppet/modules/disableservices/manifests/common.pp to see how to disable the service Once you have a patch to disable 4) Loan yourself a bld-lion-r5-* machine to test the patch References https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/HowTo/Set_up_a_user_environment

Another possible bug Add a runner task to check resolution on Windows testers before starting buildbot https://bugzilla.mozilla.org/show_bug.cgi?id=1190868 https://github.com/mozilla/build-runner

2015-08-18

Fix the slaves broken by talos's inability to deploy an update https://bugzilla.mozilla.org/show_bug.cgi?id=1141416 These machines can be reimaged now

[alin] 1. https://bugzilla.mozilla.org/show_bug.cgi?id=1191071 - t-snow-r4-0147 re-imaged, revoked VPN access and returned to production started taking jobs, marked the bug as resolved

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1194211 - panda-0345 decomm noticed your comment stating that the change is in production https://wiki.mozilla.org/ReleaseEngineering/Maintenance#Reconfigs_.2F_Deployments does not list the bug there when bld-lion-r5-055 was decomm, we can see the bug listed there Q: is this due to the fact that the reconfig occurred only yesterday and the patch was landed on 2015-08-13? This is due to the fact that my reconfig didn't update the maintenance page for some reason, I'll investigate. Fixed page.

3. https://bugzilla.mozilla.org/show_bug.cgi?id=936042 - t-w864-ix-092 investigated both yesterday and today, could not re-image the machine ping does not work, attempted to reboot it but failed managed to connect via KVM console and perform a system restore logged in as root and noticed that the slave does not have any internet connection ("Network cable unplugged"). also, the resolution is lower than it should be (1024x768) Q: my guess here is that we should open a bug to DCOps to run some diagnostics on this slave Yes, good idea --> created bug 1195785 to DCOps.

4. re-imaged 5 32-bit slaves and one 64-bit machine: talos-linux32-ix-008 - OK talos-linux32-ix-001 - connected to Coop's master, it does not take jobs at the moment talos-linux32-ix-026 - OK talos-linux32-ix-022 - failed the first 2 jobs, Ryan restarted it talos-linux32-ix-003 - OK talos-linux64-ix-027 - OK --> marked most of the bugs as resolved. Great work!

5. Alert from relengbot: [sns alert] Tue 05:08:06 PDT buildbot-master87.bb.releng.scl3.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes didn't manage to investigate, however it would be nice to know what does it mean It means that the reconfig is somehow stuck and didn't finish. See https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Reconfigs for ideas on how to fix. I looked briefly at it, I don't know what's wrong with it, still looking.

6. started looking over the bug for disabling freshclam service on OSX builders Wonderful, let me know if you have questions Look at /etc/freshclam.conf seems to have some parameters you can use to modify it Test on command line first and them implement with puppet


2015-08-19

  • Increase SETA coalescing to every 7 pushes and every 60 min

https://bugzilla.mozilla.org/show_bug.cgi?id=1195803 Just as a fyi, this will be enabled tomorrow which will reduce the number of tests run on every push which should reduce our high pending counts

[alin] 1. https://bugzilla.mozilla.org/show_bug.cgi?id=1193734 - t-snow-r4-0094 this slave has been decommissioned opened a bug to RelEng to implement the changes: https://bugzilla.mozilla.org/show_bug.cgi?id=1196217 noticed that this type of slaves is not listed in "buildbot-configs\mozilla\production_config.py" searched for a configuration file, but had little luck --> I would need some suggestions here I'll look and add some pointers to the bug http://hg.mozilla.org/build/buildbot-configs/file/d5adde30c267/mozilla-tests/production_config.py#l31

2. https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux32-ix-022 this slave is still burning jobs tried another re-image and enabled and slavealloc waiting to see if it takes jobs/how it runs them Q: in the case when it still fails the jobs, would it be a good idea to open a bug to DCOps for some diagnostics? sure, sounds good --> bug to DCOps: https://bugzilla.mozilla.org/show_bug.cgi?id=1196281

3. Disabling freshclam service on OSX builders took me a while to figure out that ClamAV is the actual antivirus and Freshclam is the automatic database update tool for ClamAV :) looked over freshclam.conf and noticed a parameter that specifies the number of database checks per day default is 12 --> this should be switched to "Checks 0" used a "sed" expression that looks for a string like "Checks 12" and changes it to "Checks 0". I tested it locally and it worked, so I updated common.pp from my environment file to do the same thing. Also obtained diff.txt (patch). when I ran "puppet agent --test" on the slave I got: Error: Could not request certificate: Error 400 on SERVER: this master is not a CA puppet agent --test --server=releng-puppet2.srv.releng.scl3.mozilla.com --environment=aselagea --pluginsync --ssldir=/var/lib/puppet/ssl I don't know what why this is happening. I debugged for about an hour. One thing I would suggest is to reimage bld-lion-r5-078 and then NOT remove the files listed here https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave#bld-lion-r5.2C_talos-mtnlion-r5.2C_t-yosemite-r5 to keep the ssh and puppet files Also, I added a line to your manifests/moco-nodes.pp

node "bld-lion-r5-078.build.releng.scl3.mozilla.com" {

manifests/nodes.pp:node "bld-lion-r5-078.build.releng.scl3.mozilla.com" { so it would be pinned to the master you are testing. Otherwise, it will run puppet against the production masters and remove your changes ok, thanks for looking into this

4. https://bugzilla.mozilla.org/show_bug.cgi?id=1195803 - Increase SETA coalescing to every 7 pushes and every 60 min as mentioned, the number of tests run on every push will be reduced: (5, 1800) <=> (10, 3600) will become (7, 3600) it would be useful to know more details about the process :) https://elvis314.wordpress.com/2015/02/06/seta-search-for-extraneous-test-automation/ http://relengofthenerds.blogspot.ca/2015/04/less-testing-same-great-firefox-taste.html


2015-08-20

1. received some alerts from nagios: <nagios-releng> Thu 01:16:23 PDT [4007] aws-manager2.srv.releng.scl3.mozilla.com:File Age - /builds/aws_manager/aws_stop_idle.log is WARNING: FILE_AGE WARNING: /builds/aws_manager/aws_stop_idle.log is 663 seconds old and 1173723 bytes (http://m.mozilla.org/File+Age+-+/builds/aws_manager/aws_stop_idle.log) connected to aws manager and looked over the log files aws_watch_pending.log --> spot requests for different instances (bid 0.07) aws_stop_idle.log --> various info on the state of the instances could not find a reason for the alert, things went back to normal soon thereafter.

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1061321 - b-2008-ix-0149 talked to Pete [:pmoore] and Nigel [:nigelb] on IRC it looks like this machine has run out of free space, most of the space is occupied by the "builds" folder disabled the slave as the jobs were failing, re-imaged it and enabled it in slavealloc waiting to see if it takes jobs/they are completed successfully UPDATE:started taking jobs, working fine alert for free disk space? look at runner code

3. took care of two loan requests: https://bugzilla.mozilla.org/show_bug.cgi?id=1196399 https://bugzilla.mozilla.org/show_bug.cgi?id=1196602 (in progress) Q: (just to make sure) do we need to create a problem tracking bug for an EC2 instance? From what I noticed, we don't need to do that. No you don't need to do that. We are not loaned existing machines but rather creating new ones so that's okay.

4. Disabling freshclam service on OSX builders re-imaged bld-lion-r5-078, but without deleting the mentioned files if I run puppet agent without specifying the ssl directory: puppet agent --test --environment=aselagea --server=releng-puppet2.srv.releng.scl3.mozilla.com Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for bld-lion-r5-078.build.releng.scl3.mozilla.com Info: Applying configuration version 'unknown' Notice: /Stage[main]/Cleanslate/File[/var/tmp/cleanslate]/ensure: removed Notice: /Stage[main]/Jacuzzi_metadata/Exec[get_jacuzzi_metadata]/returns: executed successfully Notice: Finished catalog run in 45.79 seconds if I also specify the ssl directory, it seems to do nothing checked the logs, tried to figure out why it doesn't work (in progress)

If you run with --debug or --verbose you might have more information presented to you

[sns alert] Thu 02:08:31 PDT mobile-imaging-001.p1.releng.scl3.mozilla.com mozpool_inventorysync: raise RuntimeError('got status code %s from inventory' % r.status_code) This is from puppet error that Dustin talked to you about

2015-08-21

Question from Callek What 1 or 2 things about current slaveloan process they feel is the "most painful" for them? What would be the most win if automated?

1. took care of some common tasks: added treeherder-client 1.7 to the internal pypi mirror restarted and monitored t-w864-ix-123 as it became non-responsive UPDATE: OK re-imaged t-xp32-ix-033, enabled in slavealloc, waiting to see if it takes jobs UPDATE: taking jobs, marked the bugt as solved

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1196808 - loan request from Armen disabled t-w864-ix-158, waited for the current job to end granted VPN access, moved to loaner OU and restarted VNC and SSH connection works fine, although I am not able to connect via RDP noticed that the default profile for Remote Desktop is "private", so it should be "public" in order to work in my case I must be logged in as administrator to make such changes, if needed Q: do we need to grant access to a public IP for Remote Desktop? Why I am asking this?

   --> https://bugzilla.mozilla.org/show_bug.cgi?id=1192345
   --> loaned a Windows Server machine to Pete Moore (b-2008-ix-0080)
   --> Pete sent me an e-mail that he is not able to connect via RDP, but it worked for VNC and SSH
   --> I did the change mentioned above, however on Winows Server there is no need to be logged in as administrator to grant such permissions.
  What are they using to connect via rdp on their desktop?  For instance, I have a windows rdp client app on my mac to connect, and can connect without an issue.

--> I don't know the OS and the client that Armen (or anyone that requests a loaner) uses to connect via RDP..I have Windows 8 and tried to connect using the Remote Desktop service that comes with Windows OS (and, as mentioned, I cannot do so). --> debugging.. --> yeah, I still cannot connect via RDP..even though my computer and t-w864-ix-158 belong to the same VPN, I cannot establish a connection using Remote Desktop..from what I read, I do NOT need to grant access to public IPs if the computers are on the same VPN, meaning that the default profile (private) should be fine.


3. when dealing with dead jobs: according to https://wiki.mozilla.org/ReleaseEngineering/Queue_directories : we either delete the jobs from /dead directory or we call "manage_masters.py" with the "retry_dead_queue" sub-command

   find / -name "manage_masters.py"
   /builds/buildbot/try1/tools/buildfarm/maintenance/manage_masters.py
   /builds/buildbot/queue/tools/buildfarm/maintenance/manage_masters.py

I tried to run somenthing like:

   python /builds/buildbot/queue/tools/buildfarm/maintenance/manage_masters.py -c 'retry_dead_queue'

==> ImportError: No module named fabric.api => (line 6) Q1: I don't seem to find where the "retry_dead_queue" sub-command is defined Q2: is the script still functional? Kim will look and see if it still works Yes it works Kims-MacBook-Pro:maintenance kmoir$ python manage_masters.py -f production-masters.json -H bm01-tests1-linux32 retry_dead_queue [buildbot-master01.bb.releng.use1.mozilla.com] run: find /dev/shm/queue/commands/dead -type f [buildbot-master01.bb.releng.use1.mozilla.com] run: find /dev/shm/queue/pulse/dead -type f

Do you have fabric installed? Kims-MacBook-Pro:maintenance kmoir$ pip freeze | grep -i fabric Fabric==1.4.3

If not, run pip install fabric to get the package installed locally

4. https://bugzilla.mozilla.org/show_bug.cgi?id=1196723 - revocations failing due to invalid inventory first of all, I'm sorry for the confusion generated here I wanted to debug the issue, to see why puppet agent failed the error received: Error: Could not request certificate: Error 400 on SERVER: this master is not a CA I'll continue to dig more on puppet (including certificates :)) to get myself more familiar with it Dustin would know more about how this was actually done

5. Question from Callek - What 1 or 2 things about current slaveloan process they feel is the "most painful" for them? What would be the most win if automated? to be honest, I would try to develop a script that receives the name of a certain machine as input and, according to its type, it would make the necessary steps for loaning it.(I know it would be pretty difficult) as a particularity, there's a python script that launches an EC2 instance. Right after it launches, it will try to connect to it but will fail, as the instance is still in the booting process. The immediate action for this is to wait for 1200 seconds (20 minutes) and then try again. Do we need to wait for that much time? -pass on to Callek -you can look at cloud tools repo https://github.com/mozilla/build-cloud-tools/ to see how you can change the wait

6. https://bugzilla.mozilla.org/show_bug.cgi?id=1175291

Additional bugs to work on

1) Add emulator-x86-kk builds to trychooser https://bugzilla.mozilla.org/show_bug.cgi?id=1197235 Code is in hg.mozilla.org/buildtools and trychooser dir

2)manage_masters.py retry_dead_queue should run periodically https://bugzilla.mozilla.org/show_bug.cgi?id=1158729 Way to stop all these alerts :-)

3) Add a runner task to check resolution on Windows testers before starting buildbot https://bugzilla.mozilla.org/show_bug.cgi?id=1190868 https://github.com/mozilla/build-runner


4) Add T testing to the trychooser UI https://bugzilla.mozilla.org/show_bug.cgi?id=1141280 Code is in hg.mozilla.org/buildtools and trychooser dir