From MozillaWiki
Jump to: navigation, search

Upcoming vacation/PTO:

  • alin - aug31-sep11
  • coop - aug 3, aug 17-28
  • kmoir - aug 3, aug 24
  • otilia - aug10-21, half day on July31
  • vlad - jul31 ; aug14-aug27
  • Monday Aug 3 - Holiday in Canada


1. Kim - please create another wiki page with the Etherpad notes from the last week :) Done Opened to ask for wiki access for you two.

2. - investigate re-imaged and enabled in slavealloc slave is now connected to a master, waiting to see if it takes jobs UPDATE: taking jobs, marked as resolved

3. the loaned EC2 instance is no longer needed removed user records from inventory, terminated instance, revoked VPN access, marked the problem tracking bug as resolved.


re-imaged machines, enabled in slavealloc waiting to see if it takes jobs UPDATES:

   --> b-2008-ix-0115 taking jobs

5. t-snow-r4-0133 is no longer needed revoked VPN access, re-imaged the slave after re-image, the slave is no longer accessible (only ping works, nothing else) Q: should I open a bug to DCOps to re-image the slave? --> submitted bug

6. on 08/13/2015 (guess it's August 12 on your side) we received an -email from Q: with the subject: "[RelEng] Down time for WDS (windows imaging server) upgrade" Q: is there a specific time of the day when re-imaging windows machines is available? no, this will be intermittent as we shuffle VMs around. I'll try to get some definitive info form Q about possible breakage scenarios and when it should/shouldn't be safe to try reimages

"if we have no code changes between nightly builds, we still build and release a new one the day after even without any changes?"


Kim will go look for buildduty bugs other than reimaging disable freshclam on OSX builders Steps to do this 1) Can you ssh as yourself to If not, I'll ask for correct rights 2) After connecting, cd /etc/puppet/environments mkdir aselagea cd aselagea clone as described here 3) Look at the code here puppet/modules/disableservices/manifests/common.pp to see how to disable the service Once you have a patch to disable 4) Loan yourself a bld-lion-r5-* machine to test the patch References

Another possible bug Add a runner task to check resolution on Windows testers before starting buildbot


Fix the slaves broken by talos's inability to deploy an update These machines can be reimaged now

[alin] 1. - t-snow-r4-0147 re-imaged, revoked VPN access and returned to production started taking jobs, marked the bug as resolved

2. - panda-0345 decomm noticed your comment stating that the change is in production does not list the bug there when bld-lion-r5-055 was decomm, we can see the bug listed there Q: is this due to the fact that the reconfig occurred only yesterday and the patch was landed on 2015-08-13? This is due to the fact that my reconfig didn't update the maintenance page for some reason, I'll investigate. Fixed page.

3. - t-w864-ix-092 investigated both yesterday and today, could not re-image the machine ping does not work, attempted to reboot it but failed managed to connect via KVM console and perform a system restore logged in as root and noticed that the slave does not have any internet connection ("Network cable unplugged"). also, the resolution is lower than it should be (1024x768) Q: my guess here is that we should open a bug to DCOps to run some diagnostics on this slave Yes, good idea --> created bug 1195785 to DCOps.

4. re-imaged 5 32-bit slaves and one 64-bit machine: talos-linux32-ix-008 - OK talos-linux32-ix-001 - connected to Coop's master, it does not take jobs at the moment talos-linux32-ix-026 - OK talos-linux32-ix-022 - failed the first 2 jobs, Ryan restarted it talos-linux32-ix-003 - OK talos-linux64-ix-027 - OK --> marked most of the bugs as resolved. Great work!

5. Alert from relengbot: [sns alert] Tue 05:08:06 PDT ERROR - Reconfig lockfile is older than 120 minutes didn't manage to investigate, however it would be nice to know what does it mean It means that the reconfig is somehow stuck and didn't finish. See for ideas on how to fix. I looked briefly at it, I don't know what's wrong with it, still looking.

6. started looking over the bug for disabling freshclam service on OSX builders Wonderful, let me know if you have questions Look at /etc/freshclam.conf seems to have some parameters you can use to modify it Test on command line first and them implement with puppet


  • Increase SETA coalescing to every 7 pushes and every 60 min Just as a fyi, this will be enabled tomorrow which will reduce the number of tests run on every push which should reduce our high pending counts

[alin] 1. - t-snow-r4-0094 this slave has been decommissioned opened a bug to RelEng to implement the changes: noticed that this type of slaves is not listed in "buildbot-configs\mozilla\" searched for a configuration file, but had little luck --> I would need some suggestions here I'll look and add some pointers to the bug

2. this slave is still burning jobs tried another re-image and enabled and slavealloc waiting to see if it takes jobs/how it runs them Q: in the case when it still fails the jobs, would it be a good idea to open a bug to DCOps for some diagnostics? sure, sounds good --> bug to DCOps:

3. Disabling freshclam service on OSX builders took me a while to figure out that ClamAV is the actual antivirus and Freshclam is the automatic database update tool for ClamAV :) looked over freshclam.conf and noticed a parameter that specifies the number of database checks per day default is 12 --> this should be switched to "Checks 0" used a "sed" expression that looks for a string like "Checks 12" and changes it to "Checks 0". I tested it locally and it worked, so I updated common.pp from my environment file to do the same thing. Also obtained diff.txt (patch). when I ran "puppet agent --test" on the slave I got: Error: Could not request certificate: Error 400 on SERVER: this master is not a CA puppet agent --test --environment=aselagea --pluginsync --ssldir=/var/lib/puppet/ssl I don't know what why this is happening. I debugged for about an hour. One thing I would suggest is to reimage bld-lion-r5-078 and then NOT remove the files listed here to keep the ssh and puppet files Also, I added a line to your manifests/moco-nodes.pp

node "" {

manifests/nodes.pp:node "" { so it would be pinned to the master you are testing. Otherwise, it will run puppet against the production masters and remove your changes ok, thanks for looking into this

4. - Increase SETA coalescing to every 7 pushes and every 60 min as mentioned, the number of tests run on every push will be reduced: (5, 1800) <=> (10, 3600) will become (7, 3600) it would be useful to know more details about the process :)


1. received some alerts from nagios: <nagios-releng> Thu 01:16:23 PDT [4007] Age - /builds/aws_manager/aws_stop_idle.log is WARNING: FILE_AGE WARNING: /builds/aws_manager/aws_stop_idle.log is 663 seconds old and 1173723 bytes ( connected to aws manager and looked over the log files aws_watch_pending.log --> spot requests for different instances (bid 0.07) aws_stop_idle.log --> various info on the state of the instances could not find a reason for the alert, things went back to normal soon thereafter.

2. - b-2008-ix-0149 talked to Pete [:pmoore] and Nigel [:nigelb] on IRC it looks like this machine has run out of free space, most of the space is occupied by the "builds" folder disabled the slave as the jobs were failing, re-imaged it and enabled it in slavealloc waiting to see if it takes jobs/they are completed successfully UPDATE:started taking jobs, working fine alert for free disk space? look at runner code

3. took care of two loan requests: (in progress) Q: (just to make sure) do we need to create a problem tracking bug for an EC2 instance? From what I noticed, we don't need to do that. No you don't need to do that. We are not loaned existing machines but rather creating new ones so that's okay.

4. Disabling freshclam service on OSX builders re-imaged bld-lion-r5-078, but without deleting the mentioned files if I run puppet agent without specifying the ssl directory: puppet agent --test --environment=aselagea Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for Info: Applying configuration version 'unknown' Notice: /Stage[main]/Cleanslate/File[/var/tmp/cleanslate]/ensure: removed Notice: /Stage[main]/Jacuzzi_metadata/Exec[get_jacuzzi_metadata]/returns: executed successfully Notice: Finished catalog run in 45.79 seconds if I also specify the ssl directory, it seems to do nothing checked the logs, tried to figure out why it doesn't work (in progress)

If you run with --debug or --verbose you might have more information presented to you

[sns alert] Thu 02:08:31 PDT mozpool_inventorysync: raise RuntimeError('got status code %s from inventory' % r.status_code) This is from puppet error that Dustin talked to you about


Question from Callek What 1 or 2 things about current slaveloan process they feel is the "most painful" for them? What would be the most win if automated?

1. took care of some common tasks: added treeherder-client 1.7 to the internal pypi mirror restarted and monitored t-w864-ix-123 as it became non-responsive UPDATE: OK re-imaged t-xp32-ix-033, enabled in slavealloc, waiting to see if it takes jobs UPDATE: taking jobs, marked the bugt as solved

2. - loan request from Armen disabled t-w864-ix-158, waited for the current job to end granted VPN access, moved to loaner OU and restarted VNC and SSH connection works fine, although I am not able to connect via RDP noticed that the default profile for Remote Desktop is "private", so it should be "public" in order to work in my case I must be logged in as administrator to make such changes, if needed Q: do we need to grant access to a public IP for Remote Desktop? Why I am asking this?

   --> loaned a Windows Server machine to Pete Moore (b-2008-ix-0080)
   --> Pete sent me an e-mail that he is not able to connect via RDP, but it worked for VNC and SSH
   --> I did the change mentioned above, however on Winows Server there is no need to be logged in as administrator to grant such permissions.
  What are they using to connect via rdp on their desktop?  For instance, I have a windows rdp client app on my mac to connect, and can connect without an issue.

--> I don't know the OS and the client that Armen (or anyone that requests a loaner) uses to connect via RDP..I have Windows 8 and tried to connect using the Remote Desktop service that comes with Windows OS (and, as mentioned, I cannot do so). --> debugging.. --> yeah, I still cannot connect via RDP..even though my computer and t-w864-ix-158 belong to the same VPN, I cannot establish a connection using Remote Desktop..from what I read, I do NOT need to grant access to public IPs if the computers are on the same VPN, meaning that the default profile (private) should be fine.

3. when dealing with dead jobs: according to : we either delete the jobs from /dead directory or we call "" with the "retry_dead_queue" sub-command

   find / -name ""

I tried to run somenthing like:

   python /builds/buildbot/queue/tools/buildfarm/maintenance/ -c 'retry_dead_queue'

==> ImportError: No module named fabric.api => (line 6) Q1: I don't seem to find where the "retry_dead_queue" sub-command is defined Q2: is the script still functional? Kim will look and see if it still works Yes it works Kims-MacBook-Pro:maintenance kmoir$ python -f production-masters.json -H bm01-tests1-linux32 retry_dead_queue [] run: find /dev/shm/queue/commands/dead -type f [] run: find /dev/shm/queue/pulse/dead -type f

Do you have fabric installed? Kims-MacBook-Pro:maintenance kmoir$ pip freeze | grep -i fabric Fabric==1.4.3

If not, run pip install fabric to get the package installed locally

4. - revocations failing due to invalid inventory first of all, I'm sorry for the confusion generated here I wanted to debug the issue, to see why puppet agent failed the error received: Error: Could not request certificate: Error 400 on SERVER: this master is not a CA I'll continue to dig more on puppet (including certificates :)) to get myself more familiar with it Dustin would know more about how this was actually done

5. Question from Callek - What 1 or 2 things about current slaveloan process they feel is the "most painful" for them? What would be the most win if automated? to be honest, I would try to develop a script that receives the name of a certain machine as input and, according to its type, it would make the necessary steps for loaning it.(I know it would be pretty difficult) as a particularity, there's a python script that launches an EC2 instance. Right after it launches, it will try to connect to it but will fail, as the instance is still in the booting process. The immediate action for this is to wait for 1200 seconds (20 minutes) and then try again. Do we need to wait for that much time? -pass on to Callek -you can look at cloud tools repo to see how you can change the wait


Additional bugs to work on

1) Add emulator-x86-kk builds to trychooser Code is in and trychooser dir

2) retry_dead_queue should run periodically Way to stop all these alerts :-)

3) Add a runner task to check resolution on Windows testers before starting buildbot

4) Add T testing to the trychooser UI Code is in and trychooser dir