Buildduty notes

From MozillaWiki
Jump to: navigation, search

Upcoming vacation/PTO:

  • alin - aug31-sep11
  • coop - aug 3, aug 17-28
  • kmoir - aug 3, aug 24
  • otilia - aug10-21, half day on July31
  • vlad - jul31 ; aug14-aug27
  • Monday Aug 3 - Holiday in Canada


== 2015-08-17 ==
1. Kim - please create another wiki page with the Etherpad notes from the last week :)
Done
Opened bug 1195301 to ask for wiki access for you two.

2. bug 1194786 - investigate

  • re-imaged and enabled in slavealloc
  • slave is now connected to a master, waiting to see if it takes jobs

UPDATE: taking jobs, marked as resolved

3. bug 1193054

  • the loaned EC2 instance is no longer needed
  • removed user records from inventory, terminated instance, revoked VPN access, marked the problem tracking bug as resolved.


4. bug 1063024

bug 1189049
bug 823235
  • re-imaged machines, enabled in slavealloc
  • waiting to see if it takes jobs

UPDATES:

--> b-2008-ix-0115taking jobs


5. bug 1194604

  • t-snow-r4-0133is no longer needed
  • revoked VPN access, re-imaged the slave
  • after re-image, the slave is no longer accessible (only ping works, nothing else)

Q: should I open a bug to DCOps to re-image the slave?
--> submitted bug bug 1195313

6. on 08/13/2015 (guess it's August 12 on your side) we received an -email from Q: with the subject: "[RelEng] Down time for WDS (windows imaging server) upgrade"
Q: is there a specific time of the day when re-imaging windows machines is available?
bug 936042

  • no, this will be intermittent as we shuffle VMs around. I'll try to get some definitive info form Q about possible breakage scenarios and when it should/shouldn't be safe to try reimages



"if we have no code changes between nightly builds, we still build and release a new one the day after even without any changes?"

991707

Kim will go look for buildduty bugs other than reimaging
disable freshclam on OSX builders
bug 1175291
Steps to do this
1) Can you ssh as yourself to
releng-puppet2.srv.releng.scl3.mozilla.com?
If not, I'll ask for correct rights
2) After connecting,
cd /etc/puppet/environments
mkdir aselagea
cd aselagea
clone hg.mozilla.org/build/puppet as described here
ReleaseEngineering/PuppetAgain/HowTo/Set_up_a_user_environment#Common
3) Look at the code here puppet/modules/disableservices/manifests/common.pp to see how to disable the service
Once you have a patch to disable
4) Loan yourself a bld-lion-r5-* machine
to test the patch
References
ReleaseEngineering/PuppetAgain/HowTo/Set_up_a_user_environment

Another possible bug
Add a runner task to check resolution on Windows testers before starting buildbot
bug 1190868
https://github.com/mozilla/build-runner

== 2015-08-18 ==
Fix the slaves broken by talos's inability to deploy an update
bug 1141416
These machines can be reimaged now

[alin]
1. bug 1191071 - t-snow-r4-0147

  • re-imaged, revoked VPN access and returned to production
  • started taking jobs, marked the bug as resolved


2. bug 1194211 - panda-0345 decomm

Q: is this due to the fact that the reconfig occurred only yesterday and the patch was landed on 2015-08-13? This is due to the fact that my reconfig didn't update the maintenance page for some reason, I'll investigate. Fixed page.

3. bug 936042 - t-w864-ix-092

  • investigated both yesterday and today, could not re-image the machine
  • ping does not work, attempted to reboot it but failed
  • managed to connect via KVM console and perform a system restore
  • logged in as root and noticed that the slave does not have any internet connection ("Network cable unplugged").
  • also, the resolution is lower than it should be (1024x768)

Q: my guess here is that we should open a bug to DCOps to run some diagnostics on this slave Yes, good idea
--> created bug 1195785 to DCOps.

4. re-imaged 5 32-bit slaves and one 64-bit machine:

  • talos-linux32-ix-008 - OK
  • talos-linux32-ix-001 -connected to Coop's master, it does not take jobs at the moment
  • talos-linux32-ix-026 - OK
  • talos-linux32-ix-022 - failed the first 2 jobs, Ryan restarted it
  • talos-linux32-ix-003 - OK
  • talos-linux64-ix-027 -OK

--> marked most of the bugs as resolved.
Great work!

5. Alert from relengbot: [sns alert] Tue 05:08:06 PDT buildbot-master87.bb.releng.scl3.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes

  • didn't manage to investigate, however it would be nice to know what does it mean

It means that the reconfig is somehow stuck and didn't finish. See ReleaseEngineering/Buildduty/Reconfigs for ideas on how to fix. I looked briefly at it, I don't know what's wrong with it, still looking.

6. started looking over the bug for disabling freshclamservice on OSX builders
Wonderful, let me know if you have questions
Look at /etc/freshclam.conf seems to have some parameters you can use to modify it
Test on command line first and them implement with puppet


== 2015-08-19 ==

  • Increase SETA coalescing to every 7 pushes and every 60 min

bug 1195803
Just as a fyi, this will be enabled tomorrow which will reduce the number of tests run on every push which should reduce our high pending counts

[alin]
1. bug 1193734 -t-snow-r4-0094

  • this slave has been decommissioned
  • opened a bug to RelEng to implement the changes:

bug 1196217

  • noticed that this type of slaves is not listed in "buildbot-configs\mozilla\production_config.py"
  • searched for a configuration file, but had little luck

--> I would need some suggestions here
I'll look and add some pointers to the bug
http://hg.mozilla.org/build/buildbot-configs/file/d5adde30c267/mozilla-tests/production_config.py#l31

2. https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux32-ix-022

  • this slave is still burning jobs
  • tried another re-image and enabled and slavealloc
  • waiting to see if it takes jobs/how it runs them

Q: in the case when it still fails the jobs, would it be a good idea to open a bug to DCOps for some diagnostics? sure, sounds good
--> bug to DCOps: https://bugzilla.mozilla.org/show_bug.cgi?id=1196281

3. Disabling freshclamservice on OSX builders

  • took me a while to figure out that ClamAV is the actual antivirus and Freshclam is the automatic database update tool for ClamAV :)
  • looked over freshclam.conf and noticed a parameter that specifies the number of database checks per day
  • default is 12 --> this should be switched to "Checks 0"
  • used a "sed" expression that looks for a string like "Checks 12" and changes it to "Checks 0". I tested it locally and it worked, so I updated common.pp'from my environment'file to do the same thing. Also obtained diff.txt (patch).
  • when I ran "puppet agent --test" on the slave I got:

Error: Could not request certificate: Error 400 on SERVER: this master is not a CA
puppet agent --test --server=releng-puppet2.srv.releng.scl3.mozilla.com --environment=aselagea --pluginsync --ssldir=/var/lib/puppet/ssl
I don't know what why this is happening. I debugged for about an hour. One thing I would suggest is to reimage bld-lion-r5-078 and then NOT remove the files listed here
ReleaseEngineering/How_To/Loan_a_Slave#bld-lion-r5.2C_talos-mtnlion-r5.2C_t-yosemite-r5
to keep the ssh and puppet files
Also, I added a line to your manifests/moco-nodes.pp

node "bld-lion-r5-078.build.releng.scl3.mozilla.com" {

manifests/nodes.pp:node "bld-lion-r5-078.build.releng.scl3.mozilla.com" {
so it would be pinned to the master you are testing. Otherwise, it will run puppet against the production masters and remove your changes
ok, thanks for looking into this

4. bug 1195803 - Increase SETA coalescing to every 7 pushes and every 60 min

  • as mentioned, the number of tests run on every push will be reduced:

(5, 1800) <=> (10, 3600) will become (7, 3600)

  • it would be useful to know more details about the process :)

https://elvis314.wordpress.com/2015/02/06/seta-search-for-extraneous-test-automation/
http://relengofthenerds.blogspot.ca/2015/04/less-testing-same-great-firefox-taste.html


== 2015-08-20 ==

1. received some alerts from nagios:
<nagios-releng> Thu 01:16:23 PDT [4007] aws-manager2.srv.releng.scl3.mozilla.com:File Age -/builds/aws_manager/aws_stop_idle.log is WARNING:FILE_AGE WARNING: /builds/aws_manager/aws_stop_idle.log is 663 seconds old and 1173723 bytes (http://m.mozilla.org/File+Age+-+/builds/aws_manager/aws_stop_idle.log)

  • connected to aws manager and looked over the log files
  • aws_watch_pending.log --> spot requests for different instances (bid 0.07)
  • aws_stop_idle.log --> various info on the state of the instances
  • could not find a reason for the alert, things went back to normal soon thereafter.


2. bug 1061321 - b-2008-ix-0149

  • talked to Pete [:pmoore] and Nigel [:nigelb] on IRC
  • it looks like this machine has run out of free space, most of the space is occupied by the "builds" folder
  • disabled the slave as the jobs were failing, re-imaged it and enabled it in slavealloc
  • waiting to see if it takes jobs/they are completed successfully

UPDATE:started taking jobs, working fine
alert for free disk space? look at runner code

3. took care of two loan requests:
bug 1196399
bug 1196602 (in progress)
Q: (just to make sure) do we need to create a problem tracking bug for an EC2 instance? From what I noticed, we don't need to do that.
No you don't need to do that. We are not loaned existing machines but rather creating new ones so that's okay.

4. Disabling freshclamservice on OSX builders

  • re-imaged bld-lion-r5-078,but without deleting the mentioned files
  • if I run puppet agent without specifying the ssl directory:

puppet agent --test --environment=aselagea --server=releng-puppet2.srv.releng.scl3.mozilla.com
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for bld-lion-r5-078.build.releng.scl3.mozilla.com
Info: Applying configuration version 'unknown'
Notice: /Stage[main]/Cleanslate/File[/var/tmp/cleanslate]/ensure: removed
Notice: /Stage[main]/Jacuzzi_metadata/Exec[get_jacuzzi_metadata]/returns: executed successfully
Notice: Finished catalog run in 45.79 seconds

  • if I also specify the ssl directory, it seems to do nothing
  • checked the logs, tried to figure out why it doesn't work (in progress)


If you run with --debug or --verbose you might have more information presented to you

[sns alert] Thu 02:08:31 PDT mobile-imaging-001.p1.releng.scl3.mozilla.com mozpool_inventorysync: raise RuntimeError('got status code %s from inventory' % r.status_code)
This is from puppet error that Dustin talked to you about

== 2015-08-20 ==
Question from Callek
What 1 or 2 things about current slaveloan process they feel is the "most painful" for them? What would be the most win if automated?

1. took care of some common tasks:

  • added treeherder-client 1.7 to the internal pypi mirror
  • restarted and monitored t-w864-ix-123'as it became non-responsive'
  • UPDATE: OK
  • re-imaged t-xp32-ix-033, enabled in slavealloc, waiting to see if it takes jobs
  • UPDATE: taking jobs, marked the bugt as solved


2. bug 1196808 - loan request from Armen

  • disabled t-w864-ix-158,waited for the current job to end
  • granted VPN access, moved to loaner OU and restarted
  • VNC and SSH connection works fine, although I am not able to connect via RDP
  • noticed that the default profile for Remote Desktop is "private", so it should be "public" in order to work in my case
  • I must be logged in as administrator to make such changes, if needed

Q: do we need to grant access to a public IP for Remote Desktop?
Why I am asking this?

--> bug 1192345
--> loaned a Windows Server machine to Pete Moore (b-2008-ix-0080)
--> Pete sent me an e-mail that he is not able to connect via RDP, but it worked for VNC and SSH
--> I did the change mentioned above, however on Winows Server there is no need to be logged in as administrator to grant such permissions.
What are they using to connect via rdp on their desktop? For instance, I have a windows rdp client app on my mac to connect, and can connect without an issue.

--> I don't know the OS and the client that Armen (or anyone that requests a loaner) uses to connect via RDP..I have Windows 8 and tried to connect using the Remote Desktop service that comes with Windows OS (and, as mentioned, I cannot do so).
--> debugging..
--> yeah, I still cannot connect via RDP..even though my computer and t-w864-ix-158belong to the same VPN, I cannot establish a connection using Remote Desktop..from what I read, I do NOT need to grant access to public IPs if the computers are on the same VPN, meaning that the default profile (private) should be fine.


3. when dealing with dead jobs: according to ReleaseEngineering/Queue_directories :

  • we either delete the jobs from /dead directory
  • or we call "manage_masters.py" with the "retry_dead_queue" sub-command
find / -name "manage_masters.py"
/builds/buildbot/try1/tools/buildfarm/maintenance/manage_masters.py
/builds/buildbot/queue/tools/buildfarm/maintenance/manage_masters.py
  • I tried to run somenthing like:
python /builds/buildbot/queue/tools/buildfarm/maintenance/manage_masters.py -c 'retry_dead_queue'

==> ImportError: No module named fabric.api ='>'(line 6)
Q1: I don't seem to find where the "retry_dead_queue" sub-command is defined
Q2: is the script still functional?
Kim will look and see if it still works
Yes it works
Kims-MacBook-Pro:maintenance kmoir$ python manage_masters.py -f production-masters.json -H bm01-tests1-linux32 retry_dead_queue
[buildbot-master01.bb.releng.use1.mozilla.com] run: find /dev/shm/queue/commands/dead -type f
[buildbot-master01.bb.releng.use1.mozilla.com] run: find /dev/shm/queue/pulse/dead -type f

Do you have fabric installed?
Kims-MacBook-Pro:maintenance kmoir$ pip freeze | grep -i fabric
Fabric==1.4.3

If not, run
pip install fabric
to get the package installed locally

4. bug 1196723 - revocations failing due to invalid inventory

  • first of all, I'm sorry for the confusion generated here
  • I wanted to debug the issue, to see why puppet agent failed
  • the error received: Error: Could not request certificate: Error 400 on SERVER: this master is not a CA
  • I'll continue to dig more on puppet (including certificates :)) to get myself more familiar with it
  • Dustin would know more about how this was actually done


5. Question from Callek - What 1 or 2 things about current slaveloan process they feel is the "most painful" for them? What would be the most win if automated?

  • to be honest, I would try to develop a script that receives the name of a certain machine as input and, according to its type, it would make the necessary steps for loaning it.(I know it would be pretty difficult)
  • as a particularity, there's a python script that launches an EC2 instance. Right after it launches, it will try to connect to it but will fail, as the instance is still in the booting process. The immediate action for this is to wait for 1200 seconds (20 minutes) and then try again. Do we need to wait for that much time?

-pass on to Callek
-you can look at cloud tools repo https://github.com/mozilla/build-cloud-tools/ to see how you can change the wait

6. bug 1175291

Additional bugs to work on

1) Add emulator-x86-kk builds to trychooser
bug 1197235
Code is in hg.mozilla.org/buildtools and trychooser dir

2)manage_masters.py retry_dead_queue should run periodically
bug 1158729
Way to stop all these alerts :-)

3) Add a runner task to check resolution on Windows testers before starting buildbot
bug 1190868
https://github.com/mozilla/build-runner


4) Add T testing to the trychooser UI
bug 1141280
Code is in hg.mozilla.org/buildtools and trychooser dir