Buildduty notes

Upcoming vacation/PTO:

alin - aug31-sep11
coop - aug 3, aug 17-28
kmoir - aug 3, aug 24
otilia - aug10-21, half day on July31
vlad - jul31 ; aug14-aug27
Monday Aug 3 - Holiday in Canada

== 2015-08-17 ==
1. Kim - please create another wiki page with the Etherpad notes from the last week :)
Done
Opened bug 1195301 to ask for wiki access for you two.

2. bug 1194786 - investigate

re-imaged and enabled in slavealloc
slave is now connected to a master, waiting to see if it takes jobs

UPDATE: taking jobs, marked as resolved

3. bug 1193054

the loaned EC2 instance is no longer needed
removed user records from inventory, terminated instance, revoked VPN access, marked the problem tracking bug as resolved.

4. bug 1063024

bug 1189049

bug 823235

re-imaged machines, enabled in slavealloc
waiting to see if it takes jobs

UPDATES:

--> b-2008-ix-0115taking jobs

5. bug 1194604

t-snow-r4-0133is no longer needed
revoked VPN access, re-imaged the slave
after re-image, the slave is no longer accessible (only ping works, nothing else)

Q: should I open a bug to DCOps to re-image the slave?
--> submitted bug bug 1195313

6. on 08/13/2015 (guess it's August 12 on your side) we received an -email from Q: with the subject: "[RelEng] Down time for WDS (windows imaging server) upgrade"
Q: is there a specific time of the day when re-imaging windows machines is available?
bug 936042

no, this will be intermittent as we shuffle VMs around. I'll try to get some definitive info form Q about possible breakage scenarios and when it should/shouldn't be safe to try reimages

"if we have no code changes between nightly builds, we still build and release a new one the day after even without any changes?"

991707

Kim will go look for buildduty bugs other than reimaging
disable freshclam on OSX builders
bug 1175291
Steps to do this
1) Can you ssh as yourself to
releng-puppet2.srv.releng.scl3.mozilla.com?
If not, I'll ask for correct rights
2) After connecting,
cd /etc/puppet/environments
mkdir aselagea
cd aselagea
clone hg.mozilla.org/build/puppet as described here
ReleaseEngineering/PuppetAgain/HowTo/Set_up_a_user_environment#Common
3) Look at the code here puppet/modules/disableservices/manifests/common.pp to see how to disable the service
Once you have a patch to disable
4) Loan yourself a bld-lion-r5-* machine
to test the patch
References
ReleaseEngineering/PuppetAgain/HowTo/Set_up_a_user_environment

Another possible bug
Add a runner task to check resolution on Windows testers before starting buildbot
bug 1190868
https://github.com/mozilla/build-runner

== 2015-08-18 ==
Fix the slaves broken by talos's inability to deploy an update
bug 1141416
These machines can be reimaged now

[alin]
1. bug 1191071 - t-snow-r4-0147

re-imaged, revoked VPN access and returned to production
started taking jobs, marked the bug as resolved

2. bug 1194211 - panda-0345 decomm

noticed your comment stating that the change is in production
ReleaseEngineering/Maintenance#Reconfigs_.2F_Deployments does not list the bug there
when bld-lion-r5-055 was decomm, we can see the bug listed there

Q: is this due to the fact that the reconfig occurred only yesterday and the patch was landed on 2015-08-13? This is due to the fact that my reconfig didn't update the maintenance page for some reason, I'll investigate. Fixed page.

3. bug 936042 - t-w864-ix-092

investigated both yesterday and today, could not re-image the machine
ping does not work, attempted to reboot it but failed
managed to connect via KVM console and perform a system restore
logged in as root and noticed that the slave does not have any internet connection ("Network cable unplugged").
also, the resolution is lower than it should be (1024x768)

Q: my guess here is that we should open a bug to DCOps to run some diagnostics on this slave Yes, good idea
--> created bug 1195785 to DCOps.

4. re-imaged 5 32-bit slaves and one 64-bit machine:

talos-linux32-ix-008 - OK
talos-linux32-ix-001 -connected to Coop's master, it does not take jobs at the moment
talos-linux32-ix-026 - OK
talos-linux32-ix-022 - failed the first 2 jobs, Ryan restarted it
talos-linux32-ix-003 - OK
talos-linux64-ix-027 -OK

--> marked most of the bugs as resolved.
Great work!

5. Alert from relengbot: [sns alert] Tue 05:08:06 PDT buildbot-master87.bb.releng.scl3.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes

didn't manage to investigate, however it would be nice to know what does it mean

It means that the reconfig is somehow stuck and didn't finish. See ReleaseEngineering/Buildduty/Reconfigs for ideas on how to fix. I looked briefly at it, I don't know what's wrong with it, still looking.

6. started looking over the bug for disabling freshclamservice on OSX builders
Wonderful, let me know if you have questions
Look at /etc/freshclam.conf seems to have some parameters you can use to modify it
Test on command line first and them implement with puppet

== 2015-08-19 ==

Increase SETA coalescing to every 7 pushes and every 60 min

bug 1195803
Just as a fyi, this will be enabled tomorrow which will reduce the number of tests run on every push which should reduce our high pending counts

[alin]
1. bug 1193734 -t-snow-r4-0094

this slave has been decommissioned
opened a bug to RelEng to implement the changes:

bug 1196217

noticed that this type of slaves is not listed in "buildbot-configs\mozilla\production_config.py"
searched for a configuration file, but had little luck

--> I would need some suggestions here
I'll look and add some pointers to the bug
http://hg.mozilla.org/build/buildbot-configs/file/d5adde30c267/mozilla-tests/production_config.py#l31

2. https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux32-ix-022

this slave is still burning jobs
tried another re-image and enabled and slavealloc
waiting to see if it takes jobs/how it runs them

Q: in the case when it still fails the jobs, would it be a good idea to open a bug to DCOps for some diagnostics? sure, sounds good
--> bug to DCOps: https://bugzilla.mozilla.org/show_bug.cgi?id=1196281

3. Disabling freshclamservice on OSX builders

took me a while to figure out that ClamAV is the actual antivirus and Freshclam is the automatic database update tool for ClamAV :)
looked over freshclam.conf and noticed a parameter that specifies the number of database checks per day
default is 12 --> this should be switched to "Checks 0"
used a "sed" expression that looks for a string like "Checks 12" and changes it to "Checks 0". I tested it locally and it worked, so I updated common.pp'from my environment'file to do the same thing. Also obtained diff.txt (patch).
when I ran "puppet agent --test" on the slave I got:

Error: Could not request certificate: Error 400 on SERVER: this master is not a CA
puppet agent --test --server=releng-puppet2.srv.releng.scl3.mozilla.com --environment=aselagea --pluginsync --ssldir=/var/lib/puppet/ssl
I don't know what why this is happening. I debugged for about an hour. One thing I would suggest is to reimage bld-lion-r5-078 and then NOT remove the files listed here
ReleaseEngineering/How_To/Loan_a_Slave#bld-lion-r5.2C_talos-mtnlion-r5.2C_t-yosemite-r5
to keep the ssh and puppet files
Also, I added a line to your manifests/moco-nodes.pp

node "bld-lion-r5-078.build.releng.scl3.mozilla.com" {

manifests/nodes.pp:node "bld-lion-r5-078.build.releng.scl3.mozilla.com" {
so it would be pinned to the master you are testing. Otherwise, it will run puppet against the production masters and remove your changes
ok, thanks for looking into this

4. bug 1195803 - Increase SETA coalescing to every 7 pushes and every 60 min

as mentioned, the number of tests run on every push will be reduced:

(5, 1800) <=> (10, 3600) will become (7, 3600)

it would be useful to know more details about the process :)

https://elvis314.wordpress.com/2015/02/06/seta-search-for-extraneous-test-automation/
http://relengofthenerds.blogspot.ca/2015/04/less-testing-same-great-firefox-taste.html

== 2015-08-20 ==

1. received some alerts from nagios:
<nagios-releng> Thu 01:16:23 PDT [4007] aws-manager2.srv.releng.scl3.mozilla.com:File Age -/builds/aws_manager/aws_stop_idle.log is WARNING:FILE_AGE WARNING: /builds/aws_manager/aws_stop_idle.log is 663 seconds old and 1173723 bytes (http://m.mozilla.org/File+Age+-+/builds/aws_manager/aws_stop_idle.log)

connected to aws manager and looked over the log files
aws_watch_pending.log --> spot requests for different instances (bid 0.07)
aws_stop_idle.log --> various info on the state of the instances
could not find a reason for the alert, things went back to normal soon thereafter.

2. bug 1061321 - b-2008-ix-0149

talked to Pete [:pmoore] and Nigel [:nigelb] on IRC
it looks like this machine has run out of free space, most of the space is occupied by the "builds" folder
disabled the slave as the jobs were failing, re-imaged it and enabled it in slavealloc
waiting to see if it takes jobs/they are completed successfully

UPDATE:started taking jobs, working fine
alert for free disk space? look at runner code

3. took care of two loan requests:
bug 1196399
bug 1196602 (in progress)
Q: (just to make sure) do we need to create a problem tracking bug for an EC2 instance? From what I noticed, we don't need to do that.
No you don't need to do that. We are not loaned existing machines but rather creating new ones so that's okay.

4. Disabling freshclamservice on OSX builders

re-imaged bld-lion-r5-078,but without deleting the mentioned files
if I run puppet agent without specifying the ssl directory:

puppet agent --test --environment=aselagea --server=releng-puppet2.srv.releng.scl3.mozilla.com
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for bld-lion-r5-078.build.releng.scl3.mozilla.com
Info: Applying configuration version 'unknown'
Notice: /Stage[main]/Cleanslate/File[/var/tmp/cleanslate]/ensure: removed
Notice: /Stage[main]/Jacuzzi_metadata/Exec[get_jacuzzi_metadata]/returns: executed successfully
Notice: Finished catalog run in 45.79 seconds

if I also specify the ssl directory, it seems to do nothing
checked the logs, tried to figure out why it doesn't work (in progress)

If you run with --debug or --verbose you might have more information presented to you

[sns alert] Thu 02:08:31 PDT mobile-imaging-001.p1.releng.scl3.mozilla.com mozpool_inventorysync: raise RuntimeError('got status code %s from inventory' % r.status_code)
This is from puppet error that Dustin talked to you about

== 2015-08-20 ==
Question from Callek
What 1 or 2 things about current slaveloan process they feel is the "most painful" for them? What would be the most win if automated?

1. took care of some common tasks:

added treeherder-client 1.7 to the internal pypi mirror
restarted and monitored t-w864-ix-123'as it became non-responsive'
UPDATE: OK
re-imaged t-xp32-ix-033, enabled in slavealloc, waiting to see if it takes jobs
UPDATE: taking jobs, marked the bugt as solved

2. bug 1196808 - loan request from Armen

disabled t-w864-ix-158,waited for the current job to end
granted VPN access, moved to loaner OU and restarted
VNC and SSH connection works fine, although I am not able to connect via RDP
noticed that the default profile for Remote Desktop is "private", so it should be "public" in order to work in my case
I must be logged in as administrator to make such changes, if needed

Q: do we need to grant access to a public IP for Remote Desktop?
Why I am asking this?

--> bug 1192345

--> loaned a Windows Server machine to Pete Moore (b-2008-ix-0080)

--> Pete sent me an e-mail that he is not able to connect via RDP, but it worked for VNC and SSH

--> I did the change mentioned above, however on Winows Server there is no need to be logged in as administrator to grant such permissions.

What are they using to connect via rdp on their desktop? For instance, I have a windows rdp client app on my mac to connect, and can connect without an issue.

--> I don't know the OS and the client that Armen (or anyone that requests a loaner) uses to connect via RDP..I have Windows 8 and tried to connect using the Remote Desktop service that comes with Windows OS (and, as mentioned, I cannot do so).
--> debugging..
--> yeah, I still cannot connect via RDP..even though my computer and t-w864-ix-158belong to the same VPN, I cannot establish a connection using Remote Desktop..from what I read, I do NOT need to grant access to public IPs if the computers are on the same VPN, meaning that the default profile (private) should be fine.

3. when dealing with dead jobs: according to ReleaseEngineering/Queue_directories :

we either delete the jobs from /dead directory
or we call "manage_masters.py" with the "retry_dead_queue" sub-command

find / -name "manage_masters.py"

/builds/buildbot/try1/tools/buildfarm/maintenance/manage_masters.py

/builds/buildbot/queue/tools/buildfarm/maintenance/manage_masters.py

I tried to run somenthing like:

python /builds/buildbot/queue/tools/buildfarm/maintenance/manage_masters.py -c 'retry_dead_queue'

==> ImportError: No module named fabric.api ='>'(line 6)
Q1: I don't seem to find where the "retry_dead_queue" sub-command is defined
Q2: is the script still functional?
Kim will look and see if it still works
Yes it works
Kims-MacBook-Pro:maintenance kmoir$ python manage_masters.py -f production-masters.json -H bm01-tests1-linux32 retry_dead_queue
[buildbot-master01.bb.releng.use1.mozilla.com] run: find /dev/shm/queue/commands/dead -type f
[buildbot-master01.bb.releng.use1.mozilla.com] run: find /dev/shm/queue/pulse/dead -type f

Do you have fabric installed?
Kims-MacBook-Pro:maintenance kmoir$ pip freeze | grep -i fabric
Fabric==1.4.3

If not, run
pip install fabric
to get the package installed locally

4. bug 1196723 - revocations failing due to invalid inventory

first of all, I'm sorry for the confusion generated here
I wanted to debug the issue, to see why puppet agent failed
the error received: Error: Could not request certificate: Error 400 on SERVER: this master is not a CA
I'll continue to dig more on puppet (including certificates :)) to get myself more familiar with it
Dustin would know more about how this was actually done

5. Question from Callek - What 1 or 2 things about current slaveloan process they feel is the "most painful" for them? What would be the most win if automated?

to be honest, I would try to develop a script that receives the name of a certain machine as input and, according to its type, it would make the necessary steps for loaning it.(I know it would be pretty difficult)
as a particularity, there's a python script that launches an EC2 instance. Right after it launches, it will try to connect to it but will fail, as the instance is still in the booting process. The immediate action for this is to wait for 1200 seconds (20 minutes) and then try again. Do we need to wait for that much time?

-pass on to Callek
-you can look at cloud tools repo https://github.com/mozilla/build-cloud-tools/ to see how you can change the wait

6. bug 1175291

Additional bugs to work on

1) Add emulator-x86-kk builds to trychooser
bug 1197235
Code is in hg.mozilla.org/buildtools and trychooser dir

2)manage_masters.py retry_dead_queue should run periodically
bug 1158729
Way to stop all these alerts :-)

3) Add a runner task to check resolution on Windows testers before starting buildbot
bug 1190868
https://github.com/mozilla/build-runner

4) Add T testing to the trychooser UI
bug 1141280
Code is in hg.mozilla.org/buildtools and trychooser dir

Buildduty notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools