CIDuty/SVMeetings/July27-July31
Upcoming vacation/PTO:
- alin - aug31-sep11
- coop - aug 3, aug 17-28
- kmoir - aug 3
- otilia - aug10-21, half day on July31
- vlad - jul31 ; aug14-aug27
- Monday Aug 3 - Holiday in Canada
2015-07-26
- Bugzilla accounts:
- alin.selagea@softvision.ro
- vlad.ciobancai@softvision.ro
- IRC nicknames:
- aselagea
- vladC
- Otilia
2015-07-27
- Questions:
- [otilia] LDAP accounts - are these going to be @mozilla.com or @softvision.ro?
- [vlad] how many type of environments are ?
- do we need to maintenance the cluster servers? possible restart load balancer servers
- if something goes wrong in DC, who should we contact ? #moc - ping and ask if there is an outage going on; #moc is a 24/7 service
- escalation path Chris, Chris Atlee, Hal Wine
- access to : ldap, vpn, nagios, ganglia an dbuildduty ; irc channel is the main point
- daily or weekly meeting? ; or both at the beginning
- daily 15 min or half an hour, 9 AM EST (4 PM RO)
- sheriffs - roles and the interaction with them; responsible for the health of the trees ;be responsive with the sheriffs
- how do we make the handover at the end of the day
- ssh keys have been uploaded on bug 1187063 for both of us (Vlad and Alin)
- is it possible to clone the entire repository (https://hg.mozilla.org/build) automatically ?
- puppet - config for POSIX machines, braindump - useful tools, buildbot-configs and buildbot custom - configs for buildbot servers that are being moved to taskcluster, tools - config for buildbot servers and other tools, slave_health - for slave health tools, cloud-tools - AWS configuration
- we can not connect to the following channel #mozbuild, require channel key
- we will configure our irc client (nettalk) to work with ssl conection
- Tomorrow: can start with bugs
2015-07-28
- LDAP access is ready. Did you receive passwords? I'm not sure how this works for contractors
- [vlad] I haven't received the password
- [alin] I also haven't received the password yet.
- VPN access is next: https://login.mozilla.com/
- [vlad] the following repository: "cloud-tools" has been moved to github https://github.com/mozilla/build-cloud-tools
- Bugs
- [alin] Bugs:
- 1. https://bugzilla.mozilla.org/show_bug.cgi?id=1151591 - there are a bunch of bugs that depend on this one
- [alin] Bugs:
from our understanding, a number of 30 test Linux machines from production were re-imaged in order to be used as Windows test machines.
when all the machines finished installing, they should be verified if they are up and running
there were some issues with the win7 and win8 machines and you provided a Python script there (presumably for health check).
QUESTION: how could we run that script/how could we check the mentioned machines?
Kim stated that "tst-linux64-ec2-kmoir" machine was going to be used for testing another bug
we assume that when the job is done, the machine will be available for also testing this bug
Josh asked for loaning a slave from slavetype tst-linux32-ec2
Justin assigned the bug to Josh in order for him to keep track of the loaning
Josh would need to mark the bug as RESOLVED when he doesn't need the machine any longer
pretty much the same issues as in the previous cases
Gary wants to loan a slave machine to test moving two repositories from Github. He should mark the ticket as RESOLVED when he no longer needs to use the machine.
when we look at the bug in the Buildduty Report, we notice the two dependencies: 1183380, 1018212
opening the bug will list the two dependencies, but the second one is listed with a horizontal line because it's state is RESOLVED FIXED
QUESTION: shouldn't this be displayed when looking over the bug from the Buildduty Report page?
2015-07-29
- [vlad]
- Started a chat with ryanc on #moc channel to help us to log in on ldap , he provide us the credentials for ldap access.
- The credentials has been sent via email to me and Alin.
- Tested the log in to ldap but without any success.
- ryanc updated the ticket with the details and he proposed to get a Moz account in order to log in to ldap
- [vlad]
update: the ldap access has been resolved
this machine had been disabled in slavealloc, then re-imaged and returned to production
- the status of the bug switched from RESOLVED to REOPENED as Ryan noted that the machine was running at a 1280x1024 resolution (was 1600x1200 before)
- from the logs it follows that the resolution and the mouse position were adjusted when running a certain script: "/scripts/external_tools/mouse_and_screen_resolution.py", so it should be the expected behaviour
- noticed that one of the unit tests failed due to taking too long, but it then labeled as OK:
07:52:13 INFO - 666 INFO TEST-START | browser/devtools/webconsole/test/browser_webconsole_chrome.js 07:53:18 INFO - 668 INFO TEST-UNEXPECTED-FAIL | browser/devtools/webconsole/test/browser_webconsole_chrome.js | Test timed out - expected PASS 07:53:18 INFO - 669 INFO TEST-OK | browser/devtools/webconsole/test/browser_webconsole_chrome.js | took 48024ms
-> several warnings were then thrown related to this failed test: 07:57:53 WARNING - 788 INFO Failed: 1 07:57:53 WARNING - One or more unittests failed. 07:57:54 ERROR - Return code: 1
Q: next steps?
- [vlad] Can you please explain us about :https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound . We saw this service in tickets , for example: https://bugzilla.mozilla.org/show_bug.cgi?id=886195#c7
2015-07-30
- [vlad]
- 1. The following bug https://bugzilla.mozilla.org/show_bug.cgi?id=959635 has multiple dependencies bugs:
- https://bugzilla.mozilla.org/show_bug.cgi?id=1095041 -> :markco comment says that the disk has no errors and the check was successful
- The rest of the decencies bug has been assigned to " Infrastructure & Operations" and with status "Resolved"
- 1. The following bug https://bugzilla.mozilla.org/show_bug.cgi?id=959635 has multiple dependencies bugs:
- [vlad]
My questions is if all dependencies are resolved, the bug should not be resolve?
- practice slave loan requests for kmoir
- [vlad] https://bugzilla.mozilla.org/show_bug.cgi?id=1189005
- [alin] currently working on bug https://bugzilla.mozilla.org/show_bug.cgi?id=1189003
updated the loaner bug and created a new bug for problem tracking was unable to ssh to slave "talos-linux64-ix-003" as it was down talked to :arr and he power cycled several slaves (the above included), but I was still unable to ssh to it Q: any suggestions on this? Q: do we need access to the mana docs for managing the hardware?
e.g. the machine re-imaging instructions can be found on mana
- coop to look into mana access
2015-07-31
- https://bugil.la/1189321 - bld-lion-r5 - kmoir problem tracking
- only AWS machines (ones with "ec2" or "spot" in their name) have the devs name attached, e.g. dev-linux64-ec2-coop
- HW machines machines have fixed names, e.g. bld-lion-r5-001
- many HW machines will already have problem tracking bugs filed, will be linked from slavealloc and slave heath if they do
- if no bug, file a new one: both slavealloc and slave health provide links to bug templates for doing this
- make sure to set the bug alias to be the same as the short hostname <- this is important for the aforementioned link generation
- assigning the loaner bug back to the dev requesting the loan should happen as the last step once any clean-up, etc has been done and the machine is ready, otherwise the dev may try to access the machine before it's ready. The loan bug also drops out of the buildduty report queue when this happens and may get lost in a half-loaned state.
- [alin] Thanks for the notes above!
1. talked to Rob Thijssen on IRC
he created two accounts for me and Vlad for winadmin host access I received a temporary password -> connected to the machine and changed it -> all seems to be working just fine now
NOTE: when connecting via RDP: username: releng\<username> password: <password>
use CTRL-ALT-END key combination to change the password Vlad will need to ping Rob on Monday to receive his password find winadmin vidyo presentation
2. there are several mana pages for which it seems that we don't have the required permissions to see their content
Otilia appears to have such permissions
e.g. https://mana.mozilla.org/wiki/display/DC/How+To+Reimage+Releng+iX+and+HP+Linux+Machines
talked to :arr and he added me to all the wiki groups that she is in just a fyi arr is a she - (Amy Rich)
3. attempted to re-image slave talos-linux64-ix-003
wiki: "remotely connect to the management web interface and start the java KVM console" tried over and over to figure out how to do this also asked in IRC https://wiki.mozilla.org/ReleaseEngineering/How_To/Connect_To_IPMI will sign oob-password.txt.gpg for alin and vlad
http://talos-linux64-ix-003-mgmt.build.mozilla.org/ - username/password
https://www.gnupg.org/gph/en/manual/c14.html
http://gpg.mozilla.org/ - upload public part of locally generated gpg key
4. received "Host 221.57.134.10.in-addr.arpa. not found: 3(NXDOMAIN)" when running test.sh
connected to aws-manager1.srv.releng.scl3.mozilla.com environment setup updated the script with my own data and tried to run it aws-manager2 is the production host verify that alin/vlad have permissions to update DNS
@Kim/Coop: could you please take a look at the script?