CIDuty/SVMeetings/July27-July31

From MozillaWiki
Jump to: navigation, search

Upcoming vacation/PTO:

  • alin - aug31-sep11
  • coop - aug 3, aug 17-28
  • kmoir - aug 3
  • otilia - aug10-21, half day on July31
  • vlad - jul31 ; aug14-aug27
  • Monday Aug 3 - Holiday in Canada


2015-07-26

  • Bugzilla accounts:
    • alin.selagea@softvision.ro
    • vlad.ciobancai@softvision.ro
  • IRC nicknames:
    • aselagea
    • vladC
    • Otilia

2015-07-27

  • Questions:
    • [otilia] LDAP accounts - are these going to be @mozilla.com or @softvision.ro?
    • [vlad] how many type of environments are ?
    • do we need to maintenance the cluster servers? possible restart load balancer servers
    • if something goes wrong in DC, who should we contact ? #moc - ping and ask if there is an outage going on; #moc is a 24/7 service
      • escalation path Chris, Chris Atlee, Hal Wine
    • access to : ldap, vpn, nagios, ganglia an dbuildduty ; irc channel is the main point
    • daily or weekly meeting? ; or both at the beginning
      • daily 15 min or half an hour, 9 AM EST (4 PM RO)
    • sheriffs - roles and the interaction with them; responsible for the health of the trees ;be responsive with the sheriffs
    • how do we make the handover at the end of the day
    • ssh keys have been uploaded on bug 1187063 for both of us (Vlad and Alin)
    • is it possible to clone the entire repository (https://hg.mozilla.org/build) automatically ?
      • puppet - config for POSIX machines, braindump - useful tools, buildbot-configs and buildbot custom - configs for buildbot servers that are being moved to taskcluster, tools - config for buildbot servers and other tools, slave_health - for slave health tools, cloud-tools - AWS configuration
    • we can not connect to the following channel #mozbuild, require channel key
    • we will configure our irc client (nettalk) to work with ssl conection
  • Tomorrow: can start with bugs

2015-07-28

  • LDAP access is ready. Did you receive passwords? I'm not sure how this works for contractors
    • [vlad] I haven't received the password
    • [alin] I also haven't received the password yet.
   from our understanding, a number of 30 test Linux machines from production were re-imaged in order to be used as Windows test machines.
   when all the machines finished installing, they should be verified if they are up and running
   there were some issues with the win7 and win8 machines and you provided a Python script there (presumably for health check). 

QUESTION: how could we run that script/how could we check the mentioned machines?

   Kim stated that "tst-linux64-ec2-kmoir" machine was going to be used for testing another bug
   we assume that when the job is done, the machine will be available for also testing this bug


   Josh asked for loaning a slave from slavetype tst-linux32-ec2
   Justin assigned the bug to Josh in order for him to keep track of the loaning
   Josh would need to mark the bug as RESOLVED when he doesn't need the machine any longer


   pretty much the same issues as in the previous cases
   Gary wants to loan a slave machine to test moving two repositories from Github. He should mark the ticket as RESOLVED when he no longer needs to use the machine.


   when we look at the bug in the Buildduty Report, we notice the two dependencies: 1183380, 1018212
   opening the bug will list the two dependencies, but the second one is listed with a horizontal line because it's state is RESOLVED FIXED

QUESTION: shouldn't this be displayed when looking over the bug from the Buildduty Report page?

2015-07-29

    • [vlad]
      • Started a chat with ryanc on #moc channel to help us to log in on ldap , he provide us the credentials for ldap access.
      • The credentials has been sent via email to me and Alin.
      • Tested the log in to ldap but without any success.
      • ryanc updated the ticket with the details and he proposed to get a Moz account in order to log in to ldap

update: the ldap access has been resolved

this machine had been disabled in slavealloc, then re-imaged and returned to production

        • the status of the bug switched from RESOLVED to REOPENED as Ryan noted that the machine was running at a 1280x1024 resolution (was 1600x1200 before)
        • from the logs it follows that the resolution and the mouse position were adjusted when running a certain script: "/scripts/external_tools/mouse_and_screen_resolution.py", so it should be the expected behaviour
        • noticed that one of the unit tests failed due to taking too long, but it then labeled as OK:

07:52:13 INFO - 666 INFO TEST-START | browser/devtools/webconsole/test/browser_webconsole_chrome.js 07:53:18 INFO - 668 INFO TEST-UNEXPECTED-FAIL | browser/devtools/webconsole/test/browser_webconsole_chrome.js | Test timed out - expected PASS 07:53:18 INFO - 669 INFO TEST-OK | browser/devtools/webconsole/test/browser_webconsole_chrome.js | took 48024ms

       -> several warnings were then thrown related to this failed test:
               07:57:53 WARNING - 788 INFO Failed: 1
               07:57:53 WARNING - One or more unittests failed.
               07:57:54 ERROR - Return code: 1

Q: next steps?

2015-07-30

My questions is if all dependencies are resolved, the bug should not be resolve?


   updated the loaner bug and created a new bug for problem tracking
   was unable to ssh to slave "talos-linux64-ix-003" as it was down
   talked to :arr and he power cycled several slaves (the above included), but I was still unable to ssh to it
   Q: any suggestions on this?
   Q: do we need  access to the mana docs for managing the hardware?

e.g. the machine re-imaging instructions can be found on mana

    • coop to look into mana access

2015-07-31

  • https://bugil.la/1189321 - bld-lion-r5 - kmoir problem tracking
    • only AWS machines (ones with "ec2" or "spot" in their name) have the devs name attached, e.g. dev-linux64-ec2-coop
    • HW machines machines have fixed names, e.g. bld-lion-r5-001
    • many HW machines will already have problem tracking bugs filed, will be linked from slavealloc and slave heath if they do
    • if no bug, file a new one: both slavealloc and slave health provide links to bug templates for doing this
      • make sure to set the bug alias to be the same as the short hostname <- this is important for the aforementioned link generation
    • assigning the loaner bug back to the dev requesting the loan should happen as the last step once any clean-up, etc has been done and the machine is ready, otherwise the dev may try to access the machine before it's ready. The loan bug also drops out of the buildduty report queue when this happens and may get lost in a half-loaned state.
    • [alin] Thanks for the notes above!

1. talked to Rob Thijssen on IRC

   he created two accounts for me and Vlad for winadmin host access
   I received a temporary password -> connected to the machine and changed it -> all seems to be working just fine now

NOTE: when connecting via RDP: username: releng\<username> password: <password>

          use CTRL-ALT-END key combination to change the password
   Vlad will need to ping Rob on Monday to receive his password
   find winadmin vidyo presentation

2. there are several mana pages for which it seems that we don't have the required permissions to see their content

   Otilia appears to have such permissions    

e.g. https://mana.mozilla.org/wiki/display/DC/How+To+Reimage+Releng+iX+and+HP+Linux+Machines

   talked to :arr and he added me to all the wiki groups that she is in
   just a fyi arr is a she - (Amy Rich)

3. attempted to re-image slave talos-linux64-ix-003

   wiki: "remotely connect to the management web interface and start the java KVM console"
   tried over and over to figure out how to do this
   also asked in IRC
   https://wiki.mozilla.org/ReleaseEngineering/How_To/Connect_To_IPMI
   will sign oob-password.txt.gpg for alin and vlad
   http://talos-linux64-ix-003-mgmt.build.mozilla.org/ - username/password
   https://www.gnupg.org/gph/en/manual/c14.html
   http://gpg.mozilla.org/ - upload public part of locally generated gpg key


4. received "Host 221.57.134.10.in-addr.arpa. not found: 3(NXDOMAIN)" when running test.sh

   connected to aws-manager1.srv.releng.scl3.mozilla.com
   environment setup
   updated the script with my own data and tried to run it
   aws-manager2 is the production host
   verify that alin/vlad have permissions to update DNS


@Kim/Coop: could you please take a look at the script?