CIDuty/SVMeetings/Aug3-Aug7

From MozillaWiki
Jump to: navigation, search

Upcoming vacation/PTO:

  • alin - aug31-sep11
  • coop - aug 3, aug 17-28
  • kmoir - aug 3
  • otilia - aug10-21, half day on July31
  • vlad - jul31 ; aug14-aug27
  • Monday Aug 3 - Holiday in Canada

2015-08-03

[alin]

1. https://bugzilla.mozilla.org/show_bug.cgi?id=1189787

We still do not have access to update DNS in order to create an ec2 instance. From what we understand, it should take a little bit longer to solve this.

Callek will ping corey about this access

Corey has some reservations, sent summary e-mail to coop and he will review/discuss more tomorrow (Tuesday) so we're waiting at least another day.

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1189801

we are now able to clone the private repo, but cannot decrypt the files using our private gpg keys. I suspect that Chris didn't manage to sign those files with our public gpg keys yet.

Callek confirmed otilia has access to deploypass.txt.gpg and password.txt.gpg

Callek noticed loanerou.txt.gpg should be encrypted for otilia and vlad (not re-encrypted as of this writing, since no use without the DNS issue above being solved)

2015-08-04

[alin]

1. Callek talked to Corey Shields about the rights to update DNS. Corey replied that they do not have granular access to grant such rights, but proposed two solutions:

replacing dns pointers with IP for puppet certificates

make a block of mozilla.com pointers to route53 where softvision can then have control of making changes ()

Corey also noted that the second one would be more viable. arr and coop to discuss how to move forward. (Not possible to allocate block of addresses). In the mean time you can ask in #moc to add dns entries or Kim can do it for you

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1189720

uploaded all the packages mentioned by Julien to the internal pypi on the relengwebadm machine

3. tried several times to re-image machine talos-linux64-ix-003, but failed

noticed that it hangs out for a long time during first boot up

managed to connect to the machine and opened puppetize.log:

"Running puppet agent against server 'puppet' Puppet run failed; re-trying after 10m"

when rebooting the instance, it will ask for kickstart credentials

NOTES:

  • we used the password from deploypass.txt as the puppet deployment password
  • we currently do not have access to gpg encoded files such as "kickstart.txt.gpg" (useful for debugging during re-imaging process) and "loanerou.txt.gpg" (for loaning windows machines).

So it looks like talos-linux64-ix-003 is imaged as a 14.04 machine, not 12.04 as it should be. This is why puppet is failing (I actually did this a while ago myself). So in this document, you need to ensure you pick the OS listed in step 6c

https://mana.mozilla.org/wiki/display/DC/How+To+Reimage+Releng+iX+and+HP+Linux+Machines Kim tried to resign gpg keys - having problems and cannot update them - getting invalid keys when trying to sign. Done by hwine. Thanks hwine! Alin and Vlad - you should be able to update your local private git repo and decrypt deploypass.txt.gpg, kickstart.txt.gpg, loanerou.txt.gpg and slave-passwords.txt

4. https://bugzilla.mozilla.org/show_bug.cgi?id=1186016

(we may repeat here) the status of the bug switched from RESOLVED to REOPENED because Ryan noticed that the machine was running at 1280x1024 (was 1600x1200)

checked the raw logs:

04:46:17     INFO - Running pre test command run mouse & screen adjustment script with 'c:\mozilla-build\python27\python.exe ../scripts/external_tools/mouse_and_screen_resolution.py
04:46:21     INFO -  Screen resolution (current): (1600, 1200)
04:46:21     INFO -  Changing the screen resolution...
04:46:21     INFO -  Screen resolution (new): (1280, 1024)
04:46:21     INFO -  Mouse position (current): (640, 512)
04:46:21     INFO -  Mouse position (new): (1010, 10)

it seems there's a python script that changes the resolution in this case

Q: we are not sure about the steps required to fix the problem here. Also, we are not able to access the machine via RDP.

  • I can connect via VNC, not RDP. (I use a mac). You connect and are prompted for a VNC password which is in slave-passwords.txt.
    • [alin] cloned the repo again but was unable to decrypt slave-passwords.txt.gpg. It seems that the file was not signed with our keys.
    • it should be signed now

[vladC]

1. discussed with :grenade and asked him for the temporary password to login to winadmin server. Changed the passwd 2. sent to Kim an email with the details for the new slave that i loaned for her

2015-08-05

[alin]

1. it appears that we still don't have a DNS entry for "tst-linux64-ec2-kmoir2.test.releng.use1.mozilla.com"

received the following when running "aws_create_instance.py":

2015-08-05 01:41:33,332 - INFO - Checking name conflicts for tst-linux64-ec2-kmoir2 2015-08-05 01:41:54,645 - ERROR - tst-linux64-ec2-kmoir2.test.releng.use1.mozilla.com has no DNS entry

asked in #moc channel to check this and they suggested to file a bug, which I did: https://bugzilla.mozilla.org/show_bug.cgi?id=1191231

Richard Soderberg [:atoll] said that the IP address was assigned to 'dev-linux64-ec2-kmoir2.dev.releng.use1.mozilla.com' and it should be 'tst' instead

this usually happens (to me) when I accidentally follow the wrong sequence of steps on the Loan wiki page, i.e. for build instead of test.

2. retried to re-image talos-linux64-ix-003 using the 64-bit version of Ubuntu 12.04

checked puppetize.log and it seems that the process terminated successfully

also, I am able to ssh to that machine

still, I only see a black screen when looking into the console and if rebooted it will ask for a username and password for a short time, then it switches to a black screen again.

I don't think that's unexpected. However, talos-linuxXX machines that are re-imaged will invariably hit https://bugzil.la/1141416 - in fact, this machine was in this state before it was loaned this time. Still need to find someone to dig into that bug.

3. managed to connect to several windows slaves via VNC.

noticed that for all of them, the resolution was 1600x1200.

there is a bug opened for this "wrong resolution" issue: https://bugzilla.mozilla.org/show_bug.cgi?id=1190868

https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Slave_Management

[vlad]

1. Re-imaged bld-lion-r5-004 using the following command : /usr/sbin/bless --netboot --server bsdp://10.26.52.17; reboot. After 4 hours I tried to login but whout any success. There is a way to extract the re-image logs for a mac ? Created a bug ticket for this problem : https://bugzilla.mozilla.org/show_bug.cgi?id=1191273 2. Checked the following bug : https://bugzilla.mozilla.org/show_bug.cgi?id=1103497 and tried to figured it out why the server is still idle and not jobs running on this slave 3. Checked the following bug : https://bugzilla.mozilla.org/show_bug.cgi?id=1031869 where ryanVM wrote that a re-image is need it but I didn't started until I will figured it out why the first re-image is not worked

2015-08-06

[alin]

1. we connected to aws-manager and used to following commands:

host=tst-linux64-ec2-kmoir2
terminate ${host}
  • ping to "tst-linux64-ec2-kmoir2.test.releng.use1.mozilla.com" still worked
  • we were able to connect to "tst-linux64-ec2-kmoir2.test.releng.use1.mozilla.com"
  • several hours since that, we were again not able to connect
  • terminated the host and attempted to launch a new instance
  • instance seems to be launched correctly, ping works intermittently but we are again asked for a password when connecting as root (the same thing as yesterday)
    • ping to "tst-linux64-ec2-kmoir2.test.releng.use1.mozilla.com" does not work at the moment
    • ping to 10.134.59.83 works but some packages are lost (timed out requests)

there's a loan request bug for an AWS instance: https://bugzilla.mozilla.org/show_bug.cgi?id=1191533, but we are not able to loan such machines at the moment due to not having permissions to update DNS

2015-08-06 05:34:06,001 - INFO - Sanity checking DNS entries...
2015-08-06 05:34:06,005 - INFO - Checking name conflicts for tst-linux64-ec2-kmoir2
2015-08-06 05:34:30,581 - INFO - waiting for workers
2015-08-06 05:34:30,874 - INFO - Using IP 10.134.59.83
2015-08-06 05:34:31,087 - INFO - subnet subnet-ff3542d7
2015-08-06 05:34:43,733 - INFO - instance Instance:i-638f67c8 created, waiting to come up
2015-08-06 05:35:18,262 - INFO - assimilating Instance:i-638f67c8
2015-08-06 05:35:18,529 - INFO - Using private IP
2015-08-06 05:35:28,577 - WARNING - cannot connect; instance may still be starting  tst-linux64-ec2-kmoir2.test.releng.use1.mozilla.com (i-638f67c8, 10.134.59.83) - Timed out trying to connect to 10.134.59.83 (tried 1 time),retrying in 1200 sec ...

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1191394

"bld-lion-r5-083" still inaccessible

noticed that Amy assigned the bug to DCOps to netboot the machine

2015-08-07

[vlad]

1. We still have problems to activate the mfa on AWS accounts. We are receiving the following error : arn:aws:iam::314336048151:user/vlad.ciobancai@softvision.ro is not authorized to perform: iam:CreateVirtualMFADevice on resource: arn:aws:iam::314336048151:mfa/vlad.ciobancai@softvision.ro

    • coop to investigate MFA access with script

[alin]

1. https://bugzilla.mozilla.org/show_bug.cgi?id=1186583

checked in AWS console and noticed that the instance no longer exists. Deleted DNS entries and marked the bug as resolved.

2. https://bugzilla.mozilla.org/show_bug.cgi?id=1191533 - loan request for an AWS instance

we do not have permissions to grant/revoke VPN access:

https://ldapadmin1.private.phx1.mozilla.com/manage/ will only show info about our user

https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave says that VPN access is not needed when he requester is in IT/RelOps.

we searched in IRC and noticed that Kannan Vijayan [:djvj] is not present on #it or #releng channels, just on #developers, so it is likely that he needs VPN access.

asked on IRC and Callek said that we can assume that everyone needs the access, as adding the VPN group will not hurt.

Q: is there a way to find out the groups/permissions for a requester? Kim will ping arr and ask about VPN access, then open bug https://bugzilla.mozilla.org/show_bug.cgi?id=1192253. In the interim, I can update your access.

3. still got some issues with the EC2 instance

the launch process goes fine and then the puppet script will run (takes some time)

when it finishes, we are able to connect as root, but the instance will shut down itself after several minutes

even if we start it again from the AWS console, it will still shut down pretty soon

tried to launch the instance using another subnet, but the behaviour is the same

if we look at the logs from the AWS console:

collectdmon[880]: Info: collectd terminated with exit status 0
collectdmon[880]: Info: shutting down collectdmon �[74G[ OK ]
* Asking all remaining processes to terminate...       �[80G �[74G[ OK ]
* All processes ended within 4 seconds....       �[80G �[74G[ OK ]
* Deconfiguring network interfaces...       �[80G �[74G[ OK ]
* Deactivating swap...       �[80G �[74G[ OK ]
umount: /run/lock: not mounted
mount: / is busy
* Will now halt
[71672119.320942] System halted.
  • runner needs to be disabled

https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave#talos-linux32-ix.2C_talos-linux64-ix.2C_tst-linux32-ec2.2C_tst-linux64-ec2

    • coop to find bug to disable runner

4. https://bugzilla.mozilla.org/show_bug.cgi?id=1018213

Kannan no longer needs the machine

should we re-image it and then enable it in slavealloc?