ReleaseEngineering/How To/Manage AWS slaves: Difference between revisions

Jump to navigation Jump to search
Updated instructions on how to manage AWS slaves
(aws-manager1 -> aws-manager2)
(Updated instructions on how to manage AWS slaves)
 
Line 74: Line 74:
*# ''Check machine current status (is it actually running right now) by either''
*# ''Check machine current status (is it actually running right now) by either''
*#* Logging into [https://mozilla-releng.signin.aws.amazon.com/console AWS web console], look up instance, and see if it is still running
*#* Logging into [https://mozilla-releng.signin.aws.amazon.com/console AWS web console], look up instance, and see if it is still running
*#** *note: if you don't know the credentials for this, they probably have to be generated for you. Ask Armen, as he has done this
*#** ['''Note''']: if you don't know the credentials for this, they probably have to be generated for you. Ask :catlee, as he has done this
*#* Using releng cloud-tools from aws-manager2.srv.releng.scl3.mozilla.com
*#* Using releng cloud-tools from aws-manager2.srv.releng.scl3.mozilla.com
*#** see [https://wiki.mozilla.org/index.php?title=ReleaseEngineering/How_To/Manage_AWS_slaves#Usage usage] for 'status' command above
*#** see [https://wiki.mozilla.org/index.php?title=ReleaseEngineering/How_To/Manage_AWS_slaves#Usage usage] for 'status' command above
Line 81: Line 81:
*#* If loaners/releng-dev machines:
*#* If loaners/releng-dev machines:
*#** ssh as root into that machine, and run `last`
*#** ssh as root into that machine, and run `last`
*#** find the bug that is assoctiated with the instance and check latest comments. See [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=dorem&remaction=run&namedcmd=releng-loan-requests&sharer_id=30066&list_id=9324515 loaned machines]
*#** find the bug that is associated with the instance and check latest comments.
*#*** the bug number can also be found by looking at the instance tags in AWS console
*#* If it's one of our Buildbot CI machines
*#* If it's one of our Buildbot CI machines
*#** use [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/ Slave Health] or ssh into machine and tail twistd.log
*#** use [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/ Slave Health] or ssh into machine and tail twistd.log
*#** *note: these machines should not be running long. It's put on the long running process list if it's up for more than 2h. So if it's been idle for while, further action will be required.
*#** ['''Note''']: these machines should not be running long. It's put on the long running process list if it's up for more than 2h. So if it's been idle for while, further action will be required.
*# ''For instances that have not had any recent builds/activity and you are sure they are not currently doing a build''
*# ''For instances that have not had any recent builds/activity and you are sure they are not currently doing a build''
*#* If loaners/releng-dev machines:
*#* If loaners/releng-dev machines:
*#** Poke the owner of the instance via the associated bug, checking if they still need the machine. See [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=dorem&remaction=run&namedcmd=releng-loan-requests&sharer_id=30066&list_id=9324515 loaned machines]
*#** Poke the owner of the instance via the associated bug, checking if they still need the machine.
*#** use judgement for what's fair. ''eg: if it's been up for 24-48hrs, probably not cause for further action.''
*#** use judgement for what's fair. ''eg: if it's been up for 24-48hrs, probably not cause for further action.''
*#** Store owner/usage detail in the moz-used-by instance Tag (if not already updated)
*#** Store owner/usage detail in the moz-used-by instance Tag (if not already updated)
Line 94: Line 95:
*#*** 'stop' the instance if owner wants to use it again soon but won't be working on it for a day or two
*#*** 'stop' the instance if owner wants to use it again soon but won't be working on it for a day or two
*#**** see [https://wiki.mozilla.org/index.php?title=ReleaseEngineering/How_To/Manage_AWS_slaves#Usage usage] for 'stop' command above
*#**** see [https://wiki.mozilla.org/index.php?title=ReleaseEngineering/How_To/Manage_AWS_slaves#Usage usage] for 'stop' command above
*#**** *note: this should be made appealing to the owner as turning it back on is *easy* and fast!
*#**** ['''Note''']: this should be made appealing to the owner as turning it back on is *easy* and fast!
*#*** 'terminate' the instance if owner has stated to be finished forever or if bug is resolved
*#*** 'terminate' the instance if owner has stated to be finished forever or if bug is resolved
*#**** see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave#Reclaiming Reclaiming Loaners]
*#**** see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave#Reclaiming Reclaiming Loaners]
*#**** *note: don't forget to revert vpn access bug and delete A/PTR records.
*#**** ['''Note''']: don't forget to revert vpn access bug and delete A/PTR records.
*#* If it's one of our Buildbot CI machines:
*#* If it's one of our Buildbot CI machines:
*#** Decide whether to stop or terminate instance
*#** Decide whether to stop or terminate instance
*#*** if this is a spot instance:
*#*** if this is a spot instance:
*#**** [https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave#AWS_machines_2 terminate it] (don't delete A/ATR records)
*#**** [https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave#AWS_machines_2 terminate it] (don't delete A/ATR records)
*#**** *note: they need to be terminated because spot instances don't really have a 'stopped' state.
*#**** ['''Note''']: they need to be terminated because spot instances don't really have a 'stopped' state.
*#*** if this is not a spot instance:
*#*** if this is not a spot instance:
*#**** shut it down by:
*#**** shut it down by:
Line 108: Line 109:
*#***** logging into [https://mozilla-releng.signin.aws.amazon.com/console AWS web console] and choose 'stop' in dropdown
*#***** logging into [https://mozilla-releng.signin.aws.amazon.com/console AWS web console] and choose 'stop' in dropdown
*#***** ssh in to machine and: $ shutdown -h now
*#***** ssh in to machine and: $ shutdown -h now
*#**** *note stopping will allow aws_watch_pending to deal with deciding when it needs to be started up again
*#**** ['''Note''']: stopping will allow aws_watch_pending to deal with deciding when it needs to be started up again
* '''For repeating problematic instances''', further action will be required. Ask in #releng and possibly esculate to catlee/rail
* '''For repeating problematic instances''', further action will be required. Ask in #releng and possibly esculate to catlee/rail
* '''Future Plans:'''
** first step to make this less of a manual process is {{bug|962698}}
*** eg: changing the long instance report to JSON and feeding it into slave health
** modify aws_manage_instances.py to do add the ability to fill in moz-use-by 'tags' section from above.
*** look at how the 'disable' command works in aws_manage_instances.py for an idea on how to do this


== Unknown Type Or State Instances ==
== Unknown Type Or State Instances ==
148

edits

Navigation menu