ReleaseEngineering/How To/Manage AWS slaves

From MozillaWiki
Jump to: navigation, search


To simplify management of AWS slaves you can use aws_manage_instances.py script. It can stop, start, restart instances; enable or disable automatic reboot and automatic shutdown.

Usage

ssh into aws-manager2.srv.releng.scl3.mozilla.com as buildduty

ssh buildduty@aws-manager2.srv.releng.scl3.mozilla.com

set up aws environment (note: this gets run automatically now when you login as the buildduty user)

 source /builds/aws_manager/bin/activate
 cd /builds/aws_manager/cloud-tools/scripts

from /builds/aws_manager/cloud-tools/scripts, you can exec aws_manage_instances.py with the following usage options. See below for examples

usage: aws_manage_instances.py [-h] [-k SECRETS] [-r REGIONS] [-m COMMENTS]
                               [-n] [-q]
                               
                               {stop,start,restart,enable,disable,terminate,status}
                               host [host ...]

positional arguments:
  {stop,start,restart,enable,disable,terminate,status}
                        action to be performed
  host                  hosts to be processed

optional arguments:
  -h, --help            show this help message and exit
  -k SECRETS, --secrets SECRETS
                        optional file where secrets can be found
  -r REGIONS, --region REGIONS
                        optional list of regions
  -m COMMENTS, --comments COMMENTS
                        reason to disable
  -n, --dry-run         Dry run mode
  -q, --quiet           Supress logging messages

Examples

Disable automatic reboots and start bld-linux64-ec2-001 (you may also want to disable it in slavealloc):

python aws_manage_instances.py disable -m "rail: need to debug XXX" bld-linux64-ec2-001
python aws_manage_instances.py start bld-linux64-ec2-001

Reboot it

python aws_manage_instances.py restart bld-linux64-ec2-001

Terminate it

python aws_manage_instances.py terminate dev-linux64-ec2-001

Secrets (credentials)

There a 2 ways to pass AWS credentials to properly authenicate yourself.

AWS_CREDENTIAL_FILE

The underlying library (boto) uses AWS_CREDENTIAL_FILE environment variable with path to file with your credentials in the following format:

AWSAccessKeyId=xxx
AWSSecretKey=xxx

To use it add the following command to your profile

export AWS_CREDENTIAL_FILE=~/.ec2/aws-credential-file.txt

-k secrets.json

Create a JSON file with your credentials and pass it via -k parameter. Example file:

{
    "aws_access_key_id": "xxx",
    "aws_secret_access_key": "xxx"
}

AWS Sanity Check

Long Running Instances

  • Dealing With A Long Running Instance
    1. Check machine current status (is it actually running right now) by either
      • Logging into AWS web console, look up instance, and see if it is still running
        • [Note]: if you don't know the credentials for this, they probably have to be generated for you. Ask :catlee, as he has done this
      • Using releng cloud-tools from aws-manager2.srv.releng.scl3.mozilla.com
        • see usage for 'status' command above
      • ssh into machine
    2. Check when the lastest build/activity was
      • If loaners/releng-dev machines:
        • ssh as root into that machine, and run `last`
        • find the bug that is associated with the instance and check latest comments.
          • the bug number can also be found by looking at the instance tags in AWS console
      • If it's one of our Buildbot CI machines
        • use Slave Health or ssh into machine and tail twistd.log
        • [Note]: these machines should not be running long. It's put on the long running process list if it's up for more than 2h. So if it's been idle for while, further action will be required.
    3. For instances that have not had any recent builds/activity and you are sure they are not currently doing a build
      • If loaners/releng-dev machines:
        • Poke the owner of the instance via the associated bug, checking if they still need the machine.
        • use judgement for what's fair. eg: if it's been up for 24-48hrs, probably not cause for further action.
        • Store owner/usage detail in the moz-used-by instance Tag (if not already updated)
          • this is done by going into the AWS web console and filling in the section called "tags"
        • Decide whether to stop instance for a period of time or reclaim + terminate instance
          • 'stop' the instance if owner wants to use it again soon but won't be working on it for a day or two
            • see usage for 'stop' command above
            • [Note]: this should be made appealing to the owner as turning it back on is *easy* and fast!
          • 'terminate' the instance if owner has stated to be finished forever or if bug is resolved
            • see Reclaiming Loaners
            • [Note]: don't forget to revert vpn access bug and delete A/PTR records.
      • If it's one of our Buildbot CI machines:
        • Decide whether to stop or terminate instance
          • if this is a spot instance:
            • terminate it (don't delete A/ATR records)
            • [Note]: they need to be terminated because spot instances don't really have a 'stopped' state.
          • if this is not a spot instance:
            • shut it down by:
              • see usage for 'stop' command above
              • logging into AWS web console and choose 'stop' in dropdown
              • ssh in to machine and: $ shutdown -h now
            • [Note]: stopping will allow aws_watch_pending to deal with deciding when it needs to be started up again
  • For repeating problematic instances, further action will be required. Ask in #releng and possibly esculate to catlee/rail

Unknown Type Or State Instances

  • Most of these are created by us. Track down who made the instance and request either:
    • the instance be tagged properly via the AWS web console and filling in the section called "tags"
    • fix the reporting

Stopped For A While Instances

  • If loaners/releng-dev machines:
    • when it has been more than a ~2 weeks (say >300 hrs) poke the owner in the associated bug, querying if it is OK to terminate this instance.
  • If it's one of our Buildbot CI machines:
    • when it has been more than ~1 month (say >700 hrs) we should terminate the instance