ReleaseEngineering/How To/Manage spot AMIs

From MozillaWiki
Jump to: navigation, search


Problem

Something wrong with spot AMIs.

Background

Spot AMIs are generated daily (almost) from scratch daily by aws_create_instance.py

The scripts uses base AMIs generated by aws_create_ami.py. These are generated manually and contain only base system. Particular base AMIs used to generate final spot AMIs are listed in the instance configs (e.g. bld-linux64).

The script takes a base AMI, puppetizes it, and cleans up some files. Spot AMIs use cloud-init to bootstrap their hostnames specified it instance specific user-data.

Sometimes things can go wrong during this process.

  • the most recent puppet configs could result in some bad AMIs and the jobs that are being run on the corresponding instances may fail or not start. If that's the case, we should unregister those bad AMIs, delete the associated snapshots and terminate the faulty instances. That way, we'll get automation using the previous AMIs while we're figuring out what caused the issues.
  • some golden AMIs may get stuck in the creation process and we'll get a notification in #buildduty for that:
<nagios-releng> Fri 14:36:24 UTC [7351] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 5 crit, 0 warn out of 5 processes with args 'ec2-golden' (http://m.mozilla.org/procs+age+-+golden+AMI)

We should check the logs on both aws-manager2 (/var/log/messages) and on the instance itself to see why it got stuck. Depending on situation, if the issue is an isolated one we generally terminate the golden AMI instance in AWS console and kill the associated processes. We can then force the generation of a new golden AMI. On the other hand, if the golden AMI generation process does get stuck frequently, we should file a bug against Releng::Buildduty and work on a fix.

Troubleshooting

To find which AMI a particular instance uses you can run the following:

ssh cltbld@instance_ip curl http://169.254.169.254/latest/meta-data/ami-id

Verify that it matches the latest AMIs in get_spot_amis.py output:

$ python scripts/get_spot_amis.py
bld-linux64, us-east-1: ami-42cf4d38 (spot-bld-linux64-2017-11-17-09-43, ebs)
try-linux64, us-east-1: ami-38cc4e42 (spot-try-linux64-2017-11-17-09-36, ebs)
tst-linux64, us-east-1: ami-2fdb5955 (spot-tst-linux64-2017-11-17-10-11, ebs)
tst-linux32, us-east-1: ami-eed45694 (spot-tst-linux32-2017-11-17-10-13, ebs)
tst-emulator64, us-east-1: ami-90d95bea (spot-tst-emulator64-2017-11-17-10-33, ebs)
av-linux64, us-east-1: ami-4fd25035 (spot-av-linux64-2017-11-17-09-57, ebs)
bld-linux64, us-west-2: ami-c7d201bf (spot-bld-linux64-2017-11-17-09-43, ebs)
try-linux64, us-west-2: ami-f5a6758d (spot-try-linux64-2017-11-17-09-36, ebs)
tst-linux64, us-west-2: ami-c3a97abb (spot-tst-linux64-2017-11-17-10-11, ebs)
tst-linux32, us-west-2: ami-a5a774dd (spot-tst-linux32-2017-11-16-10-13, ebs)
tst-emulator64, us-west-2: ami-3bd20143 (spot-tst-emulator64-2017-11-17-10-33, ebs)
av-linux64, us-west-2: ami-9dae7de5 (spot-av-linux64-2017-11-17-09-57, ebs)

Delete the AMIs and the corresponding snapshots using AWS Web Console: us-east-1 us-west-2. To delete instances generated by those AMIs use the following script.

e.g. python scripts/aws_terminate_by_ami_id.py -v <ami_id>