ReleaseEngineering/How To/Manage spot AMIs: Difference between revisions
(Created page with "{{Release Engineering How To|Manage_spot_AMIs}} = Problem = Something wrong with spot AMIs = Background = Spot AMIs are generated daily (almost) from scratch daily by [htt...") |
(Updated instructions on how to deal with spot AMIs.) |
||
| (2 intermediate revisions by 2 users not shown) | |||
| Line 3: | Line 3: | ||
= Problem = | = Problem = | ||
Something wrong with spot AMIs | Something wrong with spot AMIs. | ||
= Background = | = Background = | ||
Spot AMIs are generated daily (almost) from scratch daily by [ | Spot AMIs are generated daily (almost) from scratch daily by [https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_create_instance.py aws_create_instance.py] | ||
The scripts uses base AMIs generated by [ | The scripts uses base AMIs generated by [https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_create_ami.py aws_create_ami.py]. These are generated manually and contain only base system. Particular base AMIs used to generate final spot AMIs are listed in the instance configs (e.g. [https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/bld-linux64#L6 bld-linux64]). | ||
The script takes a base AMI, puppetizes it, and cleans up some files. Spot AMIs use cloud-init to bootstrap their hostnames specified it instance specific user-data. | The script takes a base AMI, puppetizes it, and cleans up some files. Spot AMIs use cloud-init to bootstrap their hostnames specified it instance specific user-data. | ||
Sometimes things can go wrong during this process. | |||
* the most recent puppet configs could result in some bad AMIs and the jobs that are being run on the corresponding instances may fail or not start. If that's the case, we should unregister those bad AMIs, delete the associated snapshots and terminate the faulty instances. That way, we'll get automation using the previous AMIs while we're figuring out what caused the issues. | |||
* some golden AMIs may get stuck in the creation process and we'll get a notification in #buildduty for that: | |||
<pre> | |||
<nagios-releng> Fri 14:36:24 UTC [7351] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 5 crit, 0 warn out of 5 processes with args 'ec2-golden' (http://m.mozilla.org/procs+age+-+golden+AMI) | |||
</pre> | |||
We should check the logs on both aws-manager2 (/var/log/messages) and on the instance itself to see why it got stuck. Depending on situation, if the issue is an isolated one we generally terminate the golden AMI instance in [https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:sort=desc:launchTime AWS console] and kill the associated processes. We can then force [https://wiki.mozilla.org/ReleaseEngineering/How_To/Work_with_Golden_AMIs the generation of a new golden AMI]. On the other hand, if the golden AMI generation process does get stuck frequently, we should file a bug against Releng::Buildduty and work on a fix. | |||
= Troubleshooting = | = Troubleshooting = | ||
To find which AMI a particular instance uses you can run the following: | |||
<pre> | |||
ssh cltbld@instance_ip curl http://169.254.169.254/latest/meta-data/ami-id | |||
</pre> | |||
Verify that it matches the latest AMIs in [https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/get_spot_amis.py get_spot_amis.py] output: | |||
Verify that it matches latest AMIs in [ | |||
<pre> | <pre> | ||
$ python scripts/get_spot_amis.py | $ python scripts/get_spot_amis.py | ||
bld-linux64, us-east-1: ami- | bld-linux64, us-east-1: ami-42cf4d38 (spot-bld-linux64-2017-11-17-09-43, ebs) | ||
try-linux64, us-east-1: ami- | try-linux64, us-east-1: ami-38cc4e42 (spot-try-linux64-2017-11-17-09-36, ebs) | ||
tst-linux64, us-east-1: ami- | tst-linux64, us-east-1: ami-2fdb5955 (spot-tst-linux64-2017-11-17-10-11, ebs) | ||
tst-linux32, us-east-1: ami- | tst-linux32, us-east-1: ami-eed45694 (spot-tst-linux32-2017-11-17-10-13, ebs) | ||
bld-linux64, us-west-2: ami- | tst-emulator64, us-east-1: ami-90d95bea (spot-tst-emulator64-2017-11-17-10-33, ebs) | ||
try-linux64, us-west-2: ami- | av-linux64, us-east-1: ami-4fd25035 (spot-av-linux64-2017-11-17-09-57, ebs) | ||
tst-linux64, us-west-2: ami- | bld-linux64, us-west-2: ami-c7d201bf (spot-bld-linux64-2017-11-17-09-43, ebs) | ||
tst-linux32, us-west-2: ami- | try-linux64, us-west-2: ami-f5a6758d (spot-try-linux64-2017-11-17-09-36, ebs) | ||
tst-linux64, us-west-2: ami-c3a97abb (spot-tst-linux64-2017-11-17-10-11, ebs) | |||
tst-linux32, us-west-2: ami-a5a774dd (spot-tst-linux32-2017-11-16-10-13, ebs) | |||
tst-emulator64, us-west-2: ami-3bd20143 (spot-tst-emulator64-2017-11-17-10-33, ebs) | |||
av-linux64, us-west-2: ami-9dae7de5 (spot-av-linux64-2017-11-17-09-57, ebs) | |||
</pre> | </pre> | ||
Delete the AMIs using AWS Web Console. | Delete the AMIs and the corresponding snapshots using AWS Web Console: [https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images:sort=name us-east-1] [https://console.aws.amazon.com/ec2/v2/home?region=us-west-2#Images:sort=name us-west-2]. To delete instances generated by those AMIs use the following [https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_terminate_by_ami_id.py script]. | ||
<pre> | |||
e.g. python scripts/aws_terminate_by_ami_id.py -v <ami_id> | |||
</pre> | |||
Latest revision as of 16:39, 17 November 2017
Problem
Something wrong with spot AMIs.
Background
Spot AMIs are generated daily (almost) from scratch daily by aws_create_instance.py
The scripts uses base AMIs generated by aws_create_ami.py. These are generated manually and contain only base system. Particular base AMIs used to generate final spot AMIs are listed in the instance configs (e.g. bld-linux64).
The script takes a base AMI, puppetizes it, and cleans up some files. Spot AMIs use cloud-init to bootstrap their hostnames specified it instance specific user-data.
Sometimes things can go wrong during this process.
- the most recent puppet configs could result in some bad AMIs and the jobs that are being run on the corresponding instances may fail or not start. If that's the case, we should unregister those bad AMIs, delete the associated snapshots and terminate the faulty instances. That way, we'll get automation using the previous AMIs while we're figuring out what caused the issues.
- some golden AMIs may get stuck in the creation process and we'll get a notification in #buildduty for that:
<nagios-releng> Fri 14:36:24 UTC [7351] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 5 crit, 0 warn out of 5 processes with args 'ec2-golden' (http://m.mozilla.org/procs+age+-+golden+AMI)
We should check the logs on both aws-manager2 (/var/log/messages) and on the instance itself to see why it got stuck. Depending on situation, if the issue is an isolated one we generally terminate the golden AMI instance in AWS console and kill the associated processes. We can then force the generation of a new golden AMI. On the other hand, if the golden AMI generation process does get stuck frequently, we should file a bug against Releng::Buildduty and work on a fix.
Troubleshooting
To find which AMI a particular instance uses you can run the following:
ssh cltbld@instance_ip curl http://169.254.169.254/latest/meta-data/ami-id
Verify that it matches the latest AMIs in get_spot_amis.py output:
$ python scripts/get_spot_amis.py bld-linux64, us-east-1: ami-42cf4d38 (spot-bld-linux64-2017-11-17-09-43, ebs) try-linux64, us-east-1: ami-38cc4e42 (spot-try-linux64-2017-11-17-09-36, ebs) tst-linux64, us-east-1: ami-2fdb5955 (spot-tst-linux64-2017-11-17-10-11, ebs) tst-linux32, us-east-1: ami-eed45694 (spot-tst-linux32-2017-11-17-10-13, ebs) tst-emulator64, us-east-1: ami-90d95bea (spot-tst-emulator64-2017-11-17-10-33, ebs) av-linux64, us-east-1: ami-4fd25035 (spot-av-linux64-2017-11-17-09-57, ebs) bld-linux64, us-west-2: ami-c7d201bf (spot-bld-linux64-2017-11-17-09-43, ebs) try-linux64, us-west-2: ami-f5a6758d (spot-try-linux64-2017-11-17-09-36, ebs) tst-linux64, us-west-2: ami-c3a97abb (spot-tst-linux64-2017-11-17-10-11, ebs) tst-linux32, us-west-2: ami-a5a774dd (spot-tst-linux32-2017-11-16-10-13, ebs) tst-emulator64, us-west-2: ami-3bd20143 (spot-tst-emulator64-2017-11-17-10-33, ebs) av-linux64, us-west-2: ami-9dae7de5 (spot-av-linux64-2017-11-17-09-57, ebs)
Delete the AMIs and the corresponding snapshots using AWS Web Console: us-east-1 us-west-2. To delete instances generated by those AMIs use the following script.
e.g. python scripts/aws_terminate_by_ami_id.py -v <ami_id>