ReleaseEngineering/How To/Manage spot AMIs: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Created page with "{{Release Engineering How To|Manage_spot_AMIs}} = Problem = Something wrong with spot AMIs = Background = Spot AMIs are generated daily (almost) from scratch daily by [htt...")
 
(Updated instructions on how to deal with spot AMIs.)
 
(2 intermediate revisions by 2 users not shown)
Line 3: Line 3:
= Problem =
= Problem =


Something wrong with spot AMIs
Something wrong with spot AMIs.


= Background =
= Background =


Spot AMIs are generated daily (almost) from scratch daily by [http://hg.mozilla.org/build/cloud-tools/file/default/scripts/aws_create_instance.py aws_create_instance.py]
Spot AMIs are generated daily (almost) from scratch daily by [https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_create_instance.py aws_create_instance.py]
   
   
The scripts uses base AMIs generated by [http://hg.mozilla.org/build/cloud-tools/file/default/scripts/aws_create_ami.py aws_create_ami.py]. These are generated manually and contain only base system. Particular base AMIs used to generate final spot AMIs are listed in the instance configs (e.g. [http://hg.mozilla.org/build/cloud-tools/file/6e474160aa7b/configs/bld-linux64#l6 bld-linux64]).
The scripts uses base AMIs generated by [https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_create_ami.py aws_create_ami.py]. These are generated manually and contain only base system. Particular base AMIs used to generate final spot AMIs are listed in the instance configs (e.g. [https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/bld-linux64#L6 bld-linux64]).


The script takes a base AMI, puppetizes it, and cleans up some files. Spot AMIs use cloud-init to bootstrap their hostnames specified it instance specific user-data.
The script takes a base AMI, puppetizes it, and cleans up some files. Spot AMIs use cloud-init to bootstrap their hostnames specified it instance specific user-data.
Sometimes things can go wrong during this process.
* the most recent puppet configs could result in some bad AMIs and the jobs that are being run on the corresponding instances may fail or not start.  If that's the case, we should unregister those bad AMIs, delete the associated snapshots and terminate the faulty instances. That way, we'll get automation using the previous AMIs while we're figuring out what caused the issues.
* some golden AMIs may get stuck in the creation process and we'll get a notification in #buildduty for that:
<pre>
<nagios-releng> Fri 14:36:24 UTC [7351] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 5 crit, 0 warn out of 5 processes with args 'ec2-golden' (http://m.mozilla.org/procs+age+-+golden+AMI)
</pre>
We should check the logs on both aws-manager2 (/var/log/messages) and on the instance itself to see why it got stuck. Depending on situation, if the issue is an isolated one we generally terminate the golden AMI instance in [https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:sort=desc:launchTime AWS console] and kill the associated processes. We can then force [https://wiki.mozilla.org/ReleaseEngineering/How_To/Work_with_Golden_AMIs the generation of a new golden AMI]. On the other hand, if the golden AMI generation process does get stuck frequently, we should file a bug against Releng::Buildduty and work on a fix.


= Troubleshooting =
= Troubleshooting =
To find which AMI a particular instance uses you can run the following:
<pre>
ssh cltbld@instance_ip curl http://169.254.169.254/latest/meta-data/ami-id
</pre>


 
Verify that it matches the latest AMIs in [https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/get_spot_amis.py get_spot_amis.py] output:
To find AMI a particular instance uses you can run the following:
 
ssh cltbld@instnace_ip curl http://169.254.169.254/latest/meta-data/ami-id
 
Verify that it matches latest AMIs in [http://hg.mozilla.org/build/cloud-tools/file/default/scripts/get_spot_amis.py get_spot_amis.py] output:
<pre>
<pre>
# example output
$ python scripts/get_spot_amis.py
$ python scripts/get_spot_amis.py
bld-linux64, us-east-1: ami-7af30012
bld-linux64, us-east-1: ami-42cf4d38 (spot-bld-linux64-2017-11-17-09-43, ebs)
try-linux64, us-east-1: ami-208c7f48
try-linux64, us-east-1: ami-38cc4e42 (spot-try-linux64-2017-11-17-09-36, ebs)
tst-linux64, us-east-1: ami-fa8f7c92
tst-linux64, us-east-1: ami-2fdb5955 (spot-tst-linux64-2017-11-17-10-11, ebs)
tst-linux32, us-east-1: ami-188f7c70
tst-linux32, us-east-1: ami-eed45694 (spot-tst-linux32-2017-11-17-10-13, ebs)
bld-linux64, us-west-2: ami-3f46340f
tst-emulator64, us-east-1: ami-90d95bea (spot-tst-emulator64-2017-11-17-10-33, ebs)
try-linux64, us-west-2: ami-23473513
av-linux64, us-east-1: ami-4fd25035 (spot-av-linux64-2017-11-17-09-57, ebs)
tst-linux64, us-west-2: ami-954032a5
bld-linux64, us-west-2: ami-c7d201bf (spot-bld-linux64-2017-11-17-09-43, ebs)
tst-linux32, us-west-2: ami-e74032d7
try-linux64, us-west-2: ami-f5a6758d (spot-try-linux64-2017-11-17-09-36, ebs)
tst-linux64, us-west-2: ami-c3a97abb (spot-tst-linux64-2017-11-17-10-11, ebs)
tst-linux32, us-west-2: ami-a5a774dd (spot-tst-linux32-2017-11-16-10-13, ebs)
tst-emulator64, us-west-2: ami-3bd20143 (spot-tst-emulator64-2017-11-17-10-33, ebs)
av-linux64, us-west-2: ami-9dae7de5 (spot-av-linux64-2017-11-17-09-57, ebs)
</pre>
</pre>


Delete the AMIs using AWS Web Console.
Delete the AMIs and the corresponding snapshots using AWS Web Console: [https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images:sort=name us-east-1] [https://console.aws.amazon.com/ec2/v2/home?region=us-west-2#Images:sort=name us-west-2]. To delete instances generated by those AMIs use the following [https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_terminate_by_ami_id.py script].
<pre>
e.g. python scripts/aws_terminate_by_ami_id.py -v <ami_id>
</pre>

Latest revision as of 16:39, 17 November 2017


Problem

Something wrong with spot AMIs.

Background

Spot AMIs are generated daily (almost) from scratch daily by aws_create_instance.py

The scripts uses base AMIs generated by aws_create_ami.py. These are generated manually and contain only base system. Particular base AMIs used to generate final spot AMIs are listed in the instance configs (e.g. bld-linux64).

The script takes a base AMI, puppetizes it, and cleans up some files. Spot AMIs use cloud-init to bootstrap their hostnames specified it instance specific user-data.

Sometimes things can go wrong during this process.

  • the most recent puppet configs could result in some bad AMIs and the jobs that are being run on the corresponding instances may fail or not start. If that's the case, we should unregister those bad AMIs, delete the associated snapshots and terminate the faulty instances. That way, we'll get automation using the previous AMIs while we're figuring out what caused the issues.
  • some golden AMIs may get stuck in the creation process and we'll get a notification in #buildduty for that:
<nagios-releng> Fri 14:36:24 UTC [7351] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 5 crit, 0 warn out of 5 processes with args 'ec2-golden' (http://m.mozilla.org/procs+age+-+golden+AMI)

We should check the logs on both aws-manager2 (/var/log/messages) and on the instance itself to see why it got stuck. Depending on situation, if the issue is an isolated one we generally terminate the golden AMI instance in AWS console and kill the associated processes. We can then force the generation of a new golden AMI. On the other hand, if the golden AMI generation process does get stuck frequently, we should file a bug against Releng::Buildduty and work on a fix.

Troubleshooting

To find which AMI a particular instance uses you can run the following:

ssh cltbld@instance_ip curl http://169.254.169.254/latest/meta-data/ami-id

Verify that it matches the latest AMIs in get_spot_amis.py output:

$ python scripts/get_spot_amis.py
bld-linux64, us-east-1: ami-42cf4d38 (spot-bld-linux64-2017-11-17-09-43, ebs)
try-linux64, us-east-1: ami-38cc4e42 (spot-try-linux64-2017-11-17-09-36, ebs)
tst-linux64, us-east-1: ami-2fdb5955 (spot-tst-linux64-2017-11-17-10-11, ebs)
tst-linux32, us-east-1: ami-eed45694 (spot-tst-linux32-2017-11-17-10-13, ebs)
tst-emulator64, us-east-1: ami-90d95bea (spot-tst-emulator64-2017-11-17-10-33, ebs)
av-linux64, us-east-1: ami-4fd25035 (spot-av-linux64-2017-11-17-09-57, ebs)
bld-linux64, us-west-2: ami-c7d201bf (spot-bld-linux64-2017-11-17-09-43, ebs)
try-linux64, us-west-2: ami-f5a6758d (spot-try-linux64-2017-11-17-09-36, ebs)
tst-linux64, us-west-2: ami-c3a97abb (spot-tst-linux64-2017-11-17-10-11, ebs)
tst-linux32, us-west-2: ami-a5a774dd (spot-tst-linux32-2017-11-16-10-13, ebs)
tst-emulator64, us-west-2: ami-3bd20143 (spot-tst-emulator64-2017-11-17-10-33, ebs)
av-linux64, us-west-2: ami-9dae7de5 (spot-av-linux64-2017-11-17-09-57, ebs)

Delete the AMIs and the corresponding snapshots using AWS Web Console: us-east-1 us-west-2. To delete instances generated by those AMIs use the following script.

e.g. python scripts/aws_terminate_by_ami_id.py -v <ami_id>