CIDuty/How To/Troubleshoot AWS

From MozillaWiki
< CIDuty‎ | How To(Redirected from CIDuty/How To/AWS)
Jump to: navigation, search

Sometimes AWS spins up bad instances. Usually sheriffs notify ciduty about these but if you see one escalate to ciduty in #ci. A job may appear as failed if the instance it was running on disappears. Spot instances can disappear when they are outbid.


Figuring out if the instance is bad or not

When jobs fail we have two things we should consider/look into. If the job failures are isolated incidents or they happen on a large number of instances from the worker pool (usually over 30%).

1. When it is an isolated incident check for other jobs the affected machine has run. If the rest of the jobs are green this means the instance isn't faulty, and the failure was probably caused by a bad build,config or a network issue. If all the jobs are failed or completed as exception, this is a sign that the instance is in a bad state and should be terminated.

2. If a large portion of the worker pool is affected the first step should be looking into the logs of the failed tests. The best bet is finding a common issue that is usually caused by infra, network or the tasks weren't properly configured.


Bad Instances

To understand if a job failure is caused by a spot instance or not it's best to first understand the various ways a task can be resolved. See this page for more information.

When AWS spins up a bad instances (usually identified by the fact that it fails every job), find it in the worker explorer of AWS Provisioner and terminate it, AWS will spin up a new one. You can do this even while a task is running due to the built in mechanism for retrying jobs. To further understand the interaction between the queue and a worker, check out the official docs.