Currently the oldest pending is nearly 6h old. Initial investigations have found that runner is not starting when instances come up, possibly because puppet did not run when the golden image was generated. grenade/q/markco are investigating, trees are closed.
Highlights from IRC, Pacific timestamps 12:36 <grenade> trying a mass reboot of running spot instances to see if they start runner on restart 12:49 <grenade> reboot did nothing, amis after march 05 deregistered. now mass terminating instances 12:50 <grenade> new instances from march 05 ami should start shortly 13:02 <grenade> I believe it will take up to an hour for new instances to spawn (from previous experience)
A bad AMI. The Puppet run did not successful run during the b-2008 golden ami creation: https://foreman.pub.build.mozilla.org/reports/18989815 In testing outside of cloudtools the error did not re-occur . That in conjunction with the y-2008 golden ami Puppet run successful completed, we may want to spin up another b-2008 golden ami this afternoon and see what the result is. Grenade, thoughts?
From the watch pending log: 2016-03-10 13:53:39,088 - b-2008 - started 80 spot instances; need 0 2016-03-10 13:57:39,920 - b-2008 - started 56 spot instances; need 23 2016-03-10 13:57:42,182 - b-2008 - started 0 instances; need 23 Pending is down to 33.
Pending is down to 0, trees are reopened. Leaving open for investigation but lowering severity. Could we please make sure that puppet failures block the rest of the golden-ami publishing process.
Created attachment 8729250 [details] [review] https://github.com/mozilla/build-cloud-tools/pull/196 Our puppet agent success/failure check included a parse of the puppet-agent-summary (https://groups.google.com/a/mozilla.com/forum/?hl=en#!topic/releng-puppet-mail/NMXCqTpLM5k) log but not the puppet-agent-run log (https://groups.google.com/a/mozilla.com/forum/?hl=en#!topic/releng-puppet-mail/mOroIr9m-MQ) where today's failures were recorded. This PR addresses this by parsing for a specific failure message in the agent run log as an additional mechanism for detecting a puppet agent run failure.
grenade, is the failure/hang of ec2-golden for the last 35 hours related to the fix here ?
nthomas, Yes! It's actually the prescribed behaviour. If the userdata puppet cert or agent jobs fail to finish successfully, we prevent the instance from shutting down, and loop in ever-increasing timespans while we rely on the nagios alerts to get someone to intervene. Occasionally the fault has even rectified itself without intervention. The fix here was to recognise a type of failure we weren't checking for before and include it as a reason to loop.