If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

No windows builds on non-try trees for several hours

RESOLVED FIXED

Status

Release Engineering
Buildduty
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: nthomas, Assigned: grenade)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
Currently the oldest pending is nearly 6h old. Initial investigations have found that runner is not starting when instances come up, possibly because puppet did not run when the golden image was generated.

grenade/q/markco are investigating, trees are closed.
(Reporter)

Comment 1

2 years ago
Highlights from IRC, Pacific timestamps

12:36 <grenade> trying a mass reboot of running spot instances to see if they start runner on restart

12:49 <grenade> reboot did nothing, amis after march 05 deregistered. now mass terminating instances
12:50 <grenade> new instances from march 05 ami should start shortly

13:02 <grenade> I believe it will take up to an hour for new instances to spawn (from previous experience)
A bad AMI. The Puppet run did not successful run during the b-2008 golden ami creation: https://foreman.pub.build.mozilla.org/reports/18989815

In testing outside of cloudtools the error did not re-occur . That in conjunction with the y-2008 golden ami Puppet run successful completed, we may want to spin up another b-2008 golden ami this afternoon and see what the result is. 

Grenade, thoughts?
Flags: needinfo?(rthijssen)
(Reporter)

Comment 3

2 years ago
From the watch pending log:
2016-03-10 13:53:39,088 - b-2008 - started 80 spot instances; need 0
2016-03-10 13:57:39,920 - b-2008 - started 56 spot instances; need 23
2016-03-10 13:57:42,182 - b-2008 - started 0 instances; need 23

Pending is down to 33.
(Reporter)

Comment 4

2 years ago
Pending is down to 0, trees are reopened. Leaving open for investigation but lowering severity.

Could we please make sure that puppet failures block the rest of the golden-ami publishing process.
Severity: blocker → normal
(Assignee)

Comment 5

2 years ago
Created attachment 8729250 [details] [review]
https://github.com/mozilla/build-cloud-tools/pull/196

Our puppet agent success/failure check included a parse of the puppet-agent-summary (https://groups.google.com/a/mozilla.com/forum/?hl=en#!topic/releng-puppet-mail/NMXCqTpLM5k) log but not the puppet-agent-run log (https://groups.google.com/a/mozilla.com/forum/?hl=en#!topic/releng-puppet-mail/mOroIr9m-MQ) where today's failures were recorded.

This PR addresses this by parsing for a specific failure message in the agent run log as an additional mechanism for detecting a puppet agent run failure.
Flags: needinfo?(rthijssen)
Attachment #8729250 - Flags: review?(mcornmesser)
(Assignee)

Updated

2 years ago
Assignee: nobody → rthijssen
Attachment #8729250 - Flags: review?(mcornmesser) → review+
(Assignee)

Updated

2 years ago
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
(Reporter)

Comment 6

2 years ago
grenade, is the failure/hang of ec2-golden for the last 35 hours related to the fix here ?
Flags: needinfo?(rthijssen)
(Assignee)

Comment 7

2 years ago
nthomas, Yes! It's actually the prescribed behaviour. If the userdata puppet cert or agent jobs fail to finish successfully, we prevent the instance from shutting down, and loop in ever-increasing timespans while we rely on the nagios alerts to get someone to intervene. Occasionally the fault has even rectified itself without intervention. The fix here was to recognise a type of failure we weren't checking for before and include it as a reason to loop.
Flags: needinfo?(rthijssen)
You need to log in before you can comment on or make changes to this bug.