Trees closed, Windows builds not starting on non-Try branches

RESOLVED FIXED

Status

--
major
RESOLVED FIXED
3 years ago
7 months ago

People

(Reporter: philor, Assigned: grenade)

Tracking

Details

(Reporter)

Description

3 years ago
Don't have enough pushes to give a clear start time, but according to the pending fuzzing and l10n nightlies, sometime around 04:30ish.

Wouldn't that be New AMI Time?

All non-try non-b2g trees are closed.
There are ~200 instances running using the newest AMIs (spot-b-2008-2016-03-20-10-01, us-east-1: ami-28959742, us-west-2: ami-2eec044e), but checking a few don't have a cmd.exe open, so no runner or buildbot.

There were some nagios alerts about building the new golden image:
Sun 06:57:09 PDT [4147] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 10 crit, 0 warn out of 10 processes with args ec2-golden
Sun 07:57:09 PDT [4170] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 10 crit, 0 warn out of 10 processes with args ec2-golden
Sun 09:57:09 PDT [4172] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is OK: ELAPSED OK: 0 processes with args ec2-golden

and also a whole bunch of issues with buildapi between 05:06 and 07:09 Pacific. Those were probably from the weekly buildbot-db maintenance, which ran the orphan cleanup steps between 04:54 and 05:08. Unknown if that could have broken something in the image generation.

I've tried rebooting a spot instance and it still doesn't start buildbot, so lacking enough knowledge about the moving parts I'm going to deregister the broken golden images, save an example instance for debugging, and terminate the other instances.
* Deregistered but didn't delete snapshots for
** ami-28959742 - spot-b-2008-2016-03-20-10-01
** ami-2eec044e - spot-b-2008-2016-03-20-10-01
* waited for publish_amis cron to run
* modified moz-ready tag on b-2008-spot-070 (10.134.54.43) to for-debugging-bug1258202 to prevent cleanup (hopefully)
* deleted five instances to test fix, hoping that AWS would clean up faster than a mass terminate
* get impatient and terminate instances on bad ami's
* wait ....
Builds are running now, no more pending. Assigning to grenade for debugging/cause analysis. Please redirect as needed.
Assignee: nobody → rthijssen
Severity: blocker → major
Bug 1258714 seems to be this situation again - puppet fails to get a cert when it runs, so doesn't set up runner on the image, but the image is still published. Then instances don't do anything when they start up.
See Also: → bug 1258714
Flags: needinfo?(rthijssen)
(Assignee)

Comment 5

3 years ago
patched here: https://github.com/mozilla/build-cloud-tools/commit/d58cb3675cbc85d31138e0b60ba1e8eb6e397a6f
problem was caused by a flawed puppet success/fail check in userdata. patched and tested (corrected and working) today.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Flags: needinfo?(rthijssen)
Resolution: --- → FIXED

Updated

7 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.