Closed
Bug 1258202
Opened 9 years ago
Closed 9 years ago
Trees closed, Windows builds not starting on non-Try branches
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: grenade)
References
Details
Don't have enough pushes to give a clear start time, but according to the pending fuzzing and l10n nightlies, sometime around 04:30ish.
Wouldn't that be New AMI Time?
All non-try non-b2g trees are closed.
Comment 1•9 years ago
|
||
There are ~200 instances running using the newest AMIs (spot-b-2008-2016-03-20-10-01, us-east-1: ami-28959742, us-west-2: ami-2eec044e), but checking a few don't have a cmd.exe open, so no runner or buildbot.
There were some nagios alerts about building the new golden image:
Sun 06:57:09 PDT [4147] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 10 crit, 0 warn out of 10 processes with args ec2-golden
Sun 07:57:09 PDT [4170] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 10 crit, 0 warn out of 10 processes with args ec2-golden
Sun 09:57:09 PDT [4172] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is OK: ELAPSED OK: 0 processes with args ec2-golden
and also a whole bunch of issues with buildapi between 05:06 and 07:09 Pacific. Those were probably from the weekly buildbot-db maintenance, which ran the orphan cleanup steps between 04:54 and 05:08. Unknown if that could have broken something in the image generation.
I've tried rebooting a spot instance and it still doesn't start buildbot, so lacking enough knowledge about the moving parts I'm going to deregister the broken golden images, save an example instance for debugging, and terminate the other instances.
Comment 2•9 years ago
|
||
* Deregistered but didn't delete snapshots for
** ami-28959742 - spot-b-2008-2016-03-20-10-01
** ami-2eec044e - spot-b-2008-2016-03-20-10-01
* waited for publish_amis cron to run
* modified moz-ready tag on b-2008-spot-070 (10.134.54.43) to for-debugging-bug1258202 to prevent cleanup (hopefully)
* deleted five instances to test fix, hoping that AWS would clean up faster than a mass terminate
* get impatient and terminate instances on bad ami's
* wait ....
Comment 3•9 years ago
|
||
Builds are running now, no more pending. Assigning to grenade for debugging/cause analysis. Please redirect as needed.
Assignee: nobody → rthijssen
Severity: blocker → major
Comment 4•9 years ago
|
||
Bug 1258714 seems to be this situation again - puppet fails to get a cert when it runs, so doesn't set up runner on the image, but the image is still published. Then instances don't do anything when they start up.
See Also: → 1258714
Updated•9 years ago
|
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 5•9 years ago
|
||
patched here: https://github.com/mozilla/build-cloud-tools/commit/d58cb3675cbc85d31138e0b60ba1e8eb6e397a6f
problem was caused by a flawed puppet success/fail check in userdata. patched and tested (corrected and working) today.
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(rthijssen)
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•