Closed
Bug 1399524
Opened 8 years ago
Closed 8 years ago
Building pending backlog for gecko-1-b-win2012
Categories
(Taskcluster :: Operations and Service Requests, task)
Taskcluster
Operations and Service Requests
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: pmoore, Unassigned)
Details
See https://queue.taskcluster.net/v1/pending/aws-provisioner-v1/gecko-1-b-win2012
Cause seems to be bad AMIs:
https://papertrailapp.com/systems/1145185921/events?selected=844640716985651203&focus=844640716985651203
> The term 'Clear-Disk' is not recognized as the name of a cmdlet, function, script file, or operable program.
I've rolled back AMIs to known good ones, and bumped maxCapacity from 256 to 2560 for now.
Backlog actually is only 200 at the moment, but it might take a while for workers to come online (up to 30 mins to notice, and then additional time for instances to be created and available).
We should put this back down once the pending count starts dropping.
Reporter | ||
Comment 1•8 years ago
|
||
14:52 RyanVM are there known issues with Windows workers on Try?
14:53 RyanVM my push to Try from 50min ago still has pending Windows builds
14:53 dustin pmoore|mtg: ^^?
15:00 pmoore|mtg dustin: RyanVM: hmmmm, i'll take a look :/
15:01 Aryx thank you, see treeherder.mozilla.org/#/jobs?repo…3a98993ea95abd31bf4505b8405c2f7d16
15:03 aki-away wcosta: correct, python3. but the script doesn't have to be scriptworker.readthedocs.io/en/latest/new_instance_types.html
15:03 wcosta thanks
15:06 pmoore|mtg RyanVM: oh boy, it looks like we might have a problem indeed
15:08 pmoore|mtg grenade: do you know what this might be caused by? papertrailapp.com/systems/11451859…03&selected=844640716985651203
15:13 pmoore|mtg Clear-Disk seems to come from github.com/mozilla-releng/OpenClou…14209f61f2cd36dc28be6da1801942bf54 but that commit landed a long time ago and was working so i don't think it is at fault
15:17 bstack jhford: I don't think it's possible with the interface it exposes at the moment, but it should be possible to hack in I think?
15:18 jhford so for the provisioner work coming up, it'd be great
15:19 jhford right now, we're doing the iterations in question once per hour, so if we look at the 5m graph, we should only get a single iteration per dataset... i think? does that sound right
15:19 bstack pmoore|mtg: there were still some tc things using the scheduler when I checked last week, but I think garndt took care of them? I'll check again today when I'm actually on a computer.
15:20 pmoore|mtg bstack: no worries! if you find anything, feel free to dump it to the list of TODOs in bug 1399437
15:20 firebot bugzil.la/1399437 — NEW, nobody%mozilla.org — Sunset the scheduler
15:20 bstack Yeah, I think that's right.
15:21 bstack You can set things to interpolate differently in signalfx to get the 5-min stats to look a bit better.
15:21 garndt bstack: I can check the audit logs, I have to double check to see if docker-worker and mozilla-taskcluster are doing the right things now
15:21 garndt those were the only two things I believe
15:25 bstack I believe so, yeah.
15:25 grenade pmoore. i think (hope) it might be a fluke. the error is something we see when the ami creation instance is a dud.
15:25 grenade did you see it on more than gecko-1-b-win2012 ?
15:25 * pmoore checks
15:26 pmoore i only see it on that worker type, yes
15:27 pmoore papertrailapp.com/groups/853883/events?q=failed to clear partition table on disk 1
15:27 pmoore lol
15:27 * armenzg goes through node/npm/yarn and Heroku dance
15:27 pmoore (lol because the url didn't paste properly)
15:29 grenade i've seen the error when the os has failed to init properly. many of the built in ps functions fail with the same "no such function errors"
15:29 pmoore ah ok
15:29 pmoore i'll roll back the ami ids manually
15:29 pmoore thanks grenade!
15:29 pmoore RyanVM: rolling back amis, see above
15:29 grenade trying a manual redeploy of gecko-1-b-win2012 now
15:30 grenade and tailing the ami log to verify
15:36 pmoore grenade: RyanVM: i've reset the AMIs in the worker type definition to use the previous ones, workers should start coming online shortly (could take a while to clear the backlog though)
15:36 RyanVM ok, thanks for the update
15:37 pmoore i'll bump the maxCapacity for the next hour to something enormous
15:39 pmoore !t-rex: win2012 backlog should start clearing shortly
15:39 dustin wcosta: ^^ is that working for you now?
15:39 pmoore (gecko-1-b-win2012)
15:41 pmoore queue.taskcluster.net/v1/pending/aws-provisioner-v1/gecko-1-b-win2012
15:42 wcosta dustin: yes!!!
15:46 pmoore garndt: grenade: RyanVM: i've created bug 1399524
Reporter | ||
Comment 2•8 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #0)
> We should put this back down once the pending count starts dropping.
I've put it back to 256.
Pending count has dropped from 200 to 85.
Comment 3•8 years ago
|
||
problem was caused by a dud ami run. no problems with the changes which triggered the run. this happens every once in a while and i've not determined why. resolution is to simply redeploy the workertype through occ with an empty commit and deploy syntax. i've also created a papertrail alert (https://papertrailapp.com/searches/25864561/edit) so that we can preempt failure of the workertype before the problem is propagated to spot instances.
Comment 4•8 years ago
|
||
redeploy without changes was successful:
https://github.com/mozilla-releng/OpenCloudConfig/commit/ed4e0c70e09c5cfe2a6a1dba7b031cb8d21ead0b
gecko-1-b-win2012 is taking jobs normally with new amis.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 5•8 years ago
|
||
(In reply to Rob Thijssen (:grenade - UTC+3) from comment #3)
> i've also created a papertrail alert
> (https://papertrailapp.com/searches/25864561/edit) so that we can preempt
> failure of the workertype before the problem is propagated to spot instances.
Thanks Rob, great idea! ++++
Assignee | ||
Updated•6 years ago
|
Component: Operations → Operations and Service Requests
You need to log in
before you can comment on or make changes to this bug.
Description
•