Closed Bug 1399524 Opened 8 years ago Closed 8 years ago

Building pending backlog for gecko-1-b-win2012

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Unassigned)

Details

See https://queue.taskcluster.net/v1/pending/aws-provisioner-v1/gecko-1-b-win2012 Cause seems to be bad AMIs: https://papertrailapp.com/systems/1145185921/events?selected=844640716985651203&focus=844640716985651203 > The term 'Clear-Disk' is not recognized as the name of a cmdlet, function, script file, or operable program. I've rolled back AMIs to known good ones, and bumped maxCapacity from 256 to 2560 for now. Backlog actually is only 200 at the moment, but it might take a while for workers to come online (up to 30 mins to notice, and then additional time for instances to be created and available). We should put this back down once the pending count starts dropping.
14:52 RyanVM are there known issues with Windows workers on Try? 14:53 RyanVM my push to Try from 50min ago still has pending Windows builds 14:53 dustin pmoore|mtg: ^^? 15:00 pmoore|mtg dustin: RyanVM: hmmmm, i'll take a look :/ 15:01 Aryx thank you, see treeherder.mozilla.org/#/jobs?repo…3a98993ea95abd31bf4505b8405c2f7d16 15:03 aki-away wcosta: correct, python3. but the script doesn't have to be scriptworker.readthedocs.io/en/latest/new_instance_types.html 15:03 wcosta thanks 15:06 pmoore|mtg RyanVM: oh boy, it looks like we might have a problem indeed 15:08 pmoore|mtg grenade: do you know what this might be caused by? papertrailapp.com/systems/11451859…03&selected=844640716985651203 15:13 pmoore|mtg Clear-Disk seems to come from github.com/mozilla-releng/OpenClou…14209f61f2cd36dc28be6da1801942bf54 but that commit landed a long time ago and was working so i don't think it is at fault 15:17 bstack jhford: I don't think it's possible with the interface it exposes at the moment, but it should be possible to hack in I think? 15:18 jhford so for the provisioner work coming up, it'd be great 15:19 jhford right now, we're doing the iterations in question once per hour, so if we look at the 5m graph, we should only get a single iteration per dataset... i think? does that sound right 15:19 bstack pmoore|mtg: there were still some tc things using the scheduler when I checked last week, but I think garndt took care of them? I'll check again today when I'm actually on a computer. 15:20 pmoore|mtg bstack: no worries! if you find anything, feel free to dump it to the list of TODOs in bug 1399437 15:20 firebot bugzil.la/1399437 — NEW, nobody%mozilla.org — Sunset the scheduler 15:20 bstack Yeah, I think that's right. 15:21 bstack You can set things to interpolate differently in signalfx to get the 5-min stats to look a bit better. 15:21 garndt bstack: I can check the audit logs, I have to double check to see if docker-worker and mozilla-taskcluster are doing the right things now 15:21 garndt those were the only two things I believe 15:25 bstack I believe so, yeah. 15:25 grenade pmoore. i think (hope) it might be a fluke. the error is something we see when the ami creation instance is a dud. 15:25 grenade did you see it on more than gecko-1-b-win2012 ? 15:25 * pmoore checks 15:26 pmoore i only see it on that worker type, yes 15:27 pmoore papertrailapp.com/groups/853883/events?q=failed to clear partition table on disk 1 15:27 pmoore lol 15:27 * armenzg goes through node/npm/yarn and Heroku dance 15:27 pmoore (lol because the url didn't paste properly) 15:29 grenade i've seen the error when the os has failed to init properly. many of the built in ps functions fail with the same "no such function errors" 15:29 pmoore ah ok 15:29 pmoore i'll roll back the ami ids manually 15:29 pmoore thanks grenade! 15:29 pmoore RyanVM: rolling back amis, see above 15:29 grenade trying a manual redeploy of gecko-1-b-win2012 now 15:30 grenade and tailing the ami log to verify 15:36 pmoore grenade: RyanVM: i've reset the AMIs in the worker type definition to use the previous ones, workers should start coming online shortly (could take a while to clear the backlog though) 15:36 RyanVM ok, thanks for the update 15:37 pmoore i'll bump the maxCapacity for the next hour to something enormous 15:39 pmoore !t-rex: win2012 backlog should start clearing shortly 15:39 dustin wcosta: ^^ is that working for you now? 15:39 pmoore (gecko-1-b-win2012) 15:41 pmoore queue.taskcluster.net/v1/pending/aws-provisioner-v1/gecko-1-b-win2012 15:42 wcosta dustin: yes!!! 15:46 pmoore garndt: grenade: RyanVM: i've created bug 1399524
(In reply to Pete Moore [:pmoore][:pete] from comment #0) > We should put this back down once the pending count starts dropping. I've put it back to 256. Pending count has dropped from 200 to 85.
problem was caused by a dud ami run. no problems with the changes which triggered the run. this happens every once in a while and i've not determined why. resolution is to simply redeploy the workertype through occ with an empty commit and deploy syntax. i've also created a papertrail alert (https://papertrailapp.com/searches/25864561/edit) so that we can preempt failure of the workertype before the problem is propagated to spot instances.
redeploy without changes was successful: https://github.com/mozilla-releng/OpenCloudConfig/commit/ed4e0c70e09c5cfe2a6a1dba7b031cb8d21ead0b gecko-1-b-win2012 is taking jobs normally with new amis.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
(In reply to Rob Thijssen (:grenade - UTC+3) from comment #3) > i've also created a papertrail alert > (https://papertrailapp.com/searches/25864561/edit) so that we can preempt > failure of the workertype before the problem is propagated to spot instances. Thanks Rob, great idea! ++++
Component: Operations → Operations and Service Requests
You need to log in before you can comment on or make changes to this bug.