Closed
Bug 1177315
Opened 9 years ago
Closed 9 years ago
gecko-decision tasks not running
Categories
(Taskcluster :: General, defect)
Taskcluster
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Unassigned)
References
Details
Attachments
(1 file)
https://treeherder.mozilla.org/#/jobs?repo=b2g-inbound&revision=d0822eec251e&exclusion_profile=false and https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a195a8127e8d&exclusion_profile=false show pending gecko-decision tasks, pushes above them don't even show a pending one, and thus of course none of the resulting builds or tests. All trunk trees are closed as a result.
Comment 1•9 years ago
|
||
Appears that the instances were not getting provisioned. As of 11:01 PDT instances were being started and pending backlog being completed. NI? jhford on this bug so we can take a look at what might have been the cause.
Flags: needinfo?(jhford)
Comment 2•9 years ago
|
||
trees reopen at 0:18am Pacific
Comment 3•9 years ago
|
||
I can't find anything in the logs which points to anything. The timeline we discussed on IRC doesn't match up with anything happening in the provisioner so I'm not entirely sure what happened.
Flags: needinfo?(jhford)
Comment 4•9 years ago
|
||
Marking as resolved -> fixed since we're back in action.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Comment 5•9 years ago
|
||
Reopening this because the same issue appeared again. Started provisioning iteration [1] at 18:39:09 GMT that never logged a success. This was the last provisioning iteration that was logged until the provisioner was restarted at 21:18:08 GMT I'm not sure if this matters at all, seems to happen often so probably not an issue: Jul 01 13:40:05 taskcluster-aws-provisioner2 app/provisioner.1: Wed, 01 Jul 2015 18:40:04 GMT azure-agent Destroying socket after idle timeout I noticed a log of: [alert-operator] this userData (undefined) is garbage But was told that is not an issue (although we should fix these "undefined" pieces. to reduce the noise. I haven't found anything definitive in the logs yet. Perhaps there is some exception state that is getting eaten up and thus not logged? [1] https://papertrailapp.com/systems/taskcluster-aws-provisioner2/events?r=552971382267621389-552971703861686287
Status: RESOLVED → REOPENED
Flags: needinfo?(jhford)
Resolution: FIXED → ---
Comment 6•9 years ago
|
||
Adding this PR that John opened against the provisioner so we have visibility into the fix.
Attachment #8628924 -
Flags: review?(jopsen)
Comment 7•9 years ago
|
||
Comment on attachment 8628924 [details] [review] PR 30 I strongly advice that this is cleaned up... If there is an argument for crashing on an absolute error count, this has an r+. Otherwise, I suggest we only crash on errors happening in a row.
Attachment #8628924 -
Flags: review?(jopsen) → review+
Comment 8•9 years ago
|
||
My patches went live today. (In reply to Greg Arndt [:garndt] from comment #5) > I'm not sure if this matters at all, seems to happen often so probably not > an issue: > Jul 01 13:40:05 taskcluster-aws-provisioner2 app/provisioner.1: Wed, 01 Jul > 2015 18:40:04 GMT azure-agent Destroying socket after idle timeout nope, not important that I know of > I noticed a log of: > [alert-operator] this userData (undefined) is garbage > > But was told that is not an issue (although we should fix these "undefined" > pieces. to reduce the noise. Actually, I just changed the logging slightly. The garbage message was in case we had something there that wasn't parsable as base64(json()). I wish the migration part of base.Entity had an option to force updating the migration in the stored format, so that I could scan the table and stop seeing the migration stuff run all the time. > I haven't found anything definitive in the logs yet. Perhaps there is some > exception state that is getting eaten up and thus not logged? That's most certainly the case here. Async/await work like synchronous code in a lot of ways, the most important that they don't is that unhandled rejected promises are dropped on the floor.
Flags: needinfo?(jhford)
Reporter | ||
Comment 9•9 years ago
|
||
Pending but not being run for the last hour or so, trunk trees closed.
Comment 10•9 years ago
|
||
@jhford, Did you deploy your fix above? If so I don't think it worked, I just had to restart the provisioner manually again. According to heroku it had been running for 12 hours, as down for approximately 3 hours. Logs from before death: https://gist.github.com/jonasfj/82dbc8a6baeb827e9e8f I think it's safe to share, as security tokens should expire -- I hope they already have :)
Comment 11•9 years ago
|
||
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #10) > @jhford, > Did you deploy your fix above? > > If so I don't think it worked, I just had to restart the provisioner > manually again. > According to heroku it had been running for 12 hours, as down for > approximately 3 hours. > Logs from before death: https://gist.github.com/jonasfj/82dbc8a6baeb827e9e8f > I think it's safe to share, as security tokens should expire -- I hope they > already have :) I did... I tested that if the provisioning iteration lasted too long or the gap between them was too long that the provisioner would restart. I'm completely confused by what's happening here. I've been going over the code looking for any possible source for this.
Comment 12•9 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #9) > Pending but not being run for the last hour or so, trunk trees closed. We still aren't seeing any decision tasks since before midnight PT. All TC-running trees closed.
Flags: needinfo?(jhford)
Flags: needinfo?(garndt)
Comment 13•9 years ago
|
||
Also, we *really* need a better way of being notified of such things instead of just waiting for somebody to happen to notice the backlog when looking at Treeherder. Is nagios monitoring a possibility here?
Comment 14•9 years ago
|
||
Expanding needinfos on the hope that someone's around to look into this tree-closing issue soonish.
Flags: needinfo?(pmoore)
Flags: needinfo?(jopsen)
Comment 15•9 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #14) > Expanding needinfos on the hope that someone's around to look into this > tree-closing issue soonish. yes working with the guys in #taskcluster 05:17 <&jhford> Tomcat|sheriffduty: Wed, 08 Jul 2015 12:16:17 GMT aws-provisioner:provision gecko-decision: 10 running capacity, 0 pending capacity and 0 pending tasks 05:17 <&jhford> Tomcat|sheriffduty: this is not the provisioner dieing this time
Comment 16•9 years ago
|
||
I don't know what's causing this but I am certain that it's not a lack of nodes caused by the provisioner. There are 10 gecko-descision workers booted right now.
Comment 17•9 years ago
|
||
Oh, i think this is a Queue issue... let me kick the Queue
Flags: needinfo?(jhford)
Comment 18•9 years ago
|
||
(In reply to John Ford [:jhford] -- please use 'needinfo?' instead of a CC from comment #17) > Oh, i think this is a Queue issue... let me kick the Queue Nope, this isn't a queue issue anymore. I just submitted a 'cli' workertype job using the tools.taskcluster.net interface and found that: > q.pendingTasks('aws-provisioner-v1', 'cli').then(console.log, console.log); taskcluster-client Calling: pendingTasks, retry: 0 +4m { then: [Function] } > taskcluster-client Success calling: pendingTasks, (0 retries) +632ms { provisionerId: 'aws-provisioner-v1', workerType: 'cli', pendingTasks: 1 } And the provisioner is doing things correctly: 2015-07-08T12:48:25.500684+00:00 app[provisioner.1]: Wed, 08 Jul 2015 12:48:25 GMT aws-provisioner:provision cli: 0 running capacity, 1 pending capacity and 1 pending tasks Which tells me that what happened here is likely in mozilla-taskcluster.
Comment 19•9 years ago
|
||
My CLI job just completed succesfully. Either the gecko-decision workerType is busted or mozilla-taskcluster is not submitting tasks/graphs.
Comment 20•9 years ago
|
||
We are in the TaskClusterPlatform vidyo room going through this. Seems to be a problem with mozilla-taskcluster.
Flags: needinfo?(pmoore)
Comment 21•9 years ago
|
||
We've established that mozilla-taskcluster is having socket failures when talking to redis cloud. e.g. Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT config Loading additional configs [ 'production-treeherder-proxy.yml' ] Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT config skip config /app/production-treeherder-proxy.yml Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT taskcluster-proxy:db creating index id { w: 'majority', unique: true } Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT taskcluster-proxy:db Initialize collection config Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT config fetching document production Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT taskcluster-proxy:db Initialize collection repositories Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT taskcluster-proxy:db creating index id { w: 'majority', unique: true } Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT taskcluster-proxy:db creating index url { w: 'majority', unique: true } Jul 08 15:27:16 mozilla-taskcluster app/worker.1: events.js:85 Jul 08 15:27:16 mozilla-taskcluster app/worker.1: throw er; // Unhandled 'error' event Jul 08 15:27:16 mozilla-taskcluster app/worker.1: ^ Jul 08 15:27:16 mozilla-taskcluster app/worker.1: Error: Redis connection to pub-redis-19466.us-east-1-2.1.ec2.garantiadata.com:19466 failed - write EPIPE Jul 08 15:27:16 mozilla-taskcluster app/worker.1: at RedisClient.on_error (/app/node_modules/kue/node_modules/redis/index.js:196:24) Jul 08 15:27:16 mozilla-taskcluster app/worker.1: at Socket.<anonymous> (/app/node_modules/kue/node_modules/redis/index.js:106:14) Jul 08 15:27:16 mozilla-taskcluster app/worker.1: at Socket.emit (events.js:107:17) Jul 08 15:27:16 mozilla-taskcluster app/worker.1: at onwriteError (_stream_writable.js:317:10) Jul 08 15:27:16 mozilla-taskcluster app/worker.1: at onwrite (_stream_writable.js:335:5) Jul 08 15:27:16 mozilla-taskcluster app/worker.1: at WritableState.onwrite (_stream_writable.js:105:5) Jul 08 15:27:16 mozilla-taskcluster app/worker.1: <starting> Jul 08 15:27:16 mozilla-taskcluster app/worker.1: at fireErrorCallbacks (net.js:456:13) Jul 08 15:27:16 mozilla-taskcluster app/worker.1: at Socket._destroy (net.js:495:3) Jul 08 15:27:16 mozilla-taskcluster app/worker.1: at WriteWrap.afterWrite (net.js:777:10) https://github.com/taskcluster/mozilla-taskcluster/ is a heroku app written in node.js, using https://www.npmjs.com/package/kue#redis-connection-settings to store data in redis, for which we are using https://redislabs.com/redis-cloud as a provider. It seems for some reason writing to redis is causing some problems. Greg (:garndt) has been able to write to the redis cloud account from his laptop, and redis cloud reports that there is ample storage available, so we suspected it might be an issue related to the redis client being used. Greg has upgraded the version of kue being used. We've also restarted services. At the moment we are not sure what is causing the problem, and do not have an ETA on resolution.
Flags: needinfo?(garndt)
Comment 22•9 years ago
|
||
We continue to discuss this topic in TaskClusterPlatform vidyo room, if anyone wishes to join.
Comment 23•9 years ago
|
||
It appears that both decision tasks and retriggers are working now. So far this is my understanding of our setup. There are two instances of 'mozilla-taskcluster' which is our bridge between pushes, treeherder, and tasks. It monitors the push log and when a new push is detected a job is created on a job queue backed by redis. There are workers that pick up this work and create task graphs.
Comment 24•9 years ago
|
||
Whoops, hit enter too quickly.. Earlier in the week the worker on the production mozilla-taskcluster stopped working which is believed to be the cause of retriggers no longer working. This morning the redis store used by mozilla-taskcluster-staging stopped with the same error and that is when all decision tasks stopped being created as well. A support request was opened with Rediscloud (the redis addon we use in heroku) but they have not found the cause of the issues yet. A new heroku-redis addon was deployed as a temporary solution and appears to have solved both the retrigger issue as well as gecko decision tasks being scheduled. There is more investigation needed to understand why mozilla-taskcluster-staging is the app responsible for scheduling task graphs, but for now the issue is fixed. Also, the heroku-redis addon is just a temproary solution. Either a higher specced version needs to be deployed during a low use period (all existing jobs will be lost) or we move back to rediscloud (still results in job loss during that time). Hopefully the heroku-redis addon will be sufficient until the weekend when it can either be upgraded or the redis cloud issue is resolved.
Comment 25•9 years ago
|
||
@garndt, there is no reason why mozilla-taskcluster-staging is the app for creating task-graphs over mozilla-taskcluster, it's just how it is... IMO, mozilla-taskcluster suffers from buzzword-overload and should be rearchitected, preferably into 2-3 different components, handling different concerns. And none of them should use mongodb or redis for storage or queuing. - These things will eventually run out of memory :) ---- Note, this is issue was unrelated to previous discussion in this bug. I suspect maybe what we can take from this is that we need monitoring.
Flags: needinfo?(jopsen)
Comment 26•9 years ago
|
||
All trees reopened.
Comment 27•9 years ago
|
||
So far the memory consumption for heroku-redis instances used by mozilla-taskcluster and mozilla-taskcluster-staging are around 1mb each. Now that trees are opened and heroku redis seems to be working, we can close this bug. Also bug 1181742 has been opened to debate various ways we could implement monitoring of the taskcluster infrastructure.
Updated•9 years ago
|
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
status-firefox41:
affected → ---
Component: TaskCluster → General
Product: Testing → Taskcluster
Target Milestone: --- → mozilla41
Comment 28•9 years ago
|
||
Resetting Version and Target Milestone that accidentally got changed...
Target Milestone: mozilla41 → ---
Version: Trunk → unspecified
You need to log in
before you can comment on or make changes to this bug.
Description
•