Closed Bug 1177315 Opened 9 years ago Closed 9 years ago

gecko-decision tasks not running

Categories

(Taskcluster :: General, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

Attachments

(1 file)

54 bytes, text/x-github-pull-request
jonasfj
: review+
Details | Review
https://treeherder.mozilla.org/#/jobs?repo=b2g-inbound&revision=d0822eec251e&exclusion_profile=false and https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a195a8127e8d&exclusion_profile=false show pending gecko-decision tasks, pushes above them don't even show a pending one, and thus of course none of the resulting builds or tests.

All trunk trees are closed as a result.
Appears that the instances were not getting provisioned.  As of 11:01 PDT instances were being started and pending backlog being completed.  NI? jhford on this bug so we can take a look at what might have been the cause.
Flags: needinfo?(jhford)
trees reopen at 0:18am Pacific
I can't find anything in the logs which points to anything.  The timeline we discussed on IRC doesn't match up with anything happening in the provisioner so I'm not entirely sure what happened.
Flags: needinfo?(jhford)
Marking as resolved -> fixed since we're back in action.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Reopening this because the same issue appeared again.  

Started provisioning iteration [1] at 18:39:09 GMT that never logged a success.  This was the last provisioning iteration that was logged until the provisioner was restarted at 21:18:08 GMT

I'm not sure if this matters at all, seems to happen often so probably not an issue:
 Jul 01 13:40:05 taskcluster-aws-provisioner2 app/provisioner.1: Wed, 01 Jul 2015 18:40:04 GMT azure-agent Destroying socket after idle timeout 

I noticed a log of:
[alert-operator] this userData (undefined) is garbage  

But was told that is not an issue (although we should fix these "undefined" pieces. to reduce the noise.

I haven't found anything definitive in the logs yet.  Perhaps there is some exception state that is getting eaten up and thus not logged?


[1] https://papertrailapp.com/systems/taskcluster-aws-provisioner2/events?r=552971382267621389-552971703861686287
Status: RESOLVED → REOPENED
Flags: needinfo?(jhford)
Resolution: FIXED → ---
Attached file PR 30
Adding this PR that John opened against the provisioner so we have visibility into the fix.
Attachment #8628924 - Flags: review?(jopsen)
Comment on attachment 8628924 [details] [review]
PR 30

I strongly advice that this is cleaned up...

If there is an argument for crashing on an absolute error count, this has an r+.
Otherwise, I suggest we only crash on errors happening in a row.
Attachment #8628924 - Flags: review?(jopsen) → review+
My patches went live today.


(In reply to Greg Arndt [:garndt] from comment #5) 
> I'm not sure if this matters at all, seems to happen often so probably not
> an issue:
>  Jul 01 13:40:05 taskcluster-aws-provisioner2 app/provisioner.1: Wed, 01 Jul
> 2015 18:40:04 GMT azure-agent Destroying socket after idle timeout 

nope, not important that I know of

> I noticed a log of:
> [alert-operator] this userData (undefined) is garbage  
> 
> But was told that is not an issue (although we should fix these "undefined"
> pieces. to reduce the noise.

Actually, I just changed the logging slightly.  The garbage message was in case we had something there that wasn't parsable as base64(json()).

I wish the migration part of base.Entity had an option to force updating the migration in the stored format, so that I could scan the table and stop seeing the migration stuff run all the time.

> I haven't found anything definitive in the logs yet.  Perhaps there is some
> exception state that is getting eaten up and thus not logged?

That's most certainly the case here.  Async/await work like synchronous code in a lot of ways, the most important that they don't is that unhandled rejected promises are dropped on the floor.
Flags: needinfo?(jhford)
Pending but not being run for the last hour or so, trunk trees closed.
@jhford,
Did you deploy your fix above?

If so I don't think it worked, I just had to restart the provisioner manually again.
According to heroku it had been running for 12 hours, as down for approximately 3 hours.
Logs from before death: https://gist.github.com/jonasfj/82dbc8a6baeb827e9e8f
I think it's safe to share, as security tokens should expire -- I hope they already have :)
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #10)
> @jhford,
> Did you deploy your fix above?
> 
> If so I don't think it worked, I just had to restart the provisioner
> manually again.
> According to heroku it had been running for 12 hours, as down for
> approximately 3 hours.
> Logs from before death: https://gist.github.com/jonasfj/82dbc8a6baeb827e9e8f
> I think it's safe to share, as security tokens should expire -- I hope they
> already have :)

I did...  I tested that if the provisioning iteration lasted too long or the gap between them was too long that the provisioner would restart.

I'm completely confused by what's happening here.  I've been going over the code looking for any possible source for this.
(In reply to Phil Ringnalda (:philor) from comment #9)
> Pending but not being run for the last hour or so, trunk trees closed.

We still aren't seeing any decision tasks since before midnight PT. All TC-running trees closed.
Flags: needinfo?(jhford)
Flags: needinfo?(garndt)
Also, we *really* need a better way of being notified of such things instead of just waiting for somebody to happen to notice the backlog when looking at Treeherder. Is nagios monitoring a possibility here?
Expanding needinfos on the hope that someone's around to look into this tree-closing issue soonish.
Flags: needinfo?(pmoore)
Flags: needinfo?(jopsen)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #14)
> Expanding needinfos on the hope that someone's around to look into this
> tree-closing issue soonish.

yes working with the guys in #taskcluster

05:17 <&jhford> Tomcat|sheriffduty: Wed, 08 Jul 2015 12:16:17 GMT aws-provisioner:provision gecko-decision: 10 running capacity, 0 pending capacity and 0 pending tasks
05:17 <&jhford> Tomcat|sheriffduty: this is not the provisioner dieing this time
I don't know what's causing this but I am certain that it's not a lack of nodes caused by the provisioner.  There are 10 gecko-descision workers booted right now.
Oh, i think this is a Queue issue... let me kick the Queue
Flags: needinfo?(jhford)
(In reply to John Ford [:jhford] -- please use 'needinfo?' instead of a CC from comment #17)
> Oh, i think this is a Queue issue... let me kick the Queue

Nope, this isn't a queue issue anymore.  I just submitted a 'cli' workertype job using the tools.taskcluster.net interface and found that:


> q.pendingTasks('aws-provisioner-v1', 'cli').then(console.log, console.log);
  taskcluster-client Calling: pendingTasks, retry: 0 +4m
{ then: [Function] }
>   taskcluster-client Success calling: pendingTasks, (0 retries) +632ms
{ provisionerId: 'aws-provisioner-v1',
  workerType: 'cli',
  pendingTasks: 1 }


And the provisioner is doing things correctly:

2015-07-08T12:48:25.500684+00:00 app[provisioner.1]: Wed, 08 Jul 2015 12:48:25 GMT aws-provisioner:provision cli: 0 running capacity, 1 pending capacity and 1 pending tasks


Which tells me that what happened here is likely in mozilla-taskcluster.
My CLI job just completed succesfully.  Either the gecko-decision workerType is busted or mozilla-taskcluster is not submitting tasks/graphs.
We are in the TaskClusterPlatform vidyo room going through this. Seems to be a problem with mozilla-taskcluster.
Flags: needinfo?(pmoore)
We've established that mozilla-taskcluster is having socket failures when talking to redis cloud.

e.g.

 Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT config Loading additional configs [ 'production-treeherder-proxy.yml' ] 
Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT config skip config /app/production-treeherder-proxy.yml 
Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT taskcluster-proxy:db creating index id { w: 'majority', unique: true } 
Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT taskcluster-proxy:db Initialize collection config 
Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT config fetching document production 
Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT taskcluster-proxy:db Initialize collection repositories 
Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT taskcluster-proxy:db creating index id { w: 'majority', unique: true } 
Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT taskcluster-proxy:db creating index url { w: 'majority', unique: true } 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1: events.js:85 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:       throw er; // Unhandled 'error' event 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:             ^ 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1: Error: Redis connection to pub-redis-19466.us-east-1-2.1.ec2.garantiadata.com:19466 failed - write EPIPE 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at RedisClient.on_error (/app/node_modules/kue/node_modules/redis/index.js:196:24) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at Socket.<anonymous> (/app/node_modules/kue/node_modules/redis/index.js:106:14) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at Socket.emit (events.js:107:17) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at onwriteError (_stream_writable.js:317:10) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at onwrite (_stream_writable.js:335:5) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at WritableState.onwrite (_stream_writable.js:105:5) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1: <starting> 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at fireErrorCallbacks (net.js:456:13) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at Socket._destroy (net.js:495:3) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at WriteWrap.afterWrite (net.js:777:10) 

https://github.com/taskcluster/mozilla-taskcluster/ is a heroku app written in node.js, using https://www.npmjs.com/package/kue#redis-connection-settings to store data in redis, for which we are using https://redislabs.com/redis-cloud as a provider.

It seems for some reason writing to redis is causing some problems.

Greg (:garndt) has been able to write to the redis cloud account from his laptop, and redis cloud reports that there is ample storage available, so we suspected it might be an issue related to the redis client being used. Greg has upgraded the version of kue being used. We've also restarted services.

At the moment we are not sure what is causing the problem, and do not have an ETA on resolution.
Flags: needinfo?(garndt)
We continue to discuss this topic in TaskClusterPlatform vidyo room, if anyone wishes to join.
It appears that both decision tasks and retriggers are working now.
So far this is my understanding of our setup.  There are two instances of 'mozilla-taskcluster' which is our bridge between pushes, treeherder, and tasks.  It monitors the push log and when a new push is detected a job is created on a job queue backed by redis.  There are workers that pick up this work and create task graphs.
Whoops, hit enter too quickly..

Earlier in the week the worker on the production mozilla-taskcluster stopped working which is believed to be the cause of retriggers no longer working.

This morning the redis store used by mozilla-taskcluster-staging stopped with the same error and that is when all decision tasks stopped being created as well.  A support request was opened with Rediscloud (the redis addon we use in heroku) but they have not found the cause of the issues yet.  A new heroku-redis addon was deployed as a temporary solution and appears to have solved both the retrigger issue as well as gecko decision tasks being scheduled.

There is more investigation needed to understand why mozilla-taskcluster-staging is the app responsible for scheduling task graphs, but for now the issue is fixed.  Also, the heroku-redis addon is just a temproary solution.  Either a higher specced version needs to be deployed during a low use period (all existing jobs will be lost) or we move back to rediscloud (still results in job loss during that time).

Hopefully the heroku-redis addon will be sufficient until the weekend when it can either be upgraded or the redis cloud issue is resolved.
@garndt, there is no reason why mozilla-taskcluster-staging is the app for creating task-graphs over
mozilla-taskcluster, it's just how it is...
IMO, mozilla-taskcluster suffers from buzzword-overload and should be rearchitected,
preferably into 2-3 different components, handling different concerns. And none of them should use
mongodb or redis for storage or queuing.
 - These things will eventually run out of memory :)

----
Note, this is issue was unrelated to previous discussion in this bug.
I suspect maybe what we can take from this is that we need monitoring.
Flags: needinfo?(jopsen)
All trees reopened.
So far the memory consumption for heroku-redis instances used by mozilla-taskcluster and mozilla-taskcluster-staging are around 1mb each.  Now that trees are opened and heroku redis seems to be working, we can close this bug.

Also bug 1181742 has been opened to debate various ways we could implement monitoring of the taskcluster infrastructure.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
See Also: → 1181979
Component: TaskCluster → General
Product: Testing → Taskcluster
Target Milestone: --- → mozilla41
Resetting Version and Target Milestone that accidentally got changed...
Target Milestone: mozilla41 → ---
Version: Trunk → unspecified
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: