1177315 - gecko-decision tasks not running

Reporter

Description

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=b2g-inbound&revision=d0822eec251e&exclusion_profile=false and https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a195a8127e8d&exclusion_profile=false show pending gecko-decision tasks, pushes above them don't even show a pending one, and thus of course none of the resulting builds or tests.

All trunk trees are closed as a result.

Greg Arndt [:garndt]

Comment 1

•

9 years ago

Appears that the instances were not getting provisioned.  As of 11:01 PDT instances were being started and pending backlog being completed.  NI? jhford on this bug so we can take a look at what might have been the cause.

Flags: needinfo?(jhford)

Carsten Book [:Tomcat]

Comment 2

•

9 years ago

trees reopen at 0:18am Pacific

John Ford [:jhford] CET/CEST Berlin Time

Comment 3

•

9 years ago

I can't find anything in the logs which points to anything.  The timeline we discussed on IRC doesn't match up with anything happening in the provisioner so I'm not entirely sure what happened.

Flags: needinfo?(jhford)

John Ford [:jhford] CET/CEST Berlin Time

Comment 4

•

9 years ago

Marking as resolved -> fixed since we're back in action.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Greg Arndt [:garndt]

Comment 5

•

9 years ago

Reopening this because the same issue appeared again.  

Started provisioning iteration [1] at 18:39:09 GMT that never logged a success.  This was the last provisioning iteration that was logged until the provisioner was restarted at 21:18:08 GMT

I'm not sure if this matters at all, seems to happen often so probably not an issue:
 Jul 01 13:40:05 taskcluster-aws-provisioner2 app/provisioner.1: Wed, 01 Jul 2015 18:40:04 GMT azure-agent Destroying socket after idle timeout 

I noticed a log of:
[alert-operator] this userData (undefined) is garbage  

But was told that is not an issue (although we should fix these "undefined" pieces. to reduce the noise.

I haven't found anything definitive in the logs yet.  Perhaps there is some exception state that is getting eaten up and thus not logged?


[1] https://papertrailapp.com/systems/taskcluster-aws-provisioner2/events?r=552971382267621389-552971703861686287

Status: RESOLVED → REOPENED

Flags: needinfo?(jhford)

Resolution: FIXED → ---

Selena Deckelmann :selenamarie :selena

Updated

•

9 years ago

Blocks: 1179797

Greg Arndt [:garndt]

Comment 6

•

9 years ago

Attached file PR 30 — Details

Adding this PR that John opened against the provisioner so we have visibility into the fix.

Attachment #8628924 - Flags: review?(jopsen)

Jonas Finnemann Jensen (:jonasfj)

Comment 7

•

9 years ago

Comment on attachment 8628924 [details] [review]
PR 30

I strongly advice that this is cleaned up...

If there is an argument for crashing on an absolute error count, this has an r+.
Otherwise, I suggest we only crash on errors happening in a row.

Attachment #8628924 - Flags: review?(jopsen) → review+

John Ford [:jhford] CET/CEST Berlin Time

Comment 8

•

9 years ago

My patches went live today.


(In reply to Greg Arndt [:garndt] from comment #5) 
> I'm not sure if this matters at all, seems to happen often so probably not
> an issue:
>  Jul 01 13:40:05 taskcluster-aws-provisioner2 app/provisioner.1: Wed, 01 Jul
> 2015 18:40:04 GMT azure-agent Destroying socket after idle timeout 

nope, not important that I know of

> I noticed a log of:
> [alert-operator] this userData (undefined) is garbage  
> 
> But was told that is not an issue (although we should fix these "undefined"
> pieces. to reduce the noise.

Actually, I just changed the logging slightly.  The garbage message was in case we had something there that wasn't parsable as base64(json()).

I wish the migration part of base.Entity had an option to force updating the migration in the stored format, so that I could scan the table and stop seeing the migration stuff run all the time.

> I haven't found anything definitive in the logs yet.  Perhaps there is some
> exception state that is getting eaten up and thus not logged?

That's most certainly the case here.  Async/await work like synchronous code in a lot of ways, the most important that they don't is that unhandled rejected promises are dropped on the floor.

Flags: needinfo?(jhford)

Phil Ringnalda (:philor)

Reporter

Comment 9

•

9 years ago

Pending but not being run for the last hour or so, trunk trees closed.

Jonas Finnemann Jensen (:jonasfj)

Comment 10

•

9 years ago

@jhford,
Did you deploy your fix above?

If so I don't think it worked, I just had to restart the provisioner manually again.
According to heroku it had been running for 12 hours, as down for approximately 3 hours.
Logs from before death: https://gist.github.com/jonasfj/82dbc8a6baeb827e9e8f
I think it's safe to share, as security tokens should expire -- I hope they already have :)

John Ford [:jhford] CET/CEST Berlin Time

Comment 11

•

9 years ago

(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #10)
> @jhford,
> Did you deploy your fix above?
> 
> If so I don't think it worked, I just had to restart the provisioner
> manually again.
> According to heroku it had been running for 12 hours, as down for
> approximately 3 hours.
> Logs from before death: https://gist.github.com/jonasfj/82dbc8a6baeb827e9e8f
> I think it's safe to share, as security tokens should expire -- I hope they
> already have :)

I did...  I tested that if the provisioning iteration lasted too long or the gap between them was too long that the provisioner would restart.

I'm completely confused by what's happening here.  I've been going over the code looking for any possible source for this.

Ryan VanderMeulen [:RyanVM]

Comment 12

•

9 years ago

(In reply to Phil Ringnalda (:philor) from comment #9)
> Pending but not being run for the last hour or so, trunk trees closed.

We still aren't seeing any decision tasks since before midnight PT. All TC-running trees closed.

Flags: needinfo?(jhford)

Flags: needinfo?(garndt)

Ryan VanderMeulen [:RyanVM]

Comment 13

•

9 years ago

Also, we *really* need a better way of being notified of such things instead of just waiting for somebody to happen to notice the backlog when looking at Treeherder. Is nagios monitoring a possibility here?

Ryan VanderMeulen [:RyanVM]

Comment 14

•

9 years ago

Expanding needinfos on the hope that someone's around to look into this tree-closing issue soonish.

Flags: needinfo?(pmoore)

Flags: needinfo?(jopsen)

Carsten Book [:Tomcat]

Comment 15

•

9 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #14)
> Expanding needinfos on the hope that someone's around to look into this
> tree-closing issue soonish.

yes working with the guys in #taskcluster

05:17 <&jhford> Tomcat|sheriffduty: Wed, 08 Jul 2015 12:16:17 GMT aws-provisioner:provision gecko-decision: 10 running capacity, 0 pending capacity and 0 pending tasks
05:17 <&jhford> Tomcat|sheriffduty: this is not the provisioner dieing this time

John Ford [:jhford] CET/CEST Berlin Time

Comment 16

•

9 years ago

I don't know what's causing this but I am certain that it's not a lack of nodes caused by the provisioner.  There are 10 gecko-descision workers booted right now.

John Ford [:jhford] CET/CEST Berlin Time

Comment 17

•

9 years ago

Oh, i think this is a Queue issue... let me kick the Queue

Flags: needinfo?(jhford)

John Ford [:jhford] CET/CEST Berlin Time

Comment 18

•

9 years ago

(In reply to John Ford [:jhford] -- please use 'needinfo?' instead of a CC from comment #17)
> Oh, i think this is a Queue issue... let me kick the Queue

Nope, this isn't a queue issue anymore.  I just submitted a 'cli' workertype job using the tools.taskcluster.net interface and found that:


> q.pendingTasks('aws-provisioner-v1', 'cli').then(console.log, console.log);
  taskcluster-client Calling: pendingTasks, retry: 0 +4m
{ then: [Function] }
>   taskcluster-client Success calling: pendingTasks, (0 retries) +632ms
{ provisionerId: 'aws-provisioner-v1',
  workerType: 'cli',
  pendingTasks: 1 }


And the provisioner is doing things correctly:

2015-07-08T12:48:25.500684+00:00 app[provisioner.1]: Wed, 08 Jul 2015 12:48:25 GMT aws-provisioner:provision cli: 0 running capacity, 1 pending capacity and 1 pending tasks


Which tells me that what happened here is likely in mozilla-taskcluster.

John Ford [:jhford] CET/CEST Berlin Time

Comment 19

•

9 years ago

My CLI job just completed succesfully.  Either the gecko-decision workerType is busted or mozilla-taskcluster is not submitting tasks/graphs.

Pete Moore [:pmoore][:pete]

Comment 20

•

9 years ago

We are in the TaskClusterPlatform vidyo room going through this. Seems to be a problem with mozilla-taskcluster.

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 21

•

9 years ago

We've established that mozilla-taskcluster is having socket failures when talking to redis cloud.

e.g.

 Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT config Loading additional configs [ 'production-treeherder-proxy.yml' ] 
Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT config skip config /app/production-treeherder-proxy.yml 
Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT taskcluster-proxy:db creating index id { w: 'majority', unique: true } 
Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT taskcluster-proxy:db Initialize collection config 
Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT config fetching document production 
Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT taskcluster-proxy:db Initialize collection repositories 
Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT taskcluster-proxy:db creating index id { w: 'majority', unique: true } 
Jul 08 15:27:15 mozilla-taskcluster app/worker.1: Wed, 08 Jul 2015 13:27:15 GMT taskcluster-proxy:db creating index url { w: 'majority', unique: true } 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1: events.js:85 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:       throw er; // Unhandled 'error' event 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:             ^ 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1: Error: Redis connection to pub-redis-19466.us-east-1-2.1.ec2.garantiadata.com:19466 failed - write EPIPE 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at RedisClient.on_error (/app/node_modules/kue/node_modules/redis/index.js:196:24) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at Socket.<anonymous> (/app/node_modules/kue/node_modules/redis/index.js:106:14) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at Socket.emit (events.js:107:17) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at onwriteError (_stream_writable.js:317:10) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at onwrite (_stream_writable.js:335:5) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at WritableState.onwrite (_stream_writable.js:105:5) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1: <starting> 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at fireErrorCallbacks (net.js:456:13) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at Socket._destroy (net.js:495:3) 
Jul 08 15:27:16 mozilla-taskcluster app/worker.1:     at WriteWrap.afterWrite (net.js:777:10) 

https://github.com/taskcluster/mozilla-taskcluster/ is a heroku app written in node.js, using https://www.npmjs.com/package/kue#redis-connection-settings to store data in redis, for which we are using https://redislabs.com/redis-cloud as a provider.

It seems for some reason writing to redis is causing some problems.

Greg (:garndt) has been able to write to the redis cloud account from his laptop, and redis cloud reports that there is ample storage available, so we suspected it might be an issue related to the redis client being used. Greg has upgraded the version of kue being used. We've also restarted services.

At the moment we are not sure what is causing the problem, and do not have an ETA on resolution.

Flags: needinfo?(garndt)

Pete Moore [:pmoore][:pete]

Comment 22

•

9 years ago

We continue to discuss this topic in TaskClusterPlatform vidyo room, if anyone wishes to join.

Greg Arndt [:garndt]

Comment 23

•

9 years ago

It appears that both decision tasks and retriggers are working now.
So far this is my understanding of our setup.  There are two instances of 'mozilla-taskcluster' which is our bridge between pushes, treeherder, and tasks.  It monitors the push log and when a new push is detected a job is created on a job queue backed by redis.  There are workers that pick up this work and create task graphs.

Greg Arndt [:garndt]

Comment 24

•

9 years ago

Whoops, hit enter too quickly..

Earlier in the week the worker on the production mozilla-taskcluster stopped working which is believed to be the cause of retriggers no longer working.

This morning the redis store used by mozilla-taskcluster-staging stopped with the same error and that is when all decision tasks stopped being created as well.  A support request was opened with Rediscloud (the redis addon we use in heroku) but they have not found the cause of the issues yet.  A new heroku-redis addon was deployed as a temporary solution and appears to have solved both the retrigger issue as well as gecko decision tasks being scheduled.

There is more investigation needed to understand why mozilla-taskcluster-staging is the app responsible for scheduling task graphs, but for now the issue is fixed.  Also, the heroku-redis addon is just a temproary solution.  Either a higher specced version needs to be deployed during a low use period (all existing jobs will be lost) or we move back to rediscloud (still results in job loss during that time).

Hopefully the heroku-redis addon will be sufficient until the weekend when it can either be upgraded or the redis cloud issue is resolved.

Jonas Finnemann Jensen (:jonasfj)

Comment 25

•

9 years ago

@garndt, there is no reason why mozilla-taskcluster-staging is the app for creating task-graphs over
mozilla-taskcluster, it's just how it is...
IMO, mozilla-taskcluster suffers from buzzword-overload and should be rearchitected,
preferably into 2-3 different components, handling different concerns. And none of them should use
mongodb or redis for storage or queuing.
 - These things will eventually run out of memory :)

----
Note, this is issue was unrelated to previous discussion in this bug.
I suspect maybe what we can take from this is that we need monitoring.

Flags: needinfo?(jopsen)

Ryan VanderMeulen [:RyanVM]

Comment 26

•

9 years ago

All trees reopened.

Greg Arndt [:garndt]

Comment 27

•

9 years ago

So far the memory consumption for heroku-redis instances used by mozilla-taskcluster and mozilla-taskcluster-staging are around 1mb each.  Now that trees are opened and heroku redis seems to be working, we can close this bug.

Also bug 1181742 has been opened to debate various ways we could implement monitoring of the taskcluster infrastructure.

Greg Arndt [:garndt]

Updated

•

9 years ago

Status: REOPENED → RESOLVED

Closed: 9 years ago → 9 years ago

Resolution: --- → FIXED

Pete Moore [:pmoore][:pete]

Updated

•

9 years ago

Updated

•

9 years ago

status-firefox41: affected → ---

Component: TaskCluster → General

Product: Testing → Taskcluster

Target Milestone: --- → mozilla41

Pete Moore [:pmoore][:pete]

Comment 28

•

9 years ago

Resetting Version and Target Milestone that accidentally got changed...

Target Milestone: mozilla41 → ---

Version: Trunk → unspecified