Jobs are not being triggered
Categories
(Taskcluster :: Services, defect)
Tracking
(Not tracked)
People
(Reporter: ebalazs_, Assigned: dustin)
References
Details
Attachments
(1 file)
289.62 KB,
image/png
|
Details |
After Gecko Decision Task is green, the rest af the builds dont start and the push looks like is complete.
E.g.: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=176554242eb31ffb066e48c84004eba1824293d5
Reporter | ||
Comment 1•6 years ago
|
||
Hi, Could you have a look?
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Comment 2•6 years ago
|
||
Trees are closed for this.
Looks like the decision tasks are still generating target tasks like https://taskcluster-artifacts.net/QgCz5LYcSLexZJCC5bGDDw/0/public/target-tasks.json ...
Looks like the tasks are being generated, but they're all listed as "unscheduled"?
Comment 6•6 years ago
|
||
Indeed, it looks like the tasks are not scheduled, which is strange. In former times, we introduced the possibility to call queue.defineTask to create a task without scheduling it, and then later schedule it. I don't think this is used by anything any more, but it looks like tasks are getting created without being scheduled, presumably because of a change that has been rolled out, with a bug, I would suspect.
Here is an example:
https://tools.taskcluster.net/groups/GIqwXIiJQ7eN5dA5QLDvkg/tasks/VKsVRRG6SzynEhV8NO8D5Q/details
Comment 7•6 years ago
|
||
The last Queue deployment was Thursday evening, EST:
https://dashboard.heroku.com/apps/queue-taskcluster-net/activity
Not sure currently if this is related. Presumably, things ran ok on Friday.
Comment 8•6 years ago
|
||
wcosta spotted that the dependency resolver went down (we had an email alert) and restarted the Queue service. Things seem to be working now, and tasks seem to be getting scheduled.
We're currently not sure why it went down and why it didn't self recover, but we're investigating.
Probably trees can be reopened again.
I did spot this in papertrail, not sure if it was the cause. We do have 4 dynos running for the dependencyResolver, so perhaps all four of them failed for the same reason?
But like I say, no idea why this didn't self-recover with new dynos starting up, or if that did happen, but they failed for some other reason.
Comment 9•6 years ago
|
||
The problem is with the dependency-resolver, there was this signalfx alert right before the trees were down:
https://app.signalfx.com/#/detector/CjefBMEAYAQ/edit?incidentId=DxNxVLuAcMY&is=anomalous
:pmoore also spotted this on papertrail https://papertrailapp.com/groups/853883/events?focus=1023893871089836047&selected=1023893871089836047
(Copying here because papertrail logs will evaporate after two days)
an 21 04:35:44 taskcluster-queue app/web.25: Authorized project/taskcluster/aws-provisioner/production for POST access to /v1/claim-work/aws-provisioner-v1/gecko-3-b-win2012
Jan 21 04:35:44 taskcluster-queue app/web.25: Authorized project/releng/generic-worker/os-x/production for POST access to /v1/claim-work/releng-hardware/gecko-t-osx-1010
Jan 21 04:35:44 taskcluster-queue app/web.25: Authorized project/taskcluster/aws-provisioner/production for POST access to /v1/claim-work/aws-provisioner-v1/gecko-t-linux-xlarge
Jan 21 04:35:44 taskcluster-queue app/web.25: 2019-01-21T06:35:44.338Z base:entity TIMING: getEntity on QueueWorker took 137.749057 milliseconds.
Jan 21 04:35:44 taskcluster-queue app/web.17: Time:2019-01-21T06:35:44.2544037Z
Jan 21 04:35:44 taskcluster-queue app/web.17: at /app/node_modules/fast-azure-storage/lib/queue.js:321:27
Jan 21 04:35:44 taskcluster-queue app/web.17: at tryCallOne (/app/node_modules/promise/lib/core.js:37:12)
Jan 21 04:35:44 taskcluster-queue app/web.17: at /app/node_modules/promise/lib/core.js:123:15
Jan 21 04:35:44 taskcluster-queue app/web.17: at flush (/app/node_modules/asap/raw.js:50:29)
Jan 21 04:35:44 taskcluster-queue app/web.17: at process._tickCallback (internal/process/next_tick.js:61:11)
Jan 21 04:35:44 taskcluster-queue app/web.17: 2019-01-21T06:35:44.347Z base:entity TIMING: getEntity on QueueWorker took 140.537029 milliseconds.
Jan 21 04:35:44 taskcluster-queue heroku/router: at=info method=POST path="/v1/claim-work/releng-hardware/gecko-t-osx-1010" host=queue.taskcluster.net request_id=7b48e959-e2c6-49b4-ad30-bdbd72fbe853 fwd="63.245.210.129" dyno=web.6 connect=1ms service=15800ms status=500 bytes=1269 protocol=https
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: raven@2.6.4 alert: failed to send exception to sentry: HTTP Error (429): undefined
Jan 21 04:35:44 taskcluster-queue app/web.19: at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:555:21)
Jan 21 04:35:44 taskcluster-queue app/web.19: at HTTPParser.parserOnHeadersComplete (_http_common.js:109:17)
Jan 21 04:35:44 taskcluster-queue app/web.19: at TLSSocket.socketOnData (_http_client.js:441:20)
Jan 21 04:35:44 taskcluster-queue app/web.19: at TLSSocket.emit (events.js:182:13)
Jan 21 04:35:44 taskcluster-queue app/web.19: at TLSSocket.EventEmitter.emit (domain.js:442:20)
Jan 21 04:35:44 taskcluster-queue app/web.19: at addChunk (_stream_readable.js:283:12)
Jan 21 04:35:44 taskcluster-queue app/web.19: at readableAddChunk (_stream_readable.js:264:11)
Jan 21 04:35:44 taskcluster-queue app/web.19: at TLSSocket.Readable.push (_stream_readable.js:219:10)
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: While trying to report OperationTimedOutError: Operation could not be completed within the specified time.
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: RequestId:4a565a89-3003-0062-2553-b188f3000000
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: Time:2019-01-21T06:35:43.2768985Z
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: at /app/node_modules/fast-azure-storage/lib/queue.js:321:27
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: at tryCallOne (/app/node_modules/promise/lib/core.js:37:12)
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: at /app/node_modules/promise/lib/core.js:123:15
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: at flush (/app/node_modules/asap/raw.js:50:29)
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: at process._tickCallback (internal/process/next_tick.js:61:11)
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: ...reporting to sentry failed: Error: HTTP Error (429): undefined
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: at ClientRequest.<anonymous> (/app/node_modules/raven/lib/transports.js:56:15)
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: at Object.onceWrapper (events.js:273:13)
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: at ClientRequest.emit (events.js:187:15)
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: at ClientRequest.EventEmitter.emit (domain.js:442:20)
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:555:21)
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: at HTTPParser.parserOnHeadersComplete (_http_common.js:109:17)
Jan 21 04:35:44 taskcluster-queue app/dependencyResolver.1: at TLSSocket.socketOnData (_http_client.js:441:20)
(time is in my local timezone)
:dustin could this be related to the deployment on Friday?
Comment 10•6 years ago
|
||
I suspect the root cause is that azure rate-limited us with HTTP 429, which causes the dependencyResolver to crash. We don't seem to have an auto-restart, so I believe all 4 dynos went down without getting restarted.
See the ~4.5 hour logging gap here:
Probably the fixes required here are:
- handle http 429 from azure
- make sure dependencyResolver has restart-on-crash semantics
Comment 11•6 years ago
|
||
Dustin, any idea if this looks correct? I wasn't sure if some of some of these services should have dynos running that currently have zero dynos (such as expireTask). It may be ok, just wanted to check!
Assignee | ||
Comment 12•6 years ago
|
||
It's fine - those are all started by the Heroku Scheduler on a schedule.
A task will be unscheduled until its dependencies are resolved.
We do handle the timeout (by reporting it). The weird thing is that we couldn't report to sentry due to a 429. That typically happens when Sentry rate-limits us due to lots of exceptions -- as was the case here when Azure went down and caused all manner of exceptions.
Jan 21 06:35:44 taskcluster-queue app/dependencyResolver.1: raven@2.6.4 alert: failed to send exception to sentry: HTTP Error (429): undefined
Heroku automatically restarts failed dynos, but it doesn't appear to have done so here. I wonder if these actually crashed, or just hung.
https://devcenter.heroku.com/changelog-items/344
In the normal case of a long-running web or worker process getting an occasional crash, the dyno will be restarted without any intervention on your part
Anyway, we won't be using sentry for very much longer, so I'm not going to worry about that issue. However, I think it makes sense to look at what happened here to cause the process to hang.
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Comment 14•6 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #12)
Anyway, we won't be using sentry for very much longer, so I'm not going to worry about that issue. However, I think it makes sense to look at what happened here to cause the process to hang.
I wonder if failing to send a report to sentry causes the process to hang...
That would kind of be ironic. :)
Updated•6 years ago
|
Description
•