Closed
Bug 1436620
Opened 7 years ago
Closed 7 years ago
Provisioner crashes when pulse connection terminates
Categories
(Taskcluster :: Services, defect)
Taskcluster
Services
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bstack, Assigned: dustin)
References
Details
Attachments
(4 files)
Today at 17:49 PST a sheriff reported trouble using the provisioner dashboard in the tools site. It was having a "network error".
Upon looking it seemed like the whole thing had crashed and was 500ing every request. I looked for a couple minutes but just ended up restarting it. Not sure what exactly happened. I'm not assuming that we will magically figure this out either, but just wanted to open a tracking bug. This same sort of thing happened late in the day on Friday as well.
Reporter | ||
Comment 1•7 years ago
|
||
Reporter | ||
Comment 2•7 years ago
|
||
Reporter | ||
Comment 3•7 years ago
|
||
Reporter | ||
Comment 4•7 years ago
|
||
Assignee | ||
Comment 5•7 years ago
|
||
The 503's are H10's:
Feb 08 01:48:55 taskcluster-aws-provisioner2 heroku/router: at=error code=H10 desc="App crashed" method=GET path="/v1/secret/RP9XsWgITi-QjH2K-lVf7Q" host=aws-provisioner.taskcluster.net request_id=0354cbff-19c5-4ccf-920a-a551306313d1 fwd="54.201.194.119" dyno= connect= service= status=503 bytes= protocol=https
The web process crashed, while the provisioner kept running. Scrolling back to the beginning of the mayhem:
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: Thu, 08 Feb 2018 01:42:03 GMT base:exchanges Channel closed unexpectedly
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: Thu, 08 Feb 2018 01:42:03 GMT base:exchanges Connection error in Publisher: Error: channel closed unexpectedly
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at ConfirmChannel.<anonymous> (/app/node_modules/pulse-publisher/src/exchanges.js:51:24)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitNone (events.js:86:13)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at ConfirmChannel.emit (events.js:188:7)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at ConfirmChannel.C.toClosed (/app/node_modules/amqplib/lib/channel.js:164:8)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Connection.C._closeChannels (/app/node_modules/amqplib/lib/connection.js:392:18)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Connection.C.toClosed (/app/node_modules/amqplib/lib/connection.js:399:8)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Object.accept (/app/node_modules/amqplib/lib/connection.js:95:18)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Connection.mainAccept [as accept] (/app/node_modules/amqplib/lib/connection.js:63:33)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSSocket.go (/app/node_modules/amqplib/lib/connection.js:476:48)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitNone (events.js:86:13)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSSocket.emit (events.js:188:7)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitReadable_ (_stream_readable.js:434:10)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitReadable (_stream_readable.js:428:7)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at readableAddChunk (_stream_readable.js:189:13)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSSocket.Readable.push (_stream_readable.js:136:10)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSWrap.onread (net.js:561:20)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: Uncaught Exception! Attempting to report to Sentry and crash.
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: Error: channel closed unexpectedly
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at ConfirmChannel.<anonymous> (/app/node_modules/pulse-publisher/src/exchanges.js:51:24)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitNone (events.js:86:13)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at ConfirmChannel.emit (events.js:188:7)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at ConfirmChannel.C.toClosed (/app/node_modules/amqplib/lib/channel.js:164:8)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Connection.C._closeChannels (/app/node_modules/amqplib/lib/connection.js:392:18)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Connection.C.toClosed (/app/node_modules/amqplib/lib/connection.js:399:8)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Object.accept (/app/node_modules/amqplib/lib/connection.js:95:18)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Connection.mainAccept [as accept] (/app/node_modules/amqplib/lib/connection.js:63:33)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSSocket.go (/app/node_modules/amqplib/lib/connection.js:476:48)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitNone (events.js:86:13)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSSocket.emit (events.js:188:7)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitReadable_ (_stream_readable.js:434:10)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitReadable (_stream_readable.js:428:7)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at readableAddChunk (_stream_readable.js:189:13)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSSocket.Readable.push (_stream_readable.js:136:10)
Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSWrap.onread (net.js:561:20)
Feb 08 01:42:04 taskcluster-aws-provisioner2 heroku/router: at=info method=GET path="/v1/secret/N1uECmfZQqe84PozqhUFsg" host=aws-provisioner.taskcluster.net request_id=af39e89f-3d03-4544-a716-47f9e54907ad fwd="34.217.114.72" dyno=web.1 connect=3ms service=122ms status=404 bytes=401 protocol=https
Feb 08 01:42:04 taskcluster-aws-provisioner2 heroku/router: at=info method=DELETE path="/v1/secret/N1uECmfZQqe84PozqhUFsg" host=aws-provisioner.taskcluster.net request_id=da1ece69-54c2-4914-9e7f-f3524764303f fwd="34.217.114.72" dyno=web.1 connect=1ms service=81ms status=204 bytes=383 protocol=https
Feb 08 01:42:04 taskcluster-aws-provisioner2 heroku/router: at=info method=GET path="/v1/secret/zVqO74CHSqimshSCn7Ke8A" host=aws-provisioner.taskcluster.net request_id=6274de10-9f4c-4f2f-8302-314704a4df29 fwd="18.144.50.182" dyno=web.1 connect=2ms service=164ms status=404 bytes=401 protocol=https
Feb 08 01:42:04 taskcluster-aws-provisioner2 heroku/router: at=info method=DELETE path="/v1/secret/zVqO74CHSqimshSCn7Ke8A" host=aws-provisioner.taskcluster.net request_id=757c04bb-ba3d-4f6d-8b41-cef12423126d fwd="18.144.50.182" dyno=web.1 connect=5ms service=85ms status=204 bytes=383 protocol=https
Feb 08 01:42:04 taskcluster-aws-provisioner2 heroku/router: at=info method=GET path="/v1/worker-type/gecko-t-win7-32" host=aws-provisioner.taskcluster.net request_id=28e66513-0a0c-477b-affd-c1547c8ff1a7 fwd="18.144.50.182" dyno=web.1 connect=3ms service=195ms status=200 bytes=11445 protocol=https
Feb 08 01:42:04 taskcluster-aws-provisioner2 app/web.1: Authorized project/taskcluster/aws-provisioner/production for GET access to /v1/worker-type/gecko-t-win7-32
Feb 08 01:42:04 taskcluster-aws-provisioner2 app/web.1: Succesfully reported error to Sentry.
Feb 08 01:42:05 taskcluster-aws-provisioner2 heroku/web.1: Process exited with status 0
Feb 08 01:42:05 taskcluster-aws-provisioner2 heroku/web.1: State changed from up to crashed
So it looks like this is another issue with the pulse listener's tendency to crash the process when things go squirrely.
Component: AWS-Provisioner → Platform Libraries
Comment 6•7 years ago
|
||
So is this actually an issue with the provisioner? If not, we should change the summary.
If we've implicated pulse, that raises a couple of issues:
1) Should we focus some stability effort here?
2) Should we be running our own pulse service rather than piggybacking on the Mozilla generic one?
My thoughts:
For #1, yes, I think we should be making these pulse interactions more robust.
For #2, yes, but not right away. We'll need to have this configurable for redeployability, but whether we actually need a separate service remains to be seen. I do like the idea of having more control over both sides of the service though.
Assignee | ||
Comment 7•7 years ago
|
||
Worth noting, too, that the web service is listening to pulse.
Summary: What happened to provisioner today → Provisioner crashes when pulse connection terminates
Assignee | ||
Comment 8•7 years ago
|
||
Haha, we midair'd but had the same thoughts. Yes, the pulse libraries should be better.
#2. No, CloudAMQP is quite well-run. It basically never goes down, which is why we get surprised by bugs in our stuff when it does! Anyway, we would probably just get our own CloudAMQP account, so no change in reliability. :garndt used to be a co-owner of the Pulse service, and :mcote would probably appreciate sharing the load with you. We are the majority producer of pulse messages by several orders of magnitude.
Aside from being well-run, there are big advantages to sharing the message bus with the rest of the organization. There are a surprising number of integrations built on pulse, and many of them consume both Taskcluster messages and some other service's messages.
You're correct that we need to be flexible for redeployability, and maybe even run our own AMQP account for staging just to demonstrate the possibility. But that's sort of a side issue.
Assignee | ||
Comment 9•7 years ago
|
||
I can work on this. I think there are some higher-level bugs which I will dupe this one into.
Assignee: nobody → dustin
Updated•7 years ago
|
Blocks: tc-stability
Assignee | ||
Updated•7 years ago
|
No longer blocks: tc-stability
Assignee | ||
Comment 10•7 years ago
|
||
> Worth noting, too, that the web service is listening to pulse.
It's sending to pulse -- that's pretty normal :)
Provisioner uses a pretty ancient version of pulse-publisher. In particular,
https://github.com/taskcluster/pulse-publisher/pull/21/files
would have helped here. I'll push an upgrade.
Assignee | ||
Comment 11•7 years ago
|
||
Assignee | ||
Comment 12•7 years ago
|
||
John, this has been waiting for review for a while..
Flags: needinfo?(jhford)
Assignee | ||
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Component: Platform Libraries → Services
You need to log in
before you can comment on or make changes to this bug.
Description
•