Closed Bug 1436620 Opened 7 years ago Closed 7 years ago

Provisioner crashes when pulse connection terminates

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bstack, Assigned: dustin)

References

Details

Attachments

(4 files)

Today at 17:49 PST a sheriff reported trouble using the provisioner dashboard in the tools site. It was having a "network error". Upon looking it seemed like the whole thing had crashed and was 500ing every request. I looked for a couple minutes but just ended up restarting it. Not sure what exactly happened. I'm not assuming that we will magically figure this out either, but just wanted to open a tracking bug. This same sort of thing happened late in the day on Friday as well.
The 503's are H10's: Feb 08 01:48:55 taskcluster-aws-provisioner2 heroku/router: at=error code=H10 desc="App crashed" method=GET path="/v1/secret/RP9XsWgITi-QjH2K-lVf7Q" host=aws-provisioner.taskcluster.net request_id=0354cbff-19c5-4ccf-920a-a551306313d1 fwd="54.201.194.119" dyno= connect= service= status=503 bytes= protocol=https The web process crashed, while the provisioner kept running. Scrolling back to the beginning of the mayhem: Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: Thu, 08 Feb 2018 01:42:03 GMT base:exchanges Channel closed unexpectedly Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: Thu, 08 Feb 2018 01:42:03 GMT base:exchanges Connection error in Publisher: Error: channel closed unexpectedly Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at ConfirmChannel.<anonymous> (/app/node_modules/pulse-publisher/src/exchanges.js:51:24) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitNone (events.js:86:13) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at ConfirmChannel.emit (events.js:188:7) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at ConfirmChannel.C.toClosed (/app/node_modules/amqplib/lib/channel.js:164:8) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Connection.C._closeChannels (/app/node_modules/amqplib/lib/connection.js:392:18) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Connection.C.toClosed (/app/node_modules/amqplib/lib/connection.js:399:8) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Object.accept (/app/node_modules/amqplib/lib/connection.js:95:18) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Connection.mainAccept [as accept] (/app/node_modules/amqplib/lib/connection.js:63:33) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSSocket.go (/app/node_modules/amqplib/lib/connection.js:476:48) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitNone (events.js:86:13) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSSocket.emit (events.js:188:7) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitReadable_ (_stream_readable.js:434:10) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitReadable (_stream_readable.js:428:7) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at readableAddChunk (_stream_readable.js:189:13) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSSocket.Readable.push (_stream_readable.js:136:10) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSWrap.onread (net.js:561:20) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: Uncaught Exception! Attempting to report to Sentry and crash. Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: Error: channel closed unexpectedly Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at ConfirmChannel.<anonymous> (/app/node_modules/pulse-publisher/src/exchanges.js:51:24) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitNone (events.js:86:13) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at ConfirmChannel.emit (events.js:188:7) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at ConfirmChannel.C.toClosed (/app/node_modules/amqplib/lib/channel.js:164:8) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Connection.C._closeChannels (/app/node_modules/amqplib/lib/connection.js:392:18) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Connection.C.toClosed (/app/node_modules/amqplib/lib/connection.js:399:8) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Object.accept (/app/node_modules/amqplib/lib/connection.js:95:18) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at Connection.mainAccept [as accept] (/app/node_modules/amqplib/lib/connection.js:63:33) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSSocket.go (/app/node_modules/amqplib/lib/connection.js:476:48) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitNone (events.js:86:13) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSSocket.emit (events.js:188:7) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitReadable_ (_stream_readable.js:434:10) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at emitReadable (_stream_readable.js:428:7) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at readableAddChunk (_stream_readable.js:189:13) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSSocket.Readable.push (_stream_readable.js:136:10) Feb 08 01:42:03 taskcluster-aws-provisioner2 app/web.1: at TLSWrap.onread (net.js:561:20) Feb 08 01:42:04 taskcluster-aws-provisioner2 heroku/router: at=info method=GET path="/v1/secret/N1uECmfZQqe84PozqhUFsg" host=aws-provisioner.taskcluster.net request_id=af39e89f-3d03-4544-a716-47f9e54907ad fwd="34.217.114.72" dyno=web.1 connect=3ms service=122ms status=404 bytes=401 protocol=https Feb 08 01:42:04 taskcluster-aws-provisioner2 heroku/router: at=info method=DELETE path="/v1/secret/N1uECmfZQqe84PozqhUFsg" host=aws-provisioner.taskcluster.net request_id=da1ece69-54c2-4914-9e7f-f3524764303f fwd="34.217.114.72" dyno=web.1 connect=1ms service=81ms status=204 bytes=383 protocol=https Feb 08 01:42:04 taskcluster-aws-provisioner2 heroku/router: at=info method=GET path="/v1/secret/zVqO74CHSqimshSCn7Ke8A" host=aws-provisioner.taskcluster.net request_id=6274de10-9f4c-4f2f-8302-314704a4df29 fwd="18.144.50.182" dyno=web.1 connect=2ms service=164ms status=404 bytes=401 protocol=https Feb 08 01:42:04 taskcluster-aws-provisioner2 heroku/router: at=info method=DELETE path="/v1/secret/zVqO74CHSqimshSCn7Ke8A" host=aws-provisioner.taskcluster.net request_id=757c04bb-ba3d-4f6d-8b41-cef12423126d fwd="18.144.50.182" dyno=web.1 connect=5ms service=85ms status=204 bytes=383 protocol=https Feb 08 01:42:04 taskcluster-aws-provisioner2 heroku/router: at=info method=GET path="/v1/worker-type/gecko-t-win7-32" host=aws-provisioner.taskcluster.net request_id=28e66513-0a0c-477b-affd-c1547c8ff1a7 fwd="18.144.50.182" dyno=web.1 connect=3ms service=195ms status=200 bytes=11445 protocol=https Feb 08 01:42:04 taskcluster-aws-provisioner2 app/web.1: Authorized project/taskcluster/aws-provisioner/production for GET access to /v1/worker-type/gecko-t-win7-32 Feb 08 01:42:04 taskcluster-aws-provisioner2 app/web.1: Succesfully reported error to Sentry. Feb 08 01:42:05 taskcluster-aws-provisioner2 heroku/web.1: Process exited with status 0 Feb 08 01:42:05 taskcluster-aws-provisioner2 heroku/web.1: State changed from up to crashed So it looks like this is another issue with the pulse listener's tendency to crash the process when things go squirrely.
Component: AWS-Provisioner → Platform Libraries
So is this actually an issue with the provisioner? If not, we should change the summary. If we've implicated pulse, that raises a couple of issues: 1) Should we focus some stability effort here? 2) Should we be running our own pulse service rather than piggybacking on the Mozilla generic one? My thoughts: For #1, yes, I think we should be making these pulse interactions more robust. For #2, yes, but not right away. We'll need to have this configurable for redeployability, but whether we actually need a separate service remains to be seen. I do like the idea of having more control over both sides of the service though.
Worth noting, too, that the web service is listening to pulse.
Summary: What happened to provisioner today → Provisioner crashes when pulse connection terminates
Haha, we midair'd but had the same thoughts. Yes, the pulse libraries should be better. #2. No, CloudAMQP is quite well-run. It basically never goes down, which is why we get surprised by bugs in our stuff when it does! Anyway, we would probably just get our own CloudAMQP account, so no change in reliability. :garndt used to be a co-owner of the Pulse service, and :mcote would probably appreciate sharing the load with you. We are the majority producer of pulse messages by several orders of magnitude. Aside from being well-run, there are big advantages to sharing the message bus with the rest of the organization. There are a surprising number of integrations built on pulse, and many of them consume both Taskcluster messages and some other service's messages. You're correct that we need to be flexible for redeployability, and maybe even run our own AMQP account for staging just to demonstrate the possibility. But that's sort of a side issue.
I can work on this. I think there are some higher-level bugs which I will dupe this one into.
Assignee: nobody → dustin
Blocks: 1436735
No longer blocks: tc-stability
> Worth noting, too, that the web service is listening to pulse. It's sending to pulse -- that's pretty normal :) Provisioner uses a pretty ancient version of pulse-publisher. In particular, https://github.com/taskcluster/pulse-publisher/pull/21/files would have helped here. I'll push an upgrade.
John, this has been waiting for review for a while..
Flags: needinfo?(jhford)
R+'d just now, didn't see the request.
Flags: needinfo?(jhford)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Platform Libraries → Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: