Closed Bug 1538961 Opened 7 years ago Closed 7 years ago

Use of subscriptions in web-server during development can leak queues and kill pulse

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Attachments

(1 file)

Screenshot_2019-03-25 RabbitMQ Management.png 7 years ago Dustin J. Mitchell [:dustin] (he/him) 115.04 KB, image/png		Details

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Description

•

7 years ago

Attached image Screenshot_2019-03-25 RabbitMQ Management.png — Details

A development rabbitmq user (taskcluster-user-haali) was configured into tc-web-server, and GraphQL subscriptions made to subscribe to "active" taskgroups.

The result was 3000+ queues like that pictured in the attachment, many with about 1000 messages in them. This killed the RabbitMQ cluster. This is likely the same thing that happened last week.

Note that the queues are not exclusive -- I'm not sure why that's the case.

The 3000+ is more than the number of individual control-r's of the ui, but HMR might have resulted in that number of reloads.

So:

Why are the queues not exclusive? Git blames me but I asked myself and I don't remember.
Why are the queues not being deleted directly by tc-web-server when the subscription completes?

Hassan Ali (:hassan)

Comment 1

•

7 years ago

Why are the queues not being deleted directly by tc-web-server when the subscription completes?

I wonder if bringing back the unsubscribeAll[1] method would fix this issue. Dustin, do you recall why we removed it?

[1] https://github.com/taskcluster/taskcluster-web-server/commit/db7f05b7cbae4a04853d89501f38494194bff293#diff-e15d433f1f3a83432bd4f9507c7125a9L83

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 2

•

7 years ago

That diff is a bit misleading -- the file was basically completely rewritten, and the old version did something really strange with making multiple channels listening to the same queue or something like that. Now the reconciliation process
https://github.com/taskcluster/taskcluster/blob/master/services/web-server/src/PulseEngine/index.js#L38
handles subscribing or unsubscribing to individual queues as necessary. But I was never very clear on when the GraphQL libraries called unsubscribe -- I assume that if an HTTP connection drops, any associated subscriptions are automatically unsubscribed?

Since tc-web-server doesn't run for me (bug 1538996), and since you can replicate, can you take a look at this? Note that tc-web-server does log when it deletes a queue.

I set up a rabbitmq user in our staging cluster:
https://hip-macaw.rmq.cloudamqp.com/#/users/bug1538961

It's probably a good idea to experiment with that cluster (you can reset the password).

pulse:
namespace: bug1538961
username: bug1538961
password: ..
hostname: hip-macaw.rmq.cloudamqp.com
vhost: /

admin user (don't use this one!) for that cluster has its username in
https://github.com/taskcluster/taskcluster-mozilla-terraform/blob/master/deployments/hassan-dev/main.sh
and its password in passwordstore.

Assignee: dustin → helfi92

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 3

•

7 years ago

OK, having fixed bug 1538996 I can take a look at this. I can reproduce the issue, too. I also see that messages remain un-acked.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 4

•

7 years ago

OK, it seems that the queues remain around when an error occurs, at least. Making them exclusive should help.

The downside is, amqplib kills the connection whenever anything goes wrong, where "anything goes wrong" includes "checkExchange(..) returns false". So that means, when any user tries to listen to an exchange that doesn't exist, all users' subscriptions will be aborted and queues destroyed. So even if they reconnect, they may have missed messages. If amqplib could possibly determine that some operations only poison the channel, not the connection, we might be OK. I think the options are:

use a distinct connection (PulseClient) for each subscription
patch, fork, or replace amqplib with something that is smarter (I'll file an issue..)

Likely the first of those is the best short-term option. The risk is that it will create a lot of consumers on the RabbitMQ server.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

7 years ago

Assignee: helfi92 → dustin

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 5

•

7 years ago

It looks like https://github.com/squaremo/amqp.node/issues is largely unmaintained, so getting a patched version that supports keeping connections alive is unlikely. But I filed an issue anyway: https://github.com/squaremo/amqp.node/issues/501.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 6

•

7 years ago

Owlish suggested using queue TTLs, too. I think those might be a useful way to simulate exclusive queues without deleting those queues on accidental disconnect.

TTL can also be set on queues, not just queue contents. Queues will expire after a period of time only when they are not used (e.g. do not have consumers). This feature can be used together with the auto-delete queue property.

So a 30-second TTL, for example, would allow us to reconnect and consume from the same queue, but if for some reason that reconnection didn't occur, the queue would be dropped before it could accumulate too many messages.

So, to-do list:

ack messages once they are sent to the browser
use one connection per subscription
set queue TTLs but keep using exclusive: false

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 7

•

7 years ago

Test query (since the playground keeps losing history..)

# Write your query or mutation here
subscription Sample {
  pulseMessages(subscriptions: [{exchange: "exchange/bug1538961/test", pattern: "#"}]) {
    payload
  }
}

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 8

•

7 years ago

work-in-progress: https://github.com/taskcluster/taskcluster/compare/master...djmitche:bug1538961

It turns out, yes, if you have an error event handler on a channel, then the error is not bubbled up to the connection. The error handling in amqplib is really unusual, but that at least seems a sensible feature. Too bad it's not documented.

So what's implemented so far:

use a channel per subscription
provide a way to feed errors back to the browser
feed errors from binding, or anything else, back to the browser, without killing the connection

Still to do:

ack messages once they are sent to the browser
set queue TTLs so that any "escaped" queues will eventually go away

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 9

•

7 years ago

https://github.com/taskcluster/taskcluster/pull/502

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 10

•

7 years ago

I played with the pulse inspector a bit, and this seems to be behaving itself. As a bonus, you now get a nice error message if you enter an exchange that doesn't exist.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Kim Moir [:kmoir] ET

Updated

•

7 years ago

Blocks: 1542805

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Use of subscriptions in web-server during development can leak queues and kill pulse

Categories

(Taskcluster :: Services, defect)

Tracking

(Not tracked)

People

(Reporter: dustin, Assigned: dustin)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Attachment

General

Description

File Name

Content Type