Closed Bug 1538961 Opened 6 years ago Closed 6 years ago

Use of subscriptions in web-server during development can leak queues and kill pulse

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Attachments

(1 file)

A development rabbitmq user (taskcluster-user-haali) was configured into tc-web-server, and GraphQL subscriptions made to subscribe to "active" taskgroups.

The result was 3000+ queues like that pictured in the attachment, many with about 1000 messages in them. This killed the RabbitMQ cluster. This is likely the same thing that happened last week.

Note that the queues are not exclusive -- I'm not sure why that's the case.

The 3000+ is more than the number of individual control-r's of the ui, but HMR might have resulted in that number of reloads.

So:

  • Why are the queues not exclusive? Git blames me but I asked myself and I don't remember.
  • Why are the queues not being deleted directly by tc-web-server when the subscription completes?

Why are the queues not being deleted directly by tc-web-server when the subscription completes?

I wonder if bringing back the unsubscribeAll[1] method would fix this issue. Dustin, do you recall why we removed it?

[1] https://github.com/taskcluster/taskcluster-web-server/commit/db7f05b7cbae4a04853d89501f38494194bff293#diff-e15d433f1f3a83432bd4f9507c7125a9L83

Flags: needinfo?(dustin)

That diff is a bit misleading -- the file was basically completely rewritten, and the old version did something really strange with making multiple channels listening to the same queue or something like that. Now the reconciliation process
https://github.com/taskcluster/taskcluster/blob/master/services/web-server/src/PulseEngine/index.js#L38
handles subscribing or unsubscribing to individual queues as necessary. But I was never very clear on when the GraphQL libraries called unsubscribe -- I assume that if an HTTP connection drops, any associated subscriptions are automatically unsubscribed?

Since tc-web-server doesn't run for me (bug 1538996), and since you can replicate, can you take a look at this? Note that tc-web-server does log when it deletes a queue.

I set up a rabbitmq user in our staging cluster:
https://hip-macaw.rmq.cloudamqp.com/#/users/bug1538961

It's probably a good idea to experiment with that cluster (you can reset the password).

pulse:
namespace: bug1538961
username: bug1538961
password: ..
hostname: hip-macaw.rmq.cloudamqp.com
vhost: /

admin user (don't use this one!) for that cluster has its username in
https://github.com/taskcluster/taskcluster-mozilla-terraform/blob/master/deployments/hassan-dev/main.sh
and its password in passwordstore.

Assignee: dustin → helfi92
Flags: needinfo?(dustin)

OK, having fixed bug 1538996 I can take a look at this. I can reproduce the issue, too. I also see that messages remain un-acked.

OK, it seems that the queues remain around when an error occurs, at least. Making them exclusive should help.

The downside is, amqplib kills the connection whenever anything goes wrong, where "anything goes wrong" includes "checkExchange(..) returns false". So that means, when any user tries to listen to an exchange that doesn't exist, all users' subscriptions will be aborted and queues destroyed. So even if they reconnect, they may have missed messages. If amqplib could possibly determine that some operations only poison the channel, not the connection, we might be OK. I think the options are:

  • use a distinct connection (PulseClient) for each subscription
  • patch, fork, or replace amqplib with something that is smarter (I'll file an issue..)

Likely the first of those is the best short-term option. The risk is that it will create a lot of consumers on the RabbitMQ server.

Assignee: helfi92 → dustin

It looks like https://github.com/squaremo/amqp.node/issues is largely unmaintained, so getting a patched version that supports keeping connections alive is unlikely. But I filed an issue anyway: https://github.com/squaremo/amqp.node/issues/501.

Owlish suggested using queue TTLs, too. I think those might be a useful way to simulate exclusive queues without deleting those queues on accidental disconnect.

TTL can also be set on queues, not just queue contents. Queues will expire after a period of time only when they are not used (e.g. do not have consumers). This feature can be used together with the auto-delete queue property.

So a 30-second TTL, for example, would allow us to reconnect and consume from the same queue, but if for some reason that reconnection didn't occur, the queue would be dropped before it could accumulate too many messages.

So, to-do list:

  • ack messages once they are sent to the browser
  • use one connection per subscription
  • set queue TTLs but keep using exclusive: false

Test query (since the playground keeps losing history..)

# Write your query or mutation here
subscription Sample {
  pulseMessages(subscriptions: [{exchange: "exchange/bug1538961/test", pattern: "#"}]) {
    payload
  }
}

work-in-progress: https://github.com/taskcluster/taskcluster/compare/master...djmitche:bug1538961

It turns out, yes, if you have an error event handler on a channel, then the error is not bubbled up to the connection. The error handling in amqplib is really unusual, but that at least seems a sensible feature. Too bad it's not documented.

So what's implemented so far:

  • use a channel per subscription
  • provide a way to feed errors back to the browser
  • feed errors from binding, or anything else, back to the browser, without killing the connection

Still to do:

  • ack messages once they are sent to the browser
  • set queue TTLs so that any "escaped" queues will eventually go away

I played with the pulse inspector a bit, and this seems to be behaving itself. As a bonus, you now get a nice error message if you enter an exchange that doesn't exist.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Blocks: 1542805
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: