Open Bug 1903235 Opened 4 months ago Updated 3 months ago

Investigate isolation of taskcluster exchanges/queues from the rest by vhost

Categories

(Webtools :: Pulse, task)

task

Tracking

(Not tracked)

People

(Reporter: yarik, Unassigned)

References

(Blocks 1 open bug)

Details

Since we allow anyone to create pulse accounts and queues to listen to all messages, we are adding a certain risk of core publishers being affected by large number of uncontrolled queues.

When message is being published by taskcluster to one of its exchanges, RMQ would need to route this message to all queues that have corresponding bindings. Only after messages were delivered to the queues, RMQ would send a confirmation to the publisher.
If one of the nodes in cluster is under load or some queue cannot accept new incoming message, whole process is going to be delayed, and producer might fail with deadline exceeded timeout (12s)

To minimize risk of waiting for the messages to be propagated to the external (non-core) queues we can experiment by separating core vs non-core queues by vhost.

Idea to test:

  1. FxCI gets a dedicated fxci vhost that only it is allowed to publish and create queues.
  2. Federation plugin is setup to forward all messages from fxci host to the existing / vhost (or a new one) (federation should be ASYNC, so wouldn't block publisher)
  3. All external integrations and queues are listening on the mirrored queues.

Potential benefits here (needs validation and testing):

  1. publishing to a dedicated vhost, should not depend on external queues, as federation would be async
  2. clean separation of core vs non-core

However, external integrations might still cause CPU spike, so it's important to test if having much lower number of direct consumer queues would work as fast.
In case of exclusive pernosco queue which had a high number of unacked messages - it could have delayed new messages being published

Blocks: 1875132
You need to log in before you can comment on or make changes to this bug.