Closed Bug 1903229 Opened 20 days ago Closed 12 days ago

Apply policies to non-core queues on Pulse

Categories

(Webtools :: Pulse, task)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: yarik, Unassigned)

References

(Blocks 1 open bug)

Details

We run Firefox-CI and Community-TC on a single rabbitmq cluster (Pulse) and allow any external integration to connect to it and listen to the events.

Sometimes those queues might grow and cause periods of downtime for FXCI (as publishers would fail to publish messages on time due to the increased loads on RMQ instances)

We also have PulseGuardian that should take care of such queues, but it can be configured to ignore some of those, so it doesn't help in all cases.

Queues with large number of "Ready" messages is usually not an issue, since rabbit stores them on the disk by default.
But the problem starts to happen when clients fetch many messages at once before ACK/Rejecting them, which forces RMQ to store them in memory and consume extra CPU.

To avoid such scenarios we can apply few policies to exchanges and queues.

Those policies can include:

  • message-ttl to keep messages for X minutes/hours only
  • expires to delete queue itself after some inactivity time
  • max-length to allow only certain number of messages in the queue (dropping the old ones during overflow)

Policies can be attached using regexp and we can either apply same for all of them, or exclude non-core ones (usually ^queue/taskcluster-.*, and few others TBD).
Probably wouldn't be an issue if applied to all, because core queues always have consumers and there are rarely stale messages/growing queues.

Proposed values:

  • message-ttl - 12h
  • expires - 5d
  • max-length - 100.000

My plan/proposal is to apply following policy to all "non-core" queues

Core queues are the ones that are essential for taskcluster to function properly queue/taskcluster-.*

Every other queue is some other integration that would be a subject to a stricter policy:

vhost: / and communitytc
queue name pattern: ^(?!queue\/taskcluster-).*

expires: 604.800.000 - 7 days
max-length: 100.000 - 100k messages max
message-ttl: 86.400.000 - 24 hours

This way queues will be guaranteed to stay alive up to 7 days (after last consumer disconnected and there were no activity)
Only up to 100k messages would be able to stay in the queue with the old one being truncated
Messages would stay at most 24h in the queue and then expire.

This should leave plenty of times for integrations and clients that run some jobs once a day or once a week, and guarantee we don't let rabbitmq overflow with unused/unconsumed messages

I think this is a good start as a general policy for non-core queues.

However, I'm not sure it is enough to help prevent the queue/pernosco/\d+ queues from blocking firefox-ci again. It took less than 24 hours and less than 100K messages in one of those queues to block firefox-ci last time. Perhaps for the pernosco queues we can consider using a separate policy with a more aggressive TTL, such as 2 hours.

Yeah, this isn't gonna solve that particular issue, but afaik :ahal was planning to rewrite routing to send only few messages to that queue

https://github.com/mozilla-services/cloudops-infra/pull/5746

(In reply to Yarik Kurmyza [:yarik] (he/him) (UTC+1) from comment #3)

Yeah, this isn't gonna solve that particular issue, but afaik :ahal was planning to rewrite routing to send only few messages to that queue

Do we have a ticket yet for this effort?

Yes, this is linked in the parent ticket - https://bugzilla.mozilla.org/show_bug.cgi?id=1903320
And is waiting for the next deployment

:yarik, can https://github.com/mozilla-services/cloudops-infra/pull/5746 be applied to community and firefoxci now, or it needs to wait for other tickets such as the one you mentioned above to complete first?

Thanks.

(In reply to :wezhou from comment #6)

:yarik, can https://github.com/mozilla-services/cloudops-infra/pull/5746 be applied to community and firefoxci now, or it needs to wait for other tickets such as the one you mentioned above to complete first?

I didn't hear any objections so far, and those policies are quite generous, so no harm should be done. I think it is safe to apply to both already.
Thanks

Thanks. The PR has been applied to all 3 envs and merged.

Status: NEW → RESOLVED
Closed: 12 days ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.