Apply policies to non-core queues on Pulse
Categories
(Webtools :: Pulse, task)
Tracking
(Not tracked)
People
(Reporter: yarik, Unassigned)
References
(Blocks 1 open bug)
Details
We run Firefox-CI and Community-TC on a single rabbitmq cluster (Pulse) and allow any external integration to connect to it and listen to the events.
Sometimes those queues might grow and cause periods of downtime for FXCI (as publishers would fail to publish messages on time due to the increased loads on RMQ instances)
We also have PulseGuardian that should take care of such queues, but it can be configured to ignore some of those, so it doesn't help in all cases.
Queues with large number of "Ready" messages is usually not an issue, since rabbit stores them on the disk by default.
But the problem starts to happen when clients fetch many messages at once before ACK/Rejecting them, which forces RMQ to store them in memory and consume extra CPU.
To avoid such scenarios we can apply few policies to exchanges and queues.
Those policies can include:
message-ttl
to keep messages for X minutes/hours onlyexpires
to delete queue itself after some inactivity timemax-length
to allow only certain number of messages in the queue (dropping the old ones during overflow)
Policies can be attached using regexp and we can either apply same for all of them, or exclude non-core ones (usually ^queue/taskcluster-.*
, and few others TBD).
Probably wouldn't be an issue if applied to all, because core queues always have consumers and there are rarely stale messages/growing queues.
Proposed values:
message-ttl
-12h
expires
-5d
max-length
-100.000
Reporter | ||
Comment 1•14 days ago
|
||
My plan/proposal is to apply following policy to all "non-core" queues
Core queues are the ones that are essential for taskcluster to function properly queue/taskcluster-.*
Every other queue is some other integration that would be a subject to a stricter policy:
vhost
: /
and communitytc
queue name pattern
: ^(?!queue\/taskcluster-).*
expires:
604.800.000
- 7 days
max-length
: 100.000
- 100k messages max
message-ttl
: 86.400.000
- 24 hours
This way queues will be guaranteed to stay alive up to 7 days (after last consumer disconnected and there were no activity)
Only up to 100k messages would be able to stay in the queue with the old one being truncated
Messages would stay at most 24h in the queue and then expire.
This should leave plenty of times for integrations and clients that run some jobs once a day or once a week, and guarantee we don't let rabbitmq overflow with unused/unconsumed messages
I think this is a good start as a general policy for non-core queues.
However, I'm not sure it is enough to help prevent the queue/pernosco/\d+
queues from blocking firefox-ci again. It took less than 24 hours and less than 100K messages in one of those queues to block firefox-ci last time. Perhaps for the pernosco
queues we can consider using a separate policy with a more aggressive TTL, such as 2 hours.
Reporter | ||
Comment 3•13 days ago
|
||
Yeah, this isn't gonna solve that particular issue, but afaik :ahal was planning to rewrite routing to send only few messages to that queue
https://github.com/mozilla-services/cloudops-infra/pull/5746
(In reply to Yarik Kurmyza [:yarik] (he/him) (UTC+1) from comment #3)
Yeah, this isn't gonna solve that particular issue, but afaik :ahal was planning to rewrite routing to send only few messages to that queue
Do we have a ticket yet for this effort?
Reporter | ||
Comment 5•13 days ago
|
||
Yes, this is linked in the parent ticket - https://bugzilla.mozilla.org/show_bug.cgi?id=1903320
And is waiting for the next deployment
:yarik, can https://github.com/mozilla-services/cloudops-infra/pull/5746 be applied to community
and firefoxci
now, or it needs to wait for other tickets such as the one you mentioned above to complete first?
Thanks.
Reporter | ||
Comment 7•12 days ago
|
||
(In reply to :wezhou from comment #6)
:yarik, can https://github.com/mozilla-services/cloudops-infra/pull/5746 be applied to
community
andfirefoxci
now, or it needs to wait for other tickets such as the one you mentioned above to complete first?
I didn't hear any objections so far, and those policies are quite generous, so no harm should be done. I think it is safe to apply to both already.
Thanks
Thanks. The PR has been applied to all 3 envs and merged.
Description
•