1903229 - Apply policies to non-core queues on Pulse

Reporter

Description

•

20 days ago

We run Firefox-CI and Community-TC on a single rabbitmq cluster (Pulse) and allow any external integration to connect to it and listen to the events.

Sometimes those queues might grow and cause periods of downtime for FXCI (as publishers would fail to publish messages on time due to the increased loads on RMQ instances)

We also have PulseGuardian that should take care of such queues, but it can be configured to ignore some of those, so it doesn't help in all cases.

Queues with large number of "Ready" messages is usually not an issue, since rabbit stores them on the disk by default.
But the problem starts to happen when clients fetch many messages at once before ACK/Rejecting them, which forces RMQ to store them in memory and consume extra CPU.

To avoid such scenarios we can apply few policies to exchanges and queues.

Those policies can include:

message-ttl to keep messages for X minutes/hours only
expires to delete queue itself after some inactivity time
max-length to allow only certain number of messages in the queue (dropping the old ones during overflow)

Policies can be attached using regexp and we can either apply same for all of them, or exclude non-core ones (usually ^queue/taskcluster-.*, and few others TBD).
Probably wouldn't be an issue if applied to all, because core queues always have consumers and there are rarely stale messages/growing queues.

Proposed values:

message-ttl - 12h
expires - 5d
max-length - 100.000

Yarik Kurmyza [:yarik] (he/him) (UTC+1)

Reporter

Updated

•

20 days ago

Blocks: 1875132

Yarik Kurmyza [:yarik] (he/him) (UTC+1)

Reporter

Comment 1

•

14 days ago

My plan/proposal is to apply following policy to all "non-core" queues

Core queues are the ones that are essential for taskcluster to function properly queue/taskcluster-.*

Every other queue is some other integration that would be a subject to a stricter policy:

vhost: / and communitytc
queue name pattern: ^(?!queue\/taskcluster-).*

expires: 604.800.000 - 7 days
max-length: 100.000 - 100k messages max
message-ttl: 86.400.000 - 24 hours

This way queues will be guaranteed to stay alive up to 7 days (after last consumer disconnected and there were no activity)
Only up to 100k messages would be able to stay in the queue with the old one being truncated
Messages would stay at most 24h in the queue and then expire.

This should leave plenty of times for integrations and clients that run some jobs once a day or once a week, and guarantee we don't let rabbitmq overflow with unused/unconsumed messages

:wezhou

Comment 2

•

14 days ago

I think this is a good start as a general policy for non-core queues.

However, I'm not sure it is enough to help prevent the queue/pernosco/\d+ queues from blocking firefox-ci again. It took less than 24 hours and less than 100K messages in one of those queues to block firefox-ci last time. Perhaps for the pernosco queues we can consider using a separate policy with a more aggressive TTL, such as 2 hours.

Yarik Kurmyza [:yarik] (he/him) (UTC+1)

Reporter

Comment 3

•

13 days ago

Yeah, this isn't gonna solve that particular issue, but afaik :ahal was planning to rewrite routing to send only few messages to that queue

https://github.com/mozilla-services/cloudops-infra/pull/5746

:wezhou

Comment 4

•

13 days ago

(In reply to Yarik Kurmyza [:yarik] (he/him) (UTC+1) from comment #3)

Yeah, this isn't gonna solve that particular issue, but afaik :ahal was planning to rewrite routing to send only few messages to that queue

Do we have a ticket yet for this effort?

Yarik Kurmyza [:yarik] (he/him) (UTC+1)

Reporter

Comment 5

•

13 days ago

Yes, this is linked in the parent ticket - https://bugzilla.mozilla.org/show_bug.cgi?id=1903320
And is waiting for the next deployment

:wezhou

Comment 6

•

13 days ago

:yarik, can https://github.com/mozilla-services/cloudops-infra/pull/5746 be applied to community and firefoxci now, or it needs to wait for other tickets such as the one you mentioned above to complete first?

Thanks.

Yarik Kurmyza [:yarik] (he/him) (UTC+1)

Reporter

Comment 7

•

12 days ago

(In reply to :wezhou from comment #6)

:yarik, can https://github.com/mozilla-services/cloudops-infra/pull/5746 be applied to community and firefoxci now, or it needs to wait for other tickets such as the one you mentioned above to complete first?

I didn't hear any objections so far, and those policies are quite generous, so no harm should be done. I think it is safe to apply to both already.
Thanks

:wezhou

Comment 8

•

12 days ago

Thanks. The PR has been applied to all 3 envs and merged.

Status: NEW → RESOLVED

Closed: 12 days ago

Resolution: --- → FIXED

Bugzilla

Quick Search

Apply policies to non-core queues on Pulse

Categories

(Webtools :: Pulse, task)

Tracking

(Not tracked)

People

(Reporter: yarik, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8