[PulseGuardian] Smarter queue monitoring to facilitate bursty loads



4 years ago
4 years ago


(Reporter: jonasfj, Unassigned)



Firefox Tracking Flags

(firefox40 affected)


So these past couple of days I've been receiving a lot of emails warnings
about the queue for the task-graph-scheduler which receiving messages from
all tasks in all tasks-graphs.

Basically it goes above 2k messages, and then back down to zero a few minutes
later. There is two reasons for this:
 A) We have busty load, all of a sudden a lot of tasks finish at the same time
 B) My background workers are slow, Heroku restarts them or I temporarily
    experience timeouts.

In both cases we want pulse guardian to be smarter, and accept that I'm a
little behind on consumption for a limited period of time.
Most of the time queue is close to empty, it's never really empty because
there is always something coming in and going out.  But it doesn't grow much.
It's only occasionally that it spikes.

### Option A): Rely < 100 messages in queue for window of 10 minutes
Instead of relying on absolute message count. We should do something smarter.
Preferably based on consumption rate. IMO a good measure would be if we
haven't had less than < 100 messages in the queue for a window of 10 minutes.

I suspect we poll RabbitMQ management APIs for stats once a minute or so.
Hence, we poll every minute and if we don't see the count below 100 messages
for 10 consecutive minutes/polling-cycles we send a warning.

Maybe 20 min and 500 messages are better numbers, that is tunable.

Reasoning: Pulse is not a work queue, it's not a place we should store messages
waiting to be processed. Hence, it's fair to require that queues are emptied
regularly. Checking that queue length is less than 100 is just a robust way of
checking for emptiness with polling.

### Option B) Something based on consumption rates
RabbitMQ does maintain rates of message consumption and publication.
Off the top of my head I can't immediately see how to use this.
But I'm sure there is other options that uses some of the stats maintained
by RabbitMQ.

I'm not sure an absolute limit on the number of messages in the queue is
good. If we do keep it, it should be a very high limit.

Note: Consumers can already define max queue length client side
      with RabbitMQ extensions. PulseListener I've implemented already
      supports this.

This is not critical yet, but going forward this is going to be important...
We can mitigate the issue temporarily by scaling up the number of consumers,
but that's not a good long-term solution.
Priority: -- → P2

Comment 1

4 years ago
I suspect the option (A) is easiest and better than (B).

I want to be warned if consumers aren't keeping up, another measure is age of oldest message in the queue (don't know if we can get that in an efficiently manner)

Also, we might want to warn about durable queues that haven't had a consumer for 48 hours or so.
You need to log in before you can comment on or make changes to this bug.