So these past couple of days I've been receiving a lot of emails warnings about the queue for the task-graph-scheduler which receiving messages from all tasks in all tasks-graphs. Basically it goes above 2k messages, and then back down to zero a few minutes later. There is two reasons for this: A) We have busty load, all of a sudden a lot of tasks finish at the same time B) My background workers are slow, Heroku restarts them or I temporarily experience timeouts. In both cases we want pulse guardian to be smarter, and accept that I'm a little behind on consumption for a limited period of time. Most of the time queue is close to empty, it's never really empty because there is always something coming in and going out. But it doesn't grow much. It's only occasionally that it spikes. ### Option A): Rely < 100 messages in queue for window of 10 minutes Instead of relying on absolute message count. We should do something smarter. Preferably based on consumption rate. IMO a good measure would be if we haven't had less than < 100 messages in the queue for a window of 10 minutes. I suspect we poll RabbitMQ management APIs for stats once a minute or so. Hence, we poll every minute and if we don't see the count below 100 messages for 10 consecutive minutes/polling-cycles we send a warning. Maybe 20 min and 500 messages are better numbers, that is tunable. Reasoning: Pulse is not a work queue, it's not a place we should store messages waiting to be processed. Hence, it's fair to require that queues are emptied regularly. Checking that queue length is less than 100 is just a robust way of checking for emptiness with polling. ### Option B) Something based on consumption rates RabbitMQ does maintain rates of message consumption and publication. Off the top of my head I can't immediately see how to use this. But I'm sure there is other options that uses some of the stats maintained by RabbitMQ. -------------------- I'm not sure an absolute limit on the number of messages in the queue is good. If we do keep it, it should be a very high limit. Note: Consumers can already define max queue length client side with RabbitMQ extensions. PulseListener I've implemented already supports this. This is not critical yet, but going forward this is going to be important... We can mitigate the issue temporarily by scaling up the number of consumers, but that's not a good long-term solution.
I suspect the option (A) is easiest and better than (B). I want to be warned if consumers aren't keeping up, another measure is age of oldest message in the queue (don't know if we can get that in an efficiently manner) Also, we might want to warn about durable queues that haven't had a consumer for 48 hours or so.
You need to log in before you can comment on or make changes to this bug.