Closed
Bug 1124945
Opened 10 years ago
Closed 10 years ago
Improve Pulse Nagios alerts
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Infrastructure & Operations Graveyard
WebOps: Other
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mcote, Assigned: cliang)
Details
(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/335] )
We raised the Pulse nagios alert threshold in bug 1079534 to 10000 messages. This isn't great, though, since this is an overall number, and PulseGuardian only deletes queues when they hit 8000 messages. So it's easy for there to be several queues with a few thousand messages, which triggers nagios alerts despite this being a nominal condition to PulseGuardian.
Pulse is hardly affected by having 10000 queued messages. We could easily increase that limit; however, number of messages is really not a good indicator of consumption fo system resources, since the messages could be of any size.
I think we should probably alert when the usage of system resources starts getting too high--when disk space, memory usage, or CPU usage starts getting problematic (or just before).
Assignee | ||
Comment 1•10 years ago
|
||
I'm wondering if it'd be better to do something like an alert for any rabbitMQ queue that hits <threshold> and stays there for longer than it takes for PulseGuardian to kick in. Let me see if I can figure out if there is a sane way to do this in the context of a Nagios check.
Reporter | ||
Comment 2•10 years ago
|
||
I think nagios alerts are still useful for the (probably rare, but possible) situation in which there are a very large number of queues that have many messages but are still under the 8k limit. The question is how to have a meaningful threshold for nagios--right now our choices of message limit (for both PulseGuardian and nagios) are essentially arbitrary.
Assignee | ||
Comment 3•10 years ago
|
||
So, stepping back a little:
1) There checks for overall use of disk space, memory, etc. There are currently no specific checks for hitting the rabbitmq high water marks for each of these; I can add those checks (since triggering any of those will mean that rabbitmq won't take any more messages).
2) I'm wondering if a check for x% of queues that are y% within the message limit (8k). If that makes sense, I can tackle that check as well.
Assignee | ||
Comment 4•10 years ago
|
||
$ svn commit -m "Remove pulse rabbitmq overview check (BZ 1124945)"
Sending nagios/manifests/mozilla/services.pp
Transmitting file data .
Committed revision 99573.
Assignee | ||
Updated•10 years ago
|
Assignee: server-ops-webops → cliang
Assignee | ||
Comment 5•10 years ago
|
||
RabbitMQ watermark checks put into place for production pulse. Right now:
WARNING: at 75% of the rabbitMQ file descriptor, socket descriptor, or memory watermark threshold
-OR- have only 10X the disk threshold left (480 MB)
CRITICAL: at 90% of the rabbitMQ file descriptor, socket descriptor, or memory watermark threshold
-OR- have only 4x the disk threshold left (192 MB)
Check documented at:
https://mana.mozilla.org/wiki/display/NAGIOS/Pulse+-+RabbitMQ+Watermarks
https://mana.mozilla.org/wiki/display/NAGIOS/RabbitMQ+Watermarks
Assignee | ||
Comment 6•10 years ago
|
||
There doesn't appear to have been any side-effects to introducing this check. Closing out this rabbitMQ monitoring bug.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•