Closed Bug 632300 Opened 15 years ago Closed 14 years ago

Need alerts when zeus has exhausted a pool

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
Other
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cshields, Assigned: rtucker)

References

Details

We need alerts from our zeus clusters whenever all nodes in a pool have failed their health checks. (ideally we will have already seen outage notices for these nodes individually, but this is monitoring the potential problem from a different angle) So for instance, tonight SUMO's 3 nodes in PHX failed to respond and they all failed, leaving zeus with no node in the pool. I know zeus throws a different type of error when this occurs, we need to catch it and alert oncall accordingly.
Which zeus node sits in front of the 3 that failed?
This is in sjc, but to close this bug out I'd like monitoring turned on for all pools in all clusters.
from conversations with cshields today, it sounds as though the best solution for this is to add trap collection to nagios. Do we have any trap collecting in place? The only way that I know of to do this is to use snmptt and funnel them into nagios, but with our setup the way it is, I can't see this as being an easy task.
I now have the zlbXX.nms cluster and pp-zlbXX clusters trapping to dm-nagios01 and dp-nagios01 respectively. I think that I've got the chatter down to where we want it. Are there other zlb clusters that should be throwing out traps as well that I don't know about? I can configure them to trap as well.
oremj, Whenever you had a free minute. Would you mind adding a comment regarding which other zlb clusters I should enable trapping for?
zlb01.nms.mozilla.org pm-zlb-generic01.nms.mozilla.org pm-zlb-amo01.nms.mozilla.org
I added these additional ones. Are there any more or can I close this out?
Closing this out.. we have more that are just doing caching so I don't want to get warnings for network blips right now. thanks!
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.