Closed Bug 603343 (releng-nagios) Opened 14 years ago Closed 13 years ago

[Tracking bug] cleanup nagios configs, so monitoring RelEng systems is less noisy

Categories

(Infrastructure & Operations :: RelOps: General, task, P3)

x86
All

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: arich)

References

Details

Between Friday night and first thing Monday morning, I got 1094 nagios messages.

Its impossible to see if there are any *real* problems in the midst of all that noise.

This bug is to track fixing configs to correctly handle how our RelEng systems behave routinely, so whenever we *do* get a nagios alert, it is something we notice!

One example (bug#575472, already fixed) was configuring nagios to treat mobile phones as devices that behave differently to desktop machines, and have different time thresholds for reporting errors on reboot - phones run so much slower!
For what it's worth, monitoring using the web page or IRC is much easier than mail, because of all the frequent status changes. For example, this query shows all things that are *currently* FAILing or WARNing:
https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15

This one shows things currently FAILing or WARNing that have not been acknowledged. Generally, this are things that currently require attention:
https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346


A couple ideas on how to make things better
- Sort things on the nagios display. Having sections for all of the platform/slave type combinations would help us find patterns quicker, I think. Having the less redundant things such as masters in their own section would make it easier to find critical problems with them.
(In reply to comment #1)
> For what it's worth, monitoring using the web page or IRC is much easier than
> mail, because of all the frequent status changes. For example, this query shows
> all things that are *currently* FAILing or WARNing:
> https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15
> 
> This one shows things currently FAILing or WARNing that have not been
> acknowledged. Generally, this are things that currently require attention:
> https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346

Actually, that misses the point. 

Yes, you can skip some of the nagios noise in email/irc. Yes, looking at the nagios master shows what is *currently* failing, but those errors still includes a bunch of noisy/flapping alerts. A human has to then weed through those looking which alerts are real, and which are noise.

Having noisy/flapping nagios is the problem. The real fix is to debug the nagios thresholds and configs so that all (or at least the vast majority) of nagios alerts are actually real valid problems. This bug is to track identifying and fixing those noisy/flapping nagios configs.



> A couple ideas on how to make things better
> - Sort things on the nagios display. Having sections for all of the
> platform/slave type combinations would help us find patterns quicker, I think.
> Having the less redundant things such as masters in their own section would
> make it easier to find critical problems with them.

I guess that might be helpful, but it feels like a separate issue.
OS: Mac OS X → All
See Also: → 589006
(In reply to comment #2)
> (In reply to comment #1)
> > For what it's worth, monitoring using the web page or IRC is much easier than
> > mail, because of all the frequent status changes. For example, this query shows
> > all things that are *currently* FAILing or WARNing:
> > https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15
> > 
> > This one shows things currently FAILing or WARNing that have not been
> > acknowledged. Generally, this are things that currently require attention:
> > https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346
> 
> Actually, that misses the point. 

I don't think it entirely misses the point -- it's still a ton easier than matching up errors and recoveries in e-mail.

> Yes, you can skip some of the nagios noise in email/irc. Yes, looking at the
> nagios master shows what is *currently* failing, but those errors still
> includes a bunch of noisy/flapping alerts.

In my experience, there's very few flapping alerts. Which ones have you noticed to be flapping?

> > A couple ideas on how to make things better
> > - Sort things on the nagios display. Having sections for all of the
> > platform/slave type combinations would help us find patterns quicker, I think.
> > Having the less redundant things such as masters in their own section would
> > make it easier to find critical problems with them.
> 
> I guess that might be helpful, but it feels like a separate issue.

OK, I'll file it separately, I didn't realize this bug was limited to your original idea.
(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > For what it's worth, monitoring using the web page or IRC is much easier than
> > > mail, because of all the frequent status changes. For example, this query shows
> > > all things that are *currently* FAILing or WARNing:
> > > https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15
> > > 
> > > This one shows things currently FAILing or WARNing that have not been
> > > acknowledged. Generally, this are things that currently require attention:
> > > https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346
> > 
> > Actually, that misses the point. 
> I don't think it entirely misses the point -- it's still a ton easier than
> matching up errors and recoveries in e-mail.
Organizing the errors is helpful, of course, but the root problem of eliminating the nagios noise is the topic here.


> > Yes, you can skip some of the nagios noise in email/irc. Yes, looking at the
> > nagios master shows what is *currently* failing, but those errors still
> > includes a bunch of noisy/flapping alerts.
> 
> In my experience, there's very few flapping alerts. Which ones have you noticed
> to be flapping?

Not complete list, but from quick glance in #build just now:
12:01:22 < nagios> [67] mw32-ix-slave11.build:buildbot is CRITICAL: CRITICAL: 
                   python.exe: stopped (critical)
12:06:28 < nagios> mw32-ix-slave11.build:buildbot is OK: OK: python.exe: 1




> > > A couple ideas on how to make things better
> > > - Sort things on the nagios display. Having sections for all of the
> > > platform/slave type combinations would help us find patterns quicker, I think.
> > > Having the less redundant things such as masters in their own section would
> > > make it easier to find critical problems with them.
> > 
> > I guess that might be helpful, but it feels like a separate issue.
> 
> OK, I'll file it separately, I didn't realize this bug was limited to your
> original idea.
I've linked your bug, thanks.
Depends on: 603684
(In reply to comment #4)
> (In reply to comment #3)
> > (In reply to comment #2)
> > > (In reply to comment #1)
> > In my experience, there's very few flapping alerts. Which ones have you noticed
> > to be flapping?
> 
> Not complete list, but from quick glance in #build just now:
> 12:01:22 < nagios> [67] mw32-ix-slave11.build:buildbot is CRITICAL: CRITICAL: 
>                    python.exe: stopped (critical)
> 12:06:28 < nagios> mw32-ix-slave11.build:buildbot is OK: OK: python.exe: 1

jhford tells me this random example I picked happened to be when he was working on this slave. 



Here's another different random example of flapping from this morning's set of nagios alerts:

Host: try-mac-slave40.build
State: CRITICAL
Date/Time: 10-13-2010 19:43:03
FILE_AGE CRITICAL: /builds/slave/twistd.log is 432149 seconds old and 835444 bytes

Host: try-linux-slave04.build
State: OK
Date/Time: 10-13-2010 19:43:11
Additional Info:
FILE_AGE OK: /builds/slave/twistd.log is 205 seconds old and 346783 bytes
(In reply to comment #5)
> (In reply to comment #4)
> > (In reply to comment #3)
> > > (In reply to comment #2)
> > > > (In reply to comment #1)
> > > In my experience, there's very few flapping alerts. Which ones have you noticed
> > > to be flapping?
> > 
> > Not complete list, but from quick glance in #build just now:
> > 12:01:22 < nagios> [67] mw32-ix-slave11.build:buildbot is CRITICAL: CRITICAL: 
> >                    python.exe: stopped (critical)
> > 12:06:28 < nagios> mw32-ix-slave11.build:buildbot is OK: OK: python.exe: 1
> 
> jhford tells me this random example I picked happened to be when he was working
> on this slave. 
> 
> 
> 
> Here's another different random example of flapping from this morning's set of
> nagios alerts:
> 
> Host: try-mac-slave40.build
> State: CRITICAL
> Date/Time: 10-13-2010 19:43:03
> FILE_AGE CRITICAL: /builds/slave/twistd.log is 432149 seconds old and 835444
> bytes
> Host: try-linux-slave04.build
> State: OK
> Date/Time: 10-13-2010 19:43:11
> Additional Info:
> FILE_AGE OK: /builds/slave/twistd.log is 205 seconds old and 346783 bytes

(Sorry for distracting from the point of this bug again, but I want to clarify these). I suspect they are both legitimate. We have very low load on 10.5 mac build machines these days, since the universal build change. Linux VMs are also less loaded beacuse of all the ix build machines.
> (Sorry for distracting from the point of this bug again, but I want to clarify
> these). I suspect they are both legitimate. We have very low load on 10.5 mac
> build machines these days, since the universal build change. Linux VMs are also
> less loaded beacuse of all the ix build machines.

Time to think what to do with these underused guys?
Assignee: nobody → joduinn
Priority: -- → P3
Another example of flapping nagios alerts: 

04:16 < nagios> [82] moz2-win32-slave12.build:disk - C is WARNING: WARNING: C:: Total: 9.75G - Used: 9.29G (95%) - Free: 477M (5%)  warning
04:56 < nagios> moz2-win32-slave12.build:disk - C is OK: OK: All drives within bounds.
09:09 < nagios> [28] moz2-win32-slave12.build:disk - C is WARNING: WARNING: C:: Total: 9.75G - Used: 9.32G (95%) - Free: 444M (5%)  warning
10:19 < nagios> moz2-win32-slave12.build:disk - C is OK: OK: All drives within bounds.
10:41 < nagios> [78] moz2-win32-slave12.build:disk - C is WARNING: WARNING: C:: Total: 9.75G - Used: 9.34G (95%) - Free: 420M (5%)  warning
12:41 < nagios> [27] moz2-win32-slave12.build:disk - C is WARNING: WARNING: C:: Total: 9.75G - Used: 9.29G (95%) - Free: 477M (5%)  warning
14:41 < nagios> [67] moz2-win32-slave12.build:disk - C is WARNING: WARNING: C:: Total: 9.75G - Used: 9.29G (95%) - Free: 477M (5%)  warning
16:41 < nagios> [90] moz2-win32-slave12.build:disk - C is WARNING: WARNING: C:: Total: 9.75G - Used: 9.29G (95%) - Free: 476M (5%)  warning
18:41 < nagios> [18] moz2-win32-slave12.build:disk - C is WARNING: WARNING: C:: Total: 9.75G - Used: 9.29G (95%) - Free: 476M (5%)  warning
I suspect this will boil down to TEMP dir on C:\ gradually being filled up with tests, and it keeps going over the threshold to warn as files are created and deleted. So it's flapping but still valid.
(In reply to comment #9)
> I suspect this will boil down to TEMP dir on C:\ gradually being filled up with
> tests, and it keeps going over the threshold to warn as files are created and
> deleted. So it's flapping but still valid.

...or we could change the threshold so it no longer flaps?
(In reply to comment #10)
> (In reply to comment #9)
> > I suspect this will boil down to TEMP dir on C:\ gradually being filled up with
> > tests, and it keeps going over the threshold to warn as files are created and
> > deleted. So it's flapping but still valid.
> 
> ...or we could change the threshold so it no longer flaps?

It would be good to change the threshold until we run unit tests on the minis but worth noticing that bug 596852 will keep slaves with more free space most of the time.
(In reply to comment #10)
> ...or we could change the threshold so it no longer flaps?

We should fix the underlying issue rather than deferring it until later, since that also helps with overall system performance. There was 1.1GB of cruft in C:\Documents and Settings\Local Settings\Temp from tests and builds. Filed bug 605379 for long term fix.
(In reply to comment #12)
> (In reply to comment #10)
> > ...or we could change the threshold so it no longer flaps?
> 
> We should fix the underlying issue rather than deferring it until later, since
> that also helps with overall system performance. There was 1.1GB of cruft in
> C:\Documents and Settings\Local Settings\Temp from tests and builds. Filed bug
> 605379 for long term fix.

Agreed, and thanks for that, Nick.
Depends on: 605379
Another example of flapping nagios alerts: 

00:10 < nagios> [34] moz2-linux64-slave03.build:root partition is WARNING: DISK WARNING - free space: / 492 MB (6% inode=88%):
01:20 < nagios> moz2-linux64-slave03.build:root partition is OK: DISK OK - free space: / 509 MB (6% inode=88%):
01:42 < nagios> [48] moz2-linux64-slave03.build:root partition is WARNING: DISK WARNING - free space: / 472 MB (6% inode=88%):
02:12 < nagios> moz2-linux64-slave03.build:root partition is OK: DISK OK - free space: / 508 MB (6% inode=88%):
02:24 < nagios> [58] moz2-linux64-slave03.build:root partition is WARNING: DISK WARNING - free space: / 488 MB (6% inode=88%):
02:34 < nagios> moz2-linux64-slave03.build:root partition is OK: DISK OK - free space: / 508 MB (6% inode=88%):
02:46 < nagios> [61] moz2-linux64-slave03.build:root partition is WARNING: DISK WARNING - free space: / 464 MB (6% inode=88%):
02:56 < nagios> moz2-linux64-slave03.build:root partition is OK: DISK OK - free space: / 508 MB (6% inode=88%):
03:18 < nagios> [69] moz2-linux64-slave03.build:root partition is WARNING: DISK WARNING - free space: / 487 MB (6% inode=88%):
03:48 < nagios> moz2-linux64-slave03.build:root partition is OK: DISK OK - free space: / 506 MB (6% inode=88%):
07:20 < nagios> [50] moz2-linux64-slave03.build:root partition is WARNING: DISK WARNING - free space: / 500 MB (6% inode=88%):
07:30 < nagios> moz2-linux64-slave03.build:root partition is OK: DISK OK - free space: / 519 MB (6% inode=88%):
07:42 < nagios> [67] moz2-linux64-slave03.build:root partition is WARNING: DISK WARNING - free space: / 499 MB (6% inode=88%):
07:52 < nagios> moz2-linux64-slave03.build:root partition is OK: DISK OK - free space: / 518 MB (6% inode=88%):
That's from a large file at ~/.mozilla/firefox/console.log, bug 603238.
Between Friday night and now (Sunday lunchtime), I got 619 nagios messages. Too many to manually parse through and figure out if there is a real problem anywhere.
from irc discussions, its possible that nagios will be replaced by ganglia at some point. If thats true, great, the new ganglia configs need to not be noisy. 

If ganglia not happening soon (for some definition of soon!), then the nagios configs should be audited and fixed. 

We've had some weeks with bad wait times, purely because a lot of slaves were down, and we couldnt see the valid nagios alerts in all the noise of the false nagios alerts.

Either way, this is IT, so kicking over to zandr, after talking with him on irc.
Assignee: joduinn → zandr
Component: Release Engineering → Server Operations
QA Contact: release → mrz
ganglia replaces munin, not nagios.

We might entertain the idea of using something other than nagios for alerting.

Also bringing shyam on, since he volunteered for this project.
(In reply to comment #18)
> ganglia replaces munin, not nagios.
duh. of course. 

> We might entertain the idea of using something other than nagios for alerting.
ok, if you feel thats best, I'll follow your lead. I care less about what tool we use, and more about the accuracy of the alerts from that tool. As far as I can see, most (all?) of the noise here is from incorrectly written/designed alerts setup in nagios not from bugs of underlying nagios.

Before we go down the path of switching tools, would an audit of nagios alerts as-currently-written be a useful starting point? 



> Also bringing shyam on, since he volunteered for this project.
Nice! :-)
(In reply to comment #19)

> Before we go down the path of switching tools, would an audit of nagios alerts
> as-currently-written be a useful starting point? 

I think it's the only thing to be done. I spent a couple of hours last night walking through every build related alert on this page: 

https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346

Every one of them is legit.

The 'flapping' disk alerts you call out in this bug are over a period of hours, and are systems running close to the edge. I don't think you want to slow down the alerting to the point that things have to be broken for days before you get an alert.

The root cause here is not that nagios is broken/flapping/noisy. The root cause is that nagios has a lot to say because there are a lot of broken systems.
AIUI, nagios is doing more-or-less the right thing.  If there's flapping, maybe we need to tweak some thresholds, but let's look at particular cases of that in their own bugs.

I think that we should redirect this effort to a way to synthesize everything we know about releng systems into a single diagnosis, along with current status and maybe even some automated interventions?

That could take data from nagios, munin, slave alloc, puppet, buildmasters, inventory, and maybe even bugzilla (all via pulse of course), present it in one place, and perform some basic correlation analysis to diagnose common problems (hung slave, etc.).

I realize this is very vague right now, but it's the direction I think we should be heading.  Nagios does not do this sort of correlation, nor does it have all of the information required to draw accurate conclusions.
Now that I've been sitting on release@ for a while and watching the alerts go by, it's obvious what's going on.

Nagios is designed around the notion that things are expected to be running. If something goes CRITICAL, then it's CRITICAL and should be fixed.

If it's neither fixed nor acknowledged, then it will alert every two hours until one of those two things happens.

So either a) ack stuff, or b) we need to separate nagios and not the have the build instance/domain continue to alert as stuff lingers around broken.
Is the re-alerting configuration per-nagios-master?  I haven't set up Nagios in almost 5 years..
I'm told that yes, it is per-master.
I'm wrong, it can be configured per-host.

But we should really talk about what you expect nagios to be doing for you. If the solution to "I can't tell when stuff is down" is "Don't tell me when stuff is down", I think we're doing it wrong.
Yes, this should definitely be a conversation.  I'll try to summarize my feelings on the matter as follows: RelEng would like to approach most (that is, including slaves, excluding masters) infrastructure management on a "polling" basis, rather than "event-driven".  Interrupting everyone for every event - particularly twice (#build and release@) - and particularly when multiple events tend to be highly correlated - is overkill.  Even interrupting the buildduty person with that much information is probably too much.

Alerts for critical meta-events, e.g., >20% of a particular slave silo down, would be good -- and good for the whole team to hear, not just buildduty.

Nagios's web interface is not the worst thing ever, and maybe that's a good place to start, but ideally we'd have a system that integrates lots of sources of information about slaves and has big fat buttons to take care of the 80% tasks.
Depends on: 623619
Depends on: 603238
Depends on: 623748
Depends on: 623761
I have spent most of the day being a sysadmin.
That is to say, looking at alerts, and either fixing stuff or filing bugs and acking alerts.

At present, there are three unacked alerts, and you should assume that anything you hear from nagios is real and current.
Depends on: 620948
Depends on: 623821
Depends on: 623828
Alias: releng-nagios
I'm running into a lot of confusing stuff with nagios that's making it hard to figure out what's going on.

1. There's a lot of old comments and enabled/disabled checks and notifications in nagios.  I should probably run through host by host and just clean that out.

2. The IRC bot misses a lot of "OK" alerts that would counterbalance e.g., "PING CRITICAL" alerts, so it's hard to tell from IRC when systems come back up.

3. Forcing a check from the web interface generates a NRPE request from a different host than the original - one that's not in our allowed_hosts variable.

I'm not unfamiliar with Nagios, but I may need a quick tour of how best to use it at Mozilla.
(In reply to comment #28)
> I'm running into a lot of confusing stuff with nagios that's making it hard to
> figure out what's going on.
> 
> 1. There's a lot of old comments and enabled/disabled checks and notifications
> in nagios.  I should probably run through host by host and just clean that out.
> 

Yes, please clean stuff out and file bugs to have checks removed that aren't needed. No point in having stuff permanently ack'd..

> 2. The IRC bot misses a lot of "OK" alerts that would counterbalance e.g.,
> "PING CRITICAL" alerts, so it's hard to tell from IRC when systems come back
> up.
> 

This can happen when a host is flapping. I'm not sure of an easy way to fix it. Perhaps setting the notification options for those checks to include flapping alerts, would help, but likely the bot and/or contactgroups might need to have that notification option added as well. Justdave might know more on this. CC'ing him for comment.

> 3. Forcing a check from the web interface generates a NRPE request from a
> different host than the original - one that's not in our allowed_hosts
> variable.
> 
> I'm not unfamiliar with Nagios, but I may need a quick tour of how best to use
> it at Mozilla.

I haven't been able to figure out how to make this work either. The web interface runs on dm-nagios01, the service checks happen on bm-admin01. I'm not sure if there is a web interface on that box, but that would likely be where to do it. dm-nagios01 is likely firewalled from hosts anyway. Perhaps a feature request for the bot to do rechecks could be filed. It would require a script to be written and called remotely from dm-nagios01.. Also maybe justdave can help here.
(In reply to comment #29)
 
> I haven't been able to figure out how to make this work either. The web
> interface runs on dm-nagios01, the service checks happen on bm-admin01.

Oh, right, this makes sense. dm-nagios01 is all passive, the checks are just being reported up. This makes manual triggering hard. Another point for tearing out RelEng nagios?
Depends on: 625474
Depends on: 625867
Depends on: 629511
Assignee: zandr → arich
Component: Server Operations → Server Operations: RelEng
QA Contact: mrz → zandr
I think I've made all the progress I can on the bugs that this bug is tracking (barring more information).  I'll leave it open to keep tracking the other bugs, but I'm looking for verification that nagios is at the appropriate level of nosiness now.
Status: NEW → ASSIGNED
No longer depends on: 625978
Removing bug 627039 from deps since it's a new monitoring task that we will take care of as the win64 builds become monitor-able (at the moment, it's not ready for the check yet).

Removing bug 626879 and bug 627126 as they are not a part of cleanup, but are part of the important project of detecting hung slaves (especially in talos).
No longer depends on: 626879, 627039, 627126
No longer depends on: 625474
Removed 625474 as a dependent since that's adding new functionality.  If necessary we can revisit that and reopen this bug.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.