Bugzilla

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 18

•

14 years ago

ganglia replaces munin, not nagios.

We might entertain the idea of using something other than nagios for alerting.

Also bringing shyam on, since he volunteered for this project.

Reporter

Comment 19

•

14 years ago

(In reply to comment #18)
> ganglia replaces munin, not nagios.
duh. of course. 

> We might entertain the idea of using something other than nagios for alerting.
ok, if you feel thats best, I'll follow your lead. I care less about what tool we use, and more about the accuracy of the alerts from that tool. As far as I can see, most (all?) of the noise here is from incorrectly written/designed alerts setup in nagios not from bugs of underlying nagios.

Before we go down the path of switching tools, would an audit of nagios alerts as-currently-written be a useful starting point? 



> Also bringing shyam on, since he volunteered for this project.
Nice! :-)

Comment 20

•

14 years ago

(In reply to comment #19)

> Before we go down the path of switching tools, would an audit of nagios alerts
> as-currently-written be a useful starting point? 

I think it's the only thing to be done. I spent a couple of hours last night walking through every build related alert on this page: 

https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346

Every one of them is legit.

The 'flapping' disk alerts you call out in this bug are over a period of hours, and are systems running close to the edge. I don't think you want to slow down the alerting to the point that things have to be broken for days before you get an alert.

The root cause here is not that nagios is broken/flapping/noisy. The root cause is that nagios has a lot to say because there are a lot of broken systems.

Comment 21

•

14 years ago

AIUI, nagios is doing more-or-less the right thing.  If there's flapping, maybe we need to tweak some thresholds, but let's look at particular cases of that in their own bugs.

I think that we should redirect this effort to a way to synthesize everything we know about releng systems into a single diagnosis, along with current status and maybe even some automated interventions?

That could take data from nagios, munin, slave alloc, puppet, buildmasters, inventory, and maybe even bugzilla (all via pulse of course), present it in one place, and perform some basic correlation analysis to diagnose common problems (hung slave, etc.).

I realize this is very vague right now, but it's the direction I think we should be heading.  Nagios does not do this sort of correlation, nor does it have all of the information required to draw accurate conclusions.

Comment 22

•

14 years ago

Now that I've been sitting on release@ for a while and watching the alerts go by, it's obvious what's going on.

Nagios is designed around the notion that things are expected to be running. If something goes CRITICAL, then it's CRITICAL and should be fixed.

If it's neither fixed nor acknowledged, then it will alert every two hours until one of those two things happens.

So either a) ack stuff, or b) we need to separate nagios and not the have the build instance/domain continue to alert as stuff lingers around broken.

Comment 23

•

14 years ago

Is the re-alerting configuration per-nagios-master?  I haven't set up Nagios in almost 5 years..

Comment 24

•

14 years ago

I'm told that yes, it is per-master.

Comment 25

•

14 years ago

I'm wrong, it can be configured per-host.

But we should really talk about what you expect nagios to be doing for you. If the solution to "I can't tell when stuff is down" is "Don't tell me when stuff is down", I think we're doing it wrong.

Comment 26

•

14 years ago

Yes, this should definitely be a conversation.  I'll try to summarize my feelings on the matter as follows: RelEng would like to approach most (that is, including slaves, excluding masters) infrastructure management on a "polling" basis, rather than "event-driven".  Interrupting everyone for every event - particularly twice (#build and release@) - and particularly when multiple events tend to be highly correlated - is overkill.  Even interrupting the buildduty person with that much information is probably too much.

Alerts for critical meta-events, e.g., >20% of a particular slave silo down, would be good -- and good for the whole team to hear, not just buildduty.

Nagios's web interface is not the worst thing ever, and maybe that's a good place to start, but ideally we'd have a system that integrates lots of sources of information about slaves and has big fat buttons to take care of the 80% tasks.

Chris Cooper [:coop] (he/him)

Updated

•

14 years ago

Depends on: 623619

Updated

•

14 years ago

Depends on: 603238

Updated

•

14 years ago

Depends on: 623748

Updated

•

14 years ago

Depends on: 623761

Comment 27

•

14 years ago

I have spent most of the day being a sysadmin.
That is to say, looking at alerts, and either fixing stuff or filing bugs and acking alerts.

At present, there are three unacked alerts, and you should assume that anything you hear from nagios is real and current.

Updated

•

14 years ago

Depends on: 620948

Updated

•

14 years ago

Depends on: 623821

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

14 years ago

Depends on: 623828

Reporter

Updated

•

14 years ago

Depends on: 624260

Updated

•

14 years ago

Alias: releng-nagios

Updated

•

14 years ago

Depends on: 624795

Updated

•

14 years ago

Depends on: 625143

Comment 28

•

14 years ago

I'm running into a lot of confusing stuff with nagios that's making it hard to figure out what's going on.

1. There's a lot of old comments and enabled/disabled checks and notifications in nagios.  I should probably run through host by host and just clean that out.

2. The IRC bot misses a lot of "OK" alerts that would counterbalance e.g., "PING CRITICAL" alerts, so it's hard to tell from IRC when systems come back up.

3. Forcing a check from the web interface generates a NRPE request from a different host than the original - one that's not in our allowed_hosts variable.

I'm not unfamiliar with Nagios, but I may need a quick tour of how best to use it at Mozilla.

Justin Dow [:jabba]

Comment 29

•

14 years ago

(In reply to comment #28)
> I'm running into a lot of confusing stuff with nagios that's making it hard to
> figure out what's going on.
> 
> 1. There's a lot of old comments and enabled/disabled checks and notifications
> in nagios.  I should probably run through host by host and just clean that out.
> 

Yes, please clean stuff out and file bugs to have checks removed that aren't needed. No point in having stuff permanently ack'd..

> 2. The IRC bot misses a lot of "OK" alerts that would counterbalance e.g.,
> "PING CRITICAL" alerts, so it's hard to tell from IRC when systems come back
> up.
> 

This can happen when a host is flapping. I'm not sure of an easy way to fix it. Perhaps setting the notification options for those checks to include flapping alerts, would help, but likely the bot and/or contactgroups might need to have that notification option added as well. Justdave might know more on this. CC'ing him for comment.

> 3. Forcing a check from the web interface generates a NRPE request from a
> different host than the original - one that's not in our allowed_hosts
> variable.
> 
> I'm not unfamiliar with Nagios, but I may need a quick tour of how best to use
> it at Mozilla.

I haven't been able to figure out how to make this work either. The web interface runs on dm-nagios01, the service checks happen on bm-admin01. I'm not sure if there is a web interface on that box, but that would likely be where to do it. dm-nagios01 is likely firewalled from hosts anyway. Perhaps a feature request for the bot to do rechecks could be filed. It would require a script to be written and called remotely from dm-nagios01.. Also maybe justdave can help here.

Comment 30

•

14 years ago

(In reply to comment #29)
 
> I haven't been able to figure out how to make this work either. The web
> interface runs on dm-nagios01, the service checks happen on bm-admin01.

Oh, right, this makes sense. dm-nagios01 is all passive, the checks are just being reported up. This makes manual triggering hard. Another point for tearing out RelEng nagios?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

14 years ago

Depends on: 625474

Reporter

Updated

•

14 years ago

Depends on: 565397

Mike Taylor [:bear]

Updated

•

13 years ago

Depends on: 625867

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Updated

•

13 years ago

Depends on: 625978

Updated

•

13 years ago

Depends on: 626088

Updated

•

13 years ago

Depends on: 626879

Updated

•

13 years ago

Depends on: 626905

Chris Cooper [:coop] (he/him)

Updated

•

13 years ago

Depends on: 627126

Updated

•

13 years ago

Depends on: 629511

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

13 years ago

Depends on: 627039

Reporter

Updated

•

13 years ago

Depends on: 633853

Assignee

Updated

•

13 years ago

Assignee: zandr → arich

Phong Tran [:phong]

Updated

•

13 years ago

Component: Server Operations → Server Operations: RelEng

QA Contact: mrz → zandr

Assignee

Comment 31

•

13 years ago

I think I've made all the progress I can on the bugs that this bug is tracking (barring more information).  I'll leave it open to keep tracking the other bugs, but I'm looking for verification that nagios is at the appropriate level of nosiness now.

Status: NEW → ASSIGNED

Assignee

Updated

•

13 years ago

No longer depends on: 625978

Comment 32

•

13 years ago

Removing bug 627039 from deps since it's a new monitoring task that we will take care of as the win64 builds become monitor-able (at the moment, it's not ready for the check yet).

Removing bug 626879 and bug 627126 as they are not a part of cleanup, but are part of the important project of detecting hung slaves (especially in talos).

No longer depends on: 626879, 627039, 627126

Assignee

Updated

•

13 years ago

No longer depends on: 625474