624795 - don't alert for n900's

Reporter

Description

•

14 years ago

We just saw an alert for an n900: 12:50 < nagios> [66] n900-082.build.mtv1:PING is CRITICAL: CRITICAL - Host Unreachable (10.250.51.41) Rather than being handled on a per-alert basis, these are handled "occasionally" whenever a trip to Haxxor comes up. At that point, whoever's making the trip checks the nagios host display to get the list of machines needing tending. Is it possible to turn off alerts for these systems?

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

14 years ago

Assignee: dustin → server-ops

Component: Release Engineering → Server Operations

QA Contact: release → mrz

John Ford [:jhford] CET/CEST Berlin Time

Comment 1

•

14 years ago

It is very important that these hosts still show up in the list of critical hosts to figure out which machines need to be reimaged.

Corey Shields [:cshields]

Comment 2

•

14 years ago

Just to clarify, you want checks done with no notification, and you will have to proactively check nagios?

John Ford [:jhford] CET/CEST Berlin Time

Comment 3

•

14 years ago

(In reply to comment #2) > Just to clarify, you want checks done with no notification, and you will have > to proactively check nagios? that sounds right.

Zandr Milewski [:zandr]

Comment 4

•

14 years ago

I really want to WONTFIX this. Per my previous comments, "Don't tell me what's broken" is not a solution to "I don't know what's broken." In the absence of a more holistic tool like BorderCollie, I'm very hesitant to disable alerts for broken stuff. (In reply to comment #0) > Rather than being handled on a per-alert basis, these are handled > "occasionally" whenever a trip to Haxxor comes up. At that point, whoever's > making the trip checks the nagios host display to get the list of machines > needing tending. This is no different than any other slave, except that you guys aren't (usually) the ones making the trip. For all other slaves, we've been filing in the 'reboots' bug, and we handle the reboots 'occasionally' when they reach critical mass or a colo trip comes up. I don't particularly like that either, but acking the alert with a reboot bug sounds better than not alerting. Until we have a better 'dashboard', though, I think the alerts are important.

bhearsum@mozilla.com (:bhearsum)

Comment 5

•

14 years ago

Personally, I'm ok tracking these reboots in a bug, like others. This bug filed after finding out that the current action is to ack with "will get to it", which is no better (yet more annoying) than not alerting at all.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 6

•

14 years ago

(In reply to comment #4) > I really want to WONTFIX this. Per my previous comments, "Don't tell me what's > broken" is not a solution to "I don't know what's broken." > > In the absence of a more holistic tool like BorderCollie, I'm very hesitant to > disable alerts for broken stuff. > Agreed, we need n900 alerts. These alerts are important to know if we have enough n900s in production to handle checkin load. We already tweaked nagios to treat these devices as a different class of machines, to allow for slower reboot times on these devices, so maybe these configs need tweaking again... Regardless, we do need to know and react to n900 failures in a timely manner.

bhearsum@mozilla.com (:bhearsum)

Comment 7

•

14 years ago

(In reply to comment #6) > (In reply to comment #4) > > I really want to WONTFIX this. Per my previous comments, "Don't tell me what's > > broken" is not a solution to "I don't know what's broken." > > > > In the absence of a more holistic tool like BorderCollie, I'm very hesitant to > > disable alerts for broken stuff. > > > > Agreed, we need n900 alerts. These alerts are important to know if we have > enough n900s in production to handle checkin load. > > We already tweaked nagios to treat these devices as a different class of > machines, to allow for slower reboot times on these devices, so maybe these > configs need tweaking again... Regardless, we do need to know and react to n900 > failures in a timely manner. We're not suggesting we remove those. We're suggesting removing alerts to which we always reply with an ack that doesn't reference a bug. If we're going to ack alerts in such a way, we should not alert in the first place. If we're going to ack and reference something useful, like a bug, they seem slightly more useful. Regardless, n900 status will be on the nagios web page.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 8

•

14 years ago

(In reply to comment #7) > (In reply to comment #6) > > (In reply to comment #4) > > > I really want to WONTFIX this. Per my previous comments, "Don't tell me what's > > > broken" is not a solution to "I don't know what's broken." > > > > > > In the absence of a more holistic tool like BorderCollie, I'm very hesitant to > > > disable alerts for broken stuff. > > > > > > > Agreed, we need n900 alerts. These alerts are important to know if we have > > enough n900s in production to handle checkin load. > > > > We already tweaked nagios to treat these devices as a different class of > > machines, to allow for slower reboot times on these devices, so maybe these > > configs need tweaking again... Regardless, we do need to know and react to n900 > > failures in a timely manner. > > We're not suggesting we remove those. We're suggesting removing alerts to which > we always reply with an ack that doesn't reference a bug. If we're going to ack > alerts in such a way, we should not alert in the first place. If we're going to > ack and reference something useful, like a bug, they seem slightly more useful. ok, I think we are saying the same thing here: having an alert that is just mindlessly "ack"ed is no use to anyone. If so, then we are 100% agreed. Instead of only using the nagios dashboard, for future mobile device alerts like this, I'd prefer to treat these like usual mini/ix reboot alerts, filing a bug to track when we ack an alert. It gives us some trending when we look back to see how many machines need touching, how often. In which case, I agree with zandr we should WONTFIX this. > > Regardless, n900 status will be on the nagios web page. yep, that too.

Aki Sasaki (not active)

Comment 9

•

14 years ago

So, in an idealistic sense, the mobile reboot/reimage bug makes sense. In practice, having n900 alerts that go to a bug: a) spam #build b) spam #build with notifications when acking (can be disabled with extra effort) c) spam email with notifications when adding to bug d) waste buildduty time and headspace which is already short e) end up with an unsorted (except chronologically, which doesn't really help) list in a bug. You can take this list and put it in a file and sort, or you can go to nagios and look at all the down devices and ignore the bug completely. I think a-d are unhelpful at best and harmful to our responsiveness towards real emergencies at worst. If we want an alert about having too few n900s live, we should have a meta check that checks to see how many n900s are in production and working. Being alerted about individual devices is more noise than not, especially since there is no IT support for these atm. If we can re-hand-off mobile imaging, my viewpoint will be largely changed.

Phong Tran [:phong]

Updated

•

14 years ago

Assignee: server-ops → zandr

Zandr Milewski [:zandr]

Comment 10

•

14 years ago

Aki: I understand the complaints, but every one of these things is also true of every other slave in the releng infrastructure. How are N900s different? (other than alerting far *less* than the other slaves, it seems?) I agree that the meta-alerts are more useful, but until those are built I'm not willing to turn off all alerting.

Aki Sasaki (not active)

Comment 11

•

14 years ago

With everything else, we're informing IT of what down/alerting machines actually need action to be taken. Buildbot might not be running on a staging slave because someone's working on it, and they forgot to set a downtime. The list of alerts on nagios is not a list of actionable items; it's a superset. With the N900s, a bug would be informing MV Releng (John Ford in practice, though the rest of us know how, now, and should take turns) which n900s are down and need action to be taken. This list should be the exact same list that is in Nagios when someone does take action. Therefore, the bug + alerts are extra noise, since one can look at the Nagios page and get the same information.

Zandr Milewski [:zandr]

Updated

•

14 years ago

Depends on: 625474

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 12

•

14 years ago

A little bit of recent history: we tried a reboots bug before - bug 614857 - and it was closed over the holiday break with "I'll fix it soon". Also, looking through nagios, most of the 15 down'd n900's have been in such a state for 60 days or more, and are acknowledged with various temporary-sounding bugs (rather than "I'll fix this when I get there"). I suspect that some of these are more than just needing a reboot when someone gets around to it -- but are *all* of them in that state? Do we have anything tracking the longer-term down'd phones to a resolution? I want to propose an alternative way of handling this: add the relevant n900's to the slave-tracking spreadsheet when they are ack'd. If the phone needs more than a simple reboot, we can track that progress in the spreadsheet, or make a bug out of it if it's sufficiently complex. If the list in the spreadsheet gets long, and buildduty is remote that week, then buildduty should ask someone local to poke buttons in Haxxor. The spreadsheet's at https://spreadsheets.google.com/ccc?key=0AqefQEn4Wp2ydFVjSkMwM1ZlS28xdVRaVDNHUEpLaEE&hl=en&authkey=CLP1msQG and also linked in the channel topic.

bhearsum@mozilla.com (:bhearsum)

Comment 13

•

14 years ago

Aki/jhford/anyone else that deals with these -- any comment on Dustin's idea?

Zandr Milewski [:zandr]

Comment 14

•

14 years ago

I've got some ideas on workflow that I'll bring up at the meeting today. Short version is use NagiosWeb to replace the spreadsheet, go back to bug-per-slave, since the reboots bugs are ungainly and error prone, IMO. There's some work to be done in Nagios, and possibly some Nagios<>Bugzilla integration here, but let's chat about it this afternoon.

Amy Rich [:arr] [:arich]

Assignee

Updated

•

14 years ago

Assignee: zandr → arich

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 15

•

14 years ago

We've begun using the suggestion in comment 12 now, with reboots recorded in bug 638922 (alias n900-reboots). So I think that this is closed. Zandr, we'll talk about better overall ideas next week, and can record concrete plans in new bugs.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard

Bugzilla

don't alert for n900's

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: dustin, Assigned: arich)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Updated

Comment 12

Comment 13

Comment 14

Updated

Comment 15

Updated