Closed
Bug 624795
Opened 14 years ago
Closed 14 years ago
don't alert for n900's
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: arich)
References
Details
We just saw an alert for an n900:
12:50 < nagios> [66] n900-082.build.mtv1:PING is CRITICAL: CRITICAL - Host Unreachable (10.250.51.41)
Rather than being handled on a per-alert basis, these are handled "occasionally" whenever a trip to Haxxor comes up. At that point, whoever's making the trip checks the nagios host display to get the list of machines needing tending.
Is it possible to turn off alerts for these systems?
Reporter | ||
Updated•14 years ago
|
Assignee: dustin → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Comment 1•14 years ago
|
||
It is very important that these hosts still show up in the list of critical hosts to figure out which machines need to be reimaged.
Comment 2•14 years ago
|
||
Just to clarify, you want checks done with no notification, and you will have to proactively check nagios?
Comment 3•14 years ago
|
||
(In reply to comment #2)
> Just to clarify, you want checks done with no notification, and you will have
> to proactively check nagios?
that sounds right.
Comment 4•14 years ago
|
||
I really want to WONTFIX this. Per my previous comments, "Don't tell me what's broken" is not a solution to "I don't know what's broken."
In the absence of a more holistic tool like BorderCollie, I'm very hesitant to disable alerts for broken stuff.
(In reply to comment #0)
> Rather than being handled on a per-alert basis, these are handled
> "occasionally" whenever a trip to Haxxor comes up. At that point, whoever's
> making the trip checks the nagios host display to get the list of machines
> needing tending.
This is no different than any other slave, except that you guys aren't (usually) the ones making the trip.
For all other slaves, we've been filing in the 'reboots' bug, and we handle the reboots 'occasionally' when they reach critical mass or a colo trip comes up.
I don't particularly like that either, but acking the alert with a reboot bug sounds better than not alerting.
Until we have a better 'dashboard', though, I think the alerts are important.
Comment 5•14 years ago
|
||
Personally, I'm ok tracking these reboots in a bug, like others. This bug filed after finding out that the current action is to ack with "will get to it", which is no better (yet more annoying) than not alerting at all.
Comment 6•14 years ago
|
||
(In reply to comment #4)
> I really want to WONTFIX this. Per my previous comments, "Don't tell me what's
> broken" is not a solution to "I don't know what's broken."
>
> In the absence of a more holistic tool like BorderCollie, I'm very hesitant to
> disable alerts for broken stuff.
>
Agreed, we need n900 alerts. These alerts are important to know if we have enough n900s in production to handle checkin load.
We already tweaked nagios to treat these devices as a different class of machines, to allow for slower reboot times on these devices, so maybe these configs need tweaking again... Regardless, we do need to know and react to n900 failures in a timely manner.
Comment 7•14 years ago
|
||
(In reply to comment #6)
> (In reply to comment #4)
> > I really want to WONTFIX this. Per my previous comments, "Don't tell me what's
> > broken" is not a solution to "I don't know what's broken."
> >
> > In the absence of a more holistic tool like BorderCollie, I'm very hesitant to
> > disable alerts for broken stuff.
> >
>
> Agreed, we need n900 alerts. These alerts are important to know if we have
> enough n900s in production to handle checkin load.
>
> We already tweaked nagios to treat these devices as a different class of
> machines, to allow for slower reboot times on these devices, so maybe these
> configs need tweaking again... Regardless, we do need to know and react to n900
> failures in a timely manner.
We're not suggesting we remove those. We're suggesting removing alerts to which we always reply with an ack that doesn't reference a bug. If we're going to ack alerts in such a way, we should not alert in the first place. If we're going to ack and reference something useful, like a bug, they seem slightly more useful.
Regardless, n900 status will be on the nagios web page.
Comment 8•14 years ago
|
||
(In reply to comment #7)
> (In reply to comment #6)
> > (In reply to comment #4)
> > > I really want to WONTFIX this. Per my previous comments, "Don't tell me what's
> > > broken" is not a solution to "I don't know what's broken."
> > >
> > > In the absence of a more holistic tool like BorderCollie, I'm very hesitant to
> > > disable alerts for broken stuff.
> > >
> >
> > Agreed, we need n900 alerts. These alerts are important to know if we have
> > enough n900s in production to handle checkin load.
> >
> > We already tweaked nagios to treat these devices as a different class of
> > machines, to allow for slower reboot times on these devices, so maybe these
> > configs need tweaking again... Regardless, we do need to know and react to n900
> > failures in a timely manner.
>
> We're not suggesting we remove those. We're suggesting removing alerts to which
> we always reply with an ack that doesn't reference a bug. If we're going to ack
> alerts in such a way, we should not alert in the first place. If we're going to
> ack and reference something useful, like a bug, they seem slightly more useful.
ok, I think we are saying the same thing here: having an alert that is just mindlessly "ack"ed is no use to anyone. If so, then we are 100% agreed.
Instead of only using the nagios dashboard, for future mobile device alerts like this, I'd prefer to treat these like usual mini/ix reboot alerts, filing a bug to track when we ack an alert. It gives us some trending when we look back to see how many machines need touching, how often. In which case, I agree with zandr we should WONTFIX this.
>
> Regardless, n900 status will be on the nagios web page.
yep, that too.
Comment 9•14 years ago
|
||
So, in an idealistic sense, the mobile reboot/reimage bug makes sense.
In practice, having n900 alerts that go to a bug:
a) spam #build
b) spam #build with notifications when acking (can be disabled with extra effort)
c) spam email with notifications when adding to bug
d) waste buildduty time and headspace which is already short
e) end up with an unsorted (except chronologically, which doesn't really help) list in a bug. You can take this list and put it in a file and sort, or you can go to nagios and look at all the down devices and ignore the bug completely.
I think a-d are unhelpful at best and harmful to our responsiveness towards real emergencies at worst.
If we want an alert about having too few n900s live, we should have a meta check that checks to see how many n900s are in production and working. Being alerted about individual devices is more noise than not, especially since there is no IT support for these atm.
If we can re-hand-off mobile imaging, my viewpoint will be largely changed.
Updated•14 years ago
|
Assignee: server-ops → zandr
Comment 10•14 years ago
|
||
Aki:
I understand the complaints, but every one of these things is also true of every other slave in the releng infrastructure. How are N900s different? (other than alerting far *less* than the other slaves, it seems?)
I agree that the meta-alerts are more useful, but until those are built I'm not willing to turn off all alerting.
Comment 11•14 years ago
|
||
With everything else, we're informing IT of what down/alerting machines actually need action to be taken. Buildbot might not be running on a staging slave because someone's working on it, and they forgot to set a downtime. The list of alerts on nagios is not a list of actionable items; it's a superset.
With the N900s, a bug would be informing MV Releng (John Ford in practice, though the rest of us know how, now, and should take turns) which n900s are down and need action to be taken. This list should be the exact same list that is in Nagios when someone does take action. Therefore, the bug + alerts are extra noise, since one can look at the Nagios page and get the same information.
Reporter | ||
Comment 12•14 years ago
|
||
A little bit of recent history: we tried a reboots bug before - bug 614857 - and it was closed over the holiday break with "I'll fix it soon".
Also, looking through nagios, most of the 15 down'd n900's have been in such a state for 60 days or more, and are acknowledged with various temporary-sounding bugs (rather than "I'll fix this when I get there"). I suspect that some of these are more than just needing a reboot when someone gets around to it -- but are *all* of them in that state? Do we have anything tracking the longer-term down'd phones to a resolution?
I want to propose an alternative way of handling this: add the relevant n900's to the slave-tracking spreadsheet when they are ack'd. If the phone needs more than a simple reboot, we can track that progress in the spreadsheet, or make a bug out of it if it's sufficiently complex. If the list in the spreadsheet gets long, and buildduty is remote that week, then buildduty should ask someone local to poke buttons in Haxxor.
The spreadsheet's at https://spreadsheets.google.com/ccc?key=0AqefQEn4Wp2ydFVjSkMwM1ZlS28xdVRaVDNHUEpLaEE&hl=en&authkey=CLP1msQG and also linked in the channel topic.
Comment 13•14 years ago
|
||
Aki/jhford/anyone else that deals with these -- any comment on Dustin's idea?
Comment 14•14 years ago
|
||
I've got some ideas on workflow that I'll bring up at the meeting today. Short version is use NagiosWeb to replace the spreadsheet, go back to bug-per-slave, since the reboots bugs are ungainly and error prone, IMO.
There's some work to be done in Nagios, and possibly some Nagios<>Bugzilla integration here, but let's chat about it this afternoon.
Assignee | ||
Updated•14 years ago
|
Assignee: zandr → arich
Reporter | ||
Comment 15•14 years ago
|
||
We've begun using the suggestion in comment 12 now, with reboots recorded in bug 638922 (alias n900-reboots). So I think that this is closed.
Zandr, we'll talk about better overall ideas next week, and can record concrete plans in new bugs.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•