Closed Bug 548760 Opened 16 years ago Closed 15 years ago

get nagios for mobile devices

Categories

(mozilla.org Graveyard :: Server Operations, task, P3)

x86
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhford, Assigned: arzhel)

References

Details

It would be great if we could get the same ping checks for the mobile phones (n900, n810) that we have for the talos slaves. As I understand it, we don't have anything running on talos slaves for this monitoring. These devices are on the mountain view build network (10.250.48.0 - 10.250.50.???) network, if that has any impact on this being an option.
Assignee: nobody → jhford
Priority: -- → P3
Status: NEW → ASSIGNED
Blocks: 550945
Assignee: jhford → server-ops
Status: ASSIGNED → NEW
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Arzhel, TIA for your Nagios <3 :)
Assignee: server-ops → ayounsi
for now, lets focus on n900s: n900-001.build.mozilla.org - 10.250.50.1 .... n900-020.build.mozilla.org - 10.250.50.20
Checks added. Need more?
Yes -- maemo-n810-01 through 80. Also, are these checks set so we can reboot these devices and only get notified if they're down for > a set time? I don't remember what that set time is for the Talos minis, but we may want to up that to 30-60 minutes for the n810s+n900s.
i'd say 60 minutes would be good for the n900s. I would hope that it is the same for the n810s.
Checks added for maemo-n810-01 through 80. Should notify after 60min for both n900s and n810s.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
I am noticing that some of the n900s are showing up as flapping. It is doing 60 checks, which I assume means that we are checking every minute. Is that correct? It is entirely likely that during 60 minutes the device is able to do a full test cycle + reboot at the end and come back up. If that is the case, can we have checks 15 minutes apart from each other? A normal reboot will take at least 5 minutes and it is expected to be failing the ping check during this entire process. (we format the filesystem before the network comes up) I also see a message "CHECK_NRPE: Socket timeout after 10 seconds. " instead of "PING CRITICAL - Packet loss = 100% " which is what we get on the desktop machines. Is this something to be concerned about?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #7) > I am noticing that some of the n900s are showing up as flapping. It is doing > 60 checks, which I assume means that we are checking every minute. Is that > correct? It is entirely likely that during 60 minutes the device is able to do > a full test cycle + reboot at the end and come back up. > > If that is the case, can we have checks 15 minutes apart from each other? A > normal reboot will take at least 5 minutes and it is expected to be failing the > ping check during this entire process. (we format the filesystem before the > network comes up) > I can adjusts 3 variables here: - the interval between 2 checks if the previous succeeded (currently 5min) - the interval between 2 checks if the previous failed (currently 1min) - the number of checks to run when a service is down before sending alert (currently 60) What are the best values for you? > > I also see a message "CHECK_NRPE: Socket timeout after 10 seconds. " instead of > "PING CRITICAL - Packet loss = 100% " which is what we get on the desktop > machines. Is this something to be concerned about? Fixed
(In reply to comment #8) > I can adjusts 3 variables here: > - the interval between 2 checks if the previous succeeded (currently 5min) > - the interval between 2 checks if the previous failed (currently 1min) > - the number of checks to run when a service is down before sending alert > (currently 60) > What are the best values for you? can the first value be 30 minutes, the second be 10 minutes and the retry count be 12 please? > > > > I also see a message "CHECK_NRPE: Socket timeout after 10 seconds. " instead of > > "PING CRITICAL - Packet loss = 100% " which is what we get on the desktop > > machines. Is this something to be concerned about? > > Fixed Thanks!
(In reply to comment #9) > (In reply to comment #8) > > > I can adjusts 3 variables here: > > - the interval between 2 checks if the previous succeeded (currently 5min) > > - the interval between 2 checks if the previous failed (currently 1min) > > - the number of checks to run when a service is down before sending alert > > (currently 60) > > What are the best values for you? > > can the first value be 30 minutes, the second be 10 minutes and the retry count > be 12 please? Updated
Status: REOPENED → RESOLVED
Closed: 16 years ago15 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.