Closed
Bug 548760
Opened 16 years ago
Closed 15 years ago
get nagios for mobile devices
Categories
(mozilla.org Graveyard :: Server Operations, task, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jhford, Assigned: arzhel)
References
Details
It would be great if we could get the same ping checks for the mobile phones (n900, n810) that we have for the talos slaves. As I understand it, we don't have anything running on talos slaves for this monitoring. These devices are on the mountain view build network (10.250.48.0 - 10.250.50.???) network, if that has any impact on this being an option.
| Reporter | ||
Updated•16 years ago
|
Assignee: nobody → jhford
Priority: -- → P3
| Reporter | ||
Updated•16 years ago
|
Status: NEW → ASSIGNED
| Reporter | ||
Updated•16 years ago
|
Assignee: jhford → server-ops
Status: ASSIGNED → NEW
Component: Release Engineering → Server Operations
QA Contact: release → mrz
| Reporter | ||
Comment 2•16 years ago
|
||
for now, lets focus on n900s:
n900-001.build.mozilla.org - 10.250.50.1
....
n900-020.build.mozilla.org - 10.250.50.20
| Assignee | ||
Comment 3•16 years ago
|
||
Checks added. Need more?
Comment 4•16 years ago
|
||
Yes -- maemo-n810-01 through 80.
Also, are these checks set so we can reboot these devices and only get notified if they're down for > a set time?
I don't remember what that set time is for the Talos minis, but we may want to up that to 30-60 minutes for the n810s+n900s.
| Reporter | ||
Comment 5•16 years ago
|
||
i'd say 60 minutes would be good for the n900s. I would hope that it is the same for the n810s.
| Assignee | ||
Comment 6•16 years ago
|
||
Checks added for maemo-n810-01 through 80.
Should notify after 60min for both n900s and n810s.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 7•16 years ago
|
||
I am noticing that some of the n900s are showing up as flapping. It is doing 60 checks, which I assume means that we are checking every minute. Is that correct? It is entirely likely that during 60 minutes the device is able to do a full test cycle + reboot at the end and come back up.
If that is the case, can we have checks 15 minutes apart from each other? A normal reboot will take at least 5 minutes and it is expected to be failing the ping check during this entire process. (we format the filesystem before the network comes up)
I also see a message "CHECK_NRPE: Socket timeout after 10 seconds. " instead of "PING CRITICAL - Packet loss = 100% " which is what we get on the desktop machines. Is this something to be concerned about?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
| Assignee | ||
Comment 8•15 years ago
|
||
(In reply to comment #7)
> I am noticing that some of the n900s are showing up as flapping. It is doing
> 60 checks, which I assume means that we are checking every minute. Is that
> correct? It is entirely likely that during 60 minutes the device is able to do
> a full test cycle + reboot at the end and come back up.
>
> If that is the case, can we have checks 15 minutes apart from each other? A
> normal reboot will take at least 5 minutes and it is expected to be failing the
> ping check during this entire process. (we format the filesystem before the
> network comes up)
>
I can adjusts 3 variables here:
- the interval between 2 checks if the previous succeeded (currently 5min)
- the interval between 2 checks if the previous failed (currently 1min)
- the number of checks to run when a service is down before sending alert (currently 60)
What are the best values for you?
>
> I also see a message "CHECK_NRPE: Socket timeout after 10 seconds. " instead of
> "PING CRITICAL - Packet loss = 100% " which is what we get on the desktop
> machines. Is this something to be concerned about?
Fixed
| Reporter | ||
Comment 9•15 years ago
|
||
(In reply to comment #8)
> I can adjusts 3 variables here:
> - the interval between 2 checks if the previous succeeded (currently 5min)
> - the interval between 2 checks if the previous failed (currently 1min)
> - the number of checks to run when a service is down before sending alert
> (currently 60)
> What are the best values for you?
can the first value be 30 minutes, the second be 10 minutes and the retry count be 12 please?
> >
> > I also see a message "CHECK_NRPE: Socket timeout after 10 seconds. " instead of
> > "PING CRITICAL - Packet loss = 100% " which is what we get on the desktop
> > machines. Is this something to be concerned about?
>
> Fixed
Thanks!
| Assignee | ||
Comment 10•15 years ago
|
||
(In reply to comment #9)
> (In reply to comment #8)
>
> > I can adjusts 3 variables here:
> > - the interval between 2 checks if the previous succeeded (currently 5min)
> > - the interval between 2 checks if the previous failed (currently 1min)
> > - the number of checks to run when a service is down before sending alert
> > (currently 60)
> > What are the best values for you?
>
> can the first value be 30 minutes, the second be 10 minutes and the retry count
> be 12 please?
Updated
Status: REOPENED → RESOLVED
Closed: 16 years ago → 15 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•