get nagios for mobile devices

RESOLVED FIXED

Status

P3
normal
RESOLVED FIXED
9 years ago
4 years ago

People

(Reporter: jhford, Assigned: arzhel)

Tracking

Details

It would be great if we could get the same ping checks for the mobile phones (n900, n810) that we have for the talos slaves.  As I understand it, we don't have anything running on talos slaves for this monitoring.  These devices are on the mountain view build network (10.250.48.0 - 10.250.50.???) network, if that has any impact on this being an option.
Assignee: nobody → jhford
Priority: -- → P3
Status: NEW → ASSIGNED
Blocks: 550945
Assignee: jhford → server-ops
Status: ASSIGNED → NEW
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Arzhel, TIA for your Nagios <3 :)
Assignee: server-ops → ayounsi
for now, lets focus on n900s:

n900-001.build.mozilla.org - 10.250.50.1
....
n900-020.build.mozilla.org - 10.250.50.20
(Assignee)

Comment 3

9 years ago
Checks added. Need more?
Yes -- maemo-n810-01 through 80.

Also, are these checks set so we can reboot these devices and only get notified if they're down for > a set time?

I don't remember what that set time is for the Talos minis, but we may want to up that to 30-60 minutes for the n810s+n900s.
i'd say 60 minutes would be good for the n900s.  I would hope that it is the same for the n810s.
(Assignee)

Comment 6

9 years ago
Checks added for maemo-n810-01 through 80.
Should notify after 60min for both n900s and n810s.
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
I am noticing that some of the n900s are showing up as flapping.  It is doing 60 checks, which I assume means that we are checking every minute.  Is that correct?  It is entirely likely that during 60 minutes the device is able to do a full test cycle + reboot at the end and come back up.  

If that is the case, can we have checks 15 minutes apart from each other?  A normal reboot will take at least 5 minutes and it is expected to be failing the ping check during this entire process. (we format the filesystem before the network comes up)

I also see a message "CHECK_NRPE: Socket timeout after 10 seconds. " instead of "PING CRITICAL - Packet loss = 100% " which is what we get on the desktop machines.  Is this something to be concerned about?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 8

9 years ago
(In reply to comment #7)
> I am noticing that some of the n900s are showing up as flapping.  It is doing
> 60 checks, which I assume means that we are checking every minute.  Is that
> correct?  It is entirely likely that during 60 minutes the device is able to do
> a full test cycle + reboot at the end and come back up.  
> 
> If that is the case, can we have checks 15 minutes apart from each other?  A
> normal reboot will take at least 5 minutes and it is expected to be failing the
> ping check during this entire process. (we format the filesystem before the
> network comes up)
> 

I can adjusts 3 variables here: 
 - the interval between 2 checks if the previous succeeded (currently 5min)
 - the interval between 2 checks if the previous failed (currently 1min)
 - the number of checks to run when a service is down before sending alert (currently 60)
What are the best values for you?

>
> I also see a message "CHECK_NRPE: Socket timeout after 10 seconds. " instead of
> "PING CRITICAL - Packet loss = 100% " which is what we get on the desktop
> machines.  Is this something to be concerned about?

Fixed
(In reply to comment #8)

> I can adjusts 3 variables here: 
>  - the interval between 2 checks if the previous succeeded (currently 5min)
>  - the interval between 2 checks if the previous failed (currently 1min)
>  - the number of checks to run when a service is down before sending alert
> (currently 60)
> What are the best values for you?

can the first value be 30 minutes, the second be 10 minutes and the retry count be 12 please?



> >
> > I also see a message "CHECK_NRPE: Socket timeout after 10 seconds. " instead of
> > "PING CRITICAL - Packet loss = 100% " which is what we get on the desktop
> > machines.  Is this something to be concerned about?
> 
> Fixed

Thanks!
(Assignee)

Comment 10

9 years ago
(In reply to comment #9)
> (In reply to comment #8)
> 
> > I can adjusts 3 variables here: 
> >  - the interval between 2 checks if the previous succeeded (currently 5min)
> >  - the interval between 2 checks if the previous failed (currently 1min)
> >  - the number of checks to run when a service is down before sending alert
> > (currently 60)
> > What are the best values for you?
> 
> can the first value be 30 minutes, the second be 10 minutes and the retry count
> be 12 please?

Updated
Status: REOPENED → RESOLVED
Last Resolved: 9 years ago9 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.