If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

get nagios for mobile devices

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
P3
normal
RESOLVED FIXED
8 years ago
3 years ago

People

(Reporter: jhford, Assigned: XioNoX)

Tracking

Details

It would be great if we could get the same ping checks for the mobile phones (n900, n810) that we have for the talos slaves.  As I understand it, we don't have anything running on talos slaves for this monitoring.  These devices are on the mountain view build network (10.250.48.0 - 10.250.50.???) network, if that has any impact on this being an option.
Assignee: nobody → jhford
Priority: -- → P3
Status: NEW → ASSIGNED
Blocks: 550945
Assignee: jhford → server-ops
Status: ASSIGNED → NEW
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Arzhel, TIA for your Nagios <3 :)
Assignee: server-ops → ayounsi
for now, lets focus on n900s:

n900-001.build.mozilla.org - 10.250.50.1
....
n900-020.build.mozilla.org - 10.250.50.20
(Assignee)

Comment 3

8 years ago
Checks added. Need more?

Comment 4

8 years ago
Yes -- maemo-n810-01 through 80.

Also, are these checks set so we can reboot these devices and only get notified if they're down for > a set time?

I don't remember what that set time is for the Talos minis, but we may want to up that to 30-60 minutes for the n810s+n900s.
i'd say 60 minutes would be good for the n900s.  I would hope that it is the same for the n810s.
(Assignee)

Comment 6

8 years ago
Checks added for maemo-n810-01 through 80.
Should notify after 60min for both n900s and n810s.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
I am noticing that some of the n900s are showing up as flapping.  It is doing 60 checks, which I assume means that we are checking every minute.  Is that correct?  It is entirely likely that during 60 minutes the device is able to do a full test cycle + reboot at the end and come back up.  

If that is the case, can we have checks 15 minutes apart from each other?  A normal reboot will take at least 5 minutes and it is expected to be failing the ping check during this entire process. (we format the filesystem before the network comes up)

I also see a message "CHECK_NRPE: Socket timeout after 10 seconds. " instead of "PING CRITICAL - Packet loss = 100% " which is what we get on the desktop machines.  Is this something to be concerned about?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 8

8 years ago
(In reply to comment #7)
> I am noticing that some of the n900s are showing up as flapping.  It is doing
> 60 checks, which I assume means that we are checking every minute.  Is that
> correct?  It is entirely likely that during 60 minutes the device is able to do
> a full test cycle + reboot at the end and come back up.  
> 
> If that is the case, can we have checks 15 minutes apart from each other?  A
> normal reboot will take at least 5 minutes and it is expected to be failing the
> ping check during this entire process. (we format the filesystem before the
> network comes up)
> 

I can adjusts 3 variables here: 
 - the interval between 2 checks if the previous succeeded (currently 5min)
 - the interval between 2 checks if the previous failed (currently 1min)
 - the number of checks to run when a service is down before sending alert (currently 60)
What are the best values for you?

>
> I also see a message "CHECK_NRPE: Socket timeout after 10 seconds. " instead of
> "PING CRITICAL - Packet loss = 100% " which is what we get on the desktop
> machines.  Is this something to be concerned about?

Fixed
(In reply to comment #8)

> I can adjusts 3 variables here: 
>  - the interval between 2 checks if the previous succeeded (currently 5min)
>  - the interval between 2 checks if the previous failed (currently 1min)
>  - the number of checks to run when a service is down before sending alert
> (currently 60)
> What are the best values for you?

can the first value be 30 minutes, the second be 10 minutes and the retry count be 12 please?



> >
> > I also see a message "CHECK_NRPE: Socket timeout after 10 seconds. " instead of
> > "PING CRITICAL - Packet loss = 100% " which is what we get on the desktop
> > machines.  Is this something to be concerned about?
> 
> Fixed

Thanks!
(Assignee)

Comment 10

8 years ago
(In reply to comment #9)
> (In reply to comment #8)
> 
> > I can adjusts 3 variables here: 
> >  - the interval between 2 checks if the previous succeeded (currently 5min)
> >  - the interval between 2 checks if the previous failed (currently 1min)
> >  - the number of checks to run when a service is down before sending alert
> > (currently 60)
> > What are the best values for you?
> 
> can the first value be 30 minutes, the second be 10 minutes and the retry count
> be 12 please?

Updated
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago8 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.