Closed Bug 608889 Opened 15 years ago Closed 14 years ago

add tegras to nagios

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
Android
task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: joduinn, Assigned: arich)

References

Details

(Whiteboard: [added to DNS/inventory])

These tegra machines need to be added to nagios as we move them to production. tegra-001.build.m.o ... tegra-013.build.m.o These should be treated in the same way as the n810s and n900 - slower reboot times, etc.
(In reply to comment #1) > These tegra machines need to be added to nagios as we move them to production. > > tegra-001.build.m.o > ... > tegra-013.build.m.o > > These should be treated in the same way as the n810s and n900 - slower reboot > times, etc. Aki just reminded me of a donated machine from Joel, so this should be tegra-001...tegra-014.
Blocks: 608747
Assignee: server-ops → jdow
These haven't been added to DNS, DHCP or Inventory yet. I'll need a list of MAC addresses and serial numbers before I can proceed.
Still seem to be missing from reverse DNS - needed for nagios to work.
Assignee: jdow → jlazaro
Whiteboard: [needs to be added to inventory/DNS first]
Whiteboard: [needs to be added to inventory/DNS first] → [added to DNS/inventory]
I think arr is doing the nagios work for releng these days. I'm not sure what the current status of this is, but I don't think jlaz will be getting to it, due to his shift to services ops. Amy, can you get these tegras added to nagios? I imagine just a ping check with some kind of lazy_host or very_lazy_host directive.
Assignee: jlazaro → arich
Component: Server Operations → Server Operations: RelEng
QA Contact: mrz → zandr
This is now tegra-001 through tegra-093.
Aki: In addition to a ping check, zandr mentioned that checking port 20701 might be useful to monitor. Dustin was under the impression that this might not be stable enough to check yet, though. Do you want anything other than a basic slow ping check?
Status: NEW → ASSIGNED
you can do a simple socket open check on 20701 - the agent will respond if it's alive
I've added the ping check and matched it to the n900s. If someone could please write up the information for the 20701 check and add it to this ticket, I'll implement that as well. For reference: http://nagiosplugins.org/man/check_tcp And the check timing numbers I'd need: normal_check_interval retry_check_interval max_check_attempts first_notification_delay Thanks!
Assignee: arich → bear
The port 20701 check should respond with $ telnet tegra-001 20701 Trying 10.250.48.251... Connected to tegra-001.build.mtv1.mozilla.com. Escape character is '^]'. $> (I imagine the $> is the response, and the rest is telnet output) The frequency should probably match the ping check. (Bear can add any further information or clarification)
(In reply to comment #9) > The port 20701 check should respond with > > $ telnet tegra-001 20701 > Trying 10.250.48.251... > Connected to tegra-001.build.mtv1.mozilla.com. > Escape character is '^]'. > $> > > (I imagine the $> is the response, and the rest is telnet output) aki is correct, you will only see "$>" with no crlf any response should be "quit\n" but IMO that is not required as closing the socket works as well > > The frequency should probably match the ping check. > > (Bear can add any further information or clarification)
Assignee: bear → arich
Since the prompt on the tegras is two metacharacters and does not play well with the nagios config file, I'm doing a simple tcp connection to the port. If we change the prompt to be something easier to check in the future, we can open a new bug and I can go back and rework the check. For now, I've rolled the simple tcp socket check out to all of the tegras. I would expect several of the tegras that are down to send out notifications later today once they finally accumulate enough failed tries.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
notifications were seen for the tegras that were offline - thanks!
Status: RESOLVED → VERIFIED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.