Add panda temp monitoring

RESOLVED WONTFIX

Status

Infrastructure & Operations
RelOps
RESOLVED WONTFIX
6 years ago
4 years ago

People

(Reporter: dividehex, Assigned: dividehex)

Tracking

Details

(Whiteboard: [2013Q3] [tracker] kanbanzilla[Ready to work on])

(Assignee)

Description

6 years ago
We would like to monitor panda board temp using builtin on board sensors.

The idea is to have a fail_overheating state in mozpool and having a panda board report that event if the board reaches a predetermined temp threshold.
Related: bug 817057 suggests polling temperature among other things in the "free" and maybe "ready" states.  That will catch problems with a chassis's fans failing.

Other temperature-monitoring options:

 - short-term, polling temperature using an ad-hoc script and stuffing it into graphite to get some data and help us decide if this is really an issue

 - monitoring board temp with nagios (contacting SUTAgent directly from a nagios check script)
(Assignee)

Updated

6 years ago
Assignee: server-ops-releng → jwatkins
(In reply to Dustin J. Mitchell [:dustin] from comment #1)
>  - monitoring board temp with nagios (contacting SUTAgent directly from a
> nagios check script)

FWIW, imho we shouldn't do anything SUT-Port direct like check board temp from nagios, we should instead have mozpool/lifeguard/whatever do the check and have nagios query the mozpool status on it. To keep nagios from flapping when it can't connect (board is imaging, etc.) but still having timely alerts when it spikes too high (rather than requiring it to be down for an hour or more before alerting)

Mozpool/whatever could reasonably know when its expected to be properly imaged - but free - and check this then. Also keeps us from spinning extra CPU/Networking connections while a job is in flight, where said extra-load could trip up a job, or cause volatile Talos Perf numbers.
I'm not sure what direct benefit we'd see from monitoring them only when they're idle.  Jake's mentioned that this would catch over-temp due to chassis fan failure, but that's not the immediate problem we're trying to solve.
(In reply to Dustin J. Mitchell [:dustin] from comment #3)
> I'm not sure what direct benefit we'd see from monitoring them only when
> they're idle.  Jake's mentioned that this would catch over-temp due to
> chassis fan failure, but that's not the immediate problem we're trying to
> solve.

well we could monitor at a higher cadence when idle, and a lower cadence when not, with different failure/retry modes when not idle too.

Basically, my main point was trying to coerce nagios to do the-right-thing with SUT port scripts, beyond a simple "does it listen on 20701, can we connect on 20701" is too much for nagios imho.
Whiteboard: [2013Q2] [tracker]
(Assignee)

Updated

5 years ago
Depends on: 870853
Whiteboard: [2013Q2] [tracker] → [2013Q3] [tracker]
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
Whiteboard: [2013Q3] [tracker] → [2013Q3] [tracker] kanbanzilla[Ready to work on]
We determined that temperature was not the problem, and this check is not needed.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.