Closed Bug 817047 Opened 12 years ago Closed 10 years ago

Add panda temp monitoring

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: dividehex, Assigned: dividehex)

References

Details

(Whiteboard: [2013Q3] [tracker] kanbanzilla[Ready to work on])

Jake Watkins [:dividehex]

Assignee

Description

•

12 years ago

We would like to monitor panda board temp using builtin on board sensors.

The idea is to have a fail_overheating state in mozpool and having a panda board report that event if the board reaches a predetermined temp threshold.

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

12 years ago

Related: bug 817057 suggests polling temperature among other things in the "free" and maybe "ready" states.  That will catch problems with a chassis's fans failing.

Other temperature-monitoring options:

 - short-term, polling temperature using an ad-hoc script and stuffing it into graphite to get some data and help us decide if this is really an issue

 - monitoring board temp with nagios (contacting SUTAgent directly from a nagios check script)

Jake Watkins [:dividehex]

Assignee

Updated

•

12 years ago

Assignee: server-ops-releng → jwatkins

Justin Wood (:Callek)

Comment 2

•

12 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #1)
>  - monitoring board temp with nagios (contacting SUTAgent directly from a
> nagios check script)

FWIW, imho we shouldn't do anything SUT-Port direct like check board temp from nagios, we should instead have mozpool/lifeguard/whatever do the check and have nagios query the mozpool status on it. To keep nagios from flapping when it can't connect (board is imaging, etc.) but still having timely alerts when it spikes too high (rather than requiring it to be down for an hour or more before alerting)

Mozpool/whatever could reasonably know when its expected to be properly imaged - but free - and check this then. Also keeps us from spinning extra CPU/Networking connections while a job is in flight, where said extra-load could trip up a job, or cause volatile Talos Perf numbers.

Dustin J. Mitchell [:dustin] (he/him)

Comment 3

•

12 years ago

I'm not sure what direct benefit we'd see from monitoring them only when they're idle.  Jake's mentioned that this would catch over-temp due to chassis fan failure, but that's not the immediate problem we're trying to solve.

Justin Wood (:Callek)

Comment 4

•

12 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #3)
> I'm not sure what direct benefit we'd see from monitoring them only when
> they're idle.  Jake's mentioned that this would catch over-temp due to
> chassis fan failure, but that's not the immediate problem we're trying to
> solve.

well we could monitor at a higher cadence when idle, and a lower cadence when not, with different failure/retry modes when not idle too.

Basically, my main point was trying to coerce nagios to do the-right-thing with SUT port scripts, beyond a simple "does it listen on 20701, can we connect on 20701" is too much for nagios imho.

Amy Rich [:arr] [:arich]

Updated

•

11 years ago

Whiteboard: [2013Q2] [tracker]

Jake Watkins [:dividehex]

Assignee

Updated

•

11 years ago

Depends on: 870853

Melissa O'Connor [:melissa]

Updated

•

11 years ago

Whiteboard: [2013Q2] [tracker] → [2013Q3] [tracker]

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations

Amy Rich [:arr] [:arich]

Updated

•

11 years ago

Whiteboard: [2013Q3] [tracker] → [2013Q3] [tracker] kanbanzilla[Ready to work on]

Amy Rich [:arr] [:arich]

Comment 5

•

10 years ago

We determined that temperature was not the problem, and this check is not needed.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Add panda temp monitoring

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

People

(Reporter: dividehex, Assigned: dividehex)

References

Details

(Whiteboard: [2013Q3] [tracker] kanbanzilla[Ready to work on])

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Updated

Updated

Updated

Updated

Updated

Comment 5