Closed Bug 627825 Opened 12 years ago Closed 8 years ago

review nagios alerts for builds-running, builds-pending

Categories

(Release Engineering :: General, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: ashish)

References

Details

(Whiteboard: [monitoring][nagios])

To reduce nagios spam, these alerts were intentionally over-relaxed until bug#625978 was fixed and, we have some idea of how quickly we can expect the files to now being posted. 

Once bug#625978, we should update this bug with the new nagios threshold we want, and kick this bug over to ServerOps. For now, filing to track, and leaving in RelEng.
Component: Release Engineering → Release Engineering: Automation (General)
QA Contact: release → catlee
Hardware: x86 → All
Whiteboard: [monitoring][nagios]
So bug 627821 (relaxing the checks) was WONTFIX so we may never have eased them.

Amy, what are the age thresholds for the these nagios checks on dm-wwwbuild01:
 http_age - build-4hr
 http_age - builds-pending
 http_age - builds-running
Product: mozilla.org → Release Engineering
Blocks: 926246
(In reply to Nick Thomas [:nthomas] from comment #1)
> So bug 627821 (relaxing the checks) was WONTFIX so we may never have eased
> them.
> 
> Amy, what are the age thresholds for the these nagios checks on
> dm-wwwbuild01:
>  http_age - build-4hr
>  http_age - builds-pending
>  http_age - builds-running

arr: ping? what is the current threshold on these alerts?


ed: also, as sheriffs closed the trees because of these build json files being stale over the weekend (bug#926245), any opinions on what threshold you'd be looking for?
Flags: needinfo?(emorley)
Flags: needinfo?(arich)
(In reply to John O'Duinn [:joduinn] from comment #2)
> ed: also, as sheriffs closed the trees because of these build json files
> being stale over the weekend (bug#926245), any opinions on what threshold
> you'd be looking for?

I requested "http_age - build-4hr" be adjusted in bug 914686 to...
* check_interval: 300s
* file age threshold: 300s

...I'm presuming this affected both our email alert and the #releng IRC alert (arr: would be good to confirm they are linked?).

Something similar for builds-running and builds-pending would be ideal :-)
Flags: needinfo?(emorley)
The SRE team takes care of nagios, now, so tagging ashish to answer comment 1, since he'll probably be up soon.
Flags: needinfo?(arich) → needinfo?(ashish)
Given the age of the original request, these checks don't exist anymore, likely gone with dm-wwwbuild01:

> http_age - builds-pending
> http_age - builds-running

From what I gather, "http_age - builds-4hr" is now "http file age - /buildjson/builds-4hr.js.gz". I'm positive the other two checks were not lost in migration (to the current Nagios infrastructure).

I'll be glad to add these back at priority. Please let me know which server to check these on and other parameters - check_interval, thresholds, contact groups.
Flags: needinfo?(ashish)
Okay, in the general interest of keeping things monitored, I've gone ahead and added checks for builds-pending.js and builds-running.js. Same thresholds and config as for builds-4hr.js.gz:

Check interval: 300s
Check failures before alert: 3
File age threshold: 300s
URL: https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?navbarsearch=1&host=builddata.pub.build.mozilla.org

Please reopen this bug for any changes.
Assignee: nobody → ashish
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.