Closed Bug 1304158 Opened 8 years ago Closed 8 years ago

Create alerts for t-w732 and g-w732 pending

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: aobreja)

References

Details

Attachments

(2 files)

nagios didn't alert us when we started having a lot of pending for g-w732 jobs last week. I'm not sure if that means we have no check right now, or it's not working?
By checking the script that create the alerts for pending jobs we found that the threshold is set to 1200 pending jobs for WARNING and 2000 for CRITICAL: https://dxr.mozilla.org/build-central/source/braindump/nagios-related/check_pending_builds.py#111 Last week the WARNING threshold was not reached as both t-w732 and g-w732 had bellow 1000 pending jobs. Also I have tested check_pending_builds.py to be sure that these platforms are included in the checking for pending jobs and both are listed in slavepools so for both the alerts are enabled.
Attached image g-w732-spot.PNG
g-w732-spot
Attached image t-w732-spot.PNG
t-w732-spot
Assignee: nobody → aobreja
The reasons I'm able to alert on them when nagios fails are: * I don't alert on total pending, I alert on the ratio of pending to the size of the pool * for pools with any pending at all, I alert on the ratio of the size of the pool to the number of slaves that have done a job in the last 4 hours * not my strongest alert, but I do alert on a single pool having a wildly different pending::pool ratio than the rest * by far my most useful, I have separate backlog age alerts for Try+fuzzer and non-Try
(In reply to Phil Ringnalda (:philor) from comment #4) > The reasons I'm able to alert on them when nagios fails are: > > * I don't alert on total pending, I alert on the ratio of pending to the > size of the pool > > * for pools with any pending at all, I alert on the ratio of the size of the > pool to the number of slaves that have done a job in the last 4 hours > > * not my strongest alert, but I do alert on a single pool having a wildly > different pending::pool ratio than the rest > > * by far my most useful, I have separate backlog age alerts for Try+fuzzer > and non-Try These all sound like alerts we should automate.
(In reply to Chris Cooper [:coop] from comment #5) > These all sound like alerts we should automate. Filed bug 1315766 for this.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: