Closed
Bug 1304158
Opened 8 years ago
Closed 8 years ago
Create alerts for t-w732 and g-w732 pending
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Assigned: aobreja)
References
Details
Attachments
(2 files)
nagios didn't alert us when we started having a lot of pending for g-w732 jobs last week. I'm not sure if that means we have no check right now, or it's not working?
Assignee | ||
Comment 1•8 years ago
|
||
By checking the script that create the alerts for pending jobs we found that the threshold is set to 1200 pending jobs for WARNING and 2000 for CRITICAL:
https://dxr.mozilla.org/build-central/source/braindump/nagios-related/check_pending_builds.py#111
Last week the WARNING threshold was not reached as both t-w732 and g-w732 had bellow 1000 pending jobs.
Also I have tested check_pending_builds.py to be sure that these platforms are included in the checking for pending jobs and both are listed in slavepools so for both the alerts are enabled.
Assignee | ||
Comment 2•8 years ago
|
||
g-w732-spot
Assignee | ||
Comment 3•8 years ago
|
||
t-w732-spot
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → aobreja
Comment 4•8 years ago
|
||
The reasons I'm able to alert on them when nagios fails are:
* I don't alert on total pending, I alert on the ratio of pending to the size of the pool
* for pools with any pending at all, I alert on the ratio of the size of the pool to the number of slaves that have done a job in the last 4 hours
* not my strongest alert, but I do alert on a single pool having a wildly different pending::pool ratio than the rest
* by far my most useful, I have separate backlog age alerts for Try+fuzzer and non-Try
Comment 5•8 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #4)
> The reasons I'm able to alert on them when nagios fails are:
>
> * I don't alert on total pending, I alert on the ratio of pending to the
> size of the pool
>
> * for pools with any pending at all, I alert on the ratio of the size of the
> pool to the number of slaves that have done a job in the last 4 hours
>
> * not my strongest alert, but I do alert on a single pool having a wildly
> different pending::pool ratio than the rest
>
> * by far my most useful, I have separate backlog age alerts for Try+fuzzer
> and non-Try
These all sound like alerts we should automate.
Comment 6•8 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #5)
> These all sound like alerts we should automate.
Filed bug 1315766 for this.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•