Closed Bug 1304158 Opened 8 years ago Closed 8 years ago

Create alerts for t-w732 and g-w732 pending

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: catlee, Assigned: aobreja)

References

Details

Attachments

(2 files)

g-w732-spot.PNG 8 years ago Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty] 55.28 KB, image/png		Details
t-w732-spot.PNG 8 years ago Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty] 80.86 KB, image/png		Details

Chris AtLee [:catlee]

Reporter

Description

•

8 years ago

nagios didn't alert us when we started having a lot of pending for g-w732 jobs last week. I'm not sure if that means we have no check right now, or it's not working?

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Assignee

Comment 1

•

8 years ago

By checking the script that create the alerts for pending jobs we found that the threshold is set to 1200 pending jobs for WARNING and 2000 for CRITICAL: https://dxr.mozilla.org/build-central/source/braindump/nagios-related/check_pending_builds.py#111 Last week the WARNING threshold was not reached as both t-w732 and g-w732 had bellow 1000 pending jobs. Also I have tested check_pending_builds.py to be sure that these platforms are included in the checking for pending jobs and both are listed in slavepools so for both the alerts are enabled.

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Assignee

Comment 2

•

8 years ago

Attached image g-w732-spot.PNG — Details

g-w732-spot

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Assignee

Comment 3

•

8 years ago

Attached image t-w732-spot.PNG — Details

t-w732-spot

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Assignee

Updated

•

8 years ago

Assignee: nobody → aobreja

Phil Ringnalda (:philor)

Comment 4

•

8 years ago

The reasons I'm able to alert on them when nagios fails are: * I don't alert on total pending, I alert on the ratio of pending to the size of the pool * for pools with any pending at all, I alert on the ratio of the size of the pool to the number of slaves that have done a job in the last 4 hours * not my strongest alert, but I do alert on a single pool having a wildly different pending::pool ratio than the rest * by far my most useful, I have separate backlog age alerts for Try+fuzzer and non-Try

Chris Cooper [:coop] (he/him)

Comment 5

•

8 years ago

(In reply to Phil Ringnalda (:philor) from comment #4) > The reasons I'm able to alert on them when nagios fails are: > > * I don't alert on total pending, I alert on the ratio of pending to the > size of the pool > > * for pools with any pending at all, I alert on the ratio of the size of the > pool to the number of slaves that have done a job in the last 4 hours > > * not my strongest alert, but I do alert on a single pool having a wildly > different pending::pool ratio than the rest > > * by far my most useful, I have separate backlog age alerts for Try+fuzzer > and non-Try These all sound like alerts we should automate.

Chris Cooper [:coop] (he/him)

Comment 6

•

8 years ago

(In reply to Chris Cooper [:coop] from comment #5) > These all sound like alerts we should automate. Filed bug 1315766 for this.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Create alerts for t-w732 and g-w732 pending

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: catlee, Assigned: aobreja)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Updated

Updated

Attachment

General

Description

File Name

Content Type