Closed
Bug 978956
Opened 11 years ago
Closed 9 years ago
Improve nagios alerts for pending job backlog
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 1220191
People
(Reporter: RyanVM, Unassigned)
References
Details
(Keywords: sheriffing-P1, Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2595] )
Right now, we only receive alerts based on the number of pending jobs being too high. However, this misses a critical time component when the issue isn't the number of builds pending but the duration of time they've been pending for. Note that this is about pending builds, not tests.
Right now, the only process we have in place for reporting builds falling behind is "sheriffs happen to notice when not running in only unstarred mode on TBPL that AWS builds have been pending for an hour, close the trees, and proceed to scream in #releng about it" (this just happened). We've seen other instances in the last week of AWS builds falling 15-20min behind across the board, even during low load times of day (in one case, it was legitimate bustage from a RelEng change).
Builds falling behind by 30+ minutes, especially during high-load parts of the day, is a tree-closing event because the bustage pile-up consequences can be worse if not. Therefore, we need better ways of monitoring for these pending build delays so they can be discovered in a more orderly fashion and investigated before things get out of control.
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2595]
Reporter | ||
Comment 1•10 years ago
|
||
Seriously, I can recount multiple recent instances where having this alert would have prevented issues from blowing up into much worse ones.
Keywords: sheriffing-P1
Reporter | ||
Comment 2•10 years ago
|
||
Adjusting the summary to make it clear that this applies to builds and tests both.
Summary: Improve nagios alerts for pending build backlog → Improve nagios alerts for pending job backlog
Updated•9 years ago
|
Component: General Automation → Buildduty
QA Contact: catlee → bugspam.Callek
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•