Closed Bug 978956 Opened 11 years ago Closed 9 years ago

Improve nagios alerts for pending job backlog

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1220191

People

(Reporter: RyanVM, Unassigned)

References

Details

(Keywords: sheriffing-P1, Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2595] )

Right now, we only receive alerts based on the number of pending jobs being too high. However, this misses a critical time component when the issue isn't the number of builds pending but the duration of time they've been pending for. Note that this is about pending builds, not tests. Right now, the only process we have in place for reporting builds falling behind is "sheriffs happen to notice when not running in only unstarred mode on TBPL that AWS builds have been pending for an hour, close the trees, and proceed to scream in #releng about it" (this just happened). We've seen other instances in the last week of AWS builds falling 15-20min behind across the board, even during low load times of day (in one case, it was legitimate bustage from a RelEng change). Builds falling behind by 30+ minutes, especially during high-load parts of the day, is a tree-closing event because the bustage pile-up consequences can be worse if not. Therefore, we need better ways of monitoring for these pending build delays so they can be discovered in a more orderly fashion and investigated before things get out of control.
Blocks: re-nagios
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2595]
Seriously, I can recount multiple recent instances where having this alert would have prevented issues from blowing up into much worse ones.
Keywords: sheriffing-P1
Adjusting the summary to make it clear that this applies to builds and tests both.
Summary: Improve nagios alerts for pending build backlog → Improve nagios alerts for pending job backlog
Component: General Automation → Buildduty
QA Contact: catlee → bugspam.Callek
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.