Closed Bug 681111 Opened 13 years ago Closed 13 years ago

change threshold on buildbot-start alerts

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: arich)

Details

We recently had bad win32 wait times for builds, because a bunch of win32 machines were not accepting jobs. Details in bug#680494, bug#680457. Turns out alerts for buildbot-start were flagged as "in downtime until 2012" because they were spamming but in this case, the alert would have caught the problem on the win32 builders.

1) Can you change the alert to alert "after 7 days, alert every X minutes"? Another suggestion was to track twistd.log instead of buildbot-start. Of course, these are just suggestions, if there is a better way to do this in nagios, let us know. 

2) can you then clear the current "downtime until 2012" alert in nagios?

3) Short term, RelEng buildduty will manually go through any machines alerting as idle after 7 days, and reboot them which will clear the alert. Of course, the longer term solution is to fix bug#637347 and bug#627126.



With these in place, we can reduce the risk of bad wait times caused by this.
In addition to the checks being downtimed, they are also currently set to not notify.  Did you want this to apply to *just* the w32 machines, or would you like this change made across the board so that you catch any machine that's hasn't been modified in 7 days?  I can split the w32 machines out into their own check and turn on notifications for them, or I can change this globally.
Assignee: server-ops-releng → arich
per meeting with IT:

I originally wrote this bug thinking about win32, because only win32 bit us in bug#680494, bug#680457.

However, it is valid to check for this on all OS and per :arr, its easier to implement by checking for this same check on all OS. Therefore, please do this check on all OS.
Summary: change threshold on win32 buildbot-start alerts → change threshold on buildbot-start alerts
Okay, I've re-enabled the check for all servers and set the freshness check to 7 days.  In order to enable and remove all the downtimes at once (because there multiple downtimes per check), I simply removed and re-added the check.  So the 7 day period starts now for all hosts.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.