Closed Bug 713281 Opened 13 years ago Closed 10 years ago

setup automatic monitoring of ~/update.log on buildbot masters for exceptions

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: bhearsum, Unassigned)

References

Details

(Whiteboard: [buildmasters])

bug 712988 tracks an issue where some masters were having trouble submitting data to the status DB. The end result of this is that tons of results were missing from TBPL. We should monitor ~/update.log on the masters like we monitor twistd.log, to make this easier to catch.
Given the series of tree-closing outages we've had in the last few weeks, this is something we should get to soon. Marking for triage-followup, but available if someone wants to grab in the meanwhile.

Ideally, I'd prefer to do this via nagios, as we already have nagios on these masters, and it gets us email+irc notifications like all other alerts - making buildduty life easier. Any gotchas/objections to that approach?


Also, once an exception is hit in the logs, and manually resolved, we'll need to do something to get that "fixed" exception no longer flagged by monitoring. From irc w/catlee, moving/renaming those specific log files should do the trick for clearing the alert.
Summary: watch ~/update.log on masters for exceptions → setup automatic monitoring of ~/update.log on buildbot masters for exceptions
Whiteboard: [triage-followup]
This is something that doesn't fit the model Mozilla has for Nagios (it's all passive from-the-outside checks.)

This is perfect for a tool I was learning about and also for the work that I and Catlee are doing with new messaging tools.  The tool would either grep or tail the log (heck, could even be realtime) and just send a #buildduty alert when the regex is matched.

The log itself shouldn't be the worry for realerting, IMO, it should be whatever data is being stored about the alert, i.e. how many of each alert have happened in a given time frame.  Yes, i'm talking about the releng dashboard here.
I'm changing this from critical (as it doesn't impact the current running of the buildmasters) and assigning it a P2 priority.
Severity: critical → major
Priority: -- → P2
Whiteboard: [triage-followup] → [buildmasters][triage-followup]
(In reply to Mike Taylor [:bear] from comment #2)
> This is perfect for a tool I was learning about and also for the work that I
> and Catlee are doing with new messaging tools.  The tool would either grep
> or tail the log (heck, could even be realtime) and just send a #buildduty
> alert when the regex is matched.
> 
> The log itself shouldn't be the worry for realerting, IMO, it should be
> whatever data is being stored about the alert, i.e. how many of each alert
> have happened in a given time frame.  Yes, i'm talking about the releng
> dashboard here.

Seems like a lot of work, and for a tool that doesn't exist yet.

Why don't we just modify the existing twistd.log script to handle the upload.log, and start logrotating the upload.log so we don't have to worry about re-alerting?
Whiteboard: [buildmasters][triage-followup] → [buildmasters]
(In reply to Chris Cooper [:coop] from comment #4)
> Seems like a lot of work, and for a tool that doesn't exist yet.
> 
> Why don't we just modify the existing twistd.log script to handle the
> upload.log, and start logrotating the upload.log so we don't have to worry
> about re-alerting?

sure, that would be a reasonable v1.0 way of dealing with this
Product: mozilla.org → Release Engineering
Found in triage.
Blocks: re-nagios
Component: Other → Platform Support
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.