setup automatic monitoring of ~/update.log on buildbot masters for exceptions

RESOLVED WONTFIX

Status

Infrastructure & Operations
CIDuty
P2
major
RESOLVED WONTFIX
7 years ago
2 months ago

People

(Reporter: bhearsum, Unassigned)

Tracking

(Blocks: 1 bug)

Details

(Whiteboard: [buildmasters])

(Reporter)

Description

7 years ago
bug 712988 tracks an issue where some masters were having trouble submitting data to the status DB. The end result of this is that tons of results were missing from TBPL. We should monitor ~/update.log on the masters like we monitor twistd.log, to make this easier to catch.
Given the series of tree-closing outages we've had in the last few weeks, this is something we should get to soon. Marking for triage-followup, but available if someone wants to grab in the meanwhile.

Ideally, I'd prefer to do this via nagios, as we already have nagios on these masters, and it gets us email+irc notifications like all other alerts - making buildduty life easier. Any gotchas/objections to that approach?


Also, once an exception is hit in the logs, and manually resolved, we'll need to do something to get that "fixed" exception no longer flagged by monitoring. From irc w/catlee, moving/renaming those specific log files should do the trick for clearing the alert.
Summary: watch ~/update.log on masters for exceptions → setup automatic monitoring of ~/update.log on buildbot masters for exceptions
Whiteboard: [triage-followup]

Comment 2

7 years ago
This is something that doesn't fit the model Mozilla has for Nagios (it's all passive from-the-outside checks.)

This is perfect for a tool I was learning about and also for the work that I and Catlee are doing with new messaging tools.  The tool would either grep or tail the log (heck, could even be realtime) and just send a #buildduty alert when the regex is matched.

The log itself shouldn't be the worry for realerting, IMO, it should be whatever data is being stored about the alert, i.e. how many of each alert have happened in a given time frame.  Yes, i'm talking about the releng dashboard here.

Comment 3

7 years ago
I'm changing this from critical (as it doesn't impact the current running of the buildmasters) and assigning it a P2 priority.
Severity: critical → major
Priority: -- → P2

Updated

7 years ago
Whiteboard: [triage-followup] → [buildmasters][triage-followup]

Comment 4

7 years ago
(In reply to Mike Taylor [:bear] from comment #2)
> This is perfect for a tool I was learning about and also for the work that I
> and Catlee are doing with new messaging tools.  The tool would either grep
> or tail the log (heck, could even be realtime) and just send a #buildduty
> alert when the regex is matched.
> 
> The log itself shouldn't be the worry for realerting, IMO, it should be
> whatever data is being stored about the alert, i.e. how many of each alert
> have happened in a given time frame.  Yes, i'm talking about the releng
> dashboard here.

Seems like a lot of work, and for a tool that doesn't exist yet.

Why don't we just modify the existing twistd.log script to handle the upload.log, and start logrotating the upload.log so we don't have to worry about re-alerting?
Whiteboard: [buildmasters][triage-followup] → [buildmasters]

Comment 5

7 years ago
(In reply to Chris Cooper [:coop] from comment #4)
> Seems like a lot of work, and for a tool that doesn't exist yet.
> 
> Why don't we just modify the existing twistd.log script to handle the
> upload.log, and start logrotating the upload.log so we don't have to worry
> about re-alerting?

sure, that would be a reasonable v1.0 way of dealing with this
(Assignee)

Updated

5 years ago
Product: mozilla.org → Release Engineering
Found in triage.
Blocks: 885560
Component: Other → Platform Support

Updated

4 years ago
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → WONTFIX
(Assignee)

Updated

2 months ago
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.