Closed
Bug 526149
Opened 16 years ago
Closed 15 years ago
Exceptions in the buildbot master and slave logs should be mailed to releng
Categories
(Release Engineering :: General, defect, P3)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: coop, Assigned: rail)
References
()
Details
Attachments
(4 files)
|
4.79 KB,
patch
|
bhearsum
:
review+
coop
:
review+
nthomas
:
review+
|
Details | Diff | Splinter Review |
|
4.62 KB,
text/plain
|
catlee
:
feedback+
coop
:
checked-in+
|
Details |
|
317 bytes,
text/plain
|
Details | |
|
1.35 KB,
patch
|
Details | Diff | Splinter Review |
As the summary indicates, we need to make exceptions encountered by the buildbot masters more visible.
We burn through twisted logs at an incredible clip on the production masters, and I know that I'm completely ignorant of how often we hit exceptions or how serious they are. I'm also not tempted to go looking because it would mean grep-ing through (sometimes) 10,000 log files that are being constantly rotated. We need to come up with some lightweight way to capture any exceptions that do happen and get them in front of our eyes automatically.
My naive suggestion involves grep-ing logs and mailing exceptions. Whatever we do needs to be pretty lightweight so as not to further bog down already overloaded servers. Could we be catching exceptions in some sort of global try block and mailing exceptions out before continuing?
Comment 1•16 years ago
|
||
I would assume that there is already a global try block as the uncaught exceptions are put in a log file instead of aborting the application. I bet we we could echo exceptions in there to a temporary file to ensure that not much in-python processing time is used and have a cronjob email all the exceptions.
I am going to move to future until someone has time to take this on
Component: Release Engineering → Release Engineering: Future
Comment 2•16 years ago
|
||
(In reply to comment #1)
> I would assume that there is already a global try block as the uncaught
> exceptions are put in a log file instead of aborting the application. I bet we
> we could echo exceptions in there to a temporary file to ensure that not much
> in-python processing time is used and have a cronjob email all the exceptions.
>
> I am going to move to future until someone has time to take this on
You've identified a core Buildbot bug here, IMHO. I just filed it upstream, http://buildbot.net/trac/ticket/635.
Comment 3•16 years ago
|
||
I'd suspect that to be a twisted RFE rather than a buildbot one.
Updated•16 years ago
|
Assignee: nobody → catlee
Component: Release Engineering: Future → Release Engineering
Summary: Exceptions in the buildbot master logs should be mailed to releng → Exceptions in the buildbot master and slave logs should be mailed to releng
Comment 4•16 years ago
|
||
So this is run thusly:
python watch_twistd_log.py -e release@mozilla.com -t /builds/buildbot/twistd_log_time.txt /builds/buildbot/moz2-master
It will keep track of the most recent timestamp of the files it examines in the twistd_log_time.txt file. It scans through all files more recent than that, and emails all exceptions found to the specified e-mail address(es).
It should be runnable on the masters and slaves via a cronjob.
Attachment #410862 -
Flags: review?(nthomas)
Attachment #410862 -
Flags: review?(ccooper)
Attachment #410862 -
Flags: review?(bhearsum)
Comment 5•16 years ago
|
||
So I just determined that this could cause problems on windows, where the log rotation will possibly fail if this script has a file open at the same time.
| Reporter | ||
Comment 6•16 years ago
|
||
Comment on attachment 410862 [details] [diff] [review]
First attempt at twisted log watcher
Looks fine to me.
Would there be any point to making the search string configurable from the command line in case we want to use this as a general purpose twisted log parsing tool?
Attachment #410862 -
Flags: review?(ccooper) → review+
Updated•16 years ago
|
Attachment #410862 -
Flags: review?(bhearsum) → review+
Comment 7•16 years ago
|
||
Comment on attachment 410862 [details] [diff] [review]
First attempt at twisted log watcher
Looks good to me, too.
Updated•16 years ago
|
Attachment #410862 -
Flags: review?(nthomas) → review+
Updated•15 years ago
|
Assignee: catlee → nobody
| Assignee | ||
Updated•15 years ago
|
Assignee: nobody → raliiev
| Assignee | ||
Comment 8•15 years ago
|
||
I tested the script on staging master 01 yesterday and today (to simulate daily reports).
First run took ~38min with 70-80% CPU and 2-3 LA. The message size was ~700K.
Today's run took 3 sec (fast enough) and the resulting message size is 133K.
I've changed a bit the message format to make it more readable:
- msg = "The following exceptions were detected on %s:\n" % hostname + "\n".join(excs)
+ msg = "The following %i exceptions were detected on %s:\n\n" % (len(excs), hostname)
+ msg += ("\n" + "-"*80 +"\n").join(excs)
If there is no objection, I would set up a cronjob for staging masters and see how it works in the real world situation, so we can get more feedback on the message format.
Status: NEW → ASSIGNED
Comment 9•15 years ago
|
||
My only problem with having it run on staging masters is that they are restarted far more frequently than the production masters, thus generating a lot more exceptions.
Also, I think it should run more frequently than every day. Maybe once an hour would be better? Allows us to respond to problems more quickly.
| Assignee | ||
Comment 10•15 years ago
|
||
(In reply to comment #9)
> Also, I think it should run more frequently than every day. Maybe once an hour
> would be better? Allows us to respond to problems more quickly.
Agree.
Additionally, it would be better to have a description besides the hostname (like sm01) in the subject.
| Assignee | ||
Comment 11•15 years ago
|
||
Set up in /home/cltbld/bin/watch_twistd_log for moz2-master, moz2-master2 and moz2-master3, hourly.
| Assignee | ||
Comment 12•15 years ago
|
||
Updated version with slightly changed formatting.
It's been running in staging-master for ~5 days and works as expected.
Is there any stopper to install it on production masters?
Attachment #433035 -
Flags: feedback?(catlee)
| Assignee | ||
Comment 13•15 years ago
|
||
| Assignee | ||
Comment 14•15 years ago
|
||
Updated•15 years ago
|
Attachment #433035 -
Flags: feedback?(catlee) → feedback+
| Reporter | ||
Comment 15•15 years ago
|
||
Comment on attachment 433035 [details]
Updated version
http://hg.mozilla.org/build/tools/rev/825341afe342
Attachment #433035 -
Flags: checked-in+
| Assignee | ||
Comment 16•15 years ago
|
||
Deployed on:
production-master.build (/build/moz2-master)
production-master02.build (/build/moz2-master)
talos-master (/build/{talos-pool,talos-r3,talos-try})
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•