Closed Bug 526149 Opened 11 years ago Closed 10 years ago

Exceptions in the buildbot master and slave logs should be mailed to releng

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: rail)

References

()

Details

Attachments

(4 files)

As the summary indicates, we need to make exceptions encountered by the buildbot masters more visible. 

We burn through twisted logs at an incredible clip on the production masters, and I know that I'm completely ignorant of how often we hit exceptions or how serious they are. I'm also not tempted to go looking because it would mean grep-ing through (sometimes) 10,000 log files that are being constantly rotated. We need to come up with some lightweight way to capture any exceptions that do happen and get them in front of our eyes automatically.

My naive suggestion involves grep-ing logs and mailing exceptions. Whatever we do needs to be pretty lightweight so as not to further bog down already overloaded servers. Could we be catching exceptions in some sort of global try block and mailing exceptions out before continuing?
I would assume that there is already a global try block as the uncaught exceptions are put in a log file instead of aborting the application.  I bet we we could echo exceptions in there to a temporary file to ensure that not much in-python processing time is used and have a cronjob email all the exceptions.

I am going to move to future until someone has time to take this on
Component: Release Engineering → Release Engineering: Future
(In reply to comment #1)
> I would assume that there is already a global try block as the uncaught
> exceptions are put in a log file instead of aborting the application.  I bet we
> we could echo exceptions in there to a temporary file to ensure that not much
> in-python processing time is used and have a cronjob email all the exceptions.
> 
> I am going to move to future until someone has time to take this on

You've identified a core Buildbot bug here, IMHO. I just filed it upstream, http://buildbot.net/trac/ticket/635.
I'd suspect that to be a twisted RFE rather than a buildbot one.
Assignee: nobody → catlee
Component: Release Engineering: Future → Release Engineering
Summary: Exceptions in the buildbot master logs should be mailed to releng → Exceptions in the buildbot master and slave logs should be mailed to releng
So this is run thusly:

python watch_twistd_log.py -e release@mozilla.com -t /builds/buildbot/twistd_log_time.txt /builds/buildbot/moz2-master

It will keep track of the most recent timestamp of the files it examines in the twistd_log_time.txt file.  It scans through all files more recent than that, and emails all exceptions found to the specified e-mail address(es).

It should be runnable on the masters and slaves via a cronjob.
Attachment #410862 - Flags: review?(nthomas)
Attachment #410862 - Flags: review?(ccooper)
Attachment #410862 - Flags: review?(bhearsum)
So I just determined that this could cause problems on windows, where the log rotation will possibly fail if this script has a file open at the same time.
Comment on attachment 410862 [details] [diff] [review]
First attempt at twisted log watcher

Looks fine to me.

Would there be any point to making the search string configurable from the command line in case we want to use this as a general purpose twisted log parsing tool?
Attachment #410862 - Flags: review?(ccooper) → review+
Attachment #410862 - Flags: review?(bhearsum) → review+
Comment on attachment 410862 [details] [diff] [review]
First attempt at twisted log watcher

Looks good to me, too.
Attachment #410862 - Flags: review?(nthomas) → review+
Assignee: catlee → nobody
Assignee: nobody → raliiev
I tested the script on staging master 01 yesterday and today (to simulate daily reports).

First run took ~38min with 70-80% CPU and 2-3 LA. The message size was ~700K.

Today's run took 3 sec (fast enough) and the resulting message size is 133K.

I've changed a bit the message format to make it more readable:

-    msg = "The following exceptions were detected on %s:\n" % hostname + "\n".join(excs)
+    msg = "The following %i exceptions were detected on %s:\n\n" % (len(excs), hostname)
+    msg += ("\n" + "-"*80 +"\n").join(excs)


If there is no objection, I would set up a cronjob for staging masters and see how it works in the real world situation, so we can get more feedback on the message format.
Status: NEW → ASSIGNED
My only problem with having it run on staging masters is that they are restarted far more frequently than the production masters, thus generating a lot more exceptions.

Also, I think it should run more frequently than every day.  Maybe once an hour would be better?  Allows us to respond to problems more quickly.
(In reply to comment #9) 
> Also, I think it should run more frequently than every day.  Maybe once an hour
> would be better?  Allows us to respond to problems more quickly.

Agree.

Additionally, it would be better to have a description besides the hostname (like sm01) in the subject.
Set up in /home/cltbld/bin/watch_twistd_log for moz2-master, moz2-master2 and moz2-master3, hourly.
Attached file Updated version
Updated version with slightly changed formatting.

It's been running in staging-master for ~5 days and works as expected.

Is there any stopper to install it on production masters?
Attachment #433035 - Flags: feedback?(catlee)
Attachment #433035 - Flags: feedback?(catlee) → feedback+
Deployed on:
production-master.build (/build/moz2-master)
production-master02.build (/build/moz2-master)
talos-master (/build/{talos-pool,talos-r3,talos-try})
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Blocks: 688403
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.