Last Comment Bug 505512 - Make infrastructure related problems turn the tree a color other than red
: Make infrastructure related problems turn the tree a color other than red
Status: RESOLVED FIXED
:
Product: Release Engineering
Classification: Other
Component: Other (show other bugs)
: other
: All All
: P5 normal (vote)
: ---
Assigned To: Ben Hearsum (:bhearsum)
:
Mentors:
: 503580 (view as bug list)
Depends on: 476656 590383 595027 601694
Blocks: 591055 releng-downtime
  Show dependency treegraph
 
Reported: 2009-07-21 12:11 PDT by Brad Lassey [:blassey] (use needinfo?)
Modified: 2013-08-12 21:54 PDT (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
simple checking for hg errors (2.73 KB, patch)
2010-08-16 15:05 PDT, Ben Hearsum (:bhearsum)
catlee: feedback+
Details | Diff | Review
turn builds with hg errors purple (and retry them) (26.24 KB, patch)
2010-08-24 10:57 PDT, Ben Hearsum (:bhearsum)
catlee: review+
bhearsum: checked‑in-
Details | Diff | Review
updated buildbotcustom patch (26.87 KB, patch)
2010-08-26 10:59 PDT, Ben Hearsum (:bhearsum)
catlee: review+
bhearsum: checked‑in+
Details | Diff | Review
round up of upstream changesets we need (6.72 KB, patch)
2010-08-26 12:29 PDT, Ben Hearsum (:bhearsum)
catlee: review+
bhearsum: checked‑in+
Details | Diff | Review
catch out of disk space globally, purge errors, fix hg errors (41.26 KB, patch)
2010-09-13 12:26 PDT, Ben Hearsum (:bhearsum)
no flags Details | Diff | Review
get rid of EvaluatingMercurial, fix bug in DisconnectStep (49.60 KB, patch)
2010-09-14 09:11 PDT, Ben Hearsum (:bhearsum)
catlee: review+
Details | Diff | Review
fix a bunch more potential reconfig issues (59.75 KB, patch)
2010-09-17 11:32 PDT, Ben Hearsum (:bhearsum)
catlee: review+
bhearsum: checked‑in+
Details | Diff | Review
fix missing import (748 bytes, patch)
2010-10-12 06:32 PDT, Ben Hearsum (:bhearsum)
rail: review+
aki: checked‑in+
Details | Diff | Review

Description Brad Lassey [:blassey] (use needinfo?) 2009-07-21 12:11:10 PDT
We started this discussion at the all hands a while back.  It seems that we've had a lot of infrastructure related problems turning the tree red lately and I'm worried that this is numbing developers to seeing the tree red (the same as random oranges numb us to seeing orange).  Also, fundamentally the issues need to be addressed by different people.  When the build breaks, the developer needs to either fix or back out his or her patch.  When the infrastructure fails IT or RelEng need to figure out what the issue is and fix it.  Also there is a different lead time in getting the fix landed (10 seconds to back out versus a week or more for a maintenance window).

It was suggested at the all hands that purple would be the most appropriate color since it is used somewhere else to identify infrastructure issues.
Comment 1 cls 2009-07-29 13:31:49 PDT

*** This bug has been marked as a duplicate of bug 476656 ***
Comment 2 Brad Lassey [:blassey] (use needinfo?) 2010-05-25 09:39:48 PDT
I'm going to reopen this bug, since bug 476656 is resolved fixed and we still turn the tree red for infrastructure exceptions. It sounds like its a dependency, not a dupe.
Comment 3 Ben Hearsum (:bhearsum) 2010-07-26 06:20:03 PDT
Yeah, agreed. I haven't seen anyone stepping up to grab this, so I'm bumping the priority.
Comment 4 Ben Hearsum (:bhearsum) 2010-08-04 08:06:55 PDT
Going to try and look at this this quarter
Comment 5 Ben Hearsum (:bhearsum) 2010-08-16 15:05:11 PDT
Created attachment 466443 [details] [diff] [review]
simple checking for hg errors

This patch depends on the upstream Buildbot patch here (or something like it): http://github.com/bhearsum/buildbot/commit/ddd6cf1dc2436efcb0b3e70161c24fdafc4dcaf4

I haven't tested this patch, but it should catch most of the HG errors that we hit. I was hoping to avoid creating a bunch of custom BuildStep's, but after writing this patch I realized that we're going to have to copy/paste around all of the calls to regex_log_evaluator unless we do so. Regardless which way we go, the upstream patch would be good to have.
Comment 6 Chris AtLee [:catlee] 2010-08-17 18:58:28 PDT
Comment on attachment 466443 [details] [diff] [review]
simple checking for hg errors

Looks sane.  hg_errors needs to be a list of tuples though I think.
Comment 7 Ben Hearsum (:bhearsum) 2010-08-24 10:57:04 PDT
Created attachment 468727 [details] [diff] [review]
turn builds with hg errors purple (and retry them)

Here's a more polished version of the previous patch. I tested this locally by changing the repo_path of a build to "mozilla-central2", to cause a 404 error. This caused the build the turn purple (http://tinderbox.mozilla.org/MozillaTest/?noignore=1, in the "OS X 10.5.2 mozilla-central build" column), and be retried.

Builds without errors had no change in behaviour.

I can test this on all platforms in staging if desired, but I think it's safe enough to just land. It depends on this upstream commit: http://github.com/buildbot/buildbot/commit/87fbf3d84711a1d3471ddb36fa16e5eae8bc9464.
Comment 8 Chris AtLee [:catlee] 2010-08-25 08:54:28 PDT
Comment on attachment 468727 [details] [diff] [review]
turn builds with hg errors purple (and retry them)

Everything looks ok except for this:

>-class MozillaTryServerHgClone(Mercurial):
>+class MozillaTryServerHgClone(EvaluatingMercurial):
>     haltOnFailure = True
>     flunkOnFailure = True
>     
>     def __init__(self, baseURL="http://hg.mozilla.org/", mode='clobber',
>                  defaultBranch='mozilla-central', timeout=3600, **kwargs):
>         # repourl overridden in startVC
>         Mercurial.__init__(self, baseURL=baseURL, mode=mode,
>                            defaultBranch=defaultBranch, timeout=timeout,

You need to update the call to Mercurial.__init__ here I think.
Comment 9 Ben Hearsum (:bhearsum) 2010-08-26 06:23:27 PDT
We're upgrading the masters to the new Buildbot today, this patch is going to land along with that.
Comment 10 Ben Hearsum (:bhearsum) 2010-08-26 06:27:59 PDT
Comment on attachment 468727 [details] [diff] [review]
turn builds with hg errors purple (and retry them)

changeset:   917:0ba8a3c89102
Comment 11 Ben Hearsum (:bhearsum) 2010-08-26 07:17:13 PDT
Comment on attachment 468727 [details] [diff] [review]
turn builds with hg errors purple (and retry them)

Got backed out due to errors in the upstream patch.
Comment 12 Ben Hearsum (:bhearsum) 2010-08-26 10:59:12 PDT
Created attachment 469523 [details] [diff] [review]
updated buildbotcustom patch

Turns out I forgot to commit the fixes to the upcall you suggested. This patch fixes that.

I'll be attaching the upstream diff that we need as well.
Comment 13 Ben Hearsum (:bhearsum) 2010-08-26 12:29:02 PDT
Created attachment 469565 [details] [diff] [review]
round up of upstream changesets we need

Here's all the upstream changesets we need to make this work bug free. This is:
http://github.com/buildbot/buildbot/commit/5764bd6edf7b639fb91bfe0e5732aefbf0bb6c5e
http://github.com/buildbot/buildbot/commit/87fbf3d84711a1d3471ddb36fa16e5eae8bc9464
http://github.com/buildbot/buildbot/commit/548d1ace6115c070b4659917536ea7e37e7aa31d
http://github.com/buildbot/buildbot/commit/9b5af09f7fa776ef2fad91c2e58dd2e6a4dde4d5

I've tested this + the buildbotcustom patch in staging. You can see the results on MozillaTest under "OS X 10.6.2 mozilla-central build". twistd.log was clear of relevant exceptions (there was some db ones and a bunch of HTTP 403's trying to poll shadow-central)
Comment 14 Chris AtLee [:catlee] 2010-08-26 19:45:33 PDT
Comment on attachment 469565 [details] [diff] [review]
round up of upstream changesets we need

Looks good.  Can you write some tests upstream for this?
Comment 15 Ben Hearsum (:bhearsum) 2010-09-02 06:32:16 PDT
Comment on attachment 469523 [details] [diff] [review]
updated buildbotcustom patch

changeset:   950:c5881ee2525a
Comment 16 Ben Hearsum (:bhearsum) 2010-09-02 06:32:34 PDT
Comment on attachment 469565 [details] [diff] [review]
round up of upstream changesets we need

Landed across:
changeset:   91:da98221aa3bb
and
changeset:   92:5e4ed40eafd2
Comment 17 Ben Hearsum (:bhearsum) 2010-09-02 06:47:12 PDT
Updated Buildbot on the masters with:
cd ~cltbld/buildbot
hg pull
hg up
unset PYTHONHOME
cd master
/tools/buildbot-0.8.0/bin/python setup.py install
Comment 18 Ben Hearsum (:bhearsum) 2010-09-02 08:04:54 PDT
This looks like it's going to stick. I successfully made a build retry after an hg error. (I faked it by causing a 404...which made me realize we shouldn't be retrying on 404)
Comment 19 Lukas Blakk [:lsblakk] use ?needinfo 2010-09-02 08:07:11 PDT
*** Bug 503580 has been marked as a duplicate of this bug. ***
Comment 20 Tony Mechelynck [:tonymec] 2010-09-03 02:16:42 PDT
The L? link is almost invisible against that dark purple background. Adding the following to userContent.css made it much more distinct:

/*
 * make text legible on purple boxes of tinderbox.mozilla.org
 */
@-moz-document domain(tinderbox.mozilla.org) {
 td[bgcolor="770088"]
  { background-color: #F4F    !important
  }
}
Comment 21 Ben Hearsum (:bhearsum) 2010-09-03 06:02:12 PDT
(In reply to comment #20)
> The L? link is almost invisible against that dark purple background. Adding the
> following to userContent.css made it much more distinct:
> 
> /*
>  * make text legible on purple boxes of tinderbox.mozilla.org
>  */
> @-moz-document domain(tinderbox.mozilla.org) {
>  td[bgcolor="770088"]
>   { background-color: #F4F    !important
>   }
> }

This bug is tracking the Buildbot integration bit, but I filed bug 593341 for this.
Comment 22 Ben Hearsum (:bhearsum) 2010-09-13 12:26:05 PDT
Created attachment 474768 [details] [diff] [review]
catch out of disk space globally, purge errors, fix hg errors

There's quite a bit going on here....:
- Replacing ShellCommand/SetProperty/Trigger/Mercurial steps with those of our own creation
- Get rid of now-superfluous EvaluatingMercurial
- Delete unused l10n.py
- Fix some classes to use super_class
- Fix a bunch of subclasses' evaluateCommand to upcall properly.

I'm testing this along with the upstream changeset noted in bug 595027 on staging build & test masters now. So far, so good.
Comment 23 Chris AtLee [:catlee] 2010-09-14 08:13:50 PDT
Comment on attachment 474768 [details] [diff] [review]
catch out of disk space globally, purge errors, fix hg errors

as per irc, patch needs to be updated with new Mercurial stuff
Comment 24 Ben Hearsum (:bhearsum) 2010-09-14 09:11:53 PDT
Created attachment 475100 [details] [diff] [review]
get rid of EvaluatingMercurial, fix bug in DisconnectStep

This is the same as the last patch modulo the Mercurial stuff and fixing up DisconnectStep to use super_class -- I had a reconfig issue with it in staging.
Comment 25 Ben Hearsum (:bhearsum) 2010-09-17 11:32:51 PDT
Created attachment 476319 [details] [diff] [review]
fix a bunch more potential reconfig issues

Sorry to throw up yet another version of this, Chris, but I hit some reconfig issues today and wanted to make sure we avoid them in production. Specifically, I had issues with CompareBloatLogs, but I applied the super_class workaround to pretty much everything. Nothing else changed in this patch.
Comment 26 Ben Hearsum (:bhearsum) 2010-09-17 13:17:20 PDT
This patch is ready to go, I've run it in staging for a long time without issue. Will land in the next RelEng downtime.
Comment 27 Ben Hearsum (:bhearsum) 2010-10-04 08:00:18 PDT
Comment on attachment 476319 [details] [diff] [review]
fix a bunch more potential reconfig issues

Landed in 1fd614e8c662 and 17a88ee7a7aa.

I'm going to consider this bug fixed now; we don't catch all infrastructure errors yet but the framework is there to easily add more. We'll do those in follow-up bugs.
Comment 28 Rail Aliiev [:rail] 2010-10-11 01:26:05 PDT
Attachment #476319 [details] [diff] tries to execute self.finished(EXCEPTION), but EXCEPTION is not imported:

2010-10-10 15:32:14-0700 [HTTPPageGetter,client] Unhandled Error
	Traceback (most recent call last):
	  File "/builds/buildbot/builder-master/sandbox/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/defer.py", line 441, in _runCallbacks
	    self.result = callback(self.result, *args, **kw)
	  File "/builds/buildbot/builder-master/sandbox/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/defer.py", line 664, in _cbDeferred
	    self.callback(self.resultList)
	  File "/builds/buildbot/builder-master/sandbox/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/defer.py", line 318, in callback
	    self._startRunCallbacks(result)
	  File "/builds/buildbot/builder-master/sandbox/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/defer.py", line 424, in _startRunCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/builds/buildbot/builder-master/sandbox/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/defer.py", line 441, in _runCallbacks
	    self.result = callback(self.result, *args, **kw)
	  File "/builds/buildbot/buildbotcustom/buildbotcustom/steps/test.py", line 489, in postFinished
	    self.finished(EXCEPTION)
	exceptions.NameError: global name 'EXCEPTION' is not defined
Comment 29 Ben Hearsum (:bhearsum) 2010-10-12 06:32:55 PDT
Created attachment 482531 [details] [diff] [review]
fix missing import

This patch fixes the missing import. I couldn't find any other files that were also missing imports.
Comment 30 Aki Sasaki [:aki] 2010-10-12 07:23:40 PDT
Comment on attachment 482531 [details] [diff] [review]
fix missing import

http://hg.mozilla.org/build/buildbotcustom/rev/31a22bd3816e
Comment 31 Ben Hearsum (:bhearsum) 2010-10-12 08:13:05 PDT
Masters have been updated with the last patch. Let's track further issues in follow-up bugs.

Note You need to log in before you can comment on or make changes to this bug.