Closed Bug 505512 Opened 15 years ago Closed 14 years ago

Make infrastructure related problems turn the tree a color other than red

Categories

(Release Engineering :: General, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: blassey, Assigned: bhearsum)

References

Details

Attachments

(4 files, 4 obsolete files)

We started this discussion at the all hands a while back.  It seems that we've had a lot of infrastructure related problems turning the tree red lately and I'm worried that this is numbing developers to seeing the tree red (the same as random oranges numb us to seeing orange).  Also, fundamentally the issues need to be addressed by different people.  When the build breaks, the developer needs to either fix or back out his or her patch.  When the infrastructure fails IT or RelEng need to figure out what the issue is and fix it.  Also there is a different lead time in getting the fix landed (10 seconds to back out versus a week or more for a maintenance window).

It was suggested at the all hands that purple would be the most appropriate color since it is used somewhere else to identify infrastructure issues.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → DUPLICATE
I'm going to reopen this bug, since bug 476656 is resolved fixed and we still turn the tree red for infrastructure exceptions. It sounds like its a dependency, not a dupe.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Depends on: 476656
Component: Tinderbox → Release Engineering
Product: Webtools → mozilla.org
QA Contact: tinderbox → release
Version: Trunk → other
Yeah, agreed. I haven't seen anyone stepping up to grab this, so I'm bumping the priority.
Priority: -- → P5
Going to try and look at this this quarter
Assignee: nobody → bhearsum
Attached patch simple checking for hg errors (obsolete) — Splinter Review
This patch depends on the upstream Buildbot patch here (or something like it): http://github.com/bhearsum/buildbot/commit/ddd6cf1dc2436efcb0b3e70161c24fdafc4dcaf4

I haven't tested this patch, but it should catch most of the HG errors that we hit. I was hoping to avoid creating a bunch of custom BuildStep's, but after writing this patch I realized that we're going to have to copy/paste around all of the calls to regex_log_evaluator unless we do so. Regardless which way we go, the upstream patch would be good to have.
Attachment #466443 - Flags: feedback?(catlee)
Comment on attachment 466443 [details] [diff] [review]
simple checking for hg errors

Looks sane.  hg_errors needs to be a list of tuples though I think.
Attachment #466443 - Flags: feedback?(catlee) → feedback+
Here's a more polished version of the previous patch. I tested this locally by changing the repo_path of a build to "mozilla-central2", to cause a 404 error. This caused the build the turn purple (http://tinderbox.mozilla.org/MozillaTest/?noignore=1, in the "OS X 10.5.2 mozilla-central build" column), and be retried.

Builds without errors had no change in behaviour.

I can test this on all platforms in staging if desired, but I think it's safe enough to just land. It depends on this upstream commit: http://github.com/buildbot/buildbot/commit/87fbf3d84711a1d3471ddb36fa16e5eae8bc9464.
Attachment #466443 - Attachment is obsolete: true
Attachment #468727 - Flags: review?(catlee)
Comment on attachment 468727 [details] [diff] [review]
turn builds with hg errors purple (and retry them)

Everything looks ok except for this:

>-class MozillaTryServerHgClone(Mercurial):
>+class MozillaTryServerHgClone(EvaluatingMercurial):
>     haltOnFailure = True
>     flunkOnFailure = True
>     
>     def __init__(self, baseURL="http://hg.mozilla.org/", mode='clobber',
>                  defaultBranch='mozilla-central', timeout=3600, **kwargs):
>         # repourl overridden in startVC
>         Mercurial.__init__(self, baseURL=baseURL, mode=mode,
>                            defaultBranch=defaultBranch, timeout=timeout,

You need to update the call to Mercurial.__init__ here I think.
Attachment #468727 - Flags: review?(catlee) → review+
We're upgrading the masters to the new Buildbot today, this patch is going to land along with that.
Blocks: 590208
Depends on: 590383
Comment on attachment 468727 [details] [diff] [review]
turn builds with hg errors purple (and retry them)

changeset:   917:0ba8a3c89102
Attachment #468727 - Flags: checked-in+
Comment on attachment 468727 [details] [diff] [review]
turn builds with hg errors purple (and retry them)

Got backed out due to errors in the upstream patch.
Attachment #468727 - Flags: checked-in+ → checked-in-
Turns out I forgot to commit the fixes to the upcall you suggested. This patch fixes that.

I'll be attaching the upstream diff that we need as well.
Attachment #468727 - Attachment is obsolete: true
Attachment #469523 - Flags: review?(catlee)
Here's all the upstream changesets we need to make this work bug free. This is:
http://github.com/buildbot/buildbot/commit/5764bd6edf7b639fb91bfe0e5732aefbf0bb6c5e
http://github.com/buildbot/buildbot/commit/87fbf3d84711a1d3471ddb36fa16e5eae8bc9464
http://github.com/buildbot/buildbot/commit/548d1ace6115c070b4659917536ea7e37e7aa31d
http://github.com/buildbot/buildbot/commit/9b5af09f7fa776ef2fad91c2e58dd2e6a4dde4d5

I've tested this + the buildbotcustom patch in staging. You can see the results on MozillaTest under "OS X 10.6.2 mozilla-central build". twistd.log was clear of relevant exceptions (there was some db ones and a bunch of HTTP 403's trying to poll shadow-central)
Attachment #469565 - Flags: review?(catlee)
Blocks: 591055
No longer blocks: 590208
Attachment #469523 - Flags: review?(catlee) → review+
Comment on attachment 469565 [details] [diff] [review]
round up of upstream changesets we need

Looks good.  Can you write some tests upstream for this?
Attachment #469565 - Flags: review?(catlee) → review+
Priority: P5 → P3
Comment on attachment 469523 [details] [diff] [review]
updated buildbotcustom patch

changeset:   950:c5881ee2525a
Attachment #469523 - Flags: checked-in+
Comment on attachment 469565 [details] [diff] [review]
round up of upstream changesets we need

Landed across:
changeset:   91:da98221aa3bb
and
changeset:   92:5e4ed40eafd2
Attachment #469565 - Flags: checked-in+
Updated Buildbot on the masters with:
cd ~cltbld/buildbot
hg pull
hg up
unset PYTHONHOME
cd master
/tools/buildbot-0.8.0/bin/python setup.py install
This looks like it's going to stick. I successfully made a build retry after an hg error. (I faked it by causing a 404...which made me realize we shouldn't be retrying on 404)
Priority: P3 → P5
The L? link is almost invisible against that dark purple background. Adding the following to userContent.css made it much more distinct:

/*
 * make text legible on purple boxes of tinderbox.mozilla.org
 */
@-moz-document domain(tinderbox.mozilla.org) {
 td[bgcolor="770088"]
  { background-color: #F4F    !important
  }
}
(In reply to comment #20)
> The L? link is almost invisible against that dark purple background. Adding the
> following to userContent.css made it much more distinct:
> 
> /*
>  * make text legible on purple boxes of tinderbox.mozilla.org
>  */
> @-moz-document domain(tinderbox.mozilla.org) {
>  td[bgcolor="770088"]
>   { background-color: #F4F    !important
>   }
> }

This bug is tracking the Buildbot integration bit, but I filed bug 593341 for this.
Depends on: 595027
There's quite a bit going on here....:
- Replacing ShellCommand/SetProperty/Trigger/Mercurial steps with those of our own creation
- Get rid of now-superfluous EvaluatingMercurial
- Delete unused l10n.py
- Fix some classes to use super_class
- Fix a bunch of subclasses' evaluateCommand to upcall properly.

I'm testing this along with the upstream changeset noted in bug 595027 on staging build & test masters now. So far, so good.
Attachment #474768 - Flags: review?(catlee)
Comment on attachment 474768 [details] [diff] [review]
catch out of disk space globally, purge errors, fix hg errors

as per irc, patch needs to be updated with new Mercurial stuff
Attachment #474768 - Flags: review?(catlee)
This is the same as the last patch modulo the Mercurial stuff and fixing up DisconnectStep to use super_class -- I had a reconfig issue with it in staging.
Attachment #474768 - Attachment is obsolete: true
Attachment #475100 - Flags: review?(catlee)
Attachment #475100 - Flags: review?(catlee) → review+
Sorry to throw up yet another version of this, Chris, but I hit some reconfig issues today and wanted to make sure we avoid them in production. Specifically, I had issues with CompareBloatLogs, but I applied the super_class workaround to pretty much everything. Nothing else changed in this patch.
Attachment #475100 - Attachment is obsolete: true
Attachment #476319 - Flags: review?(catlee)
Attachment #476319 - Flags: review?(catlee) → review+
This patch is ready to go, I've run it in staging for a long time without issue. Will land in the next RelEng downtime.
Flags: needs-treeclosure+
Comment on attachment 476319 [details] [diff] [review]
fix a bunch more potential reconfig issues

Landed in 1fd614e8c662 and 17a88ee7a7aa.

I'm going to consider this bug fixed now; we don't catch all infrastructure errors yet but the framework is there to easily add more. We'll do those in follow-up bugs.
Attachment #476319 - Flags: checked-in+
Status: REOPENED → RESOLVED
Closed: 15 years ago14 years ago
Resolution: --- → FIXED
Depends on: 601694
Attachment #476319 [details] [diff] tries to execute self.finished(EXCEPTION), but EXCEPTION is not imported:

2010-10-10 15:32:14-0700 [HTTPPageGetter,client] Unhandled Error
	Traceback (most recent call last):
	  File "/builds/buildbot/builder-master/sandbox/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/defer.py", line 441, in _runCallbacks
	    self.result = callback(self.result, *args, **kw)
	  File "/builds/buildbot/builder-master/sandbox/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/defer.py", line 664, in _cbDeferred
	    self.callback(self.resultList)
	  File "/builds/buildbot/builder-master/sandbox/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/defer.py", line 318, in callback
	    self._startRunCallbacks(result)
	  File "/builds/buildbot/builder-master/sandbox/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/defer.py", line 424, in _startRunCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/builds/buildbot/builder-master/sandbox/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/defer.py", line 441, in _runCallbacks
	    self.result = callback(self.result, *args, **kw)
	  File "/builds/buildbot/buildbotcustom/buildbotcustom/steps/test.py", line 489, in postFinished
	    self.finished(EXCEPTION)
	exceptions.NameError: global name 'EXCEPTION' is not defined
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This patch fixes the missing import. I couldn't find any other files that were also missing imports.
Attachment #482531 - Flags: review?(rail)
Attachment #482531 - Flags: review?(rail) → review+
Flags: needs-treeclosure+ → needs-reconfig?
Flags: needs-reconfig? → needs-reconfig+
Masters have been updated with the last patch. Let's track further issues in follow-up bugs.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Flags: needs-reconfig+
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: