Make infrastructure related problems turn the tree a color other than red

RESOLVED FIXED

Status

Release Engineering
General
P5
normal
RESOLVED FIXED
8 years ago
4 years ago

People

(Reporter: blassey, Assigned: bhearsum)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments, 4 obsolete attachments)

We started this discussion at the all hands a while back.  It seems that we've had a lot of infrastructure related problems turning the tree red lately and I'm worried that this is numbing developers to seeing the tree red (the same as random oranges numb us to seeing orange).  Also, fundamentally the issues need to be addressed by different people.  When the build breaks, the developer needs to either fix or back out his or her patch.  When the infrastructure fails IT or RelEng need to figure out what the issue is and fix it.  Also there is a different lead time in getting the fix landed (10 seconds to back out versus a week or more for a maintenance window).

It was suggested at the all hands that purple would be the most appropriate color since it is used somewhere else to identify infrastructure issues.

Updated

8 years ago
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 476656
I'm going to reopen this bug, since bug 476656 is resolved fixed and we still turn the tree red for infrastructure exceptions. It sounds like its a dependency, not a dupe.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
(Reporter)

Updated

7 years ago
Depends on: 476656

Updated

7 years ago
Component: Tinderbox → Release Engineering
Product: Webtools → mozilla.org
QA Contact: tinderbox → release
Version: Trunk → other
(Assignee)

Comment 3

7 years ago
Yeah, agreed. I haven't seen anyone stepping up to grab this, so I'm bumping the priority.
Priority: -- → P5
(Assignee)

Comment 4

7 years ago
Going to try and look at this this quarter
Assignee: nobody → bhearsum
(Assignee)

Comment 5

7 years ago
Created attachment 466443 [details] [diff] [review]
simple checking for hg errors

This patch depends on the upstream Buildbot patch here (or something like it): http://github.com/bhearsum/buildbot/commit/ddd6cf1dc2436efcb0b3e70161c24fdafc4dcaf4

I haven't tested this patch, but it should catch most of the HG errors that we hit. I was hoping to avoid creating a bunch of custom BuildStep's, but after writing this patch I realized that we're going to have to copy/paste around all of the calls to regex_log_evaluator unless we do so. Regardless which way we go, the upstream patch would be good to have.
Attachment #466443 - Flags: feedback?(catlee)
Comment on attachment 466443 [details] [diff] [review]
simple checking for hg errors

Looks sane.  hg_errors needs to be a list of tuples though I think.
Attachment #466443 - Flags: feedback?(catlee) → feedback+
(Assignee)

Comment 7

7 years ago
Created attachment 468727 [details] [diff] [review]
turn builds with hg errors purple (and retry them)

Here's a more polished version of the previous patch. I tested this locally by changing the repo_path of a build to "mozilla-central2", to cause a 404 error. This caused the build the turn purple (http://tinderbox.mozilla.org/MozillaTest/?noignore=1, in the "OS X 10.5.2 mozilla-central build" column), and be retried.

Builds without errors had no change in behaviour.

I can test this on all platforms in staging if desired, but I think it's safe enough to just land. It depends on this upstream commit: http://github.com/buildbot/buildbot/commit/87fbf3d84711a1d3471ddb36fa16e5eae8bc9464.
Attachment #466443 - Attachment is obsolete: true
Attachment #468727 - Flags: review?(catlee)
Comment on attachment 468727 [details] [diff] [review]
turn builds with hg errors purple (and retry them)

Everything looks ok except for this:

>-class MozillaTryServerHgClone(Mercurial):
>+class MozillaTryServerHgClone(EvaluatingMercurial):
>     haltOnFailure = True
>     flunkOnFailure = True
>     
>     def __init__(self, baseURL="http://hg.mozilla.org/", mode='clobber',
>                  defaultBranch='mozilla-central', timeout=3600, **kwargs):
>         # repourl overridden in startVC
>         Mercurial.__init__(self, baseURL=baseURL, mode=mode,
>                            defaultBranch=defaultBranch, timeout=timeout,

You need to update the call to Mercurial.__init__ here I think.
Attachment #468727 - Flags: review?(catlee) → review+
(Assignee)

Comment 9

7 years ago
We're upgrading the masters to the new Buildbot today, this patch is going to land along with that.
Blocks: 590208
Depends on: 590383
(Assignee)

Comment 10

7 years ago
Comment on attachment 468727 [details] [diff] [review]
turn builds with hg errors purple (and retry them)

changeset:   917:0ba8a3c89102
Attachment #468727 - Flags: checked-in+
(Assignee)

Comment 11

7 years ago
Comment on attachment 468727 [details] [diff] [review]
turn builds with hg errors purple (and retry them)

Got backed out due to errors in the upstream patch.
Attachment #468727 - Flags: checked-in+ → checked-in-
(Assignee)

Comment 12

7 years ago
Created attachment 469523 [details] [diff] [review]
updated buildbotcustom patch

Turns out I forgot to commit the fixes to the upcall you suggested. This patch fixes that.

I'll be attaching the upstream diff that we need as well.
Attachment #468727 - Attachment is obsolete: true
Attachment #469523 - Flags: review?(catlee)
(Assignee)

Comment 13

7 years ago
Created attachment 469565 [details] [diff] [review]
round up of upstream changesets we need

Here's all the upstream changesets we need to make this work bug free. This is:
http://github.com/buildbot/buildbot/commit/5764bd6edf7b639fb91bfe0e5732aefbf0bb6c5e
http://github.com/buildbot/buildbot/commit/87fbf3d84711a1d3471ddb36fa16e5eae8bc9464
http://github.com/buildbot/buildbot/commit/548d1ace6115c070b4659917536ea7e37e7aa31d
http://github.com/buildbot/buildbot/commit/9b5af09f7fa776ef2fad91c2e58dd2e6a4dde4d5

I've tested this + the buildbotcustom patch in staging. You can see the results on MozillaTest under "OS X 10.6.2 mozilla-central build". twistd.log was clear of relevant exceptions (there was some db ones and a bunch of HTTP 403's trying to poll shadow-central)
Attachment #469565 - Flags: review?(catlee)
(Assignee)

Updated

7 years ago
Blocks: 591055
No longer blocks: 590208

Updated

7 years ago
Attachment #469523 - Flags: review?(catlee) → review+
Comment on attachment 469565 [details] [diff] [review]
round up of upstream changesets we need

Looks good.  Can you write some tests upstream for this?
Attachment #469565 - Flags: review?(catlee) → review+
(Assignee)

Updated

7 years ago
Priority: P5 → P3
(Assignee)

Comment 15

7 years ago
Comment on attachment 469523 [details] [diff] [review]
updated buildbotcustom patch

changeset:   950:c5881ee2525a
Attachment #469523 - Flags: checked-in+
(Assignee)

Comment 16

7 years ago
Comment on attachment 469565 [details] [diff] [review]
round up of upstream changesets we need

Landed across:
changeset:   91:da98221aa3bb
and
changeset:   92:5e4ed40eafd2
Attachment #469565 - Flags: checked-in+
(Assignee)

Comment 17

7 years ago
Updated Buildbot on the masters with:
cd ~cltbld/buildbot
hg pull
hg up
unset PYTHONHOME
cd master
/tools/buildbot-0.8.0/bin/python setup.py install
(Assignee)

Comment 18

7 years ago
This looks like it's going to stick. I successfully made a build retry after an hg error. (I faked it by causing a 404...which made me realize we shouldn't be retrying on 404)
Priority: P3 → P5
Duplicate of this bug: 503580
The L? link is almost invisible against that dark purple background. Adding the following to userContent.css made it much more distinct:

/*
 * make text legible on purple boxes of tinderbox.mozilla.org
 */
@-moz-document domain(tinderbox.mozilla.org) {
 td[bgcolor="770088"]
  { background-color: #F4F    !important
  }
}
(Assignee)

Comment 21

7 years ago
(In reply to comment #20)
> The L? link is almost invisible against that dark purple background. Adding the
> following to userContent.css made it much more distinct:
> 
> /*
>  * make text legible on purple boxes of tinderbox.mozilla.org
>  */
> @-moz-document domain(tinderbox.mozilla.org) {
>  td[bgcolor="770088"]
>   { background-color: #F4F    !important
>   }
> }

This bug is tracking the Buildbot integration bit, but I filed bug 593341 for this.

Updated

7 years ago
Depends on: 595027
(Assignee)

Comment 22

7 years ago
Created attachment 474768 [details] [diff] [review]
catch out of disk space globally, purge errors, fix hg errors

There's quite a bit going on here....:
- Replacing ShellCommand/SetProperty/Trigger/Mercurial steps with those of our own creation
- Get rid of now-superfluous EvaluatingMercurial
- Delete unused l10n.py
- Fix some classes to use super_class
- Fix a bunch of subclasses' evaluateCommand to upcall properly.

I'm testing this along with the upstream changeset noted in bug 595027 on staging build & test masters now. So far, so good.
Attachment #474768 - Flags: review?(catlee)
Comment on attachment 474768 [details] [diff] [review]
catch out of disk space globally, purge errors, fix hg errors

as per irc, patch needs to be updated with new Mercurial stuff
Attachment #474768 - Flags: review?(catlee)
(Assignee)

Comment 24

7 years ago
Created attachment 475100 [details] [diff] [review]
get rid of EvaluatingMercurial, fix bug in DisconnectStep

This is the same as the last patch modulo the Mercurial stuff and fixing up DisconnectStep to use super_class -- I had a reconfig issue with it in staging.
Attachment #474768 - Attachment is obsolete: true
Attachment #475100 - Flags: review?(catlee)

Updated

7 years ago
Attachment #475100 - Flags: review?(catlee) → review+
(Assignee)

Comment 25

7 years ago
Created attachment 476319 [details] [diff] [review]
fix a bunch more potential reconfig issues

Sorry to throw up yet another version of this, Chris, but I hit some reconfig issues today and wanted to make sure we avoid them in production. Specifically, I had issues with CompareBloatLogs, but I applied the super_class workaround to pretty much everything. Nothing else changed in this patch.
Attachment #475100 - Attachment is obsolete: true
Attachment #476319 - Flags: review?(catlee)

Updated

7 years ago
Attachment #476319 - Flags: review?(catlee) → review+
(Assignee)

Comment 26

7 years ago
This patch is ready to go, I've run it in staging for a long time without issue. Will land in the next RelEng downtime.
(Assignee)

Updated

7 years ago
Blocks: 593081
(Assignee)

Updated

7 years ago
Flags: needs-treeclosure+
(Assignee)

Comment 27

7 years ago
Comment on attachment 476319 [details] [diff] [review]
fix a bunch more potential reconfig issues

Landed in 1fd614e8c662 and 17a88ee7a7aa.

I'm going to consider this bug fixed now; we don't catch all infrastructure errors yet but the framework is there to easily add more. We'll do those in follow-up bugs.
Attachment #476319 - Flags: checked-in+
(Assignee)

Updated

7 years ago
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago7 years ago
Resolution: --- → FIXED
Depends on: 601694
Attachment #476319 [details] [diff] tries to execute self.finished(EXCEPTION), but EXCEPTION is not imported:

2010-10-10 15:32:14-0700 [HTTPPageGetter,client] Unhandled Error
	Traceback (most recent call last):
	  File "/builds/buildbot/builder-master/sandbox/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/defer.py", line 441, in _runCallbacks
	    self.result = callback(self.result, *args, **kw)
	  File "/builds/buildbot/builder-master/sandbox/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/defer.py", line 664, in _cbDeferred
	    self.callback(self.resultList)
	  File "/builds/buildbot/builder-master/sandbox/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/defer.py", line 318, in callback
	    self._startRunCallbacks(result)
	  File "/builds/buildbot/builder-master/sandbox/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/defer.py", line 424, in _startRunCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/builds/buildbot/builder-master/sandbox/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/defer.py", line 441, in _runCallbacks
	    self.result = callback(self.result, *args, **kw)
	  File "/builds/buildbot/buildbotcustom/buildbotcustom/steps/test.py", line 489, in postFinished
	    self.finished(EXCEPTION)
	exceptions.NameError: global name 'EXCEPTION' is not defined
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 29

7 years ago
Created attachment 482531 [details] [diff] [review]
fix missing import

This patch fixes the missing import. I couldn't find any other files that were also missing imports.
Attachment #482531 - Flags: review?(rail)
Attachment #482531 - Flags: review?(rail) → review+
(Assignee)

Updated

7 years ago
Flags: needs-treeclosure+ → needs-reconfig?

Updated

7 years ago
Flags: needs-reconfig? → needs-reconfig+

Comment 30

7 years ago
Comment on attachment 482531 [details] [diff] [review]
fix missing import

http://hg.mozilla.org/build/buildbotcustom/rev/31a22bd3816e
Attachment #482531 - Flags: checked-in+
(Assignee)

Comment 31

7 years ago
Masters have been updated with the last patch. Let's track further issues in follow-up bugs.
Status: REOPENED → RESOLVED
Last Resolved: 7 years ago7 years ago
Resolution: --- → FIXED

Updated

7 years ago
Flags: needs-reconfig+
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.