Closed
Bug 1293538
Opened 9 years ago
Closed 9 years ago
Masters hosed and unable to reconfig due to SETA, tests not being run
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Unassigned)
References
Details
Too much backlog, I'm closing everything but try.
Nick and Aki are looking at it, and know the details.
Comment 1•9 years ago
|
||
We've restarted the affected masters (bm02, bm03, bm04, bm54, bm114, bm116, bm119, bm132, bm134) with
DISABLE_SETA=1 make start
Comment 2•9 years ago
|
||
The story here is
* the SETA server is returning 500 errors for all requests (bug 1293538)
* the test scheduler and all the test masters talk to the SETA server each time they start, or are reconfiged (bug 1176784). They retry ~5 times then give up
* test masters which failed in the initial checkconfig state were fine, no reconfig was attempted. There was no actual change to deploy so that's fine
* some masters attempted a reconfig and failed, ended up with (eg bm119)
> 2016-08-08 14:06:42-0700 [-] Failed to fetch 'http://alertmanager.allizom.org/data/setadetails/?date=2016-08-08&buildbot=1&branch=fx-team&inactive=1': HTTP Error 500: Internal Server Error
> 2016-08-08 14:06:42-0700 [-] error while parsing config file
> 2016-08-08 14:06:42-0700 [-] error during loadConfig
> 2016-08-08 14:06:42-0700 [-] Unhandled Error
> ...
> File "/builds/buildbot/tests1-windows/master/config_seta.py", line 64, in wfetch
> raise Exception("Could not fetch url '%s'" % url)
> exceptions.Exception: Could not fetch url 'http://alertmanager.allizom.org/data/setadetails/?date=2016-08-08&buildbot=1&branch=fx-team&inactive=1'
> 2016-08-08 14:06:42-0700 [-] The new config file is unusable, so I'll ignore it.
> 2016-08-08 14:06:42-0700 [-] I will keep using the previous config file instead.
* often this is fine, but the masters listed in comment #1 ended up with this error when starting some/all (??) builds
> 2016-08-08 14:07:04-0700 [Broker,4145,10.26.41.39] Build.setupBuild failed
> 2016-08-08 14:07:04-0700 [Broker,4145,10.26.41.39] Unhandled Error
> Traceback (most recent call last):
> File "/builds/buildbot/tests1-windows/lib/python2.7/site-packages/twisted/internet/defer.py", line 318, in callback
> self._startRunCallbacks(result)
> File "/builds/buildbot/tests1-windows/lib/python2.7/site-packages/twisted/internet/defer.py", line 424, in _startRunCallbacks
> self._runCallbacks()
> File "/builds/buildbot/tests1-windows/lib/python2.7/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks
> self.result = callback(self.result, *args, **kw)
> File "/builds/buildbot/tests1-windows/lib/python2.7/site-packages/buildbot-0.8.2_hg_8b87b4974e3c_production_0.8-py2.7.egg/buildbot/process/builder.py", line 904, in _startBuild_2
> d = build.startBuild(bs, self.expectations, sb)
> --- <exception caught here> ---
> File "/builds/buildbot/tests1-windows/lib/python2.7/site-packages/buildbot-0.8.2_hg_8b87b4974e3c_production_0.8-py2.7.egg/buildbot/process/base.py", line 217, in startBuild
> self.setupBuild(expectations) # create .steps
> File "/builds/buildbot/tests1-windows/lib/python2.7/site-packages/buildbot-0.8.2_hg_8b87b4974e3c_production_0.8-py2.7.egg/buildbot/process/base.py", line 272, in setupBuild
> step = factory(**args)
> File "/builds/buildbot/tests1-windows/lib/python2.7/site-packages/buildbotcustom/steps/mock.py", line 47, in __init__
> self.super_class.__init__(self, **kwargs)
> exceptions.TypeError: unbound method __init__() must be called with C instance as first argument (got MockCommand instance instead)
* the job disappears into the ether at this point, so it looks like a lot of tests are missing on treeherder
* the fix is to restart the master, but you can't do that if SETA is down, hence using DISABLE_SETA=1 make start
Comment 3•9 years ago
|
||
Current state
* test scheduler seems fine (it doesn't actually need to start any jobs)
* some test masters are started with DISABLE_SETA=1 (bm02, bm03, bm04, bm54, bm114, bm116, bm119, bm132, bm134), the rest without. This is not shown in the output of 'ps uxf'
* SETA server is still down
--> If we try to deploy buildbot code changes we'll get out of sync, as only the DISABLE_SETA masters can reconfig.
Followups
* SETA only changes the scheduling of jobs, as far as aki and I can tell, not the jobs that run on the test masters [1][2]
--> don't call loadSkipConfig unless it's the test scheduler master, bug 1293554
* this has the nice effect of not slamming the seta server when a reconfig comes along
* cache the response from SETA, ideally in a repo for trace-ability, bug 1176784
[1] https://hg.mozilla.org/build/buildbot-configs/file/production/mozilla-tests/config_seta.py#l114
[2] https://dxr.mozilla.org/build-central/search?q=skipconfig+path%3Abuildbotcustom&redirect=false
Severity: blocker → major
See Also: → 1293554
| Reporter | ||
Comment 4•9 years ago
|
||
Reopend once we got some results.
Comment 5•9 years ago
|
||
Lets close this out. SETA is back, and we have followups on file.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•