Closed Bug 1293538 Opened 9 years ago Closed 9 years ago

Masters hosed and unable to reconfig due to SETA, tests not being run

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

Too much backlog, I'm closing everything but try. Nick and Aki are looking at it, and know the details.
We've restarted the affected masters (bm02, bm03, bm04, bm54, bm114, bm116, bm119, bm132, bm134) with DISABLE_SETA=1 make start
The story here is * the SETA server is returning 500 errors for all requests (bug 1293538) * the test scheduler and all the test masters talk to the SETA server each time they start, or are reconfiged (bug 1176784). They retry ~5 times then give up * test masters which failed in the initial checkconfig state were fine, no reconfig was attempted. There was no actual change to deploy so that's fine * some masters attempted a reconfig and failed, ended up with (eg bm119) > 2016-08-08 14:06:42-0700 [-] Failed to fetch 'http://alertmanager.allizom.org/data/setadetails/?date=2016-08-08&buildbot=1&branch=fx-team&inactive=1': HTTP Error 500: Internal Server Error > 2016-08-08 14:06:42-0700 [-] error while parsing config file > 2016-08-08 14:06:42-0700 [-] error during loadConfig > 2016-08-08 14:06:42-0700 [-] Unhandled Error > ... > File "/builds/buildbot/tests1-windows/master/config_seta.py", line 64, in wfetch > raise Exception("Could not fetch url '%s'" % url) > exceptions.Exception: Could not fetch url 'http://alertmanager.allizom.org/data/setadetails/?date=2016-08-08&buildbot=1&branch=fx-team&inactive=1' > 2016-08-08 14:06:42-0700 [-] The new config file is unusable, so I'll ignore it. > 2016-08-08 14:06:42-0700 [-] I will keep using the previous config file instead. * often this is fine, but the masters listed in comment #1 ended up with this error when starting some/all (??) builds > 2016-08-08 14:07:04-0700 [Broker,4145,10.26.41.39] Build.setupBuild failed > 2016-08-08 14:07:04-0700 [Broker,4145,10.26.41.39] Unhandled Error > Traceback (most recent call last): > File "/builds/buildbot/tests1-windows/lib/python2.7/site-packages/twisted/internet/defer.py", line 318, in callback > self._startRunCallbacks(result) > File "/builds/buildbot/tests1-windows/lib/python2.7/site-packages/twisted/internet/defer.py", line 424, in _startRunCallbacks > self._runCallbacks() > File "/builds/buildbot/tests1-windows/lib/python2.7/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks > self.result = callback(self.result, *args, **kw) > File "/builds/buildbot/tests1-windows/lib/python2.7/site-packages/buildbot-0.8.2_hg_8b87b4974e3c_production_0.8-py2.7.egg/buildbot/process/builder.py", line 904, in _startBuild_2 > d = build.startBuild(bs, self.expectations, sb) > --- <exception caught here> --- > File "/builds/buildbot/tests1-windows/lib/python2.7/site-packages/buildbot-0.8.2_hg_8b87b4974e3c_production_0.8-py2.7.egg/buildbot/process/base.py", line 217, in startBuild > self.setupBuild(expectations) # create .steps > File "/builds/buildbot/tests1-windows/lib/python2.7/site-packages/buildbot-0.8.2_hg_8b87b4974e3c_production_0.8-py2.7.egg/buildbot/process/base.py", line 272, in setupBuild > step = factory(**args) > File "/builds/buildbot/tests1-windows/lib/python2.7/site-packages/buildbotcustom/steps/mock.py", line 47, in __init__ > self.super_class.__init__(self, **kwargs) > exceptions.TypeError: unbound method __init__() must be called with C instance as first argument (got MockCommand instance instead) * the job disappears into the ether at this point, so it looks like a lot of tests are missing on treeherder * the fix is to restart the master, but you can't do that if SETA is down, hence using DISABLE_SETA=1 make start
Current state * test scheduler seems fine (it doesn't actually need to start any jobs) * some test masters are started with DISABLE_SETA=1 (bm02, bm03, bm04, bm54, bm114, bm116, bm119, bm132, bm134), the rest without. This is not shown in the output of 'ps uxf' * SETA server is still down --> If we try to deploy buildbot code changes we'll get out of sync, as only the DISABLE_SETA masters can reconfig. Followups * SETA only changes the scheduling of jobs, as far as aki and I can tell, not the jobs that run on the test masters [1][2] --> don't call loadSkipConfig unless it's the test scheduler master, bug 1293554 * this has the nice effect of not slamming the seta server when a reconfig comes along * cache the response from SETA, ideally in a repo for trace-ability, bug 1176784 [1] https://hg.mozilla.org/build/buildbot-configs/file/production/mozilla-tests/config_seta.py#l114 [2] https://dxr.mozilla.org/build-central/search?q=skipconfig+path%3Abuildbotcustom&redirect=false
Severity: blocker → major
See Also: → 1293554
Reopend once we got some results.
Lets close this out. SETA is back, and we have followups on file.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.