Closed
Bug 925285
Opened 11 years ago
Closed 11 years ago
Set buildbot RETRY for failures of form "remoteFailed: [Failure instance: Traceback (failure with no frames): : Connection to the other side was lost in a non-clean fashion."
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cbook, Unassigned)
References
Details
(Keywords: sheriffing-P1)
Attachments
(2 files)
1.14 KB,
patch
|
dustin
:
review+
catlee
:
review+
|
Details | Diff | Splinter Review |
701 bytes,
patch
|
dustin
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
We should restart the task like builds when running into the "remoteFailed: [Failure instance: Traceback (failure with no frames): : Connection to the other side was lost in a non-clean fashion." condition. As example this was hitting the nightlies and so caused delay or if stared to the existing bugs and forgetting to retrigger no nightly is generated etc. from irc: 02:17 < edmorley> Tomcat|sheriffduty: that's a buildbot change (TBPL only displays the results of the jobs, it doesn't schedule them) 02:17 < edmorley> Tomcat|sheriffduty: you'd need to add the string to https://hg.mozilla.org/build/buildbotcustom/file/717dccdcb472/status/errors.py
Comment 1•11 years ago
|
||
global_errors already contains "Connection to the other side was lost in a non-clean fashion", we just need to add log_eval_func to the corresponding buildbot steps.
Updated•11 years ago
|
Summary: Restart/Reboot Build when running into "remoteFailed: [Failure instance: Traceback (failure with no frames): : Connection to the other side was lost in a non-clean fashion." → Set buildbot RETRY for failures of form "remoteFailed: [Failure instance: Traceback (failure with no frames): : Connection to the other side was lost in a non-clean fashion."
Updated•11 years ago
|
Component: Release Automation → General Automation
QA Contact: bhearsum → catlee
Comment 2•11 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #1) > we just need to add log_eval_func to the corresponding > buildbot steps. Is there somewhere global that it can be added? It's just these disconnects can occur at any point... :-(
Keywords: sheriffing-P1
Comment 3•11 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #1) > global_errors already contains "Connection to the other side was lost in a > non-clean fashion", we just need to add log_eval_func to the corresponding > buildbot steps. There must be something else we have to do here. We have failures that happen during the compile step, which is a MockCommand, which is a ShellCommand (from steps/base.py) which has a log_eval_func watching global_errors. I'm guessing that the stdout that the log_eval_func receives doesn't have that message in it, but it's hard to confirm that. The log that bm66 has on disk looks like this: [cltbld@buildbot-master66.srv.releng.usw2.mozilla.com mozilla-aurora-win32-nightly]$ pwd /builds/buildbot/build1/master/mozilla-aurora-win32-nightly [cltbld@buildbot-master66.srv.releng.usw2.mozilla.com mozilla-aurora-win32-nightly]$ zcat 7-log-compile-stdio.gz | tail -n5 nsXMLPrettyPrinter.cpp nsXMLFragmentContentSink.cpp ,183:2 remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ], That ",183:2" stuff appears to be some sort of control/escape characters - because it doesn't show up when looking at the log on the Buildbot WebStatus.
Comment 4•11 years ago
|
||
Indeed, looking at the raw log on Buildbot (eg http://buildbot-master66.srv.releng.usw2.mozilla.com:8001/builders/WINNT%205.2%20mozilla-aurora%20nightly/builds/7/steps/compile/logs/stdio/text) doesn't show the disconnect line.
Comment 5•11 years ago
|
||
I think these messages are in the "HEADER" channel. We might be able to get them into the log that log_eval_func sees by adjusting this method to include that channel: https://hg.mozilla.org/build/buildbot/file/d47b00ad80e1/master/buildbot/status/builder.py#l336 I thought these messages might be in the "HEADER" channel so I tried adjusting: https://hg.mozilla.org/build/buildbot/file/d47b00ad80e1/master/buildbot/status/builder.py#l336 to return the HEADER channel too. That didn't work, and I also noticed that these messages come from Twistd (https://mxr.mozilla.org/build-central/search?string=Connection%2Bto%2Bthe%2Bother%2Bside), so maybe they're deeper down than I thought. Dustin, any chance you know of an easy way to make these errors appear in the logs that log_eval_func gets, or another way of making builds that fail due to this set RETRY?
Flags: needinfo?(dustin)
Comment 6•11 years ago
|
||
The "control characters" you're seeing are netstring metadata and a stream prefix (2=HEADER, as you've surmised). I'm not sure why you wouldn't see the header output with the change you suggest to builder.py. It's possible that the message hasn't been added at that point? I assume that you want to parse this from logs because that's how a lot of other, similar things are handled. Barring that, I think it might make more sense to override some step methods to detect the exception itself and set the result accordingly.
Flags: needinfo?(dustin)
Comment 7•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #6) > The "control characters" you're seeing are netstring metadata and a stream > prefix (2=HEADER, as you've surmised). > > I'm not sure why you wouldn't see the header output with the change you > suggest to builder.py. It's possible that the message hasn't been added at > that point? I assume that you want to parse this from logs because that's > how a lot of other, similar things are handled. Barring that, I think it > might make more sense to override some step methods to detect the exception > itself and set the result accordingly. Yeah, log eval is simply the easiest way to go about this because it's a well trodden path. I think it might be easy to do what you suggest though...I'm going to take a stab at it.
Comment 8•11 years ago
|
||
Huh, just reading a bit more code it looks like this should be handled already, at least in some cases: https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l1121 But other places don't seem to RETRY: https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l127 https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l842 https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l79 The last one in particular seems likely to be the path we're hitting - it's the errback that's called if the remoteCommand fails.
Comment 9•11 years ago
|
||
Master output from a recent failure: 36232 2013-10-22 03:41:51-0700 [Broker,1691,10.132.58.42] BuildSlave.detached(tst-linux64-ec2-385) 36233 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42] <Build b2g_emulator_vm mozilla-aurora opt test crashtest-3>.lostRemote 36234 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42] stopping currentStep <buildbotcustom.steps.mock.MockCommand instance at 0x16c4c950> 36235 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42] addCompleteLog(interrupt) 36236 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42] RemoteCommand.interrupt <RemoteShellCommand '['/tools/buildbot/bin/python', 'scripts/scripts/b2g_emulator_unitte st.py', '--cfg', 'b2g/emulator_automation_config.py', '--test-suite', 'crashtest', '--this-chunk', '3', '--total-chunks', '3', '--no-update', '--download-symbols', 'ondemand']'> [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non -clean fashion. 36237 ] 36238 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42] RemoteCommand.disconnect: lost slave 36239 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42] releaseLocks(<buildbotcustom.steps.mock.MockCommand instance at 0x16c4c950>): [] 36240 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42] step 'run_script' complete: exception 36241 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42] <Build b2g_emulator_vm mozilla-aurora opt test crashtest-3>: build finished Looking through some Buildbot code I see that the ".lostRemote" and "stopping currentStep lines come from https://mxr.mozilla.org/build-central/source/buildbot/master/buildbot/process/base.py#422. They end up calling LoggingBuildStep.interrupt, which is https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l1107, which calls BuildStep.interrupt - https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l842. None of these call checkDisconnect, which is what handles RETRYing for interrupted builds. The fact that our step_status in "interrupted" supports the theory that checkDisconnect isn't called. It should be "exception slave lost" if it was (https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l1123). Interestingly, the "interrupted" status seems get set from self.finished, which is a callback of the runCommand deferred - not an errback. The other errback from that deferred calls https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l879, which we also don't see any log messages from. So, it appears to me that because ConnectionLost errors don't actually bubble up to the runCommand deferred. I'm having trouble following RemoteCommand's code because of all the deferred indirection, but I _think_ it eats the ConnectionLost exception in or around _finished: https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l191. I certainly don't see anything that's re-raising... Dustin, this comment may be a little rambly, but do you have any more insight to offer here? It looks like this might affect the latest upstream too, as this code doesn't seem to have changed much.
Flags: needinfo?(dustin)
Comment 10•11 years ago
|
||
In general, what your suggesting sounds plausible - PB exceptions burp up at random spots, and aren't handled at all consistently. And yes, even when an exception occurs, if the step was interrupted, it is treated as a success as far as the Deferred is concerned (meaning that .finished gets called), with the result forced to INTERRUPTED. The new slave protocol will handle this better, since it isolates all of the PB code in one place where we can be consistent about handling of exceptions like this. And the async-steps work in nine is taking care of the weird distinction between self.finished and self.failed. Anyway, it sounds like the hope of just catching this error is fading :(
Flags: needinfo?(dustin)
Comment 11•11 years ago
|
||
This appears to work in staging, modulo slightly weird text in summaries (which I don't think are used by TBPL or other developer facing tools). This patch does a couple of things to make this work: 1) Actually fire the RemoteCommand.deferred errback when there's an error. Currently, the code fires the callback, which is why we never get into LoggingBuildStep.checkDisconnect. 2) Don't override the RETRY status with EXCEPTION in LoggingBuildStep.finished. Have a look at this for examples: http://dev-master01.build.scl1.mozilla.com:8018/buildslaves/bld-centos6-hp-004 I rebooted the slave during the "Linux mozilla-central leak test build", "Linux x86-64 mozilla-central build", and "Linux mozilla-central build" jobs. For the first two, this patch was applied. For the latter it was not.
Attachment #821012 -
Flags: review?(dustin)
Attachment #821012 -
Flags: review?(catlee)
Comment 12•11 years ago
|
||
In case it helps review, here's the whiteboarding I did when working out this fix: https://people.mozilla.org/~bhearsum/retry-disconnect.jpg The "Da" deferred runs throughout LoggingBuildStep and RemoteCommand, and contains the errback with checkDisconnect The "Db" deferred is internal to RemoteCommand and is the one whose errback never got called (which ended up eating the exception).
Comment 13•11 years ago
|
||
Comment on attachment 821012 [details] [diff] [review] retry on slave loss So one key point to note here is that `Deferred.callback` is equivalent to `Deferred.errback` if given a failure. So, assuming `_finished` is called with `failure` set to an actual `Failure` instance: 198 d = defer.maybeDeferred(self.remoteComplete, failure) calls remoteComplete, which simply returns failure. maybeDeferred converts this to a Deferred, fired with the failure. 201 d.addCallback(lambda r: self) is ignored since the callback is in the failure path 204 d.addBoth(self.deferred.callback) invokes `self.deferred` with the failure from `d`, which is `failure`. All of which is to say, the effect of the first hunk in the patch is almost nil. The only difference comes from + d.addErrback(lambda r: failure) which will ignore any failure returned from `remoteComplete` and substitute the failure given to `_finished`, if any. If `remoteComplete` returned a failure when none had been given to `_finished`, this would ignore the failure. None of that is really an issue here, since `remoteComplete` just returns its argument. I think I'm seeing the big picture now: the step's interrupt method gets called when the buildslave detects a detach, which is based on a `notifyOnDisconnect` call to the PB broker. That races with the exception bubbling up from the `callRemote` method used within the `RemoteCommand`. So, the exception *is* caught and bubbled from the `RemoteCommand` to the step. `checkDisconnect` handles it, and calls `finished(RETRY)`. However, by that time the `interrupt` process has set `step.stopped`, so `finished` ignores the `RETRY` and sets the status to `EXCEPTION`. So I think that the second hunk in your patch is a good fix in the moment (and upstreamable to nine!) although in an ideal world we'd get rid of the race.
Attachment #821012 -
Flags: review?(dustin) → review+
Comment 14•11 years ago
|
||
I included CANCELLED in this one because it seems like we shouldn't override it either. I sent the patch upstream too, but CANCELLED is missing from there, curiously: https://github.com/buildbot/buildbot/pull/937 https://github.com/buildbot/buildbot/pull/555#issuecomment-26933871
Updated•11 years ago
|
Attachment #821191 -
Flags: review?(dustin)
Comment 15•11 years ago
|
||
CANCELLED was only added in nine.
Comment 16•11 years ago
|
||
We've backported CANCELLED: http://hg.mozilla.org/build/buildbot/rev/797a4d4b2c01
Updated•11 years ago
|
Attachment #821012 -
Flags: review?(catlee) → review+
Comment 17•11 years ago
|
||
Comment on attachment 821191 [details] [diff] [review] don't override retry/cancelled status To the extent that I merged this upstream, LGTM.
Attachment #821191 -
Flags: review?(dustin) → review+
Comment 18•11 years ago
|
||
Comment on attachment 821191 [details] [diff] [review] don't override retry/cancelled status Starting graceful restarts for this shortly.
Attachment #821191 -
Flags: checked-in+
Comment 19•11 years ago
|
||
It took awhile, but the restarts for this are finally done. Please let me know if you see any future instances of this.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 20•11 years ago
|
||
Thank you :-)
Comment 21•11 years ago
|
||
Thanks Ben!
Assignee | ||
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•