Closed Bug 925285 Opened 11 years ago Closed 11 years ago

Set buildbot RETRY for failures of form "remoteFailed: [Failure instance: Traceback (failure with no frames): : Connection to the other side was lost in a non-clean fashion."

Categories

(Release Engineering :: General, defect)

x86
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Unassigned)

References

Details

(Keywords: sheriffing-P1)

Attachments

(2 files)

We should restart the task like builds when running into the "remoteFailed: [Failure instance: Traceback (failure with no frames): : Connection to the other side was lost in a non-clean fashion." condition.

As example this was hitting the nightlies and so caused delay or if stared to the existing bugs and forgetting to retrigger no nightly is generated etc.

from irc:

02:17 < edmorley> Tomcat|sheriffduty: that's a buildbot change (TBPL only displays the results of the jobs, it doesn't schedule them)
02:17 < edmorley> Tomcat|sheriffduty: you'd need to add the string to https://hg.mozilla.org/build/buildbotcustom/file/717dccdcb472/status/errors.py
global_errors already contains "Connection to the other side was lost in a non-clean fashion", we just need to add log_eval_func to the corresponding buildbot steps.
Summary: Restart/Reboot Build when running into "remoteFailed: [Failure instance: Traceback (failure with no frames): : Connection to the other side was lost in a non-clean fashion." → Set buildbot RETRY for failures of form "remoteFailed: [Failure instance: Traceback (failure with no frames): : Connection to the other side was lost in a non-clean fashion."
Blocks: 918677
Component: Release Automation → General Automation
QA Contact: bhearsum → catlee
(In reply to Rail Aliiev [:rail] from comment #1)
> we just need to add log_eval_func to the corresponding
> buildbot steps.

Is there somewhere global that it can be added? It's just these disconnects can occur at any point... :-(
Keywords: sheriffing-P1
(In reply to Rail Aliiev [:rail] from comment #1)
> global_errors already contains "Connection to the other side was lost in a
> non-clean fashion", we just need to add log_eval_func to the corresponding
> buildbot steps.

There must be something else we have to do here. We have failures that happen during the compile step, which is a MockCommand, which is a ShellCommand (from steps/base.py) which has a log_eval_func watching global_errors. I'm guessing that the stdout that the log_eval_func receives doesn't have that message in it, but it's hard to confirm that. The log that bm66 has on disk looks like this:
[cltbld@buildbot-master66.srv.releng.usw2.mozilla.com mozilla-aurora-win32-nightly]$ pwd
/builds/buildbot/build1/master/mozilla-aurora-win32-nightly
[cltbld@buildbot-master66.srv.releng.usw2.mozilla.com mozilla-aurora-win32-nightly]$ zcat 7-log-compile-stdio.gz  | tail -n5
nsXMLPrettyPrinter.cpp
nsXMLFragmentContentSink.cpp
,183:2
remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
],

That ",183:2" stuff appears to be some sort of control/escape characters - because it doesn't show up when looking at the log on the Buildbot WebStatus.
I think these messages are in the "HEADER" channel. We might be able to get them into the log that log_eval_func sees by adjusting this method to include that channel:
https://hg.mozilla.org/build/buildbot/file/d47b00ad80e1/master/buildbot/status/builder.py#l336


I thought these messages might be in the "HEADER" channel so I tried adjusting:
https://hg.mozilla.org/build/buildbot/file/d47b00ad80e1/master/buildbot/status/builder.py#l336
to return the HEADER channel too. That didn't work, and I also noticed that these messages come from Twistd (https://mxr.mozilla.org/build-central/search?string=Connection%2Bto%2Bthe%2Bother%2Bside), so maybe they're deeper down than I thought.

Dustin, any chance you know of an easy way to make these errors appear in the logs that log_eval_func gets, or another way of making builds that fail due to this set RETRY?
Flags: needinfo?(dustin)
The "control characters" you're seeing are netstring metadata and a stream prefix (2=HEADER, as you've surmised).

I'm not sure why you wouldn't see the header output with the change you suggest to builder.py.  It's possible that the message hasn't been added at that point?  I assume that you want to parse this from logs because that's how a lot of other, similar things are handled.  Barring that, I think it might make more sense to override some step methods to detect the exception itself and set the result accordingly.
Flags: needinfo?(dustin)
(In reply to Dustin J. Mitchell [:dustin] from comment #6)
> The "control characters" you're seeing are netstring metadata and a stream
> prefix (2=HEADER, as you've surmised).
> 
> I'm not sure why you wouldn't see the header output with the change you
> suggest to builder.py.  It's possible that the message hasn't been added at
> that point?  I assume that you want to parse this from logs because that's
> how a lot of other, similar things are handled.  Barring that, I think it
> might make more sense to override some step methods to detect the exception
> itself and set the result accordingly.

Yeah, log eval is simply the easiest way to go about this because it's a well trodden path. I think it might be easy to do what you suggest though...I'm going to take a stab at it.
Huh, just reading a bit more code it looks like this should be handled already, at least in some cases:
https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l1121

But other places don't seem to RETRY:
https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l127
https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l842
https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l79

The last one in particular seems likely to be the path we're hitting - it's the errback that's called if the remoteCommand fails.
Master output from a recent failure:
36232 2013-10-22 03:41:51-0700 [Broker,1691,10.132.58.42] BuildSlave.detached(tst-linux64-ec2-385)
36233 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42] <Build b2g_emulator_vm mozilla-aurora opt test crashtest-3>.lostRemote
36234 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42]  stopping currentStep <buildbotcustom.steps.mock.MockCommand instance at 0x16c4c950>
36235 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42] addCompleteLog(interrupt)
36236 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42] RemoteCommand.interrupt <RemoteShellCommand '['/tools/buildbot/bin/python', 'scripts/scripts/b2g_emulator_unitte      st.py', '--cfg', 'b2g/emulator_automation_config.py', '--test-suite', 'crashtest', '--this-chunk', '3', '--total-chunks', '3', '--no-update', '--download-symbols',       'ondemand']'> [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non      -clean fashion.
36237     ]
36238 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42] RemoteCommand.disconnect: lost slave
36239 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42] releaseLocks(<buildbotcustom.steps.mock.MockCommand instance at 0x16c4c950>): []
36240 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42]  step 'run_script' complete: exception
36241 2013-10-22 03:41:52-0700 [Broker,1691,10.132.58.42]  <Build b2g_emulator_vm mozilla-aurora opt test crashtest-3>: build finished


Looking through some Buildbot code I see that the ".lostRemote" and "stopping currentStep lines come from https://mxr.mozilla.org/build-central/source/buildbot/master/buildbot/process/base.py#422. They end up calling LoggingBuildStep.interrupt, which is https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l1107, which calls BuildStep.interrupt - https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l842. None of these call checkDisconnect, which is what handles RETRYing for interrupted builds.

The fact that our step_status in "interrupted" supports the theory that checkDisconnect isn't called. It should be "exception slave lost" if it was (https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l1123).

Interestingly, the "interrupted" status seems get set from self.finished, which is a callback of the runCommand deferred - not an errback. The other errback from that deferred calls https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l879, which we also don't see any log messages from.

So, it appears to me that because ConnectionLost errors don't actually bubble up to the runCommand deferred. I'm having trouble following RemoteCommand's code because of all the deferred indirection, but I _think_ it eats the ConnectionLost exception in or around _finished: https://hg.mozilla.org/build/buildbot/file/ecfab480076f/master/buildbot/process/buildstep.py#l191. I certainly don't see anything that's re-raising...

Dustin, this comment may be a little rambly, but do you have any more insight to offer here? It looks like this might affect the latest upstream too, as this code doesn't seem to have changed much.
Flags: needinfo?(dustin)
In general, what your suggesting sounds plausible - PB exceptions burp up at random spots, and aren't handled at all consistently.  And yes, even when an exception occurs, if the step was interrupted, it is treated as a success as far as the Deferred is concerned (meaning that .finished gets called), with the result forced to INTERRUPTED.

The new slave protocol will handle this better, since it isolates all of the PB code in one place where we can be consistent about handling of exceptions like this.  And the async-steps work in nine is taking care of the weird distinction between self.finished and self.failed.

Anyway, it sounds like the hope of just catching this error is fading :(
Flags: needinfo?(dustin)
This appears to work in staging, modulo slightly weird text in summaries (which I don't think are used by TBPL or other developer facing tools). This patch does a couple of things to make this work:
1) Actually fire the RemoteCommand.deferred errback when there's an error. Currently, the code fires the callback, which is why we never get into LoggingBuildStep.checkDisconnect.
2) Don't override the RETRY status with EXCEPTION in LoggingBuildStep.finished.

Have a look at this for examples:
http://dev-master01.build.scl1.mozilla.com:8018/buildslaves/bld-centos6-hp-004

I rebooted the slave during the "Linux mozilla-central leak test build", "Linux x86-64 mozilla-central build", and "Linux mozilla-central build" jobs. For the first two, this patch was applied. For the latter it was not.
Attachment #821012 - Flags: review?(dustin)
Attachment #821012 - Flags: review?(catlee)
In case it helps review, here's the whiteboarding I did when working out this fix:
https://people.mozilla.org/~bhearsum/retry-disconnect.jpg
The "Da" deferred runs throughout LoggingBuildStep and RemoteCommand, and contains the errback with checkDisconnect
The "Db" deferred is internal to RemoteCommand and is the one whose errback never got called (which ended up eating the exception).
Comment on attachment 821012 [details] [diff] [review]
retry on slave loss

So one key point to note here is that `Deferred.callback` is equivalent to `Deferred.errback` if given a failure.  So, assuming `_finished` is called with `failure` set to an actual `Failure` instance:

   198         d = defer.maybeDeferred(self.remoteComplete, failure)

calls remoteComplete, which simply returns failure.  maybeDeferred converts this to a Deferred, fired with the failure.

   201         d.addCallback(lambda r: self)

is ignored since the callback is in the failure path

   204         d.addBoth(self.deferred.callback)

invokes `self.deferred` with the failure from `d`, which is `failure`.

All of which is to say, the effect of the first hunk in the patch is almost nil.  The only difference comes from

+        d.addErrback(lambda r: failure)

which will ignore any failure returned from `remoteComplete` and substitute the failure given to `_finished`, if any.  If `remoteComplete` returned a failure when none had been given to `_finished`, this would ignore the failure.  None of that is really an issue here, since `remoteComplete` just returns its argument.

I think I'm seeing the big picture now: the step's interrupt method gets called when the buildslave detects a detach, which is based on a `notifyOnDisconnect` call to the PB broker.  That races with the exception bubbling up from the `callRemote` method used within the `RemoteCommand`.

So, the exception *is* caught and bubbled from the `RemoteCommand` to the step. `checkDisconnect` handles it, and calls `finished(RETRY)`.  However, by that time the `interrupt` process has set `step.stopped`, so `finished` ignores the `RETRY` and sets the status to `EXCEPTION`.

So I think that the second hunk in your patch is a good fix in the moment (and upstreamable to nine!) although in an ideal world we'd get rid of the race.
Attachment #821012 - Flags: review?(dustin) → review+
I included CANCELLED in this one because it seems like we shouldn't override it either. I sent the patch upstream too, but CANCELLED is missing from there, curiously:
https://github.com/buildbot/buildbot/pull/937
https://github.com/buildbot/buildbot/pull/555#issuecomment-26933871
Attachment #821191 - Flags: review?(dustin)
CANCELLED was only added in nine.
Attachment #821012 - Flags: review?(catlee) → review+
Comment on attachment 821191 [details] [diff] [review]
don't override retry/cancelled status

To the extent that I merged this upstream, LGTM.
Attachment #821191 - Flags: review?(dustin) → review+
Comment on attachment 821191 [details] [diff] [review]
don't override retry/cancelled status

Starting graceful restarts for this shortly.
Attachment #821191 - Flags: checked-in+
Depends on: 931805
It took awhile, but the restarts for this are finally done. Please let me know if you see any future instances of this.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Thank you :-)
Thanks Ben!
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: