Slave disconnect exceptions should auto-retry

RESOLVED FIXED

Status

Release Engineering
General
RESOLVED FIXED
7 years ago
4 years ago

People

(Reporter: philor, Assigned: bhearsum)

Tracking

Bug Flags:
needs-treeclosure +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

7 years ago
Since we seem to be heading back into another round of "remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion." it would be nice if we auto-retried, rather than randomly manually deciding whether or not someone might want a particular test result badly enough to poke buildduty about it.
(Assignee)

Comment 1

7 years ago
I still haven't gotten a chance to look at this, but I will =\.
Assignee: nobody → bhearsum
(Assignee)

Comment 2

7 years ago
Hmmm, been digging into the Buildbot code and it looks like the Build object, which is responsible for deciding on overall Build status, doesn't work very well when an earlier step sets a worse status than a subsequent. In these cases, we have an early-ish step setting RETRY, and then once the slave disconnects, buildbot dumbly tries to run any steps marked as alwaysRun=True. Those fail out with an exception because the slave is already gone, and the Build object chooses to set the overall status as EXCEPTION rather than retry. So, there's two upstream bugs I'm going to address here:
- Build object needs to be smarter about setting overall build status
- Shouldn't run alwaysRun steps if a build fails out due to a slave disconnecting
Flags: needs-reconfig+
(Assignee)

Comment 3

7 years ago
Created attachment 481281 [details] [diff] [review]
fix buildbot to stop overriding RETRY

This patch is a combination of my two upstream buildbot commits that should resolve this bug:
http://github.com/buildbot/buildbot/commit/8556b470586c72c5788587fec427959e9fe5a553
http://github.com/buildbot/buildbot/commit/90e38d541ce3114d3cd0faa38d1dfa0b0b713329
(Assignee)

Comment 4

7 years ago
Comment on attachment 481281 [details] [diff] [review]
fix buildbot to stop overriding RETRY

survey of IRC told me I should still get review on this, even though it's upstream. Nick or Rail, either of you feel comfortable doing this?
Attachment #481281 - Flags: review?(rail)
Attachment #481281 - Flags: review?(nrthomas)
Comment on attachment 481281 [details] [diff] [review]
fix buildbot to stop overriding RETRY

I don't have my head in the buildbot codebase so can't review this. However I would be interested in how the test results change with and without the first hunk. Just testStepDoneRetryOverridesAnythingElse ? And with 6 flavors of *On* settings, do you have a enough tests to make sure no combination regresses ?
Attachment #481281 - Flags: review?(nrthomas)
(Assignee)

Comment 6

7 years ago
I'm pretty certain I'm not changing anything subtle based on reading the code, but it's pretty trivial to add more tests -- I'll do that.
(Assignee)

Comment 7

7 years ago
Created attachment 481577 [details] [diff] [review]
fix regression, add more tests

Good idea indeed, Nick. I ended up catching a regression. Here's an all inclusive patch that tests at least the default and non-default states of each of those flags. Testing every possible combination isn't _quite_ possible (5 step results ^ 6 flags ^ 2), but this covers a much better portion than before.
Attachment #481281 - Attachment is obsolete: true
Attachment #481577 - Flags: review?(rail)
Attachment #481281 - Flags: review?(rail)
(Assignee)

Updated

7 years ago
Flags: needs-reconfig+ → needs-reconfig?
(Assignee)

Updated

7 years ago
Flags: needs-reconfig?
Comment on attachment 481577 [details] [diff] [review]
fix regression, add more tests

Looks good. The only confusing for me is the description of warnOnWarnings in the documentation:

warnOnWarnings
    when True, a WARNINGS or _FAILURE_ of this build step will mark the overall build as having WARNINGS. The remaining steps will still be executed.

I you really want to mark overall result as WARNINGS when warnOnWarnings is set and the step result is FAILURE, you need another "if" in "if result == FAILURE" section.
Attachment #481577 - Flags: review?(rail) → review+
(Assignee)

Comment 9

7 years ago
(In reply to comment #8)
> Comment on attachment 481577 [details] [diff] [review]
> fix regression, add more tests
> 
> Looks good. The only confusing for me is the description of warnOnWarnings in
> the documentation:
> 
> warnOnWarnings
>     when True, a WARNINGS or _FAILURE_ of this build step will mark the overall
> build as having WARNINGS. The remaining steps will still be executed.
> 
> I you really want to mark overall result as WARNINGS when warnOnWarnings is set
> and the step result is FAILURE, you need another "if" in "if result == FAILURE"
> section.


Seems like a documentation error to me. Here's the old code:
http://github.com/buildbot/buildbot/blob/a09caa3ce81adc89c61730b57180b7066348a465/master/buildbot/process/base.py#L399

...which doesn't match the docs afaict. I'm going to guess that somebody copied/pasted from flunkOnWarnings and didn't change it enough. I'll fix them, too.
(Assignee)

Comment 10

7 years ago
This is ready to land in the next tree closure window.
Flags: needs-treeclosure?
(Assignee)

Comment 11

7 years ago
Comment on attachment 481577 [details] [diff] [review]
fix regression, add more tests

Landed w/out the tests, because test_process_base.py doesn't exist in our repo (it was added post-0.8.1).
changeset:   95:58225ba72678
Attachment #481577 - Flags: checked-in+
(Assignee)

Updated

7 years ago
Flags: needs-treeclosure? → needs-treeclosure+
(Assignee)

Comment 12

7 years ago
I'm in the process of updating our masters for this now.
(Assignee)

Comment 13

7 years ago
I just confirmed that this is working on one of the masters w/ the new code.
(Assignee)

Comment 14

7 years ago
(for posterity, I tested by kill -9'ing the buildbot process on the slave, which caused the build to finish with RETRY and was automatically restarted)
(Assignee)

Comment 15

7 years ago
This should be active on all the 0.8.0 masters now.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.