1572193 - Can't land on autoland when it's closed

Reporter

Description

•

6 years ago

The sheriff asked me today to push a quick fix to autoland with CLOSED TREE in the commit message, but Lando doesn't seem to have allowed it to proceed. https://lando.services.mozilla.com/D41069/

(no longer active)

Reporter

Updated

•

6 years ago

Blocks: 1514805

(no longer active)

Reporter

Comment 1

•

6 years ago

FWIW this landed when Narcis reopened the tree manually to try to sneak this commit in...

:glob ✱

Comment 2

•

6 years ago

This doesn't block requiring Lando as the current policy for busted commits is for that commit to be backed out, not for a fixup commit.

Until we add CLOSED TREE support to Lando a MANUAL PUSH override is probably a better workaround that a quick open/closing of the tree.

No longer blocks: 1514805

Keywords: conduit-triaged

Priority: -- → P3

(no longer active)

Reporter

Comment 3

•

6 years ago

(In reply to Byron Jones ‹:glob› 🎈 from comment #2)

This doesn't block requiring Lando as the current policy for busted commits is for that commit to be backed out, not for a fixup commit.

Huh. This may be our policy, but it is absolutely not what we do in practice, for whatever our actual practices are worth. ;-) I haven't dug into why there is a difference between our policy and our practices, but usually where there is a difference like this I assume our policies need to be updated to match reality?

Until we add CLOSED TREE support to Lando a MANUAL PUSH override is probably a better workaround that a quick open/closing of the tree.

Sorry but I don't think this actually captures the problem expressed in comment 0. Let me try once again (and I would like to urge you to please forget about policy issues, let's focus about what engineers will see in practice):

Developer lands patch through Lando.
Patch is busted, sheriff notices, pings dev, "hey dev, look at this bustage link, can you please fix quickly or should we backout?"
She responds, yeah it's an easy error, let me fix quickly.
She uploads a quick fix to Lando.
- If the fix is submitted without "CLOSED TREE" in the commit message, cannot land.
- If the fix is submitted with "CLOSED TREE" in the commit message, cannot land.
- The patch therefore likely gets stuck in the commit queue.
Sheriff now has to decide, whether to open the tree and let random other bustage to creep in (and hopefully the fix too), or push the fix manually ignoring what's in the queue.
Developer is feeling terrible for this all, as she's tried to fix things, but often causes even more problems (e.g. a commit stuck in the queue, who knows whether it'll apply successfully or not).
MANUAL PUSH isn't a suitable solution here since developer doesn't have those magic privileges to autoland anyway, and can't take her bustage fix to inbound in the middle of the bustage hunting.

Hope this explains the spiderweb I have gotten stuck in a bunch of times...

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 4

•

6 years ago

The current practice for pushes with check-in related failures is to give the developer a very short timeframe to provide a follow-up patch to fix it. Reasons:

Low overhead for developer. They write the (very short) fix and submit it to Phabricator. If it gets backed out, they have to hg graft -fr it again or how they do it with their workflow. In the past, they had to reopen the revisions on Phabricator (doesn't apply anymore as far as I know).
Cleaner annotation / hg log views: Common complain is that backouts introduce noise and make those difficult to read.
More complex patches with failures on platforms not tested with the Try pushes. Subjective impression those got handled on inbound for a very long time.
Less pushes: original push + follow-up = 2; original push + backout + relanding = 3. This is also about cost.

The plan for autoland was to always do backouts and rewrite history. Because that didn't materialize, autoland's draft phase has been dropped (bug 1538695).

(no longer active)

Reporter

Comment 5

•

6 years ago

Thanks Aryx!

:glob ✱

Comment 6

•

6 years ago

(In reply to :Ehsan Akhgari from comment #3)

Huh. This may be our policy, but it is absolutely not what we do in practice, for whatever our actual practices are worth. ;-) I haven't dug into why there is a difference between our policy and our practices, but usually where there is a difference like this I assume our policies need to be updated to match reality?

When it comes to implementation of these sorts of features, we (tooling developers) have to implement the documented policy, as that's the authoritative stance on the business's expectations. If there's a disparity between practice and policy then the policy needs to be updated prior to codification in tooling.

I don't completely agree with your assumption that deviations from policy usually means the policy needs to be updated. While I agree this is indeed sometimes the case, assuming that a policy is obsolete because it isn't being following leads to fragmented workflows and decreasing clarity around expected actions and outcomes.

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #4)

The current practice for pushes with check-in related failures is to give the developer a very short timeframe to provide a follow-up patch to fix it.

This is in direct contradiction with https://wiki.mozilla.org/Tree_Rules/Integration :

Sheriffs will watch these trees (mozilla-inbound and autoland) and back out bustage/regressions as necessary to keep it green.

Bustage is backed out right away. There's no "we'll let you fix this tree while everybody stands by".

Aryx - looks like that policy needs to be updated to reflect the current practices 🙂

While doing so it would help us if you can clarify terms in documentation as defined periods rather than abstract notions.
For example ".. is to give the developer no more than 30 minutes to provide a follow-up patch.. " is much easier to code than "a very short timeframe", and sets your expectations of developers clearly up front. Note "30 minutes" is a placeholder, I honestly am not sure what constitutes "a very short time" within the context of tree breakage.

Flags: needinfo?(aryx.bugmail)

(no longer active)

Reporter

Comment 7

•

6 years ago

(In reply to Byron Jones ‹:glob› 🎈 from comment #6)

(In reply to :Ehsan Akhgari from comment #3)

Huh. This may be our policy, but it is absolutely not what we do in practice, for whatever our actual practices are worth. ;-) I haven't dug into why there is a difference between our policy and our practices, but usually where there is a difference like this I assume our policies need to be updated to match reality?

When it comes to implementation of these sorts of features, we (tooling developers) have to implement the documented policy, as that's the authoritative stance on the business's expectations. If there's a disparity between practice and policy then the policy needs to be updated prior to codification in tooling.

Yes, I agree that this is the ideal. Just to be clear, I was not criticizing anything you/tooling developers have done here, merely pointing out a problem that I had run into. I'm looking at this from the perspective of "our tools should work for folks using them" which I think both of us agree with.

I don't completely agree with your assumption that deviations from policy usually means the policy needs to be updated. While I agree this is indeed sometimes the case, assuming that a policy is obsolete because it isn't being following leads to fragmented workflows and decreasing clarity around expected actions and outcomes.

I agree with what you are saying here generally. I believe if we were during the information gathering phase of this project and were speaking within the context of a tool which doesn't exist yet, it would totally be fair game to look at how much our policies matches our practices, and if there is a discrepancy why that is and which side needs fixing.

I would however like to point out that in my experience people's workflows in practice not matching what's written in a policy document to a letter is the default and not an odd exception. But anyway, I'd very much like to go back to talk about the problem discussed in the bug more detail, we can keep the more general discussions for other fora. :-)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 9

•

9 months ago

Document has been updated and follow-up patches should be provided by Lando from which code sheriffs can pull it with moz-phab patch.

There is still the open question if code sheriffs should be able to land code with Lando while the tree is closed.

Flags: needinfo?(aryx.bugmail)

Bugzilla

Quick Search

Can't land on autoland when it's closed

Categories

(Conduit :: Lando, defect, P3)

Tracking

(Not tracked)

People

(Reporter: ehsan.akhgari, Unassigned)

References

Details

(Keywords: conduit-triaged)

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 9