Closed Bug 962291 Opened 10 years ago Closed 10 years ago

reset try during next TCW

Categories

(Developer Services :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hwine, Assigned: bkero)

References

Details

bug 962275 is the latest in a long line of developer-frustration generated try-reset requests.

ATM, we seem to be good for several months. So doing a reset as part of every tree closing window would be a preventative measure. Let's do that for the next one.
Flags: cab-review?
Moving over discussion from bug 962275

(In reply to Shyam Mani [:fox2mike] from bug 962275 comment #7)
> Just wanted to confirm that this bug isn't
> clashing with the TCW reset bug and if we do the reset now (before the next
> window), we'll probably skip the Feb window and do a reset again for the one
> after.

My gut says to not skip. We don't fully understand the underlying problem, and the cost to developers (frustration, lost time) is likely higher than is reported via complaining.

IT has made several changes to the hg infrastructure to address this issue (i.e. things that were believed to have resolved the issue, but haven't). RelEng has attempted to "repair" the try repo via automation. Neither have been effective.

There remains some technical hope (reasons) that the next major hg upgrade will impact the issue (other bugs), but in the interim, resetting appears to be our best preventative. Let's do that regularly.

That makes the Feb window a good opportunity to try this out. This requires coordination between the IT vcs team, and the releng CI team (each owns 1 or more parts of the process).
(In reply to Hal Wine [:hwine] (use needinfo) from comment #1)
> Moving over discussion from bug 962275
> 
> (In reply to Shyam Mani [:fox2mike] from bug 962275 comment #7)
> > Just wanted to confirm that this bug isn't
> > clashing with the TCW reset bug and if we do the reset now (before the next
> > window), we'll probably skip the Feb window and do a reset again for the one
> > after.
> 
> My gut says to not skip. We don't fully understand the underlying problem,
> and the cost to developers (frustration, lost time) is likely higher than is
> reported via complaining.
> 
> IT has made several changes to the hg infrastructure to address this issue
> (i.e. things that were believed to have resolved the issue, but haven't).
> RelEng has attempted to "repair" the try repo via automation. Neither have
> been effective.

This is completely incorrect. The last reset was July 2013. See Bug 894429. 

try has been humming along since then until now, without any reported issues in-between. After the upgrade, we've been able to push mercurial to now 20,000+ heads on try. Saying we haven't been effective in addressing this or fixing the underlying issue is completely incorrect :) We're misusing mercurial, in the way we use try and the upstream fixes are the only thing that's kept it going for as long as it has, this time.

This is why I suggested we skip the February window if try is going to be reset today.
I guess this was reset during the TCW.
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: cab-review? → cab-review-
Resolution: --- → FIXED
This was done outside, re-opening.
Status: RESOLVED → REOPENED
Flags: cab-review- → cab-review?
Resolution: FIXED → ---
per irc, tbd in CAB

(In reply to Shyam Mani [:fox2mike] from comment #2)
> (In reply to Hal Wine [:hwine] (use needinfo) from comment #1)
> > 
> > IT has made several changes to the hg infrastructure to address this issue
> > (i.e. things that were believed to have resolved the issue, but haven't).
> > RelEng has attempted to "repair" the try repo via automation. Neither have
> > been effective.
> 
> This is completely incorrect. The last reset was July 2013. See Bug 894429. 
> 
> try has been humming along since then until now, without any reported issues
> in-between. After the upgrade, we've been able to push mercurial to now
> 20,000+ heads on try. 

> Saying we haven't been effective in addressing this or
> fixing the underlying issue is completely incorrect :) We're misusing
> mercurial, in the way we use try and the upstream fixes are the only thing
> that's kept it going for as long as it has, this time.

Agreed - my apologies for the way comment 1 could be misinterpreted. -- both teams have made a good effort to mitigating things, and it has improved. However, we still can't predict, monitor, or alert when this problem starts impacting devs. (I maintain impact starts a long before they start complaining, as it's now taken as "normal behavior".)

The tone is from my personal frustration that the problem isn't "playing nice", and has consumed far more resources and developer good will than it's worth. Personally, I'm surrendering, and believe that to reset every TCW is the lesser evil, and will let us all put our attention and energy into "better bets" for improving things for the Project.

Yes, that is more often than is required based on the last run, but it's predictable and easy to communicate to impacted groups. I'm open to a less frequent update if we can figure how to (a) communicate that & (b) act on it. (afaik, we have no other such "every X TCW's" actions.)
Approved by CAB on Feb 12 to be carried out during the next TCW on Feb 22nd
Blocks: 971818
Flags: cab-review? → cab-review+
per irc, bkero will do the work on this.
Assignee: server-ops-webops → bkero
Try has been reset. nthomas corfirmed test push was successful.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Component: WebOps: Source Control → General
Product: Infrastructure & Operations → Developer Services
Change Request: --- → approved
Flags: cab-review+
You need to log in before you can comment on or make changes to this bug.