962291 - reset try during next TCW

Reporter

Description

•

10 years ago

bug 962275 is the latest in a long line of developer-frustration generated try-reset requests.

ATM, we seem to be good for several months. So doing a reset as part of every tree closing window would be a preventative measure. Let's do that for the next one.

Flags: cab-review?

Hal Wine [:hwine] use NI!

Reporter

Comment 1

•

10 years ago

Moving over discussion from bug 962275

(In reply to Shyam Mani [:fox2mike] from bug 962275 comment #7)
> Just wanted to confirm that this bug isn't
> clashing with the TCW reset bug and if we do the reset now (before the next
> window), we'll probably skip the Feb window and do a reset again for the one
> after.

My gut says to not skip. We don't fully understand the underlying problem, and the cost to developers (frustration, lost time) is likely higher than is reported via complaining.

IT has made several changes to the hg infrastructure to address this issue (i.e. things that were believed to have resolved the issue, but haven't). RelEng has attempted to "repair" the try repo via automation. Neither have been effective.

There remains some technical hope (reasons) that the next major hg upgrade will impact the issue (other bugs), but in the interim, resetting appears to be our best preventative. Let's do that regularly.

That makes the Feb window a good opportunity to try this out. This requires coordination between the IT vcs team, and the releng CI team (each owns 1 or more parts of the process).

Shyam Mani [:fox2mike]

Comment 2

•

10 years ago

(In reply to Hal Wine [:hwine] (use needinfo) from comment #1)
> Moving over discussion from bug 962275
> 
> (In reply to Shyam Mani [:fox2mike] from bug 962275 comment #7)
> > Just wanted to confirm that this bug isn't
> > clashing with the TCW reset bug and if we do the reset now (before the next
> > window), we'll probably skip the Feb window and do a reset again for the one
> > after.
> 
> My gut says to not skip. We don't fully understand the underlying problem,
> and the cost to developers (frustration, lost time) is likely higher than is
> reported via complaining.
> 
> IT has made several changes to the hg infrastructure to address this issue
> (i.e. things that were believed to have resolved the issue, but haven't).
> RelEng has attempted to "repair" the try repo via automation. Neither have
> been effective.

This is completely incorrect. The last reset was July 2013. See Bug 894429. 

try has been humming along since then until now, without any reported issues in-between. After the upgrade, we've been able to push mercurial to now 20,000+ heads on try. Saying we haven't been effective in addressing this or fixing the underlying issue is completely incorrect :) We're misusing mercurial, in the way we use try and the upstream fixes are the only thing that's kept it going for as long as it has, this time.

This is why I suggested we skip the February window if try is going to be reset today.

Shyam Mani [:fox2mike]

Comment 3

•

10 years ago

I guess this was reset during the TCW.

Status: NEW → RESOLVED

Closed: 10 years ago

Flags: cab-review? → cab-review-

Resolution: --- → FIXED

Shyam Mani [:fox2mike]

Comment 4

•

10 years ago

This was done outside, re-opening.

Status: RESOLVED → REOPENED

Flags: cab-review- → cab-review?

Resolution: FIXED → ---

Hal Wine [:hwine] use NI!

Reporter

Comment 5

•

10 years ago

per irc, tbd in CAB

(In reply to Shyam Mani [:fox2mike] from comment #2)
> (In reply to Hal Wine [:hwine] (use needinfo) from comment #1)
> > 
> > IT has made several changes to the hg infrastructure to address this issue
> > (i.e. things that were believed to have resolved the issue, but haven't).
> > RelEng has attempted to "repair" the try repo via automation. Neither have
> > been effective.
> 
> This is completely incorrect. The last reset was July 2013. See Bug 894429. 
> 
> try has been humming along since then until now, without any reported issues
> in-between. After the upgrade, we've been able to push mercurial to now
> 20,000+ heads on try. 

> Saying we haven't been effective in addressing this or
> fixing the underlying issue is completely incorrect :) We're misusing
> mercurial, in the way we use try and the upstream fixes are the only thing
> that's kept it going for as long as it has, this time.

Agreed - my apologies for the way comment 1 could be misinterpreted. -- both teams have made a good effort to mitigating things, and it has improved. However, we still can't predict, monitor, or alert when this problem starts impacting devs. (I maintain impact starts a long before they start complaining, as it's now taken as "normal behavior".)

The tone is from my personal frustration that the problem isn't "playing nice", and has consumed far more resources and developer good will than it's worth. Personally, I'm surrendering, and believe that to reset every TCW is the lesser evil, and will let us all put our attention and energy into "better bets" for improving things for the Project.

Yes, that is more often than is required based on the last run, but it's predictable and easy to communicate to impacted groups. I'm open to a less frequent update if we can figure how to (a) communicate that & (b) act on it. (afaik, we have no other such "every X TCW's" actions.)

Shyam Mani [:fox2mike]

Comment 6

•

10 years ago

Approved by CAB on Feb 12 to be carried out during the next TCW on Feb 22nd

Blocks: 971818

Flags: cab-review? → cab-review+

Corey Shields [:cshields]

Comment 7

•

10 years ago

per irc, bkero will do the work on this.

Assignee: server-ops-webops → bkero

Ben Kero [:bkero]

Assignee

Comment 8

•

10 years ago

Try has been reset. nthomas corfirmed test push was successful.

Ben Kero [:bkero]

Assignee

Updated

•

10 years ago

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Component: WebOps: Source Control → General

Product: Infrastructure & Operations → Developer Services

BMO Automation

Updated

•

9 years ago

Change Request: --- → approved

Flags: cab-review+

Bugzilla

Quick Search

reset try during next TCW

Categories

(Developer Services :: General, task)

Tracking

(Not tracked)

People

(Reporter: hwine, Assigned: bkero)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Updated

Updated