We should probably reset try to stop breaking hg.m.o

RESOLVED FIXED

Status

Developer Services
Mercurial: hg.mozilla.org
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: KWierso, Assigned: bkero)

Tracking

Details

(Reporter)

Description

3 years ago
All of these recent tree-wide closures seem to happen after tryserver has been open for a while. Backlog in #vcs makes it sound like resetting try would make these problems go away.

[16:37]	bkero	gps: I do have quite a few tracebacks ending in chmap.py', in 'update'
[16:37]	bkero	#19 file '/root/gunicorn/lib/python2.6/site-packages/mercurial/branchmap.py', in 'updatecache'
[16:37]	bkero	#23 file '/root/gunicorn/lib/python2.6/site-packages/mercurial/localrepo.
[16:42]	gps	bkero: that's the branch cache
[16:43]	gps	bkero: cache population time is proportional to number of heads
[16:43]	gps	so it might be time to reset try


We should try to schedule this for sometime soon so we can put this nightmare behind us.

Comment 1

3 years ago
The evidence (trace output and tracebacks from processes on pegged cores) supports a known Mercurial scaling problem with branch cache population on mega-headed repos is being hit on the web heads.

Culling the heads is the mitigation strategy.

The hot function is http://selenic.com/repo/hg/file/8a7bd2dccd44/mercurial/branchmap.py#l146 (from 2.5.4).

This function has been significantly rewritten in newer versions and should scale farther than before:

http://selenic.com/repo/hg/file/44d6818b9cd9/mercurial/branchmap.py#l227
(In reply to Gregory Szorc [:gps] from comment #1)
> The evidence (trace output and tracebacks from processes on pegged cores)
> supports a known Mercurial scaling problem with branch cache population on
> mega-headed repos is being hit on the web heads.
> 
> Culling the heads is the mitigation strategy.

If it's only a number of heads problem, we could merge them.
(In reply to Mike Hommey [:glandium] from comment #2)
> If it's only a number of heads problem, we could merge them.

(and merge new heads as they come in)
Let's go with a try reset -- worst case, it will provide a data point.

To clarify the context of the irc log in comment 0:
 - that discussion is from a test harness running a different version of hg (3.1 vs production 2.5.4), and under a different WSGI container (gunicorn vs apache mod_wsgi), than used in production.

Also moving to new home of hg bugs
Assignee: server-ops-webops → nobody
Component: WebOps: Source Control → Repos and Hooks
Product: Infrastructure & Operations → Release Engineering
QA Contact: nmaul → hwine
(In reply to Mike Hommey [:glandium] from comment #3)
> (In reply to Mike Hommey [:glandium] from comment #2)
> > If it's only a number of heads problem, we could merge them.
> 
> (and merge new heads as they come in)

fwiw, this was tried in the past, and did not affect ssh push times. Since try pushes are no longer an issue (or masked), this may be worth a retry

Comment 6

3 years ago
I would try merging heads before doing it for real.

Make a clone of the try repo, merge the heads. Then `rm .hg/cache/*` and `hg --time branches` and see what happens. If that is less than a few minutes, we are in business.
http://mercurial.selenic.com/wiki/PruningDeadBranches#No-Op_Merges

Comment 8

3 years ago
http://hg.stage.mozaws.net/mirrors/generaldelta/try/ is a live backup of Try. It goes back several Try resets :D
(In reply to Gregory Szorc [:gps] from comment #8)
> http://hg.stage.mozaws.net/mirrors/generaldelta/try/ is a live backup of
> Try. It goes back several Try resets :D

It'd be nicer if it used the same UI as hg.m.o, and it would be awesome if we 302'd try/ 404s to there.
try reset in progress
Assignee: nobody → bkero
Status: NEW → ASSIGNED
(Assignee)

Comment 11

3 years ago
2014-08-13-1830: [bkero@boris ~]$ parallel ssh {} rm -rf /repo/hg/mozilla/try ::: hgweb{1..10}.dmz.scl3.mozilla.com 

2014-08-13-1830: [root@hgssh1 ~]$ /repo/hg/scripts/reset_try.sh

Resetting try is a disruptive event to developer worflows and must be coordinated with RelEng buildduty, and notifications sent to the CAB and dev mailing lists.
Proceed? (y/N): y

Okay, here we go!
Moving current try repo to /repo/hg/nonlive/try-reset-2014-08-13-1826Cloning mozilla-central into the try repo
requesting all changes
adding changesets
adding manifests
adding file changes
added 199347 changesets with 1113408 changes to 165015 files
Trying to insert into pushlog.
Please do not interrupt...
Inserted into the pushlog db successfully.

real	32m36.422s
user	11m46.610s
sys	1m33.189s
Fixing try repo permissions
Cleaning up pushlog.db
All done

2014-08-13-1901: [hg@hgssh1 ~]$ /usr/local/bin/repo-push.sh /try
2014-08-13-1924: [hg@hgssh1 ~]$

Try has been reset

Updated

3 years ago
Depends on: 1053678

Updated

3 years ago
Blocks: 1040308
(In reply to Gregory Szorc [:gps] from comment #1)
> The evidence (trace output and tracebacks from processes on pegged cores)
> supports a known Mercurial scaling problem with branch cache population on
> mega-headed repos is being hit on the web heads.
...
> This function has been significantly rewritten in newer versions and should
> scale farther than before

Sounds like we should update Mercurial on hg.m.o then :-)
Filed bug 1053705.
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Product: Release Engineering → Developer Services
You need to log in before you can comment on or make changes to this bug.