1053558 - We should probably reset try to stop breaking hg.m.o

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Description

•

11 years ago

All of these recent tree-wide closures seem to happen after tryserver has been open for a while. Backlog in #vcs makes it sound like resetting try would make these problems go away. [16:37] bkero gps: I do have quite a few tracebacks ending in chmap.py', in 'update' [16:37] bkero #19 file '/root/gunicorn/lib/python2.6/site-packages/mercurial/branchmap.py', in 'updatecache' [16:37] bkero #23 file '/root/gunicorn/lib/python2.6/site-packages/mercurial/localrepo. [16:42] gps bkero: that's the branch cache [16:43] gps bkero: cache population time is proportional to number of heads [16:43] gps so it might be time to reset try We should try to schedule this for sometime soon so we can put this nightmare behind us.

Gregory Szorc [:gps]

Comment 1

•

11 years ago

The evidence (trace output and tracebacks from processes on pegged cores) supports a known Mercurial scaling problem with branch cache population on mega-headed repos is being hit on the web heads. Culling the heads is the mitigation strategy. The hot function is http://selenic.com/repo/hg/file/8a7bd2dccd44/mercurial/branchmap.py#l146 (from 2.5.4). This function has been significantly rewritten in newer versions and should scale farther than before: http://selenic.com/repo/hg/file/44d6818b9cd9/mercurial/branchmap.py#l227

Mike Hommey [:glandium]

Comment 2

•

11 years ago

(In reply to Gregory Szorc [:gps] from comment #1) > The evidence (trace output and tracebacks from processes on pegged cores) > supports a known Mercurial scaling problem with branch cache population on > mega-headed repos is being hit on the web heads. > > Culling the heads is the mitigation strategy. If it's only a number of heads problem, we could merge them.

Mike Hommey [:glandium]

Comment 3

•

11 years ago

(In reply to Mike Hommey [:glandium] from comment #2) > If it's only a number of heads problem, we could merge them. (and merge new heads as they come in)

hwine

Comment 4

•

11 years ago

Let's go with a try reset -- worst case, it will provide a data point. To clarify the context of the irc log in comment 0: - that discussion is from a test harness running a different version of hg (3.1 vs production 2.5.4), and under a different WSGI container (gunicorn vs apache mod_wsgi), than used in production. Also moving to new home of hg bugs

Assignee: server-ops-webops → nobody

Component: WebOps: Source Control → Repos and Hooks

Product: Infrastructure & Operations → Release Engineering

QA Contact: nmaul → hwine

hwine

Comment 5

•

11 years ago

(In reply to Mike Hommey [:glandium] from comment #3) > (In reply to Mike Hommey [:glandium] from comment #2) > > If it's only a number of heads problem, we could merge them. > > (and merge new heads as they come in) fwiw, this was tried in the past, and did not affect ssh push times. Since try pushes are no longer an issue (or masked), this may be worth a retry

Gregory Szorc [:gps]

Comment 6

•

11 years ago

I would try merging heads before doing it for real. Make a clone of the try repo, merge the heads. Then `rm .hg/cache/*` and `hg --time branches` and see what happens. If that is less than a few minutes, we are in business.

Mike Hommey [:glandium]

Comment 7

•

11 years ago

http://mercurial.selenic.com/wiki/PruningDeadBranches#No-Op_Merges

Gregory Szorc [:gps]

Comment 8

•

11 years ago

http://hg.stage.mozaws.net/mirrors/generaldelta/try/ is a live backup of Try. It goes back several Try resets :D

Mike Hommey [:glandium]

Comment 9

•

11 years ago

(In reply to Gregory Szorc [:gps] from comment #8) > http://hg.stage.mozaws.net/mirrors/generaldelta/try/ is a live backup of > Try. It goes back several Try resets :D It'd be nicer if it used the same UI as hg.m.o, and it would be awesome if we 302'd try/ 404s to there.

hwine

Comment 10

•

11 years ago

try reset in progress

Assignee: nobody → bkero

Status: NEW → ASSIGNED

Ben Kero [:bkero]

Assignee

Comment 11

•

11 years ago

2014-08-13-1830: [bkero@boris ~]$ parallel ssh {} rm -rf /repo/hg/mozilla/try ::: hgweb{1..10}.dmz.scl3.mozilla.com 2014-08-13-1830: [root@hgssh1 ~]$ /repo/hg/scripts/reset_try.sh Resetting try is a disruptive event to developer worflows and must be coordinated with RelEng buildduty, and notifications sent to the CAB and dev mailing lists. Proceed? (y/N): y Okay, here we go! Moving current try repo to /repo/hg/nonlive/try-reset-2014-08-13-1826Cloning mozilla-central into the try repo requesting all changes adding changesets adding manifests adding file changes added 199347 changesets with 1113408 changes to 165015 files Trying to insert into pushlog. Please do not interrupt... Inserted into the pushlog db successfully. real 32m36.422s user 11m46.610s sys 1m33.189s Fixing try repo permissions Cleaning up pushlog.db All done 2014-08-13-1901: [hg@hgssh1 ~]$ /usr/local/bin/repo-push.sh /try 2014-08-13-1924: [hg@hgssh1 ~]$ Try has been reset

Ed Morley [:emorley]

Updated

•

11 years ago

Depends on: 1053678

Ed Morley [:emorley]

Updated

•

11 years ago

Blocks: 1040308

Ed Morley [:emorley]

Comment 12

•

11 years ago

(In reply to Gregory Szorc [:gps] from comment #1) > The evidence (trace output and tracebacks from processes on pegged cores) > supports a known Mercurial scaling problem with branch cache population on > mega-headed repos is being hit on the web heads. ... > This function has been significantly rewritten in newer versions and should > scale farther than before Sounds like we should update Mercurial on hg.m.o then :-) Filed bug 1053705.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: Release Engineering → Developer Services

Bugzilla

We should probably reset try to stop breaking hg.m.o

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

Tracking

(Not tracked)

People

(Reporter: KWierso, Assigned: bkero)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated

Comment 12

Updated