Closed Bug 1001735 Opened 10 years ago Closed 10 years ago

Push to try taking 45 minutes

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: cabanier, Assigned: hwine)

References

(Blocks 1 open bug)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/682] )

please schedule a reset of the try repository ASAP with sheriffs & releng.
This is extremely disruptive, let's get it done.
Rik: can you give some more details, please? We've seen quick pushes since you filed this, and resets highly impactful on your fellow developers.

If you've had a successful push since then, please update the bug.
Yes, pushes resumed for me later that day.
I've seen sporadic timeout but nothing like last Friday
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
So what is it going to take to get a reset now? It's been four hours since anyone was able to push to try, and yesterday I know of one one hour period, and from the times I was half paying attention to people bitching, there were other extended periods.
Well, noone's been unable to push for 4 hours now. So please let's get this reset done.

This is the third time this has happened in 2 weeks.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
[note: this is sort of also covered by bug 994028. Per bug 994028 comment 22, it's unbroken itself now. Reset still may be wise, though.]
We're trying to work out a new process here, as the suggestion of doing regularly scheduled resets was spoken out against.

As mentioned in bug 994028 comment 22 that never got answered in the dev.planning thread. 

What we (dev services, sheriffs, releng) need is a way to determine when there is "enough pain" to warrant the disruption of a reset.
(In reply to Hal Wine [:hwine] (use needinfo) from comment #7)
> We're trying to work out a new process here, as the suggestion of doing
> regularly scheduled resets was spoken out against.
> 
> As mentioned in bug 994028 comment 22 that never got answered in the
> dev.planning thread. 
> 
> What we (dev services, sheriffs, releng) need is a way to determine when
> there is "enough pain" to warrant the disruption of a reset.

I don't know what the reason for speaking out was, but the reality is that over the past few weeks I've experienced at least 6 hours of total downtime because of this. Assuming a reset takes 1 hour (there's some other small pains, I realize that), we could've reset 6 times for those implicit costs. And more importantly, those would've been planned outages, and we could've planned for them. With the outage today being unexpected it has a much more profound impact on our ability to do our work, at least for me.
Please see/join the discussion on dev.platform -- when try is reset, all prior history of pushes is lost. TBPL data remains (if you saved your email), but some developers have expressed the need for access to the diffs pushed by other users.

That thread starts at http://goo.gl/o4omy1
Looks like she's down again, roughly 20 minuted to half an hour now I think :)
A push just timed out for me, too:
{
$ time hg push -f ssh://hg.mozilla.org/try/
pushing to ssh://hg.mozilla.org/try/
searching for changes
remote: waiting for lock on repository /repo/hg/mozilla/try/ held by 'hgssh1.dmz.scl3.mozilla.com:14463'                                                                                                            
remote: abort: repository /repo/hg/mozilla/try/: timed out waiting for lock held by hgssh1.dmz.scl3.mozilla.com:14463
abort: unexpected response: empty string

real	10m5.538s
user	0m2.924s
sys	0m0.124s
}
No problem with try, just a long push from mfinkle. From pushlog:
mfinkle@mozilla.com
Fri May 02 20:35:58 2014 +0000	41b9dae6d1a8	Mark Finkle — try: -b o -p android-armv6 -u robocop-1,robocop-2,robocop-3,robocop-4,robocop-5 -t none
	71d0d0d6d593	Mark Finkle — Bug 1004890 - Add UI telemetry for the Home Page contextmenu r=rnewman
	4223c07f935b	Mark Finkle — Bug 1004889 - Add UI telemetry for the main menu r=rnewman
	fb88b8685827	Ryan VanderMeulen — Backed out changesets 34516556f823 (bug 1004890) and 0610c2dcc6f5 (bug 1004889) for Android Armv6 robocop crashes. CLOSED TREE
	36853e536d68	Jim Chen — Bug 1004073 - Refactor EventDispatcher usages; r=fennec-team
	193c049cea42	Gavin Sharp — Bug 896711: remove BrowserChromeTests.runWhenReady because it's being abused, r=dao/ttaubert/vporof
dwillcoxon@mozilla.com
Fri May 02 19:41:50 2014 +0000	13f45a0f910f	Drew Willcoxon — try: -b o -p linux64 -u none -t svgr
(In reply to Hal Wine [:hwine] (use needinfo) from comment #9)
> Please see/join the discussion on dev.platform -- when try is reset, all
> prior history of pushes is lost. TBPL data remains (if you saved your
> email), but some developers have expressed the need for access to the diffs
> pushed by other users.
> 
> That thread starts at http://goo.gl/o4omy1

I don't think any of the issues raised in that thread focused on doing a try reset when the try server is unable to accept pushes as the case have been in the past few days.  That thread was about doing regular try resets every 6 weeks.
(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #13)
> > That thread starts at http://goo.gl/o4omy1
> 
> I don't think any of the issues raised in that thread focused on doing a try
> reset when the try server is unable to accept pushes as the case have been
> in the past few days.  That thread was about doing regular try resets every
> 6 weeks.

The intent of the thread is to decide when to do a reset, since a regular 6 week cycle caused issues. We need to have a practical solution that all are aware of.

Thanks for the feedback - I'll try to clarify the topic better in that thread.
It's timing out again. Someone having a long push again?

{
$ time hg try
pushing to ssh://hg.mozilla.org/try/
searching for changes
remote: waiting for lock on repository /repo/hg/mozilla/try/ held by 'hgssh1.dmz.scl3.mozilla.com:21656'
remote: abort: repository /repo/hg/mozilla/try/: timed out waiting for lock held by hgssh1.dmz.scl3.mozilla.com:21656
abort: unexpected response: empty string

real	10m3.970s
}

TBPL shows that the last successful Try push was at 1 hr 45 minutes ago.
Back to normal. For the record, based on the pushlog, this time around the no-pushes window was ~2 hrs 20 min, between these push timestamps:
  Mon May 5 22:45:53 2014
  Tue May 6 01:05:44 2014
% time hg push -f try
pushing to ssh://hg.mozilla.org/try
searching for changes
remote: waiting for lock on repository /repo/hg/mozilla/try held by 'hgssh1.dmz.scl3.mozilla.com:16994'
remote: abort: repository /repo/hg/mozilla/try: timed out waiting for lock held by hgssh1.dmz.scl3.mozilla.com:16994
abort: unexpected response: empty string
hg push -f try  1.35s user 0.07s system 0% cpu 10:02.99 total
(In reply to Shu-yu Guo [:shu] from comment #17)
> % time hg push -f try
> pushing to ssh://hg.mozilla.org/try
> searching for changes
> remote: waiting for lock on repository /repo/hg/mozilla/try held by
> 'hgssh1.dmz.scl3.mozilla.com:16994'
> remote: abort: repository /repo/hg/mozilla/try: timed out waiting for lock
> held by hgssh1.dmz.scl3.mozilla.com:16994
> abort: unexpected response: empty string
> hg push -f try  1.35s user 0.07s system 0% cpu 10:02.99 total

Exact same timeout for me about 45 minutes before posting this.  Happened twice then gave up.
For the window of comment 17 and comment 18, pushlog shows successes before & after:
    bballo@mozilla.com Thu May 08 02:32:24 2014 +0000
    bballo@mozilla.com Thu May 08 00:11:46 2014 +0000

:botond - do you recall if your 2nd push to try took an extra long time? If so, can you provide some additional information, please:
 - where did you push from? (e.g. laptop in MoTo office)
 - what type of network connection did you have? (wifi/wired dsl/office, etc)
 - anything unusual you recall about the push (messages, restarts, etc.)

Thanks!
Flags: needinfo?(botond)
(In reply to Hal Wine [:hwine] (use needinfo) from comment #19)
> For the window of comment 17 and comment 18, pushlog shows successes before
> & after:
>     bballo@mozilla.com Thu May 08 02:32:24 2014 +0000
>     bballo@mozilla.com Thu May 08 00:11:46 2014 +0000
> 
> :botond - do you recall if your 2nd push to try took an extra long time? If
> so, can you provide some additional information, please:
>  - where did you push from? (e.g. laptop in MoTo office)
>  - what type of network connection did you have? (wifi/wired dsl/office, etc)
>  - anything unusual you recall about the push (messages, restarts, etc.)

Yeah, I made several attempts to push to try yesterday evening. They all took very long (10+ minutes), and they all failed except for the last attempt which was successful (this is the 02:32 one). (I also pushed a different patch earlier and that was successful, that's the 00:11 one).

I was physically located at home, connected to my MoTo office desktop via VPN, and pushed from there. My desktop uses ethernet.

Please let me know if there is any additional information I can provide.
Flags: needinfo?(botond)
Also having this problem:

pushing to ssh://hg.mozilla.org/try
searching for changes
remote: waiting for lock on repository /repo/hg/mozilla/try held by 'hgssh1.dmz.scl3.mozilla.com:1681'
remote: abort: repository /repo/hg/mozilla/try: timed out waiting for lock held by hgssh1.dmz.scl3.mozilla.com:1681
abort: unexpected response: empty string

real	10m4.179s
user	0m1.547s
sys	0m0.136s
(In reply to Botond Ballo [:botond] from comment #20)
> Yeah, I made several attempts to push to try yesterday evening. They all
> took very long (10+ minutes), and they all failed except for the last
> attempt

Oh and the error message when they failed was like in comment 21.
(In reply to Botond Ballo [:botond] from comment #20)
> Yeah, I made several attempts to push to try yesterday evening. They all
> took very long (10+ minutes), 

Any recollection of how big several was? 5? 10? 22?

> Please let me know if there is any additional information I can provide.

Thanks very much for a very complete response. The "ethernet from office" eliminates an entire line of speculation. :)
(In reply to Hal Wine [:hwine] (use needinfo) from comment #23)
> (In reply to Botond Ballo [:botond] from comment #20)
> > Yeah, I made several attempts to push to try yesterday evening. They all
> > took very long (10+ minutes), 
> 
> Any recollection of how big several was? 5? 10? 22?

Between 5 and 10.
I just had 5 of these, before the 6 attempt finally succeeded:

time hg try
pushing to ssh://hg.mozilla.org/try
remote: X11 forwarding request failed on channel 1
searching for changes
1 changesets found
remote: waiting for lock on repository /repo/hg/mozilla/try held by 'hgssh1.dmz.scl3.mozilla.com:19110'
remote: abort: repository /repo/hg/mozilla/try: timed out waiting for lock held by hgssh1.dmz.scl3.mozilla.com:19110
abort: unexpected response: empty string

real	10m3.337s
user	0m1.423s
sys	0m0.188s
There was an 85 minute window where no successful push occurred. Several reports in #developers, including comment 25.

    drno@ohlmeier.org  Fri May 09 20:07:33 2014 +0000
    spohl@mozilla.com  Fri May 09 20:06:34 2014 +0000
    dminor@mozilla.com Fri May 09 20:06:30 2014 +0000
    daniel@haxx.se     Fri May 09 18:41:44 2014 +0000
My push attempts are hanging again.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #27)
> My push attempts are hanging again.

Please include information with your requests that will let us correlate data with logs. E.g. how long have you been trying? How many timeouts? How long before it succeeds?

At this moment, it has been less than 60 minutes since the last successful try push.
Flags: needinfo?(bugmail.mozilla)
I've been trying for around 45 minutes. I've tried three times so far. Each time I push, it gets stuck after this:

kats@kgupta-air mozilla$ hg push -f try -r tip
pushing to ssh://hg.mozilla.org/try/
searching for changes

and then eventually I kill it with ctrl-c and it outputs this:

^Cinterrupted!
remote: waiting for lock on repository /repo/hg/mozilla/try/ held by 'hgssh1.dmz.scl3.mozilla.com:30503'
remote: Killed by signal 2.

I've done this twice so far, and the lock holder was the same both times. I'm on my third attempt now, and it's doing the same thing. It's been 16 minutes so far on that last attempt.
Flags: needinfo?(bugmail.mozilla)
Yup, it's that time of the week again.
30+ minutes here, failed 3 times already, waiting a few minutes on my 4th attempt.
(In reply to Georg Fritzsche [:gfritzsche] from comment #31)
>waiting a few minutes on my 4th attempt.

... which magically happened to go through 30ish seconds after commenting here.
Is replication happening synchronously on push? i.e. does the master wait on the read-only slaves to pull the new changesets before telling the client the push is complete? If so, that's the first thing I'd finger for the recent lock issues.

We know the Try repo can be very slow for certain operations (especially if the repo isn't paged into memory). Since we now have N Try repos (master + N - 1 slaves), we've increased our probability of encountering slowness by N times.

I'd like to make some observations/recommendations:

1) Switching to generaldelta will make Try scale better. However, this will have CPU performance implications on build slaves unless certain steps are taken. See http://gregoryszorc.com/blog/2014/02/05/the-mercurial-revlog/

2) I would seriously consider removing slave replication for Try or at least making replication asynchronous. Expose the master as a separate hostname and have build automation read from that so there aren't issues reading from slaves. Downside: https://hg.mozilla.org/try may not work if it hits a slave. Alternative: can we intercept specific repo URLs on a load balancer or whatever does TLS termination and ensure they only hit the master server?

3) Improvements to multi headed repo scaling have been added to Mercurial 2.9 and 3.0. Upgrade the server already.

4) We should be able to monkeypatch mercurial.util.makelock() via an extension to add logging around locking. I believe we expose the username via environment variable. That's infinitely more useful than a PID.

5) Excessive headed repos won't scale. Long term, we are better off switching Try (and the eventual code review repo) to use bundles (or some other scalable changeset distribution and storage mechanism) rather than taking Mercurial (or Git) in a direction it was never meant to go. I /think/ this can be accomplished with only server-side changes (a custom extension).
6) The ui.timeout config option defines the max duration to wait for a repo lock. It defaults to 600 (10 minutes). We should consider lowering this on the server so people don't sit around forever.

7) The blackbox extension (added in 2.6 and improved in versions since) logs a lot of information useful for debugging these kinds of failures. Again, we need to get the server upgraded.
QA Contact: nmaul → laura
QA Contact: laura → hwine
Happening again to me right now.
We had a 5h 24m hour outage this evening:
    jsantell@mozilla.com Wed May 14 03:27:50 2014 +0000
    bzbarsky@mozilla.com Tue May 13 22:03:59 2014 +0000

With glandium's guidance, an old hook has been disabled (bug 1010053).

Current plan is to let this change stay for a while, and see when (if?) significant try delays recur.
Depends on: 1010053
For the record, this (-> http://bz.selenic.com/show_bug.cgi?id=4248) is what is preventing the useful "waiting for lock" message to be displayed when it's actually happening instead of ones things are unstuck or when you hit ctrl+c.
For the record, this is why we came up with bug 1010053:

- Hal was saying it was observed that what a mercurial process seems to be stuck on when this happens is something in ancestor.py.
- My main observation, from having held the lock (instead of waiting for it like most of the time), is that the changeset didn't even make it to the repository.
- The changeset not making it to the repository means the push is not reaching the changegroup hooks, which happen after that (which would tend to exclude push-repo.sh from being the root cause).
- At some point, the tree was closed, and my pushes were able to go through sending the changegroup bundle and receiving the tree closure message, which means it was able to go through to the treeclosure pretxnchangegroup hook.
- Between the treeclosure pretxnchangegroup hook and the changeset making it to the repository, there were two pretxnchangegroup hooks: try_gcc45, and try_mandatory.
- Coincidentally, try_gcc45 has a call to ancestor()
FWIW, I also have a clone of the try repository I did while i was pushing and holding the log, and i can reproduce pushes going through without the hook and being stuck sucking cpu with it. Although, it's worth noting a lot of things do take a lot of time on that clone, including a simple hg tip.
(In reply to Mike Hommey [:glandium] from comment #40)
> FWIW, I also have a clone of the try repository I did while i was pushing
> and holding the log

holding the lock.
(using both mercurial 2.5.4 as on the servers, and mercurial 2.9.2)
(In reply to Mike Hommey [:glandium] from comment #40)
> FWIW, I also have a clone of the try repository I did while i was pushing
> and holding the log, and i can reproduce pushes going through without the
> hook and being stuck sucking cpu with it. Although, it's worth noting a lot
> of things do take a lot of time on that clone, including a simple hg tip.

So, in fact, it seems i'm no able to reproduce locally with that clone. The behaviour i was seeing looks like it was due to the fact that it was a fresh clone and that it needed to create caches, and that was taking time. Once the cache are there, things are functioning normally.

That being said, I still believe try_gcc45 was involved in my own case of holding the lock, and while it might not be the whole story, it was contributing to it.
(In reply to Gregory Szorc [:gps] (not actively checking bugmail) from comment #34)
> Is replication happening synchronously on push? i.e. does the master wait on
> the read-only slaves to pull the new changesets before telling the client
> the push is complete? If so, that's the first thing I'd finger for the
> recent lock issues.
> 
> We know the Try repo can be very slow for certain operations (especially if
> the repo isn't paged into memory). Since we now have N Try repos (master + N
> - 1 slaves), we've increased our probability of encountering slowness by N
> times.
> 
> I'd like to make some observations/recommendations:
> 
> 1) Switching to generaldelta will make Try scale better. However, this will
> have CPU performance implications on build slaves unless certain steps are
> taken. See http://gregoryszorc.com/blog/2014/02/05/the-mercurial-revlog/

generaldelta has cpu implications on both ends. On the client side if it is setup to store generaldelta (which has to be explicit, so, actually, most people won't see a difference), and on the server side, because it will have to redo some deltas because afaik the wire protocol doesn't use generaldelta.

> 5) Excessive headed repos won't scale. Long term, we are better off
> switching Try (and the eventual code review repo) to use bundles (or some
> other scalable changeset distribution and storage mechanism) rather than
> taking Mercurial (or Git) in a direction it was never meant to go. I /think/
> this can be accomplished with only server-side changes (a custom extension).

If the number of heads is a problem more than the depth of the changeset graph, we could merge them into a fictional and useless head.
(In reply to Mike Hommey [:glandium] from comment #44)
> (In reply to Gregory Szorc [:gps] (not actively checking bugmail) from
> comment #34)
> > 5) Excessive headed repos won't scale. Long term, we are better off
> > switching Try (and the eventual code review repo) to use bundles (or some
> > other scalable changeset distribution and storage mechanism) rather than
> > taking Mercurial (or Git) in a direction it was never meant to go. I /think/
> > this can be accomplished with only server-side changes (a custom extension).

fwiw, the bundle approach won't support current try requirements, as articulated by developers. Using the bundle is (conceptually) the approach taken by travis & jenkins. We have the additional requirement that the commits be viewable in hgweb (i.e. maintain history). Yeah, there may be ways to work around this, but they add complexity.

> 
> If the number of heads is a problem more than the depth of the changeset
> graph, we could merge them into a fictional and useless head.

Per hg devs, the number of heads is large, but shouldn't be causing this issue by itself. (Which matches with our observations, as it is "randomly" slow with the current large head count.

Previously, we did try a 'empty merge' to eliminate heads, and that did not change the occurrence of the issue. (Early automation attempt from several years ago.)

To reiterate and expand on comment 37:
 a) right now we're monitoring for the impact (if any) of changes outlined in comment 39
 b) if there are any significant issues, we will try a new history-preserving reset process
 c) if there are any significant issues, we will fall back to the old try reset

We have some confidence that (b) will be successful, and could be automated with minimal try closure time. I'll get a new bug open on that within a few days.
recording a long-but-successful try push. It was long enough for other devs to have timeouts waiting for the lock, but the system was not hung:
 kgupta@mozilla.com      Thu May 15 16:09:38 2014 +0000
 chrislord.net@gmail.com Thu May 15 15:24:15 2014 +0000

commit based on m-c. While it completed successfully, it did not successfully complete on the client side (hg version 2.8.2):

kats@kgupta-air mozilla$ hg push -f try -r tip
pushing to ssh://hg.mozilla.org/try/
searching for changes

^Cinterrupted!
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 3 changesets with 2 changes to 1 files (+1 heads)
remote: Looks like you used try syntax, going ahead with the push.
remote: If you don't get what you expected, check http://trychooser.pub.build.mozilla.org/
remote: for help with building your trychooser request.
remote: Thanks for helping save resources, you're the best!
remote: Killed by signal 2.
kats@kgupta-air mozilla$
(In reply to Hal Wine [:hwine] (use needinfo) from comment #46)
> recording a long-but-successful try push. It was long enough for other devs
> to have timeouts waiting for the lock, but the system was not hung:
>  kgupta@mozilla.com      Thu May 15 16:09:38 2014 +0000
>  chrislord.net@gmail.com Thu May 15 15:24:15 2014 +0000
> 
> commit based on m-c. While it completed successfully, it did not
> successfully complete on the client side (hg version 2.8.2):
> 
> kats@kgupta-air mozilla$ hg push -f try -r tip
> pushing to ssh://hg.mozilla.org/try/
> searching for changes
> 
> ^Cinterrupted!
> remote: adding changesets
> remote: adding manifests
> remote: adding file changes
> remote: added 3 changesets with 2 changes to 1 files (+1 heads)
> remote: Looks like you used try syntax, going ahead with the push.
> remote: If you don't get what you expected, check
> http://trychooser.pub.build.mozilla.org/
> remote: for help with building your trychooser request.
> remote: Thanks for helping save resources, you're the best!
> remote: Killed by signal 2.
> kats@kgupta-air mozilla$

There is progress. The try_mandatory hook was run while we weren't going that far in my case. Did the push make it through to the repo (i would guess it did)? to the pushlog?
Blocks: 770811
(In reply to Mike Hommey [:glandium] from comment #47)
> There is progress. The try_mandatory hook was run while we weren't going
> that far in my case. Did the push make it through to the repo (i would guess
> it did)? to the pushlog?

Yeah, it got pushed to https://tbpl.mozilla.org/?tree=Try&rev=e289cd3644d9 and I got the email notification for it too.
(In reply to Mike Hommey [:glandium] from comment #43)
> (In reply to Mike Hommey [:glandium] from comment #40)
> > FWIW, I also have a clone of the try repository I did while i was pushing
> > and holding the log, and i can reproduce pushes going through without the
> > hook and being stuck sucking cpu with it. Although, it's worth noting a lot
> > of things do take a lot of time on that clone, including a simple hg tip.
> 
> So, in fact, it seems i'm no able to reproduce locally with that clone. The
> behaviour i was seeing looks like it was due to the fact that it was a fresh
> clone and that it needed to create caches, and that was taking time. Once
> the cache are there, things are functioning normally.

Yes, the caches take forever to populate on initial clone. This is something I've brought to the attention of Mercurial developers. See http://bz.selenic.com/show_bug.cgi?id=4205. I added blackbox logging of cache population in Mercurial 3.0 to help identify this in the real world.

I've found that once the caches are populated, updating latencies aren't too bad. I suspect the "random" delays could be caused by the caches getting invalidated, possibly due to a rollback of some kind.

Regarding generaldelta, we /could/ do the following until bundle2 is ready (likely 3.1 or 3.2):

1) Create Try as generaldelta
2) Have build slaves pull m-c before pulling from Try. They'll get most changesets with the regular, efficient encoding. The extra CPU work should involve <100 changesets, so the CPU hit should largely be mitigated.

Yes, the server will spend a little more CPU on push to re-encode to generaldelta. But again, each Try push only involves a handful of changesets. Unless we're re-encoding >100 changesets, I don't think the extra CPU work will be a problem.

FWIW, my mirror of Try used to encounter random performance issues all the time. When I switched to generaldelta, the random performance issues largely went away.
Theory: Rejected pushes are being rolled back and cause invalidation of the tags and branch caches which trigger an (expensive) repopulation of those caches.

I don't believe that pushes by themselves should update the branch or tags cache. However, a hook may be accessing data that causes a tags or branch cache update.

This theory can be proved be upgrading the server to Mercurial 3.0 and activating the blackbox extension to record cache population events and times. Alternatively, you can manually patch the server to add 9ea132aee96c and 05cfcecb3aef (or a modified version of them) to the server for logging.

If this theory is proved, Mercurial 3.1 or possibly 3.0.1 or 3.0.2 could contain code that restores the caches on rollback. (Generally speaking, caches aren't too expensive to compute, which is why they are merely invalidated currently.)
I confirmed that Mercurial 3.0 (and presumably all versions prior) will invalidate the caches when a transaction is aborted and rolled back. STR:

1) Install a pretxncommit hook that runs `hg heads && hg branches && exit 1`
2) Attempt to commit something
3) Run `hg heads` and look at blackbox log to confirm cache is rebuilt

Also, if someone with access to the Try server could run the following to test things:

1) rm .hg/cache/*
2) time hg branches
3) time hg heads

The `hg branches` will rebuild branch caches. `hg heads` will rebuild the tags cache. I expect these cache rebuilds to take a minute or two given the reported performance and observations I've made on my clone of Try.
Please note that deleting the caches on the production server will likely cause pushes to time out.
I've filed http://bz.selenic.com/show_bug.cgi?id=4255 upstream regarding the cache invalidation problem.

Also, profiling Mercurial revealed that resolving manifest content accounted for most of the CPU during tags cache generation. I was able to reduce CPU significantly by switching to generaldelta + lz4 revlogs. Still, with my 24,698 headed mirror of Try it took ~15 minutes to populate the tags cache. That's on an SSD.

I'm pretty confident this is the scaling problem. Everything is fixable. But it won't get in Mercurial until 3.1 or 3.2. Also, it's unclear non-Mozilla people will fix this. AFAIK Mozilla is the only entity encountering this scaling vector, so it might be up to us (me?) to fix it.
Assignee: server-ops-webops → hwine
Blocks: try-tracker
Component: WebOps: Source Control → Mercurial: hg.mozilla.org
Product: Infrastructure & Operations → Developer Services
I've been pushing for 10 minutes now, and fubar confirms I'm the one holding the lock. The interesting data is that I know how this one happened because I'm the last one to have pushed to try before that. So here is what happened, at least in this occurrence:
- I pushed normally to try.
- I pushed another changeset to try but forgot to qpush my try patch containing try syntax
- That second push thus got rejected by pretxnchangegroup.c_try_mandatory
- I then pushed again with that try patch with try syntax, and that's the push that's currently stuck.

So, it doesn't even take someone actively canceling a push :(
It's the act of a transaction rollback in the hg store that causes cache invalidation and slowness on next push, assuming caches have been updated when hooks run (they likely would have). This is why I spent a lot of energy at the summit on a plan to add cache data to transactions. ctrl+c or hook rejection are effectively the same thing from the perspective of the server/txn.
Which at least comforts the possibility that there really is only one root cause. Interestingly, though, the push took around 45 minutes. Do we already have too many heads on try?
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/178]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/178] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/677] [kanban:engops:https://kanbanize.com/ctrl_board/6/178]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/677] [kanban:engops:https://kanbanize.com/ctrl_board/6/178] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/682] [kanban:engops:https://kanbanize.com/ctrl_board/6/178]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/682] [kanban:engops:https://kanbanize.com/ctrl_board/6/178] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/682]
As of today, pushes to try are taking less time than they have since local disk HG rolled out 10 months ago. Push time is still a little long - 15s is the median. That will be fixed in bug 1055298. Calling this WORKSFORME.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.