Closed Bug 767657 Opened 12 years ago Closed 12 years ago

hg try repo is broken

Categories

(Developer Services :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ericz, Unassigned)

References

Details

(Whiteboard: [TreeClosure][Workaround in comment 21])

https://hg.mozilla.org/try/rev/b5a8c59ecf28 is unable to load, per what I'm seeing that cset was pushed very recently c.f. http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/saurabhanandiit@gmail.com-b5a8c59ecf28/ https://bugzilla.mozilla.org/show_bug.cgi?id=763468 and I see it in buildAPI

Also it (https://hg.mozilla.org/try itself) is giving me connection reset.

This prevents viewing try results for this push, while many other try pushes are still visible (oddly) if you load https://tbpl.mozilla.org/?tree=Try alone, rather than a specific cset.
FYI, I closed Try until we know the scope of the problem.

It basically makes it impossible for people to easily check their results anyway.
Whiteboard: [TreeClosure]
FWIW https://tbpl.mozilla.org/?tree=Try with &rev= has been like this since Jun 20 at least
It seems to me that try is just miserably slow. I am able to load some pages within /try but very slow.
See bug 676420 (and possible others) when RelEng asked us to reset try.
Callek confirmed that after several attempts he was also able to load the page. 
I didn't see any errors in the web logs. Nobody could provide a hg error message to point what the issue is, except for try being super slow.

If anyone else can confirm that this is indeed the problem, we can try resetting try.

CC'ing more ppl here.
And for clarity, several attempts includes 2 attempts after :dumitru was able to load try main summary page (with an ~2 minute slowness) where I was still getting the "connection reset" issue.

After that I get success with that page (albeit long wait) while *The connection to the server was reset while the page was loading.* after 15-20 seconds at https://hg.mozilla.org/try/rev/b5ab1913ee8f (and any other rev I pull out of a hat)

Leaving try closed for now since we still exceed timeout for TBPL's AJAX calls to specific revs, which many devs rely on when pushing to try. I am running a |time hg push -f| for try per :dumitru for sanity, one way or the other I'll leave my findings of that here.
Push went through, now pages are loading for me just fine in <20 seconds. I'm deciding to reopen tree for now but leaving this bug open for IT/others to chime in. Can be duped around if need be at this point.


Justin@ORION /d/sources/mozilla-central
$ time hg push ssh://hg.mozilla.org/try -f
pushing to ssh://hg.mozilla.org/try
searching for changes
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 9 changesets with 40 changes to 27 files (+1 heads)
remote: Tree try is CLOSED! (http://tinderbox.mozilla.org/Try/status.html)
remote: But you included the magic words.  Hope you had permission!
remote: Looks like you used try syntax, going ahead with the push.
remote: If you don't get what you expected, check http://trychooser.pub.build.mozilla.org/ for help
with building your trychooser request.
remote: Thanks for helping save resources, you're the best!
remote: Trying to insert into pushlog.
remote: Please do not interrupt...
remote: Inserted into the pushlog db successfully.

real    8m2.266s
user    0m0.015s
sys     0m0.015s
Was there a push to try before this one that left things in a broken state? (just a wild guess here)
Severity: critical → normal
fyi - I've started a head merge in bug 767715. While the self-recovery shows this isn't the underlying problem, it may reduce some pressure on the try server.
Just as a point of reference:

When I had stuff suddenly seeming to work, the queries/jobs completed in ~2 minutes

Right now I get (with manual counting) "The connection to the server was reset while the page was loading." after 36 seconds at https://hg.mozilla.org/try/rev/94bd4a5cef45)

but just wanted to point it out as ongoing, though I think *this* is just a matter of the try repo size/heads, not related to LB load; though there could be a zues timeout embedded helping to cut things out faster

Hoping Hal's run in 767715 makes a difference.
Severity: normal → blocker
Why is this made a blocker again?
Sorry, I couldn't load try tbpl, but the problem ended up being in Firefox.
Severity: blocker → normal
Running curl eventually worked, but took 2:44 to complete.
and when it fails it says that the connection was closed without any data being received.
This is a known problem caused by a bug in Mercurial.  The changeset in question likely modifies some old sections of code, and calculating the deltas for that takes longer than the 

The reason that the index page http://hg.mozilla.org/try takes a long time to load is because that offending revision is being parsed as the top commit of the summary.  If I bypass out Zeus load balancers, I can see that it correctly loads the page, albeit in a very slow time.

Back in April I attended a mercurial code sprint, and this is one of the issues that I addressed with them.  The creator of mercurial, mpm, looked into the issue and made code modifications to correct this problem.  However, the code changes are not staged to land in 2.3 since they did not pass unit tests, and he did not have time to find out why.  Please note that the delta generation code for hgweb is different from hg command-line, so clones and updates should go uninterrupted.

tl;dr: It's a mercurial problem, not our hosting.  A fix is in the works, but we're up to the mercy of mercurial as to when this gets done.
Would it be possible to cache the full patches on our servers while the problem in mercurial is worked on?
First paragraph got cut off.  The changeset in question likely modifies some old sections of code, and calculating the deltas for that takes longer than the load balancers allow for a timeout.

One method for fixing this in the web interface would be to stuff dummy commits in to get this large calculation out of the latest-10, and thus wouldn't be rendered on the web page anymore.

[root@boris ~]# time curl -H "Host: hg.mozilla.org" http://hgweb4.dmz.scl3.mozilla.com/try/

<...>

<div class="page_nav">
summary |
<this is where the long wait happens>
Only the web interface should exhibit a problem.  Actual cloning and usage of mercurial should be unaffected.  I do not think that mercurial has a mechanism for simply 'caching full patches' on a remote.  This is typically accomplished by creating a new head.

Additionally, the mercurial problem is not likely to be solved anytime soon, and if it is, it will not be backported to older (existing) mercurial versions.

What is the problem you are trying to solve?
I was thinking of putting a http cache in front of the hg server so that only the first run of

curl -O  http://hg.mozilla.org/try/rev/<rev>

takes a long time for any given rev.
Have added a link to this bug in the Try tree status message, since this issue came up several times on IRC.

Workaround:

(In reply to Justin Wood (:Callek) from bug 768225 comment #2)
> Just to be explicit (here) a workaround is to load the try tree without the
> &rev=* and then use the arrow at the bottom repeatedly until you find your
> push.
Whiteboard: [TreeClosure] → [TreeClosure][Workaround in comment 21]
This affects developers every day and makes their life harder every day.

Can we please attempt what espindola suggests on comment 16?

Is there anything else that can be attempted?

Have we just been degrading over time? or would a reset of the try repo help?
(It might make no sense what I am asking for; feel free to disregard)
heads have been reduced (bug 767715) without significantly improving problem. That trick used to work, so something else appears to be happening now.
Depends on: 767715
Depends on: 768622
Can the offending changeset be stripped from the try repo?  While it's true that there are workarounds for this bug, it doesn't make a lot of sense for us to wait until the next time that try gets reset!
(In reply to Ehsan Akhgari [:ehsan] from comment #25)
> Can the offending changeset be stripped from the try repo?  While it's true
> that there are workarounds for this bug, it doesn't make a lot of sense for
> us to wait until the next time that try gets reset!

We have decided we are resetting try tonight, about 8/9p PT
Depends on: 768847
After the reset we might be now seeing bug 768847.
We're golden now IIUC.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: Developer Services → General
Product: mozilla.org → Developer Services
You need to log in before you can comment on or make changes to this bug.