767657 - hg try repo is broken

Reporter

Description

•

12 years ago

https://hg.mozilla.org/try/ won't load and https://tbpl.mozilla.org/?tree=Try&rev=b5a8c59ecf28 loads only headers.

Justin Wood (:Callek)

Comment 1

•

12 years ago

https://hg.mozilla.org/try/rev/b5a8c59ecf28 is unable to load, per what I'm seeing that cset was pushed very recently c.f. http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/saurabhanandiit@gmail.com-b5a8c59ecf28/ https://bugzilla.mozilla.org/show_bug.cgi?id=763468 and I see it in buildAPI

Also it (https://hg.mozilla.org/try itself) is giving me connection reset.

This prevents viewing try results for this push, while many other try pushes are still visible (oddly) if you load https://tbpl.mozilla.org/?tree=Try alone, rather than a specific cset.

Justin Wood (:Callek)

Comment 2

•

12 years ago

FYI, I closed Try until we know the scope of the problem.

It basically makes it impossible for people to easily check their results anyway.

Whiteboard: [TreeClosure]

:Felipe Gomes (needinfo for replies!)

Comment 3

•

12 years ago

FWIW https://tbpl.mozilla.org/?tree=Try with &rev= has been like this since Jun 20 at least

Dumitru Gherman [:dumitru]

Comment 4

•

12 years ago

It seems to me that try is just miserably slow. I am able to load some pages within /try but very slow.
See bug 676420 (and possible others) when RelEng asked us to reset try.
Callek confirmed that after several attempts he was also able to load the page. 
I didn't see any errors in the web logs. Nobody could provide a hg error message to point what the issue is, except for try being super slow.

If anyone else can confirm that this is indeed the problem, we can try resetting try.

CC'ing more ppl here.

Justin Wood (:Callek)

Comment 5

•

12 years ago

And for clarity, several attempts includes 2 attempts after :dumitru was able to load try main summary page (with an ~2 minute slowness) where I was still getting the "connection reset" issue.

After that I get success with that page (albeit long wait) while *The connection to the server was reset while the page was loading.* after 15-20 seconds at https://hg.mozilla.org/try/rev/b5ab1913ee8f (and any other rev I pull out of a hat)

Leaving try closed for now since we still exceed timeout for TBPL's AJAX calls to specific revs, which many devs rely on when pushing to try. I am running a |time hg push -f| for try per :dumitru for sanity, one way or the other I'll leave my findings of that here.

Justin Wood (:Callek)

Comment 6

•

12 years ago

Push went through, now pages are loading for me just fine in <20 seconds. I'm deciding to reopen tree for now but leaving this bug open for IT/others to chime in. Can be duped around if need be at this point.


Justin@ORION /d/sources/mozilla-central
$ time hg push ssh://hg.mozilla.org/try -f
pushing to ssh://hg.mozilla.org/try
searching for changes
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 9 changesets with 40 changes to 27 files (+1 heads)
remote: Tree try is CLOSED! (http://tinderbox.mozilla.org/Try/status.html)
remote: But you included the magic words.  Hope you had permission!
remote: Looks like you used try syntax, going ahead with the push.
remote: If you don't get what you expected, check http://trychooser.pub.build.mozilla.org/ for help
with building your trychooser request.
remote: Thanks for helping save resources, you're the best!
remote: Trying to insert into pushlog.
remote: Please do not interrupt...
remote: Inserted into the pushlog db successfully.

real    8m2.266s
user    0m0.015s
sys     0m0.015s

Shyam Mani [:fox2mike]

Comment 7

•

12 years ago

Was there a push to try before this one that left things in a broken state? (just a wild guess here)

Michael Burns [:mburns]

Updated

•

12 years ago

Severity: critical → normal

Hal Wine [:hwine] use NI!

Comment 8

•

12 years ago

fyi - I've started a head merge in bug 767715. While the self-recovery shows this isn't the underlying problem, it may reduce some pressure on the try server.

Justin Wood (:Callek)

Comment 9

•

12 years ago

Just as a point of reference:

When I had stuff suddenly seeming to work, the queries/jobs completed in ~2 minutes

Right now I get (with manual counting) "The connection to the server was reset while the page was loading." after 36 seconds at https://hg.mozilla.org/try/rev/94bd4a5cef45)

but just wanted to point it out as ongoing, though I think *this* is just a matter of the try repo size/heads, not related to LB load; though there could be a zues timeout embedded helping to cut things out faster

Hoping Hal's run in 767715 makes a difference.

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

12 years ago

Severity: normal → blocker

Ashish Vijayaram [:ashish]

Comment 10

•

12 years ago

Why is this made a blocker again?

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 11

•

12 years ago

Sorry, I couldn't load try tbpl, but the problem ended up being in Firefox.

Severity: blocker → normal

Rafael Ávila de Espíndola (:espindola) (not reading bugmail)

Comment 12

•

12 years ago

I am seeing exactly the same problem trying to load

https://tbpl.mozilla.org/?tree=Try&rev=20e27ef3c670

or even just the patch itself:

https://hg.mozilla.org/try/rev/20e27ef3c670

Rafael Ávila de Espíndola (:espindola) (not reading bugmail)

Comment 13

•

12 years ago

Running curl eventually worked, but took 2:44 to complete.

Rafael Ávila de Espíndola (:espindola) (not reading bugmail)

Comment 14

•

12 years ago

and when it fails it says that the connection was closed without any data being received.

Ben Kero [:bkero]

Comment 15

•

12 years ago

This is a known problem caused by a bug in Mercurial.  The changeset in question likely modifies some old sections of code, and calculating the deltas for that takes longer than the 

The reason that the index page http://hg.mozilla.org/try takes a long time to load is because that offending revision is being parsed as the top commit of the summary.  If I bypass out Zeus load balancers, I can see that it correctly loads the page, albeit in a very slow time.

Back in April I attended a mercurial code sprint, and this is one of the issues that I addressed with them.  The creator of mercurial, mpm, looked into the issue and made code modifications to correct this problem.  However, the code changes are not staged to land in 2.3 since they did not pass unit tests, and he did not have time to find out why.  Please note that the delta generation code for hgweb is different from hg command-line, so clones and updates should go uninterrupted.

tl;dr: It's a mercurial problem, not our hosting.  A fix is in the works, but we're up to the mercy of mercurial as to when this gets done.

Rafael Ávila de Espíndola (:espindola) (not reading bugmail)

Comment 16

•

12 years ago

Would it be possible to cache the full patches on our servers while the problem in mercurial is worked on?

Ben Kero [:bkero]

Comment 17

•

12 years ago

First paragraph got cut off.  The changeset in question likely modifies some old sections of code, and calculating the deltas for that takes longer than the load balancers allow for a timeout.

One method for fixing this in the web interface would be to stuff dummy commits in to get this large calculation out of the latest-10, and thus wouldn't be rendered on the web page anymore.

[root@boris ~]# time curl -H "Host: hg.mozilla.org" http://hgweb4.dmz.scl3.mozilla.com/try/

<...>

<div class="page_nav">
summary |
<this is where the long wait happens>

Ben Kero [:bkero]

Comment 18

•

12 years ago

Only the web interface should exhibit a problem.  Actual cloning and usage of mercurial should be unaffected.  I do not think that mercurial has a mechanism for simply 'caching full patches' on a remote.  This is typically accomplished by creating a new head.

Additionally, the mercurial problem is not likely to be solved anytime soon, and if it is, it will not be backported to older (existing) mercurial versions.

What is the problem you are trying to solve?

Rafael Ávila de Espíndola (:espindola) (not reading bugmail)

Comment 19

•

12 years ago

I was thinking of putting a http cache in front of the hg server so that only the first run of

curl -O  http://hg.mozilla.org/try/rev/<rev>

takes a long time for any given rev.

Ed Morley [:emorley]

Comment 21

•

12 years ago

Have added a link to this bug in the Try tree status message, since this issue came up several times on IRC.

Workaround:

(In reply to Justin Wood (:Callek) from bug 768225 comment #2)
> Just to be explicit (here) a workaround is to load the try tree without the
> &rev=* and then use the arrow at the bottom repeatedly until you find your
> push.

Whiteboard: [TreeClosure] → [TreeClosure][Workaround in comment 21]

Armen [:armenzg]

Comment 22

•

12 years ago

This affects developers every day and makes their life harder every day.

Can we please attempt what espindola suggests on comment 16?

Is there anything else that can be attempted?

Have we just been degrading over time? or would a reset of the try repo help?
(It might make no sense what I am asking for; feel free to disregard)

Hal Wine [:hwine] use NI!

Comment 23

•

12 years ago

heads have been reduced (bug 767715) without significantly improving problem. That trick used to work, so something else appears to be happening now.

Depends on: 767715

Randell Jesup [:jesup] (needinfo me)

Comment 24

•

12 years ago

Another trick: load by name, not by rev:

https://tbpl.mozilla.org/?tree=Try&pusher=rjesup@wgate.com

Ed Morley [:emorley]

Updated

•

12 years ago

Depends on: 768622

(no longer active)

Comment 25

•

12 years ago

Can the offending changeset be stripped from the try repo?  While it's true that there are workarounds for this bug, it doesn't make a lot of sense for us to wait until the next time that try gets reset!

Justin Wood (:Callek)

Comment 26

•

12 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #25)
> Can the offending changeset be stripped from the try repo?  While it's true
> that there are workarounds for this bug, it doesn't make a lot of sense for
> us to wait until the next time that try gets reset!

We have decided we are resetting try tonight, about 8/9p PT

Armen [:armenzg]

Updated

•

12 years ago

Depends on: 768847

Armen [:armenzg]

Comment 27

•

12 years ago

After the reset we might be now seeing bug 768847.

Armen [:armenzg]

Comment 28

•

12 years ago

We're golden now IIUC.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

12 years ago

Blocks: 770811

Nobody; OK to take it and work on it

Updated

•

10 years ago

Component: Server Operations: Developer Services → General

Product: mozilla.org → Developer Services