Closed Bug 1631348 Opened 4 months ago Closed 4 months ago

Intermittent abort: error: Connection refused when pulling fxtrees

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: apavel, Assigned: jgraham)

References

Details

Attachments

(1 file)

Since last week we've (sheriffs) been receiving abort: error: Connection refused when trying to pull fxtrees. We first noticed this on April 16th.
Pulling trees sometimes works if we pull each tree individually, but there are cases where we receive that error too when pulling just one tree
Debug shows the following: https://mozilla.modular.im/_matrix/media/r0/download/mozilla.modular.im/8fa759a16f392cf4fd54f5f259b2945b1c86ea16

Pushing also returns

remote: ssh: connect to host hg.mozilla.org port 22: Connection refused
abort: no suitable response from remote hg!

Connor, can you check what's causing the frequent connection issues? I haven't experienced them myself so far.

Flags: needinfo?(sheehan)

I was looking into a recent increase in 404's the other day, and it seems these 404 requests also started on the same day as these intermittent problems. I tracked down the "wpt manifest downloader" as the main culprit of the increased junk requests. The tool is sending requests based on changeset hashes passed into this format string. Switching the repo from mozilla-central to integration/autoland and the requests return a 200 (based on a few manually changed queries). The requests are coming in sort of "bursty", and my theory right now is that these bursts are causing the intermittent connection issues, likely due to too many processes trying to read from a SQLite DB at the same time.

James, is it possible to change that string to autoland instead of central? That would be a simple fix to this problem from my end, but I'm not sure if that breaks any assumptions of the downloader.

Flags: needinfo?(sheehan) → needinfo?(james)

I think I can change that to autoland (though I'll need to test to be sure), but I'm curious about what's going on here. The code in question hasn't changed for more than a year, so it's somewhat surprising that it's suddenly causing problems.

Instead of going through every commit, look for a "base_ref" which is
actually a base revision i.e. one that's on a remote head. Then look
for this commit on inbound or central. This should cover all cases except
for when a branch is based on something that was fetched from try in
such a way as to create a remote ref. However that's pretty rare and we
fall back to useful behaviour in that case.

Assignee: nobody → james
Status: NEW → ASSIGNED
Flags: needinfo?(james)

No sure what changed here, if anything did, however, my shift did not have this issue in the last 12 hours at all. Seems fixed.

I have a script that pull from a few hg.m.o repositories, and haven't been able to run it for the past 12 hours. Every time, one of the pulls fails randomly with abort: error: Connection refused.

Looking at Pontoon, I see a few abort: no suitable response from remote hg in the log, but not as many as I'd expect.

Could we be intermittently tripping a DOS detector that is blocking by IP ? I thought I saw a comment on Matrix/Slack in the last week about that but can't find it again.

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #7)

Could we be intermittently tripping a DOS detector that is blocking by IP

I doubt that's my case. It's a script that pulls from a limited number of repositories, and it's manually run once or twice a day. I tried again this morning and I'm still getting the same random errors.

(In reply to Francesco Lodolo [:flod] from comment #6)

I have a script that pull from a few hg.m.o repositories, and haven't been able to run it for the past 12 hours. Every time, one of the pulls fails randomly with abort: error: Connection refused.

Looking at Pontoon, I see a few abort: no suitable response from remote hg in the log, but not as many as I'd expect.

yesterday there were intermittent network issues at our datacenters, but Fx CI had issues on Saturday, as well. we're going to need people to start running traceroute or similar tools, to see where the issues are; I don't believe it's typically the hgmo service.

Mine seems to get stuck around Cogent

One succeeded, although slowly

 4  172.17.162.97 (172.17.162.97)  13.621 ms  13.040 ms  12.694 ms
 5  172.17.92.1 (172.17.92.1)  12.545 ms *  14.958 ms
 6  ae63.edge7.frankfurt1.level3.net (195.16.162.245)  27.019 ms  30.608 ms  31.836 ms
 7  * * *
 8  cogent-level3-200g.frankfurt1.level3.net (4.68.111.178)  61.812 ms  32.502 ms  151.939 ms
 9  be2845.ccr41.fra03.atlas.cogentco.com (154.54.56.189)  29.087 ms  34.155 ms
    be2846.ccr42.fra03.atlas.cogentco.com (154.54.37.29)  152.376 ms
10  be2800.ccr42.par01.atlas.cogentco.com (154.54.58.238)  128.090 ms  190.122 ms
    be2813.ccr41.ams03.atlas.cogentco.com (130.117.0.121)  34.092 ms
11  be12194.ccr41.lon13.atlas.cogentco.com (154.54.56.93)  142.985 ms
    be3628.ccr42.jfk02.atlas.cogentco.com (154.54.27.169)  147.403 ms *
12  * be2916.ccr22.alb02.atlas.cogentco.com (154.54.41.61)  129.877 ms  171.137 ms
13  be2879.ccr22.cle04.atlas.cogentco.com (154.54.29.173)  138.405 ms *
    be3599.ccr21.alb02.atlas.cogentco.com (66.28.4.237)  128.857 ms
14  * be2718.ccr42.ord01.atlas.cogentco.com (154.54.7.129)  207.439 ms *
15  * be2832.ccr22.mci01.atlas.cogentco.com (154.54.44.169)  202.935 ms  145.103 ms
16  * be2831.ccr21.mci01.atlas.cogentco.com (154.54.42.165)  245.105 ms *
17  be3038.ccr32.slc01.atlas.cogentco.com (154.54.42.97)  187.079 ms  172.671 ms
    be3035.ccr21.den01.atlas.cogentco.com (154.54.5.89)  149.340 ms
18  154.54.89.98 (154.54.89.98)  228.417 ms  179.040 ms  183.773 ms
19  be2041.rcr21.smf01.atlas.cogentco.com (154.54.6.142)  211.130 ms
    te0-0-2-1.nr11.b015947-1.smf01.atlas.cogentco.com (154.24.6.74)  204.021 ms
    te0-0-2-0.nr11.b015947-1.smf01.atlas.cogentco.com (154.24.47.50)  264.839 ms
20  * * 38.104.143.138 (38.104.143.138)  190.267 ms
21  173.225.175.142 (173.225.175.142)  263.443 ms
    38.104.143.138 (38.104.143.138)  181.838 ms  177.093 ms
22  65.74.145.153 (65.74.145.153)  269.396 ms  204.645 ms
    173.225.175.142 (173.225.175.142)  204.554 ms
23  65.74.145.153 (65.74.145.153)  200.498 ms  202.962 ms
    63.245.208.22 (63.245.208.22)  183.508 ms
24  hg.public.mdc1.mozilla.com (63.245.208.203)  190.804 ms
    ethernet1-13_4094.fw1.untrust.mdc1.mozilla.net (63.245.208.18)  245.659 ms
    hg.public.mdc1.mozilla.com (63.245.208.203)  180.482 ms

The other got stuck (still going)

 4  172.17.162.97 (172.17.162.97)  16.214 ms  14.565 ms  15.552 ms
 5  172.17.92.1 (172.17.92.1)  12.300 ms *  14.498 ms
 6  ae63.edge7.frankfurt1.level3.net (195.16.162.245)  31.085 ms  29.926 ms  29.190 ms
 7  * * *
 8  cogent-level3-200g.frankfurt1.level3.net (4.68.111.178)  30.970 ms  29.294 ms  34.351 ms
 9  be2845.ccr41.fra03.atlas.cogentco.com (154.54.56.189)  31.381 ms
    be2846.ccr42.fra03.atlas.cogentco.com (154.54.37.29)  29.498 ms
    be2845.ccr41.fra03.atlas.cogentco.com (154.54.56.189)  31.981 ms
10  * be2813.ccr41.ams03.atlas.cogentco.com (130.117.0.121)  33.486 ms
    be2800.ccr42.par01.atlas.cogentco.com (154.54.58.238)  167.855 ms
11  be3628.ccr42.jfk02.atlas.cogentco.com (154.54.27.169)  201.401 ms
    be2182.ccr21.lpl01.atlas.cogentco.com (154.54.77.246)  203.831 ms
    be3628.ccr42.jfk02.atlas.cogentco.com (154.54.27.169)  174.005 ms
12  be2916.ccr22.alb02.atlas.cogentco.com (154.54.41.61)  201.620 ms  204.409 ms
    be2099.ccr31.bos01.atlas.cogentco.com (154.54.82.34)  191.988 ms
13  be3599.ccr21.alb02.atlas.cogentco.com (66.28.4.237)  190.240 ms
    be2879.ccr22.cle04.atlas.cogentco.com (154.54.29.173)  201.351 ms  203.783 ms
14  be2718.ccr42.ord01.atlas.cogentco.com (154.54.7.129)  204.481 ms *  217.801 ms
15  be2832.ccr22.mci01.atlas.cogentco.com (154.54.44.169)  205.750 ms *
    be2717.ccr41.ord01.atlas.cogentco.com (154.54.6.221)  184.246 ms
16  be3036.ccr22.den01.atlas.cogentco.com (154.54.31.89)  202.560 ms  172.066 ms *
17  * * *

As an update, we're getting abort: error: Connection refused again this shift, intermittently, when pulling fxtrees.

Pushed by james@hoppipolla.co.uk:
https://hg.mozilla.org/integration/autoland/rev/c8a063cf794b
Be more efficient looking for a wpt manifest to download, r=ahal
Status: ASSIGNED → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Regressions: 1643192
You need to log in before you can comment on or make changes to this bug.