Closed Bug 568005 Opened 14 years ago Closed 14 years ago

connection to mail, mpt-vpn, build-vpn, and others very slow

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
macOS
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: dmoore)

Details

Has caused some build failures, too.

It's been happening for the last 2 hours at least.
Derek and mrz have been looking into this.
Assignee: server-ops → dmoore
Oddly mail is coming to my phone (wtf?) much quicker than to my desktop here in Toronto. Dunno if that helps with diagnosis.
Some additional diagnosisis:

 - toronto office to speedtest in SJ (smugmug) is fine
 - ssh direct from office to people.mozilla.org is slow
 - ssh through off.net to people.mozilla.org is fine

Hope that helps!
It helps. We're working though problems with our upstream providers,
and it's very dependent on where you're connecting from.
Some traceroute, as requested by shaver:

direct from toronto office to people.mozilla.org:



mtr from off.net to people.mozilla.org:

 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. ca-gw1.ca.mozilla.com             0.0%    93    2.1   2.5   1.6  26.0   2.6
 2. 66.207.206.177                    0.0%    93    3.7   4.2   3.2  35.2   3.4
 3. 76-9-207-1.beanfield.net          0.0%    92    3.3   9.1   2.7 207.7  26.3
 4. 72.15.49.2                        0.0%    92   16.2  17.9  15.6  96.2   9.2
 5. 72.15.49.58                       0.0%    92   15.6  15.9  15.4  17.8   0.5
 6. nyiix.layer42.net                 0.0%    92   16.4  16.6  15.7  29.6   2.1
 7. xe3-2.core1.mpt.layer42.net       0.0%    92  100.1 104.5  99.4 148.8  11.0
 8. 216-129-125-182.cust.layer42.net 39.1%    92  112.6 131.6 101.5 359.9  56.8
 9. v9.core1.sj.mozilla.com          28.3%    92   95.1  95.7  94.8 100.1   0.9
10. ???

traceroute to people.mozilla.org (63.245.208.169), 30 hops max, 40 byte packets
 1  161.136.196.67.static.heavycomputing.ca (67.196.136.161)  1.055 ms  0.976 ms  1.033 ms
 2  gw-he.torontointernetxchange.net (198.32.245.112)  9.848 ms  9.820 ms  10.014 ms
 3  10gigabitethernet1-2.core1.nyc5.he.net (72.52.92.165)  17.317 ms  17.591 ms  17.547 ms
 4  10gigabitethernet1-4.core1.nyc1.he.net (72.52.92.153)  17.499 ms  17.476 ms  17.426 ms
 5  10gigabitethernet1-1.core1.nyc4.he.net (72.52.92.45)  28.246 ms  28.315 ms  28.394 ms
 6  10gigabitethernet5-3.core1.lax1.he.net (72.52.92.226)  79.816 ms  81.964 ms  78.760 ms
 7  10gigabitethernet1-3.core1.lax2.he.net (72.52.92.122)  78.719 ms  78.688 ms  78.645 ms
 8  mozilla.com.any2ix.coresite.com (206.223.143.109)  86.136 ms  86.167 ms  86.136 ms
 9  v8.core1.sj.mozilla.com (63.245.208.49)  86.081 ms  86.048 ms  86.086 ms
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *
Ugh, mispasted above: the mtr is from the toronto office to people, the traceroute is from off.net to people (don't have mtr on that box)

Looks like the problem is at layer42.net which isn't on off.net's routing.
Here's what I'm seeing from home, too. All mozilla-related infra is painfully slow, and I'm seeing timeouts to mail (since 7am-ish Eastern)

                             My traceroute  [v0.72]
cuttlefish (0.0.0.0)                                   Tue May 25 11:18:32 2010
Resolver: Received error response 2. (server failure)er of fields   quit
                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. gw-home.deadsquid.com             0.0%    29    0.5   0.6   0.4   4.2   0.7
 2. xplr-142-46-160-1.xplornet.com    0.0%    29   71.7  64.7  19.5 152.8  39.6
 3. 174.35.131.81                     0.0%    29   41.7  55.2  24.2 164.7  30.6
 4. 142.46.4.25                       0.0%    29   35.8  62.4  22.2 190.1  40.3
 5. 142.46.128.9                      0.0%    29   56.6  60.7  22.2 258.7  55.5
 6. gw-wbsconnect.torontointernetxch  0.0%    29   68.0  89.5  38.6 258.7  60.9
 7. te-1-4.bmf1.sjc1.gt-t.net         3.6%    29  236.1 159.7 113.4 316.8  51.6
 8. 98.124.130.254                    3.6%    28  185.3 158.1 123.3 292.9  41.7
 9. xe3-2.core1.mpt.layer42.net       3.6%    28  268.6 202.6 115.6 520.6 111.4
10. 216-129-125-182.cust.layer42.net 39.3%    28  337.6 171.7 104.8 379.5  86.8
11. v9.core1.sj.mozilla.com          46.4%    28  281.3 167.2 107.9 305.0  68.8
12. ???
I'm not getting packet loss, but couldn't get to bugzilla a second ago.  Seeing a jump in latency between torontointernetxchange.ent and te-1-4.bmf1.sjc1.gt-t.net like kev:

 Host                                           Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 192.168.1.1                                  0.0%    44    0.8   0.9   0.6   2.0   0.2
 2. 10.76.64.1                                   0.0%    44    7.8   9.8   7.1  19.5   2.9
 3. d226-8-177.home.cgocable.net                 0.0%    44   15.0  15.4  13.0  33.7   3.1
 4. 113-0-226-24.cgocable.net                    0.0%    44   21.5  22.2  19.0  33.1   3.2
 5. gw-wbsconnect.torontointernetxchange.net     0.0%    44   38.1  44.4  32.8 206.3  32.0
 6. te-1-4.bmf1.sjc1.gt-t.net                    0.0%    43  112.4 113.4 110.5 126.7   3.3
 7. 98.124.130.254                               0.0%    43  119.2 120.8 118.8 128.8   1.7
 8. ge2-24.core2.mpt.layer42.net                 0.0%    43  128.7 125.5 119.5 162.6  10.4
 9. 216-129-125-186.cust.layer42.net             0.0%    43  119.9 143.8 117.7 336.7  56.6
10. v8.core1.sj.mozilla.com                      0.0%    43  121.3 120.3 118.2 126.2   1.8
11. dyna-bugzilla.acelb.sj.mozilla.com           0.0%    43  120.5 121.2 118.5 137.2   3.1
Seems much better now, fwiw:

 Host                                 Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. ca-gw1.ca.mozilla.com              0.0%    33   17.8   6.2   1.9  22.1   5.8
 2. 66.207.206.177                     0.0%    33   22.5  11.5   3.1  75.3  14.6
 3. 76-9-207-1.beanfield.net           0.0%    33    3.6  10.3   3.1  89.5  15.7
 4. 72.15.49.2                         0.0%    32   18.2  24.4  15.7 171.9  27.4
 5. 72.15.49.58                        0.0%    32   70.2  24.5  15.5  84.7  15.6
 6. nyiix.layer42.net                  0.0%    32   17.1  21.7  15.6  73.4  11.4
 7. ge2-24.core2.mpt.layer42.net       0.0%    32  107.3 106.5  99.6 125.6   7.9
 8. 216-129-125-186.cust.layer42.net   0.0%    32   96.5 126.0  93.0 293.1  49.6
 9. v8.core1.sj.mozilla.com            0.0%    32  101.3 101.8  93.0 142.8  11.9
10. ???
We've been taking aggressive steps to work around the problematic providers throughout the night, but we need to maintain a minimum number of connections or we run into congestion issues. That balancing act is still underway, but we're making progress.
We believe this is sorted. A combination of provider outages (at Level3 and Mzima) and equipment capacity contributed to this in many interesting, interleaving ways.

Our datacenter network was reconfigured to facilitate better debugging, and we're in the process of returning it to full production readiness. The rest of the recovery process should be transparent to end users, though.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.