Closed
Bug 993766
Opened 11 years ago
Closed 11 years ago
github.com unreachable from some hosts
Categories
(Infrastructure & Operations Graveyard :: NetOps: DC Carrier, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: adam, Unassigned)
References
Details
Attachments
(1 file)
854 bytes,
text/plain
|
Details |
We have noticed that certain hosts within SCL3 are unable to reach github.com. We are working with our upstream providers and github.com to bring resolve.
Comment 1•11 years ago
|
||
We advertise 63.245.214.0/23 from our SCL3 data center.
Most hosts within that network block can get to github.com,
github.com has address 192.30.252.129.
github's network block is 192.30.252.0/22.
Any hosts in the following two subnets of 63.245.214.0/23 can not
reach any IP address in github's /22:
63.245.214.124/30
63.245.214.128/26
Those are the NAT pools for fw1.releng.scl3 and fw1.scl3, respectively.
Comment 2•11 years ago
|
||
The announcement of our /23 network block looks fine.
We contacted github and they helped test by tracerouting from their infrastructure
back to our affected subnets. The traceroutes look correct.
They even tried to modify their routing policy so that they send packets
back to us via a different path, but that did not help.
The problem appears to be the path that the packets take going TOWARDS github.com
Comment 3•11 years ago
|
||
Here is a traceroute from an affected IP address:
[dcurado@admin1a.private.scl3 ~]$ traceroute 192.30.252.128
traceroute to 192.30.252.128 (192.30.252.128), 30 hops max, 60 byte packets
1 fw1.private.scl3.mozilla.net (10.22.75.1) 0.555 ms 0.533 ms 0.515 ms
2 63.245.214.53 (63.245.214.53) 2.481 ms 2.488 ms 2.411 ms
3 63.245.214.46 (63.245.214.46) 0.832 ms 0.844 ms 0.839 ms
4 xe-1-2-0.border1.sjc2.mozilla.net (63.245.219.161) 1.427 ms 1.427
ms 1.413 ms
5 xe-5-0-0.mpr4.sjc7.us.above.net (64.125.170.37) 1.396 ms 1.383 ms
1.361 ms
6 ae3.mpr3.sjc7.us.above.net (64.125.27.85) 1.353 ms 1.461 ms 1.390 ms
7 above-level3.sjc7.us.above.net (64.125.13.242) 1.459 ms 1.452 ms
1.433 ms
8 vlan80.csw3.SanJose1.Level3.net (4.69.152.190) 73.617 ms
vlan70.csw2.SanJose1.Level3.net (4.69.152.126) 73.323 ms
vlan80.csw3.SanJose1.Level3.net (4.69.152.190) 73.506 ms
9 ae-91-91.ebr1.SanJose1.Level3.net (4.69.153.13) 73.422 ms
ae-61-61.ebr1.SanJose1.Level3.net (4.69.153.1) 75.220 ms
ae-91-91.ebr1.SanJose1.Level3.net (4.69.153.13) 73.390 ms
10 ae-2-2.ebr2.NewYork1.Level3.net (4.69.135.186) 73.219 ms 73.399
ms 73.737 ms
11 ae-46-46.ebr2.NewYork2.Level3.net (4.69.201.30) 73.680 ms
ae-48-48.ebr2.NewYork2.Level3.net (4.69.201.38) 73.512 ms
ae-45-45.ebr2.NewYork2.Level3.net (4.69.141.22) 73.454 ms
12 ae-1-100.ebr1.NewYork2.Level3.net (4.69.135.253) 73.345 ms 73.198
ms 73.457 ms
13 ae-40-40.ebr2.Washington1.Level3.net (4.69.201.93) 72.864 ms
ae-37-37.ebr2.Washington1.Level3.net (4.69.132.89) 73.159 ms
ae-39-39.ebr2.Washington1.Level3.net (4.69.201.89) 73.346 ms
14 ae-62-62.csw1.Washington1.Level3.net (4.69.134.146) 73.690 ms
ae-82-82.csw3.Washington1.Level3.net (4.69.134.154) 72.990 ms
ae-72-72.csw2.Washington1.Level3.net (4.69.134.150) 73.053 ms
15 ae-4-90.edge3.Washington4.Level3.net (4.69.149.210) 73.976 ms
ae-2-70.edge3.Washington4.Level3.net (4.69.149.82) 73.563 ms 73.614 ms
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
Notice hop 7 to hop 8. From an affected host, Above.net hands the packet
to Level3's router, 4.69.152.190
The same traceroute (to github.com) from an unaffected host within the
same /23 netblock:
dcurado@fw1.scl3.mozilla.net> traceroute 192.30.252.128
traceroute to 192.30.252.128 (192.30.252.128), 30 hops max, 40 byte packets
1 63.245.214.53 (63.245.214.53) 6.280 ms 6.135 ms 4.398 ms
2 63.245.214.46 (63.245.214.46) 3.526 ms 3.361 ms 2.528 ms
3 63.245.219.161 (63.245.219.161) 3.657 ms 3.621 ms 2.987 ms
4 64.125.170.37 (64.125.170.37) 3.904 ms 4.718 ms 3.584 ms
5 64.125.27.85 (64.125.27.85) 3.417 ms 4.318 ms 3.569 ms
6 64.125.13.242 (64.125.13.242) 3.634 ms 3.801 ms 3.020 ms
7 4.69.152.126 (4.69.152.126) 75.401 ms 4.69.152.254 (4.69.152.254)
75.520 ms 4.69.152.62 (4.69.152.62) 75.385 ms
MPLS Label=1592 CoS=0 TTL=1 S=1
8 4.69.153.13 (4.69.153.13) 75.201 ms 75.263 ms 4.69.153.9
(4.69.153.9) 75.973 ms
MPLS Label=1880 CoS=0 TTL=1 S=1
9 4.69.135.186 (4.69.135.186) 76.844 ms 75.551 ms 75.712 ms
MPLS Label=2001 CoS=0 TTL=1 S=1
10 4.69.201.30 (4.69.201.30) 83.146 ms 4.69.201.34 (4.69.201.34)
74.897 ms 4.69.141.22 (4.69.141.22) 76.787 ms
MPLS Label=1572 CoS=0 TTL=1 S=1
11 4.69.135.253 (4.69.135.253) 76.584 ms 75.676 ms 76.005 ms
MPLS Label=1987 CoS=0 TTL=1 S=1
12 4.69.201.85 (4.69.201.85) 75.179 ms 4.69.132.89 (4.69.132.89)
74.710 ms 4.69.201.85 (4.69.201.85) 75.950 ms
MPLS Label=1428 CoS=0 TTL=1 S=1
13 4.69.134.158 (4.69.134.158) 75.174 ms 4.69.134.150 (4.69.134.150)
75.504 ms 4.69.134.154 (4.69.134.154) 75.336 ms
MPLS Label=1965 CoS=0 TTL=1 S=1
14 4.69.149.18 (4.69.149.18) 75.611 ms 4.69.149.146 (4.69.149.146)
75.936 ms 4.69.149.18 (4.69.149.18) 75.337 ms
15 * * *
16 * * *
Here above.net is handing off the packet to Level's router 4.69.152.126.
While it looks like this traceroute dies, this unaffected host can
ping github.com.
These traceroutes from affected and unaffected hosts are consistent.
This looks like a problem with above.net's router, like a really
broken route cache entry, or Level3 is announcing something
unexpected to above.net regarding these netblocks.
Notice that the two paths through Level3's network are quite different,
with the correct path taking an MPLS route.
Next step is to talk to above.net, in order to find out what their router
(the router at hop #6) is thinking.
Ticket opened with AboveNet/Zayo: 432235
Comment 4•11 years ago
|
||
This is the info we got from github when they try to ping/traceroute back towards us:
All of the below is from github (who was very very helpful on this!)
jssjr@github-lb3a-cp1-prd:~$ ping 63.245.214.5
PING 63.245.214.5 (63.245.214.5) 56(84) bytes of data.
64 bytes from 63.245.214.5: icmp_req=1 ttl=45 time=73.9 ms
64 bytes from 63.245.214.5: icmp_req=2 ttl=45 time=74.3 ms
mtr --report --report-wide --report-cycles 50 63.245.214.125
HOST: github-lb3c-cp1-prd.iad.github.net Loss% Snt Last Avg Best Wrst StDev
<snipped internal network>
6.|-- xe-0-2-0-13.r04.asbnva02.us.bb.gin.ntt.net 0.0% 50 1.5 1.7 1.5 4.3 0.4
7.|-- xe-2-0-1.er4.iad10.us.above.net 0.0% 50 3.9 4.8 1.2 30.0 4.9
8.|-- xe-1-0-1.er2.iad10.us.above.net 0.0% 50 2.2 2.6 1.9 15.8 2.2
9.|-- ae2.cr2.dca2.us.above.net 2.0% 50 2.5 4.5 2.5 26.5 5.4
10.|-- ae6.cr2.iah1.us.above.net 0.0% 50 35.1 38.3 34.2 72.0 6.7
11.|-- ae2.cr2.lax112.us.above.net 0.0% 50 67.3 66.6 61.4 85.8 5.4
12.|-- ae1.cr2.sjc2.us.above.net 0.0% 50 73.0 75.9 72.2 93.4 6.3
13.|-- xe-5-2-0.mpr4.sjc7.us.above.net 0.0% 50 72.6 75.0 72.2 129.2 9.5
14.|-- 64.125.170.38.t00539-06.above.net 0.0% 50 72.9 75.4 72.5 121.8 8.4
15.|-- xe-0-0-1.border2.scl3.mozilla.net 0.0% 50 73.5 74.4 72.8 95.5 3.7
16.|-- v-1032.core1.releng.scl3.mozilla.net 0.0% 50 78.2 74.5 73.3 81.6 1.3
17.|-- ??? 100.0 50 0.0 0.0 0.0 0.0 0.0
jssjr@github-lb3c-cp1-prd:~$ ping -c4 63.245.214.125
PING 63.245.214.125 (63.245.214.125) 56(84) bytes of data.
From 63.245.214.82 icmp_seq=1 Destination Net Unreachable
From 63.245.214.82 icmp_seq=2 Destination Net Unreachable
From 63.245.214.82 icmp_seq=3 Destination Net Unreachable
From 63.245.214.82 icmp_seq=4 Destination Net Unreachable
--- 63.245.214.125 ping statistics ---
4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3004ms
I do have valid routes to that netblock through:
63.245.214.0/23 *[BGP/170] 9w0d 12:07:50, MED 0, localpref 100
AS path: 2914 6461 36856 I, validation-state: unverified
> to 129.250.197.73 via xe-1/0/1.0
And I'm able to ping .5 in the same /24 subnet as the affected block.
Comment 5•11 years ago
|
||
I worked with Zayo on this problem.
As noted in the traceroutes above, the router with the IP address of 64.125.13.242 is the
one routing packets from 63.245.214.x differently, depending on which subnet the packet
comes from.
64.125.13.242 is from Zayo address space, so I assumed that is a Zayo router.
It's not. It's the other side of the /30 that Zayo and Level3 use on their
peering session. They just happen to use Zayo address space for that.
Level3 uses a great deal of aggregated ethernet bundles. To get load balancing
to work correctly on AE bundles, all kinds of hashing schemes are used.
So a router choosing one path over another based on more than just dest IP is
not outside the realm of possibility.
Next stop, Level3.
Comment 6•11 years ago
|
||
Level3 ticket is 7740498.
I started writing up the problem description for Level3 and started to question whether this is
a Level3 problem after all.
However, saw something unexpected in their email on the ticket.
They are only seeing an advertisement for 63.245.208.0/22 -- just /22, which
would not include 63.245.214.0/23. I'm going to see if I can talk with them
about how they get to 63.245.214.0/23
Comment 7•11 years ago
|
||
Talked with Level3. They are looking at our bgp session over at sjc1, which is why they
only see that /22.
Comment 8•11 years ago
|
||
I temporarily changed the NAT for dmz.scl3 from a pool nat to an interface nat to unblock 993632 #6. This will needs to be reverted when this bug is solved.
Comment 9•11 years ago
|
||
I can ping 192.30.252.129 now that the NAT has changed, but still no connectivity to port 9148. Hal, I haven't been able to get ANY host anywhere either on or off Mozilla's network to reach port 9148, so I'm guessing this is what most of the rest of the world sees at this point.
How does this impact our ability to do checkins?
Flags: needinfo?(hwine)
Comment 10•11 years ago
|
||
Can you tell me you source IP from that specific box. If possible I'd like to know you nat address can you:
1. linux: curl ifconfig.me
2. windows: ipchicken.com
Comment 11•11 years ago
|
||
63.245.214.54
Comment 12•11 years ago
|
||
Amy, git is defined on the firewall using port 9418.
Application Rule:
set applications application git protocol tcp
set applications application git destination-port 9418
I can also connect to git from that fireall:
jbarnell@fw1.scl3.mozilla.net> telnet github.com port 9418 interface reth0.1024
Trying 192.30.252.130...
Connected to github.com.
Escape character is '^]'.
^CConnection closed by foreign host.
Can you take a look at the port you tested above with. It may just be a typo.
Comment 13•11 years ago
|
||
Sorry, bleary heartbleed morning. It was a typo and 9418 does indeed work.
Updated•11 years ago
|
Flags: needinfo?(hwine)
Comment 14•11 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #9)
> How does this impact our ability to do checkins?
[answering for the future]
Technically, it does not prevent developer checkins (as gaia commits on github.com). However it prevents all CI and release builds using that newly checked in code. (The build farm only pulls from *.m.o to avoid service disruptions. :/)
Usually, sheriffs will close the trees (preventing checkins administratively) when CI can't be run. (Github does not support that feature.) The more commits that are included in a CI run, the harder it is to bisect which commit caused any issue. Experience shows it's overall less disruptive to close the trees than subject many developers to the frustration of large cset bisection.
In this case (fxos), there was also the issue that all the newly committed code was invisible to partners and QA, so no forward progress could be demonstrated or verified.
Comment 15•11 years ago
|
||
After the temporary fix of comment 8, is there still servers that need to reach github but can't?
Comment 16•11 years ago
|
||
(In reply to Arzhel Younsi [:XioNoX] from comment #15)
> After the temporary fix of comment 8, is there still servers that need to
> reach github but can't?
here are a couple:
[cturra@developeradm.private.scl3 ~]$ nc -zv github.com 443
nc: connect to github.com port 443 (tcp) failed: Connection timed out
[cturra@datazillaadm.private.scl3 ~]$ nc -zv github.com 443
nc: connect to github.com port 443 (tcp) failed: Connection timed out
Comment 17•11 years ago
|
||
We are able to connect from the firewall, can you please send your source address?
FW Connection:
{primary:node1}
jbarnell@fw1.scl3.mozilla.net> telnet github.com port 443 interface reth0.1024
Trying 192.30.252.130...
Connected to github.com.
Escape character is '^]'.
Comment 18•11 years ago
|
||
(In reply to James Barnell from comment #17)
> We are able to connect from the firewall, can you please send your source
> address?
both return the same...
[cturra@datazillaadm.private.scl3 ~]$ dig +short github.com
192.30.252.130
Comment 19•11 years ago
|
||
I temporarily changed the NAT to use the fw interface address instead of an IP in the pool.
I kept a static rule though for admin1a.scl3 so we can keep troubleshooting.
Attached is what will need to be applied to rollback when the issue is solved
Comment 20•11 years ago
|
||
Nicely done Arzhel -- Thanks for making that change!
Comment 21•11 years ago
|
||
i see successes! thnx :XioNoX, this will allow us to get code pushes out the door.
[cturra@developeradm.private.scl3 ~]$ nc -zv 192.30.252.130 443
Connection to 192.30.252.130 443 port [tcp/https] succeeded!
Comment 22•11 years ago
|
||
This just in from github.com:
Hi Dave,
I've done some digging through our logs this morning and tracked down a single banned IP in your range.
63.245.214.162 was banned at 2014-04-08 11:29:42 UTC by an operations engineer in response to a surge of requests that were generating exceptions against our API. The requests were all to mozilla/remo and originated from a script with a user agent of "python-requests/2.2.1 CPython/2.7.2 Linux/3.2.0-36-virtual" by user penthu. We monitored the issue and decided to block the IP after the issue surfaced three times.
I've unblocked the IP, but cannot make any assurances that it will remain unblocked. If the problem occurs again, it is likely to be banned again to prevent impact to other users. I definitely recommend tracking down the offending script and figuring out why it generated such a large number of requested.
Cheers,
Scott Sanders
GitHub Ops
Comment 23•11 years ago
|
||
:dcurado - can you please fwd that email to me so i can reach out to Scott? i would like to ask him if he can alert me if they see this again since webops run the automated deployment environments for most sites (include remo - reps.mozilla.org).
Comment 24•11 years ago
|
||
yay! I rolled back my workarounds.
Comment 25•11 years ago
|
||
Chris -- will forward the email to you.
We may need to approach them carefully here -- obviously they thought what we were doing was
abuse (!) so I don't know how they feel about helping us with anything at the moment.
Not a technical issue, but an inter-personal one...
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 26•11 years ago
|
||
just a quick update here to let everyone know - i have tracked down the developer and application that had caused this block by github and had a great conversation with him. it was definitely accidental and he was extremely sorry for the impact this caused.
Updated•3 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•