Closed Bug 993766 Opened 11 years ago Closed 11 years ago

github.com unreachable from some hosts

Categories

(Infrastructure & Operations Graveyard :: NetOps: DC Carrier, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: adam, Unassigned)

References

Details

Attachments

(1 file)

We have noticed that certain hosts within SCL3 are unable to reach github.com. We are working with our upstream providers and github.com to bring resolve.
Blocks: 993632
We advertise 63.245.214.0/23 from our SCL3 data center. Most hosts within that network block can get to github.com, github.com has address 192.30.252.129. github's network block is 192.30.252.0/22. Any hosts in the following two subnets of 63.245.214.0/23 can not reach any IP address in github's /22: 63.245.214.124/30 63.245.214.128/26 Those are the NAT pools for fw1.releng.scl3 and fw1.scl3, respectively.
The announcement of our /23 network block looks fine. We contacted github and they helped test by tracerouting from their infrastructure back to our affected subnets. The traceroutes look correct. They even tried to modify their routing policy so that they send packets back to us via a different path, but that did not help. The problem appears to be the path that the packets take going TOWARDS github.com
Here is a traceroute from an affected IP address: [dcurado@admin1a.private.scl3 ~]$ traceroute 192.30.252.128 traceroute to 192.30.252.128 (192.30.252.128), 30 hops max, 60 byte packets 1 fw1.private.scl3.mozilla.net (10.22.75.1) 0.555 ms 0.533 ms 0.515 ms 2 63.245.214.53 (63.245.214.53) 2.481 ms 2.488 ms 2.411 ms 3 63.245.214.46 (63.245.214.46) 0.832 ms 0.844 ms 0.839 ms 4 xe-1-2-0.border1.sjc2.mozilla.net (63.245.219.161) 1.427 ms 1.427 ms 1.413 ms 5 xe-5-0-0.mpr4.sjc7.us.above.net (64.125.170.37) 1.396 ms 1.383 ms 1.361 ms 6 ae3.mpr3.sjc7.us.above.net (64.125.27.85) 1.353 ms 1.461 ms 1.390 ms 7 above-level3.sjc7.us.above.net (64.125.13.242) 1.459 ms 1.452 ms 1.433 ms 8 vlan80.csw3.SanJose1.Level3.net (4.69.152.190) 73.617 ms vlan70.csw2.SanJose1.Level3.net (4.69.152.126) 73.323 ms vlan80.csw3.SanJose1.Level3.net (4.69.152.190) 73.506 ms 9 ae-91-91.ebr1.SanJose1.Level3.net (4.69.153.13) 73.422 ms ae-61-61.ebr1.SanJose1.Level3.net (4.69.153.1) 75.220 ms ae-91-91.ebr1.SanJose1.Level3.net (4.69.153.13) 73.390 ms 10 ae-2-2.ebr2.NewYork1.Level3.net (4.69.135.186) 73.219 ms 73.399 ms 73.737 ms 11 ae-46-46.ebr2.NewYork2.Level3.net (4.69.201.30) 73.680 ms ae-48-48.ebr2.NewYork2.Level3.net (4.69.201.38) 73.512 ms ae-45-45.ebr2.NewYork2.Level3.net (4.69.141.22) 73.454 ms 12 ae-1-100.ebr1.NewYork2.Level3.net (4.69.135.253) 73.345 ms 73.198 ms 73.457 ms 13 ae-40-40.ebr2.Washington1.Level3.net (4.69.201.93) 72.864 ms ae-37-37.ebr2.Washington1.Level3.net (4.69.132.89) 73.159 ms ae-39-39.ebr2.Washington1.Level3.net (4.69.201.89) 73.346 ms 14 ae-62-62.csw1.Washington1.Level3.net (4.69.134.146) 73.690 ms ae-82-82.csw3.Washington1.Level3.net (4.69.134.154) 72.990 ms ae-72-72.csw2.Washington1.Level3.net (4.69.134.150) 73.053 ms 15 ae-4-90.edge3.Washington4.Level3.net (4.69.149.210) 73.976 ms ae-2-70.edge3.Washington4.Level3.net (4.69.149.82) 73.563 ms 73.614 ms 16 * * * 17 * * * 18 * * * 19 * * * 20 * * * Notice hop 7 to hop 8. From an affected host, Above.net hands the packet to Level3's router, 4.69.152.190 The same traceroute (to github.com) from an unaffected host within the same /23 netblock: dcurado@fw1.scl3.mozilla.net> traceroute 192.30.252.128 traceroute to 192.30.252.128 (192.30.252.128), 30 hops max, 40 byte packets 1 63.245.214.53 (63.245.214.53) 6.280 ms 6.135 ms 4.398 ms 2 63.245.214.46 (63.245.214.46) 3.526 ms 3.361 ms 2.528 ms 3 63.245.219.161 (63.245.219.161) 3.657 ms 3.621 ms 2.987 ms 4 64.125.170.37 (64.125.170.37) 3.904 ms 4.718 ms 3.584 ms 5 64.125.27.85 (64.125.27.85) 3.417 ms 4.318 ms 3.569 ms 6 64.125.13.242 (64.125.13.242) 3.634 ms 3.801 ms 3.020 ms 7 4.69.152.126 (4.69.152.126) 75.401 ms 4.69.152.254 (4.69.152.254) 75.520 ms 4.69.152.62 (4.69.152.62) 75.385 ms MPLS Label=1592 CoS=0 TTL=1 S=1 8 4.69.153.13 (4.69.153.13) 75.201 ms 75.263 ms 4.69.153.9 (4.69.153.9) 75.973 ms MPLS Label=1880 CoS=0 TTL=1 S=1 9 4.69.135.186 (4.69.135.186) 76.844 ms 75.551 ms 75.712 ms MPLS Label=2001 CoS=0 TTL=1 S=1 10 4.69.201.30 (4.69.201.30) 83.146 ms 4.69.201.34 (4.69.201.34) 74.897 ms 4.69.141.22 (4.69.141.22) 76.787 ms MPLS Label=1572 CoS=0 TTL=1 S=1 11 4.69.135.253 (4.69.135.253) 76.584 ms 75.676 ms 76.005 ms MPLS Label=1987 CoS=0 TTL=1 S=1 12 4.69.201.85 (4.69.201.85) 75.179 ms 4.69.132.89 (4.69.132.89) 74.710 ms 4.69.201.85 (4.69.201.85) 75.950 ms MPLS Label=1428 CoS=0 TTL=1 S=1 13 4.69.134.158 (4.69.134.158) 75.174 ms 4.69.134.150 (4.69.134.150) 75.504 ms 4.69.134.154 (4.69.134.154) 75.336 ms MPLS Label=1965 CoS=0 TTL=1 S=1 14 4.69.149.18 (4.69.149.18) 75.611 ms 4.69.149.146 (4.69.149.146) 75.936 ms 4.69.149.18 (4.69.149.18) 75.337 ms 15 * * * 16 * * * Here above.net is handing off the packet to Level's router 4.69.152.126. While it looks like this traceroute dies, this unaffected host can ping github.com. These traceroutes from affected and unaffected hosts are consistent. This looks like a problem with above.net's router, like a really broken route cache entry, or Level3 is announcing something unexpected to above.net regarding these netblocks. Notice that the two paths through Level3's network are quite different, with the correct path taking an MPLS route. Next step is to talk to above.net, in order to find out what their router (the router at hop #6) is thinking. Ticket opened with AboveNet/Zayo: 432235
This is the info we got from github when they try to ping/traceroute back towards us: All of the below is from github (who was very very helpful on this!) jssjr@github-lb3a-cp1-prd:~$ ping 63.245.214.5 PING 63.245.214.5 (63.245.214.5) 56(84) bytes of data. 64 bytes from 63.245.214.5: icmp_req=1 ttl=45 time=73.9 ms 64 bytes from 63.245.214.5: icmp_req=2 ttl=45 time=74.3 ms mtr --report --report-wide --report-cycles 50 63.245.214.125 HOST: github-lb3c-cp1-prd.iad.github.net Loss% Snt Last Avg Best Wrst StDev <snipped internal network> 6.|-- xe-0-2-0-13.r04.asbnva02.us.bb.gin.ntt.net 0.0% 50 1.5 1.7 1.5 4.3 0.4 7.|-- xe-2-0-1.er4.iad10.us.above.net 0.0% 50 3.9 4.8 1.2 30.0 4.9 8.|-- xe-1-0-1.er2.iad10.us.above.net 0.0% 50 2.2 2.6 1.9 15.8 2.2 9.|-- ae2.cr2.dca2.us.above.net 2.0% 50 2.5 4.5 2.5 26.5 5.4 10.|-- ae6.cr2.iah1.us.above.net 0.0% 50 35.1 38.3 34.2 72.0 6.7 11.|-- ae2.cr2.lax112.us.above.net 0.0% 50 67.3 66.6 61.4 85.8 5.4 12.|-- ae1.cr2.sjc2.us.above.net 0.0% 50 73.0 75.9 72.2 93.4 6.3 13.|-- xe-5-2-0.mpr4.sjc7.us.above.net 0.0% 50 72.6 75.0 72.2 129.2 9.5 14.|-- 64.125.170.38.t00539-06.above.net 0.0% 50 72.9 75.4 72.5 121.8 8.4 15.|-- xe-0-0-1.border2.scl3.mozilla.net 0.0% 50 73.5 74.4 72.8 95.5 3.7 16.|-- v-1032.core1.releng.scl3.mozilla.net 0.0% 50 78.2 74.5 73.3 81.6 1.3 17.|-- ??? 100.0 50 0.0 0.0 0.0 0.0 0.0 jssjr@github-lb3c-cp1-prd:~$ ping -c4 63.245.214.125 PING 63.245.214.125 (63.245.214.125) 56(84) bytes of data. From 63.245.214.82 icmp_seq=1 Destination Net Unreachable From 63.245.214.82 icmp_seq=2 Destination Net Unreachable From 63.245.214.82 icmp_seq=3 Destination Net Unreachable From 63.245.214.82 icmp_seq=4 Destination Net Unreachable --- 63.245.214.125 ping statistics --- 4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3004ms I do have valid routes to that netblock through: 63.245.214.0/23 *[BGP/170] 9w0d 12:07:50, MED 0, localpref 100 AS path: 2914 6461 36856 I, validation-state: unverified > to 129.250.197.73 via xe-1/0/1.0 And I'm able to ping .5 in the same /24 subnet as the affected block.
I worked with Zayo on this problem. As noted in the traceroutes above, the router with the IP address of 64.125.13.242 is the one routing packets from 63.245.214.x differently, depending on which subnet the packet comes from. 64.125.13.242 is from Zayo address space, so I assumed that is a Zayo router. It's not. It's the other side of the /30 that Zayo and Level3 use on their peering session. They just happen to use Zayo address space for that. Level3 uses a great deal of aggregated ethernet bundles. To get load balancing to work correctly on AE bundles, all kinds of hashing schemes are used. So a router choosing one path over another based on more than just dest IP is not outside the realm of possibility. Next stop, Level3.
Level3 ticket is 7740498. I started writing up the problem description for Level3 and started to question whether this is a Level3 problem after all. However, saw something unexpected in their email on the ticket. They are only seeing an advertisement for 63.245.208.0/22 -- just /22, which would not include 63.245.214.0/23. I'm going to see if I can talk with them about how they get to 63.245.214.0/23
Talked with Level3. They are looking at our bgp session over at sjc1, which is why they only see that /22.
I temporarily changed the NAT for dmz.scl3 from a pool nat to an interface nat to unblock 993632 #6. This will needs to be reverted when this bug is solved.
I can ping 192.30.252.129 now that the NAT has changed, but still no connectivity to port 9148. Hal, I haven't been able to get ANY host anywhere either on or off Mozilla's network to reach port 9148, so I'm guessing this is what most of the rest of the world sees at this point. How does this impact our ability to do checkins?
Flags: needinfo?(hwine)
Can you tell me you source IP from that specific box. If possible I'd like to know you nat address can you: 1. linux: curl ifconfig.me 2. windows: ipchicken.com
63.245.214.54
Amy, git is defined on the firewall using port 9418. Application Rule: set applications application git protocol tcp set applications application git destination-port 9418 I can also connect to git from that fireall: jbarnell@fw1.scl3.mozilla.net> telnet github.com port 9418 interface reth0.1024 Trying 192.30.252.130... Connected to github.com. Escape character is '^]'. ^CConnection closed by foreign host. Can you take a look at the port you tested above with. It may just be a typo.
Sorry, bleary heartbleed morning. It was a typo and 9418 does indeed work.
Flags: needinfo?(hwine)
(In reply to Amy Rich [:arich] [:arr] from comment #9) > How does this impact our ability to do checkins? [answering for the future] Technically, it does not prevent developer checkins (as gaia commits on github.com). However it prevents all CI and release builds using that newly checked in code. (The build farm only pulls from *.m.o to avoid service disruptions. :/) Usually, sheriffs will close the trees (preventing checkins administratively) when CI can't be run. (Github does not support that feature.) The more commits that are included in a CI run, the harder it is to bisect which commit caused any issue. Experience shows it's overall less disruptive to close the trees than subject many developers to the frustration of large cset bisection. In this case (fxos), there was also the issue that all the newly committed code was invisible to partners and QA, so no forward progress could be demonstrated or verified.
After the temporary fix of comment 8, is there still servers that need to reach github but can't?
Blocks: 994065
(In reply to Arzhel Younsi [:XioNoX] from comment #15) > After the temporary fix of comment 8, is there still servers that need to > reach github but can't? here are a couple: [cturra@developeradm.private.scl3 ~]$ nc -zv github.com 443 nc: connect to github.com port 443 (tcp) failed: Connection timed out [cturra@datazillaadm.private.scl3 ~]$ nc -zv github.com 443 nc: connect to github.com port 443 (tcp) failed: Connection timed out
We are able to connect from the firewall, can you please send your source address? FW Connection: {primary:node1} jbarnell@fw1.scl3.mozilla.net> telnet github.com port 443 interface reth0.1024 Trying 192.30.252.130... Connected to github.com. Escape character is '^]'.
(In reply to James Barnell from comment #17) > We are able to connect from the firewall, can you please send your source > address? both return the same... [cturra@datazillaadm.private.scl3 ~]$ dig +short github.com 192.30.252.130
I temporarily changed the NAT to use the fw interface address instead of an IP in the pool. I kept a static rule though for admin1a.scl3 so we can keep troubleshooting. Attached is what will need to be applied to rollback when the issue is solved
Nicely done Arzhel -- Thanks for making that change!
i see successes! thnx :XioNoX, this will allow us to get code pushes out the door. [cturra@developeradm.private.scl3 ~]$ nc -zv 192.30.252.130 443 Connection to 192.30.252.130 443 port [tcp/https] succeeded!
This just in from github.com: Hi Dave, I've done some digging through our logs this morning and tracked down a single banned IP in your range. 63.245.214.162 was banned at 2014-04-08 11:29:42 UTC by an operations engineer in response to a surge of requests that were generating exceptions against our API. The requests were all to mozilla/remo and originated from a script with a user agent of "python-requests/2.2.1 CPython/2.7.2 Linux/3.2.0-36-virtual" by user penthu. We monitored the issue and decided to block the IP after the issue surfaced three times. I've unblocked the IP, but cannot make any assurances that it will remain unblocked. If the problem occurs again, it is likely to be banned again to prevent impact to other users. I definitely recommend tracking down the offending script and figuring out why it generated such a large number of requested. Cheers, Scott Sanders GitHub Ops
:dcurado - can you please fwd that email to me so i can reach out to Scott? i would like to ask him if he can alert me if they see this again since webops run the automated deployment environments for most sites (include remo - reps.mozilla.org).
yay! I rolled back my workarounds.
Chris -- will forward the email to you. We may need to approach them carefully here -- obviously they thought what we were doing was abuse (!) so I don't know how they feel about helping us with anything at the moment. Not a technical issue, but an inter-personal one...
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
just a quick update here to let everyone know - i have tracked down the developer and application that had caused this block by github and had a great conversation with him. it was definitely accidental and he was extremely sorry for the impact this caused.
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: