Connectivity to the use1 region of AWS was terrible today. This resulted in a spate of dropped upload connections between use1 hosts and stage.mozilla.org, causing us to fill up the /tmp dir there repeatedly. Smokeping results: http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-use1 We've had issues with BGP flaps in the past. What was the state of our tunnels today? Have we already been swapping back and forth between them with the flaps today? If not, could we swap to the other tunnel to try to improve performance?
Here's some current mtr output, but of course we're not experiencing any issues *right now*. bld-linux64-spot-1047.build.releng.use1.mozilla.com (0.0.0.0) Thu Oct 2 16:46:54 2014 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. 169.254.255.61 0.0% 312 39.1 1.5 0.4 39.1 4.1 2. 22.214.171.124 0.0% 312 1.2 1.2 1.0 3.7 0.2 3. 169.254.255.74 1.0% 311 95.5 88.1 83.5 103.6 4.4 4. v-1030.core1.releng.scl3.mozilla.net 0.0% 311 96.2 89.4 84.5 103.4 4.6 5. v-1032.border2.scl3.mozilla.net 0.0% 311 95.5 95.1 84.0 138.7 11.6 6. v-1027.core1.scl3.mozilla.net 0.0% 311 99.8 91.8 85.4 105.9 4.9 7. upload-zlb.vips.scl3.mozilla.com 0.0% 311 97.6 88.9 84.2 102.0 4.6
There was a thread on NANOG about AWS rebooting all of US-West. I realize that the problem you are talking about is US-East, but I couldn't help but wonder if Amazon was doing the same security fix in USE that they did in USW? http://mailman.nanog.org/pipermail/nanog/2014-October/070084.html
All our metrics on the datacenter side look fine and as it was only use1 that was affected I'd tend to think that it's related to either AWS or a provider between AWS and us. When BGP flaps, the traffic starts using the 2nd link and stays on it, until the next flap. As they use the same path (only different endpoints at AWS) the probability that a manual failover helps is limited but can indeed be tried next time. Don't hesitate to ping the Netops oncall while it's happening so we can troubleshot it live. I'm going to close that bug, please reopen if you have more questions/comments.
Assignee: network-operations → arzhel
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.