Investigate cause of recent connection issues to use1

RESOLVED INCOMPLETE

Status

Infrastructure & Operations
NetOps
--
major
RESOLVED INCOMPLETE
4 years ago
4 years ago

People

(Reporter: coop, Assigned: XioNoX)

Tracking

Details

Attachments

(1 attachment)

(Reporter)

Description

4 years ago
Connectivity to the use1 region of AWS was terrible today. This resulted in a spate of dropped upload connections between use1 hosts and stage.mozilla.org, causing us to fill up the /tmp dir there repeatedly.

Smokeping results:

http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-use1

We've had issues with BGP flaps in the past. What was the state of our tunnels today? Have we already been swapping back and forth between them with the flaps today? If not, could we swap to the other tunnel to try to improve performance?
(Reporter)

Comment 1

4 years ago
Here's some current mtr output, but of course we're not experiencing any issues *right now*.

bld-linux64-spot-1047.build.releng.use1.mozilla.com (0.0.0.0)                                                                                                                             Thu Oct  2 16:46:54 2014
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                                                                                                                                          Packets               Pings
 Host                                                                                                                                                                   Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 169.254.255.61                                                                                                                                                       0.0%   312   39.1   1.5   0.4  39.1   4.1
 2. 72.21.209.193                                                                                                                                                        0.0%   312    1.2   1.2   1.0   3.7   0.2
 3. 169.254.255.74                                                                                                                                                       1.0%   311   95.5  88.1  83.5 103.6   4.4
 4. v-1030.core1.releng.scl3.mozilla.net                                                                                                                                 0.0%   311   96.2  89.4  84.5 103.4   4.6
 5. v-1032.border2.scl3.mozilla.net                                                                                                                                      0.0%   311   95.5  95.1  84.0 138.7  11.6
 6. v-1027.core1.scl3.mozilla.net                                                                                                                                        0.0%   311   99.8  91.8  85.4 105.9   4.9
 7. upload-zlb.vips.scl3.mozilla.com                                                                                                                                     0.0%   311   97.6  88.9  84.2 102.0   4.6
Created attachment 8499284 [details]
Smokeping graph

PDT timestamps.

Comment 3

4 years ago
There was a thread on NANOG about AWS rebooting all of US-West.
I realize that the problem you are talking about is US-East, but I couldn't help but wonder
if Amazon was doing the same security fix in USE that they did in USW?

http://mailman.nanog.org/pipermail/nanog/2014-October/070084.html
(Assignee)

Comment 4

4 years ago
All our metrics on the datacenter side look fine and as it was only use1 that was affected I'd tend to think that it's related to either AWS or a provider between AWS and us.

When BGP flaps, the traffic starts using the 2nd link and stays on it, until the next flap. As they use the same path (only different endpoints at AWS) the probability that a manual failover helps is limited but can indeed be tried next time.

Don't hesitate to ping the Netops oncall while it's happening so we can troubleshot it live.

I'm going to close that bug, please reopen if you have more questions/comments.
Assignee: network-operations → arzhel
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.