Closed Bug 1052120 Opened 10 years ago Closed 10 years ago

packet loss seen for users accessing PHX1 resources via above.net/Zayo

Categories

(Infrastructure & Operations Graveyard :: NetOps: DC Carrier, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dcurado, Assigned: dcurado)

References

Details

(Whiteboard: Case: TTN-0000513000)

Attachments

(1 file)

oremj contacted netops via IRC to report that he is seeing tcp timeouts
when trying the following command from github.com:

curl https://addons.mozilla.org/en-US/firefox/

When I try it via AT&T wireless, and he tries it from his home network provider,
we're seeing good throughput.  This suggests Zayo is having a bad hair day.
But the problem has been doing on for 4 hours, so this is something we need
to get straight with them.

Also, this just in: a monitoring service we have called "thousandeyes" is 
showing that access to the resource in PHX1 is fine via Telia and XO, just
not good via Zayo.  Calling Zayo now.
Assignee: network-operations → dcurado
Status: NEW → ASSIGNED
I opened ticket number 513000 with Zayo.

I started with a general question to them:  "how are things with your network today,
any problems?

Answer: "oh yes..."

I asked if they already had a ticket open on these issues and if so, if I could have that
ticket number.  They said they do not give out master ticket numbers.  They would open a
new ticket though, and we can get in line.
So that's what I did.
I heard back from Zayo.  They have a fiber cut in Ohio.
That is likely causing congestion in parts of their network.
There is no ETA on a repair time, but splicing crews have been dispatched.

The impact to our end customers was enough of an issue that we escalated internally and 
chose to shut down our interfaces to Zayo in PHX1.  
Doing that caused a bit of a performance hit to PHX1 while our border routers got over
having a full view of the Internet routing table removed, but not too bad.

The MOC has been notified.  They have been brought up to speed and have filed a whistlepig notice.

We will likely opt to leave Zayo shut down in PHX1 for the night, as we still have plenty of
capacity there (40G with 2 providers) and it will probably take Zayo a while to fix their
issues.  We will bring those interface to Zayo back up once we know that they are repaired,
and we'll do so at some god awful off hour so it minimizes the impact to Mozilla people.
Whiteboard: Case: TTN-0000513000
Telia is researching issues on their side via trouble ticket 00362376.
Blocks: 1051978
I contacted the Zayo NOC this morning and confirmed that their fiber cut had been repaired.
For the record, it was repaired as of this morning at 8:30am -- it took them all night to
splice the cut.

Conferred with James Barnell, and we opted to restore (turn back on) our two connections to Zayo
in PHX1.  One at a time.  

Contacted the MOC to give them a heads up before beginning this process.
We re-enabled the interfaces to Zayo.

James Barnell worked with Telia and found that they were getting errors on one of our
interfaces to them.

We decided to try shutting down that interface to Telia to see if that improved the situation.

Early reports are good.  The people in BER1 are reporting slightly better performance.
see: https://bugzilla.mozilla.org/show_bug.cgi?id=1048826

And we received word via IRC that 1000 eyes has seen performance normalize.
I took a screen shot of 1000 eyes, and will attach it to this bug.

Looking at the stripchart on that screen shot, I believe we had the following:
a) one interface had errors in the Mozilla -> Telia direction for some time
b) that gave us a low amount of errored traffic, as we have two connections to
   Telia in PHX1, and about 50% of the time traffic was using the non-errored link.
c) Then Zayo had a bad day yesterday, due to a fiber cut, and that added a lot of
   packet loss to the situation.  
d) But shutting down Zayo didn't really fix the problem, because we still had an
   errored circuit with Telia.
e) Zayo fixed their fiber cut, and we restored our service with them.
f) We shut down the errored link to Telia, and all reports from 1000 eyes have normalized.

In other words, we had 2 problems on top of each other.
The stripchart from 1000 eyes strongly suggests that this was the case to me.
We can see where things were very bad when we had two problems going on at once.
Then things improved markedly when we shut off service to Zayo.
Things did not degrade when we re-enabled service to Zayo.
Things got 100% normal after taking down the errored link to Telia.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: