Closed Bug 1164484 Opened 9 years ago Closed 9 years ago

We lost both 10G links to PHX1 from our POPs

Categories

(Infrastructure & Operations Graveyard :: NetOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Usul, Assigned: dcurado)

Details

<nagios-phx1> Wed 08:14:03 PDT [1277] nagios1.private.pek1.mozilla.com (10.24.75.42) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-scl3> Wed 08:14:19 PDT [5174] webwewant.mozilla.org (63.245.217.19) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-euw1> Wed 11:14:24 EDT [8176] nagios1.private.scl3.mozilla.com (10.22.75.42) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-corp-phx1> Wed 08:14:43 PDT [3001] nagios1.private.pek1.mozilla.com (10.24.75.42) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-releng> Wed 08:14:53 PDT [4649] nagios1.private.phx1.mozilla.com (10.8.75.19) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-phx1> Wed 08:15:03 PDT [1280] nagios1.private.scl3.mozilla.com (10.22.75.42) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-corp-phx1> Wed 08:15:03 PDT [3002] nagios1.private.scl3.mozilla.com (10.22.75.42) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-scl3> Wed 08:16:09 PDT [5175] nagios1.private.phx1.mozilla.com (10.8.75.19) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-scl3> Wed 08:16:09 PDT [5177] nagios1.private.corp.phx1.mozilla.com (10.20.75.46) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-scl3> Wed 08:16:10 PDT [5178] nagios1.private.euw1.mozilla.com (10.150.75.12) is DOWN :PING CRITICAL - Packet loss = 100%
Looks like we lost both 10G links to PHX1 from our POPs.
I will contact Zayo right away.
Assignee: network-operations → dcurado
Status: NEW → ASSIGNED
hwine> usul_training: mana is unreachable for me from SFO
* grenade|afk is now known as grenade
<usul_training> https://bugzilla.mozilla.org/show_bug.cgi?id=1164484
<hwine> usul_training: ah, seems like a good item for the topic here, since whistle pig appears also offline
<glob> hwine, fwiw i can access whistlepig
<usul_training> I'll send a whistlepig
<usul_training> probbaly
<arr> did we just lose phx1?
<glob> (but replication between bmo's scl3 and phx1 clusters is broken)
<usul_training> arr looks like it
<usul_training> dcurado, is looking
<usul_training> the bug is https://bugzilla.mozilla.org/show_bug.cgi?id=1164484
<dcurado> yes, we lost both 10G circuits into PHX1
<dcurado> However, there is a back door tunnel thing that should be working
<arr> we're seeing nagios failures
<arr> like of the nagios server in phx itself :}
* havi has quit (Quit: havi_away)
<arr> and dc2
<usul_training> glob, whistlepig is slow for me
<rhelmer> hm I am having trouble logging into any of our servers, from the VPN or the jumphost
Summary: some links look like they are down → We lost both 10G links to PHX1 from our POPs
zayo is aware of the problem.
Ticket numbers are: 703860  703864
Zayo is still determining the extent of the problem, but the tech I spoke with said:
"I think this is a pretty big outage because I haven't had a breath in 20 minutes"
I found a missing route in the bgp policy on Adam Newman's IPSec Love Child Backdoor link, for corp.phx1 (10.20/16).  I fixed that, and now we appear to have 100% reach-ability again, even while the 2 x 10GE
links from Zayo are down.  Which is pretty cool.

As Zayo still doesn't really know what is going on, I suspect this may be a prolonged outage.
From #moc:

09:57:15 < jbarnell> Zayo is observing a fiber cut between Roll, AZ and Winterhaven, CA which is impacting our longhaul network Phoenix, AZ to Los Angeles, CA. Technicians are in the area locating the point of damage at this time. We will continue to provide updates immediately as the become available.
Connectivity has been restored.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
It looks like this may be a returning issue. Dave C is looking.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I contacted the provider, Zayo.  They said the original repair (yesterday) was not 100% complete, and
they are now splicing the cable.  
Hard to know what that means, but if I had to guess, I'd say they are re-doing the work with better
materials and better splices, after a quick patch in the field yesterday.
But usually a provider would re-route live circuits over different fiber first, or at least tell
you they are going to do take down the circuits again.
These circuits have been restored to service.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.