We lost both 10G links to PHX1 from our POPs

RESOLVED FIXED

Status

Infrastructure & Operations
NetOps
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: Usul, Assigned: dcurado)

Tracking

Details

(Reporter)

Description

3 years ago
<nagios-phx1> Wed 08:14:03 PDT [1277] nagios1.private.pek1.mozilla.com (10.24.75.42) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-scl3> Wed 08:14:19 PDT [5174] webwewant.mozilla.org (63.245.217.19) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-euw1> Wed 11:14:24 EDT [8176] nagios1.private.scl3.mozilla.com (10.22.75.42) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-corp-phx1> Wed 08:14:43 PDT [3001] nagios1.private.pek1.mozilla.com (10.24.75.42) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-releng> Wed 08:14:53 PDT [4649] nagios1.private.phx1.mozilla.com (10.8.75.19) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-phx1> Wed 08:15:03 PDT [1280] nagios1.private.scl3.mozilla.com (10.22.75.42) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-corp-phx1> Wed 08:15:03 PDT [3002] nagios1.private.scl3.mozilla.com (10.22.75.42) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-scl3> Wed 08:16:09 PDT [5175] nagios1.private.phx1.mozilla.com (10.8.75.19) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-scl3> Wed 08:16:09 PDT [5177] nagios1.private.corp.phx1.mozilla.com (10.20.75.46) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-scl3> Wed 08:16:10 PDT [5178] nagios1.private.euw1.mozilla.com (10.150.75.12) is DOWN :PING CRITICAL - Packet loss = 100%
(Assignee)

Comment 1

3 years ago
Looks like we lost both 10G links to PHX1 from our POPs.
I will contact Zayo right away.
Assignee: network-operations → dcurado
Status: NEW → ASSIGNED
(Reporter)

Comment 2

3 years ago
hwine> usul_training: mana is unreachable for me from SFO
* grenade|afk is now known as grenade
<usul_training> https://bugzilla.mozilla.org/show_bug.cgi?id=1164484
<hwine> usul_training: ah, seems like a good item for the topic here, since whistle pig appears also offline
<glob> hwine, fwiw i can access whistlepig
<usul_training> I'll send a whistlepig
<usul_training> probbaly
<arr> did we just lose phx1?
<glob> (but replication between bmo's scl3 and phx1 clusters is broken)
<usul_training> arr looks like it
<usul_training> dcurado, is looking
<usul_training> the bug is https://bugzilla.mozilla.org/show_bug.cgi?id=1164484
<dcurado> yes, we lost both 10G circuits into PHX1
<dcurado> However, there is a back door tunnel thing that should be working
<arr> we're seeing nagios failures
<arr> like of the nagios server in phx itself :}
* havi has quit (Quit: havi_away)
<arr> and dc2
<usul_training> glob, whistlepig is slow for me
<rhelmer> hm I am having trouble logging into any of our servers, from the VPN or the jumphost
(Reporter)

Updated

3 years ago
Summary: some links look like they are down → We lost both 10G links to PHX1 from our POPs
(Assignee)

Comment 3

3 years ago
zayo is aware of the problem.
Ticket numbers are: 703860  703864
(Assignee)

Comment 4

3 years ago
Zayo is still determining the extent of the problem, but the tech I spoke with said:
"I think this is a pretty big outage because I haven't had a breath in 20 minutes"
(Assignee)

Comment 5

3 years ago
I found a missing route in the bgp policy on Adam Newman's IPSec Love Child Backdoor link, for corp.phx1 (10.20/16).  I fixed that, and now we appear to have 100% reach-ability again, even while the 2 x 10GE
links from Zayo are down.  Which is pretty cool.

As Zayo still doesn't really know what is going on, I suspect this may be a prolonged outage.
From #moc:

09:57:15 < jbarnell> Zayo is observing a fiber cut between Roll, AZ and Winterhaven, CA which is impacting our longhaul network Phoenix, AZ to Los Angeles, CA. Technicians are in the area locating the point of damage at this time. We will continue to provide updates immediately as the become available.
(Assignee)

Comment 7

3 years ago
Connectivity has been restored.
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
It looks like this may be a returning issue. Dave C is looking.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 9

3 years ago
I contacted the provider, Zayo.  They said the original repair (yesterday) was not 100% complete, and
they are now splicing the cable.  
Hard to know what that means, but if I had to guess, I'd say they are re-doing the work with better
materials and better splices, after a quick patch in the field yesterday.
But usually a provider would re-route live circuits over different fiber first, or at least tell
you they are going to do take down the circuits again.
(Assignee)

Comment 10

3 years ago
These circuits have been restored to service.
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED

Updated

3 years ago
Blocks: 1164957
You need to log in before you can comment on or make changes to this bug.