1164484 - We lost both 10G links to PHX1 from our POPs

Reporter

Description

•

9 years ago

<nagios-phx1> Wed 08:14:03 PDT [1277] nagios1.private.pek1.mozilla.com (10.24.75.42) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-scl3> Wed 08:14:19 PDT [5174] webwewant.mozilla.org (63.245.217.19) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-euw1> Wed 11:14:24 EDT [8176] nagios1.private.scl3.mozilla.com (10.22.75.42) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-corp-phx1> Wed 08:14:43 PDT [3001] nagios1.private.pek1.mozilla.com (10.24.75.42) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-releng> Wed 08:14:53 PDT [4649] nagios1.private.phx1.mozilla.com (10.8.75.19) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-phx1> Wed 08:15:03 PDT [1280] nagios1.private.scl3.mozilla.com (10.22.75.42) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-corp-phx1> Wed 08:15:03 PDT [3002] nagios1.private.scl3.mozilla.com (10.22.75.42) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-scl3> Wed 08:16:09 PDT [5175] nagios1.private.phx1.mozilla.com (10.8.75.19) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-scl3> Wed 08:16:09 PDT [5177] nagios1.private.corp.phx1.mozilla.com (10.20.75.46) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-scl3> Wed 08:16:10 PDT [5178] nagios1.private.euw1.mozilla.com (10.150.75.12) is DOWN :PING CRITICAL - Packet loss = 100%

Dave Curado :dcurado

Assignee

Comment 1

•

9 years ago

Looks like we lost both 10G links to PHX1 from our POPs.
I will contact Zayo right away.

Assignee: network-operations → dcurado

Status: NEW → ASSIGNED

Ludovic Hirlimann [:Usul]

Reporter

Comment 2

•

9 years ago

hwine> usul_training: mana is unreachable for me from SFO
* grenade|afk is now known as grenade
<usul_training> https://bugzilla.mozilla.org/show_bug.cgi?id=1164484
<hwine> usul_training: ah, seems like a good item for the topic here, since whistle pig appears also offline
<glob> hwine, fwiw i can access whistlepig
<usul_training> I'll send a whistlepig
<usul_training> probbaly
<arr> did we just lose phx1?
<glob> (but replication between bmo's scl3 and phx1 clusters is broken)
<usul_training> arr looks like it
<usul_training> dcurado, is looking
<usul_training> the bug is https://bugzilla.mozilla.org/show_bug.cgi?id=1164484
<dcurado> yes, we lost both 10G circuits into PHX1
<dcurado> However, there is a back door tunnel thing that should be working
<arr> we're seeing nagios failures
<arr> like of the nagios server in phx itself :}
* havi has quit (Quit: havi_away)
<arr> and dc2
<usul_training> glob, whistlepig is slow for me
<rhelmer> hm I am having trouble logging into any of our servers, from the VPN or the jumphost

Ludovic Hirlimann [:Usul]

Reporter

Updated

•

9 years ago

Summary: some links look like they are down → We lost both 10G links to PHX1 from our POPs

Dave Curado :dcurado

Assignee

Comment 3

•

9 years ago

zayo is aware of the problem.
Ticket numbers are: 703860  703864

Dave Curado :dcurado

Assignee

Comment 4

•

9 years ago

Zayo is still determining the extent of the problem, but the tech I spoke with said:
"I think this is a pretty big outage because I haven't had a breath in 20 minutes"

Dave Curado :dcurado

Assignee

Comment 5

•

9 years ago

I found a missing route in the bgp policy on Adam Newman's IPSec Love Child Backdoor link, for corp.phx1 (10.20/16).  I fixed that, and now we appear to have 100% reach-ability again, even while the 2 x 10GE
links from Zayo are down.  Which is pretty cool.

As Zayo still doesn't really know what is going on, I suspect this may be a prolonged outage.

Ashish Vijayaram [:ashish]

Comment 6

•

9 years ago

From #moc:

09:57:15 < jbarnell> Zayo is observing a fiber cut between Roll, AZ and Winterhaven, CA which is impacting our longhaul network Phoenix, AZ to Los Angeles, CA. Technicians are in the area locating the point of damage at this time. We will continue to provide updates immediately as the become available.

Dave Curado :dcurado

Assignee

Comment 7

•

9 years ago

Connectivity has been restored.

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Ryan Watson [:w0ts0n]

Comment 8

•

9 years ago

It looks like this may be a returning issue. Dave C is looking.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Dave Curado :dcurado

Assignee

Comment 9

•

9 years ago

I contacted the provider, Zayo.  They said the original repair (yesterday) was not 100% complete, and
they are now splicing the cable.  
Hard to know what that means, but if I had to guess, I'd say they are re-doing the work with better
materials and better splices, after a quick patch in the field yesterday.
But usually a provider would re-route live circuits over different fiber first, or at least tell
you they are going to do take down the circuits again.

Dave Curado :dcurado

Assignee

Comment 10

•

9 years ago

These circuits have been restored to service.

Status: REOPENED → RESOLVED

Closed: 9 years ago → 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

2 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Quick Search

We lost both 10G links to PHX1 from our POPs

Categories

(Infrastructure & Operations Graveyard :: NetOps, task)

Tracking

(Not tracked)

People

(Reporter: Usul, Assigned: dcurado)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated