build hosts in mtv reporting down

RESOLVED FIXED

Status

Infrastructure & Operations
NetOps
--
critical
RESOLVED FIXED
7 years ago
5 years ago

People

(Reporter: arr, Unassigned)

Tracking

Details

(Reporter)

Description

7 years ago
We just got a number of pages for hosts not responding.  fox2mike said he thinks it might be a network issue:

At least a partial list:

buildbot-master3.build
ganglia3.build.mtv1   
geriatric-master.build
test-master01.build
kvm1.build.mtv1
kvm2.build.mtv1
mv-buildproxy01.build
staging-mobile-master.build
production-mobile-master.build
08:10:04 < arr> if they can't talk to sjc, critical
08:10:26 <@fox2mike> well, they're offline. there is no way you can reach them from sjc...
08:10:35 <@fox2mike> so I'm assuming they can't talk back to sjc as well
08:11:07 < arr> so critical, then, because that means portions of the production build infra are down

Bumping up and moving queues. Trying to call folks.
Assignee: server-ops → network-operations
Severity: normal → critical
Component: Server Operations → Server Operations: Netops
mv-buildproxy01 is the DNS server for all build hosts in mtv1

This host being down will have many repercussions. We're working on bringing it back up.
Here's the detail we have on the issue:

14:33 <zandr> Loss of OSPF adjacency between mtv1 and sjc1. all internal routes between the sites were down
14:34 <zandr> when that happened, we failed over to defunct static routes...routes that pointed to devices no longer present

and the issue has been completely resolved now.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Summary:

1) OSPF adjacency with mtv1 was lost on one of our two core routers in sjc1

2) These core routers had a legacy static route designed to provide backup in the event of an OSPF failure but, due to recent network changes, the static routes were invalid

3) Traffic is distributed equally between our two core routers in sjc1. Traffic shunted to core1 continued to flow as normal. Traffic shunted to core2 was dropped due to the loss of OSPF and presence of an invalid backup route.
Thanks!

Comment 6

7 years ago
/me hugs IT
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.