build hosts in mtv reporting down



Infrastructure & Operations
7 years ago
5 years ago


(Reporter: arr, Unassigned)





7 years ago
We just got a number of pages for hosts not responding.  fox2mike said he thinks it might be a network issue:

At least a partial list:
08:10:04 < arr> if they can't talk to sjc, critical
08:10:26 <@fox2mike> well, they're offline. there is no way you can reach them from sjc...
08:10:35 <@fox2mike> so I'm assuming they can't talk back to sjc as well
08:11:07 < arr> so critical, then, because that means portions of the production build infra are down

Bumping up and moving queues. Trying to call folks.
Assignee: server-ops → network-operations
Severity: normal → critical
Component: Server Operations → Server Operations: Netops
mv-buildproxy01 is the DNS server for all build hosts in mtv1

This host being down will have many repercussions. We're working on bringing it back up.
Here's the detail we have on the issue:

14:33 <zandr> Loss of OSPF adjacency between mtv1 and sjc1. all internal routes between the sites were down
14:34 <zandr> when that happened, we failed over to defunct static routes...routes that pointed to devices no longer present

and the issue has been completely resolved now.
Last Resolved: 7 years ago
Resolution: --- → FIXED

1) OSPF adjacency with mtv1 was lost on one of our two core routers in sjc1

2) These core routers had a legacy static route designed to provide backup in the event of an OSPF failure but, due to recent network changes, the static routes were invalid

3) Traffic is distributed equally between our two core routers in sjc1. Traffic shunted to core1 continued to flow as normal. Traffic shunted to core2 was dropped due to the loss of OSPF and presence of an invalid backup route.

Comment 6

7 years ago
/me hugs IT
Product: → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.