We just got a number of pages for hosts not responding. fox2mike said he thinks it might be a network issue: At least a partial list: buildbot-master3.build ganglia3.build.mtv1 geriatric-master.build test-master01.build kvm1.build.mtv1 kvm2.build.mtv1 mv-buildproxy01.build staging-mobile-master.build production-mobile-master.build
08:10:04 < arr> if they can't talk to sjc, critical 08:10:26 <@fox2mike> well, they're offline. there is no way you can reach them from sjc... 08:10:35 <@fox2mike> so I'm assuming they can't talk back to sjc as well 08:11:07 < arr> so critical, then, because that means portions of the production build infra are down Bumping up and moving queues. Trying to call folks.
Assignee: server-ops → network-operations
Severity: normal → critical
Component: Server Operations → Server Operations: Netops
mv-buildproxy01 is the DNS server for all build hosts in mtv1 This host being down will have many repercussions. We're working on bringing it back up.
Here's the detail we have on the issue: 14:33 <zandr> Loss of OSPF adjacency between mtv1 and sjc1. all internal routes between the sites were down 14:34 <zandr> when that happened, we failed over to defunct static routes...routes that pointed to devices no longer present and the issue has been completely resolved now.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Summary: 1) OSPF adjacency with mtv1 was lost on one of our two core routers in sjc1 2) These core routers had a legacy static route designed to provide backup in the event of an OSPF failure but, due to recent network changes, the static routes were invalid 3) Traffic is distributed equally between our two core routers in sjc1. Traffic shunted to core1 continued to flow as normal. Traffic shunted to core2 was dropped due to the loss of OSPF and presence of an invalid backup route.
/me hugs IT
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.