Closed Bug 1137751 Opened 10 years ago Closed 10 years ago

phx1 filer outage 2015-02-27

Categories

(Infrastructure & Operations :: MOC: Problems, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gcox, Unassigned)

Details

2015-02-27 07:37 PT, I start a filer upgrade in phx1. 08:00ish PT, I start the 4th-and-final head in the upgrade, where head 2 goes down and head 1 takes over. 08:06 PT, first alerts of issues come in, svn/socorro/bugzilla items are losing their mounts. 08:15 PT, I've triaged it enough to see where the issue is, data begins flowing again.
This affected svn and soccoro as well.
<gcox> So, the XREs manage to kick us from beyond the grave. This was a single-link problem, filer started talking to the other core, then somewhere along the line everything went to the other core, but the link didn't drop, so one head got orphaned.
So the issue centers around the split cores since the XREs were removed from the core. The filer has active-passive (single-mode as they're referred to in the filer) portchannels to the core. The filer is slow to change between "using one" and "using the other", and won't drop a connection if it still thinks it's good. 3 of the heads' reboots and upgrades, things worked fine on. However, on the 4th head, its connection to core2 was up, connection to core1 was down, but it had link but no data actually flowing over it, and that's when the visible issues began. This feels like one of those asymmetric networking things. Long-term we should get this off the core and down to a more local switch that still can do active-active. Short-term if someone from netops can take a look with me on where things went sideways, that'd be good.
dave can you shime in ?
Flags: needinfo?(dcurado)
dcurado is PTO. Mailed :johnb some deep-divey rambly things, per a nudge from :jbarnell.
Flags: needinfo?(dcurado)
Met with :johnb on 2015-03-12, architecture review / familiarization. Trying to find if there's a config that lends to everyone being happy, short of a total stack rebuild.
:gcox, Any findings?
The onus on this one is more on netops. The filer side determines link sufficiency by the simplistic method of "do I have link?". Since we lost LACP / active-active when we lost the XREs, we became vulnerable. I am guessing this will go wontfix since it's not a great design and this is phx1.
Sounds about right to me, phx1 isn't too long for this world.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Component: MOC: Incidents → MOC: Problems
You need to log in before you can comment on or make changes to this bug.