Closed
Bug 1137751
Opened 10 years ago
Closed 10 years ago
phx1 filer outage 2015-02-27
Categories
(Infrastructure & Operations :: MOC: Problems, task)
Infrastructure & Operations
MOC: Problems
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: gcox, Unassigned)
Details
2015-02-27
07:37 PT, I start a filer upgrade in phx1.
08:00ish PT, I start the 4th-and-final head in the upgrade, where head 2 goes down and head 1 takes over.
08:06 PT, first alerts of issues come in, svn/socorro/bugzilla items are losing their mounts.
08:15 PT, I've triaged it enough to see where the issue is, data begins flowing again.
Comment 1•10 years ago
|
||
This affected svn and soccoro as well.
Comment 2•10 years ago
|
||
<gcox> So, the XREs manage to kick us from beyond the grave. This was a single-link problem, filer started talking to the other core, then somewhere along the line everything went to the other core, but the link didn't drop, so one head got orphaned.
| Reporter | ||
Comment 3•10 years ago
|
||
So the issue centers around the split cores since the XREs were removed from the core. The filer has active-passive (single-mode as they're referred to in the filer) portchannels to the core. The filer is slow to change between "using one" and "using the other", and won't drop a connection if it still thinks it's good.
3 of the heads' reboots and upgrades, things worked fine on. However, on the 4th head, its connection to core2 was up, connection to core1 was down, but it had link but no data actually flowing over it, and that's when the visible issues began. This feels like one of those asymmetric networking things.
Long-term we should get this off the core and down to a more local switch that still can do active-active.
Short-term if someone from netops can take a look with me on where things went sideways, that'd be good.
| Reporter | ||
Comment 5•10 years ago
|
||
dcurado is PTO. Mailed :johnb some deep-divey rambly things, per a nudge from :jbarnell.
Flags: needinfo?(dcurado)
| Reporter | ||
Comment 6•10 years ago
|
||
Met with :johnb on 2015-03-12, architecture review / familiarization. Trying to find if there's a config that lends to everyone being happy, short of a total stack rebuild.
Comment 7•10 years ago
|
||
:gcox,
Any findings?
| Reporter | ||
Comment 8•10 years ago
|
||
The onus on this one is more on netops. The filer side determines link sufficiency by the simplistic method of "do I have link?". Since we lost LACP / active-active when we lost the XREs, we became vulnerable.
I am guessing this will go wontfix since it's not a great design and this is phx1.
Comment 9•10 years ago
|
||
Sounds about right to me, phx1 isn't too long for this world.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
| Assignee | ||
Updated•8 years ago
|
Component: MOC: Incidents → MOC: Problems
You need to log in
before you can comment on or make changes to this bug.
Description
•