Closed Bug 1023481 Opened 7 years ago Closed 7 years ago

request for a 15-30 maintenance window (outage!) in phx1

Categories

(Infrastructure & Operations :: NetOps, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dcurado, Assigned: dcurado)

Details

Last week we tried optimizing BGP on our border routers.
If anyone wants to gory details, email me, I can walk you through it.

However, when we made the change, it resulted in an outage for some
important services in PHX1.  That was unexpected.  We gathered as much
data as we could, and were probably close to getting it resolved, but
we felt pressure to roll back to the original configuration.

We've now had a chance to review the data we collected, and have some
ideas about how we can move forward.  But, there is a risk that we'll
cause an outage.

I'd like ask for a 15-30 minute maintenance window in PHX1, which may
(or may not) result in downtime for services in the marketplace BU.

I'd like to do this change on Saturday morning, 9am EST.
If the change does not work, I'll back out of it.
If we can not do this change at this time due to the need
for longer lead times, that's OK.
Assignee: network-operations → dcurado
Flags: cab-review?
Status: NEW → ASSIGNED
Might affect crash-stats.
may affect cloud services, adding a few cc's from them
Flags: cab-review? → cab-review+
:cshields :dcuardo - taking this back to team to verify that this works for us. Kinda late notice....
Totally fair comment, and my apologies about short notice.
As previous:

> If we can not do this change at this time due to the need
> for longer lead times, that's OK.

Thanks very much
Please provide more details about what this entails and the benefits.  Taking the Marketplace down affects partners and 3rd parties and we need to give them notice and reasoning.

Also, is there a reason this bug is confidential?  It's difficult to share among relevant parties like this and I don't see confidential info in here.

Thanks.
Group: mozilla-employee-confidential
Karen - We're talking about a potential outage here.  How much time do you need in advance to give our operators, OEMs, etc notice?
Flags: needinfo?(kward)
Details of what this entails:
 unwinding a complex routing design in the PHX1 data center where NLRI to the BUs within the
 data center is conveyed to the border routers via eBGP.  The border routers are then 
 exporting that routing information into OSPF, presumably so that the core switches in
 PHX1 will have a route for the other BUs in that data center.
 
 We want to stop exporting those routes into OSPF, because we have a problem with the
 way our iBGP mesh is currently configured.  Our iBGP configuration works, but it is 
 incorrect: the way packets leave our network on the way to their destination is non-deterministic.  
 This was set up years ago by other engineers, and we are the lucky ones to get
 to clean it up.  

Benefits: we would very much like to control how packets leave our network.  Otherwise we
 end up with some unexpected routes, sometimes with significantly (10x) higher latencies.
 Finally, the goal here is not only to clean up and optimize our routing configuration, 
 but to use BGP policy to control inbound and outbound traffic loads, which will save
 the organization money.
Can you give us 2 business days to notify our partners of an outage of more than 15 minutes?
Flags: needinfo?(kward)
Of course!  We were thinking about next thursday early AM hours.
Would that be OK?
Thanks,
Dave
Flags: needinfo?(kward)
Another note we have identified a process that should largely eliminate outage risk.
(In reply to Dave Curado :dcurado from comment #9)
> Of course!  We were thinking about next thursday early AM hours.
> Would that be OK?
> Thanks,
> Dave

We would prefer that this would be done during our low traffic period between 8PM PDT - 4AM PDT.
Yes, it is ok.  Will there be a more precise window available to be communicated to the Operators?
Flags: needinfo?(kward)
Jason has said you (pl.) would prefer 8PM PDT to 4AM PDT.
I think the 8pm side of that would be easier for us -- just in terms of when we work and/or sleep.
Can we say 8pm to 8:30pm PDT?

And, as James Barnell mentioned, we think we have a way to make these changes without having
an outage, we just can't be 100% sure of that.  
Usually, we could predict such things with more certainty, but what we're doing is unraveling some
legacy configurations to make the network more manageable; sometimes we get tripped up by something
unexpected. 
Thanks
Flags: needinfo?(kward)
8pm on June 26th - got it.  I'll send out an email to interested parties on this side.  Thanks for working with us on a date/time. :)
Flags: needinfo?(kward)
This work has been successfully completed.
Managed to do it without an outage to any services. =-)

Thanks for your help and patience on this!
Dave
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Change Request: --- → approved
Flags: cab-review+
You need to log in before you can comment on or make changes to this bug.