Closed Bug 1042993 Opened 10 years ago Closed 10 years ago

remove graceful restart from network device configurations

Categories

(Infrastructure & Operations Graveyard :: NetOps, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dcurado, Assigned: dcurado)

Details

- remove graceful-restart from:
  Why?: There are a few reasons to use graceful-restart.  The main
        reason is to be able to use graceful-routing-engine-switchover (GRES).
        That allows us to switch from a primary routing-engine to a backup,
        and not drop packets.  However, none of our border routers and few
        of our core switches have a backup routing-engine. 
        graceful-restart is still useful with only 1 routing-engine.
        It allows us to restart routing protocols without disrupting traffic.
        However, we don't tend to restart routing-protocols or the RPD process.
        As well, we are interested in deploying bi-directional forwarding
        detection (BFD) which conflicts with graceful-restart.  
        i.e. you should only have one of the two configured when using BGP

So, we'd like to remove graceful-restart from all the devices in our
network that currently have it configured:

	+ agg1.s301.ops.phx1.mozilla.net
	+ border1.console.pao1.mozilla.net
	+ border1.console.scl3.mozilla.net
	+ border1.console.sjc2.mozilla.net
	+ border1.phx1.mozilla.net
	+ border2.console.scl3.mozilla.net
	+ border2.phx1.mozilla.net
	+ core1.corp.console.scl3.mozilla.net
	+ core1.corp.phx1.mozilla.net
	+ core1.svc.phx1.mozilla.net
	+ fw1.akl1.mozilla.net
	+ fw1.corp.console.scl3.mozilla.net
	+ fw1.corp.phx1.mozilla.net
	+ fw1.lon1.mozilla.net
	+ fw1.ops.par1.mozilla.net
	+ fw1.ops.pdx1.mozilla.net
	+ fw1.ops.scl1.mozilla.net
	+ fw1.phx1.mozilla.net
	+ fw1.releng.scl3.mozilla.net
	+ fw1.scl3.mozilla.net
	+ fw1.sfo1.mozilla.net
	+ fw1.svc.phx1.mozilla.net
	+ fw1.tor1.mozilla.net
	+ switch1.r101-10.ops.scl3.mozilla.net
	+ switch1.r301-10.ops.scl3.mozilla.net

There is no documentation on the impact of removing graceful-restart from a
switch, router, or firewall configuration.  While graceful-restart is configured
on these devices, it is *not* configured as part of any protocol configuration.
Worst case: protocol adjacencies will be cleared when this configuration line
 is removed.
Best case: nothing will happen when this configuration line is removed.

Either way, we'll do this change one device at a time, making sure that
the network is in a good working state before moving on to the next device.

Total Maintenance Time: 2 hours
Expected Impact: A series of short periods of routing churn
Assignee: network-operations → dcurado
Flags: cab-review?
Status: NEW → ASSIGNED
Approved by the CAB on July 23rd. When are we doing this Dave?
Flags: cab-review? → cab-review+
We removed graceful restart from the remote office firewalls and some switches.
As mentioned above, there is no documentation from Juniper about the impact
of removing graceful restart.
What we learned is that is probably restarts the Routing Protocol Daemon, aka RPD.
That means all protocols restart.
That means all BGP sessions restart.
Rather than wreaking temporary havoc on the data centers by clearing all the BGP
sessions there, we opted to wait until the upcoming TCW to do that.
We want to clean this stuff up, but there is no need to cause problems in order to do so.
Graceful restart has been unconfigured from all of our equipment except border1.sjc2.
We'll have to take care of that some time, but making this change causes a long
a disturbing re-convergence time for the entire network.

We made this change to border1.pao1, and it took a long time to reconverge.
Not wanting to do that twice in one day, we left border1.sjc2 configured
with graceful-restart for now.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Change Request: --- → approved
Flags: cab-review+
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.