Graceful Restart

GracefulRestart feature test plan is described in this document. More details about the feature itself are in GracefulRestart Functional Specifications.

Release: 3.2

GracefulRestart (GR) has two pieces embedded in it. One is Graceful Restart Helper mode where in controller keeps routes of all of its bgp peers and xmpp agents even if the session goes down (for certain period). If and when the session comes back up, routes are cleaned up using the standard mark and sweep approach. In this scenario, the control-node itself does not undergo any restart per se.

Second piece is where in control-node itself restarts. In this scenario, called GracefulRestart mode with proper advertisements to bgp peers (before restart), one would like to avail the GR helper mode functionality provided by the peer (such as JUNOS-MX).

Test Status

Reference

GracefulRestart wiki
GracefulRestart for BGP (and XMPP) follows RFC4724 specifications
LongLivedGracefulRestart feature follows draft-uttaro-idr-bgp-persistence specifications
Unit Test

###GracefulRestart Mode Configure GracefulRestart timer interval via web-ui/api in vnc_cfg.xsd schema under global-system-config Bring up the controller, peers and agents, and learn the routes

Stop the control-node (service supervisor-control stop)

The routes advertised from the control-node towards its peers such as MX-JUNOS should continue to remain in the peers as 'STALE'. Once the control-node comes back up, routes should get re-advertised and no longer should remain as stale in the bgp peers

Events to test and verify

Only GR time intervals configured only in DUT, the restarting contrail-control node
Only LLGR time intervals configured only in DUT, the restarting contrail-control node
Both GR and LLGR time intervals configured only in DUT, the restarting contrail-control node
Only GR time intervals configured only in MX Peer, the non restarting GR helper node
Only LLGR time intervals configured only in MX Peer, the non restarting GR helper node
Both GR and LLGR time intervals configured MX Peer, the non restarting GR helper node
Only GR time intervals configured only in both GR Helper MX node and restarting contrail-control node
Only LLGR time intervals configured only in both GR Helper MX node and restarting contrail-control node
Both GR and LLGR time intervals configured in both GR Helper MX node and restarting contrail-control node
GR/LLGR is configured selectively for certain address families (or all address families) in MX JUNOS peer.
Session reset due to cold reboots, warm reboots, process restarts, config changes, etc.

In all the above scenarios,

GR and LLGR functionality must be verified
Routes should remain in the table until GR timer (and LLGR timer) expires as negotiated
EndOfRib must be sent out by the restarting contrail-control node only after at least BgpPeer::kMinEndOfRibSendTimeUsecs duration and at most BgpPeer::kMaxEndOfRibSendTimeUsecs duration.
If number of updates to be sent out by the restarting node is not large, then EoR can be expected to be sent out pretty much after BgpPeer::kMinEndOfRibSendTimeUsecs duration. Otherwise, only after output queue is fully drained (Can be checked in introspect)
GR and/or LLGR comes into effect only if GR is negotiated for all address families carried over the session
Changes to negotiated list of families in GR should result in session closure (non-graceful)

###GracefulRestart Helper Mode

This is the more complicated mode of the two. In this mode, if GR is negotiated in a session, then routes received by a peer are kept intact even after a session goes down. The routes are managed using the standard mark and sweep approach.

Events to test and verify

All scenarios listed above are applicable with restart step applied to the peer (not DUT control-node)
When GR/LLGR is in effect, routes must be verified to for proper flags and path attributes (e.g. LLGR_STALE attribute)
Best path selection must be verified for LLGR_STALE paths which are to be less preferred
Nested closure where in sessions flap before they reach stable state (before GR and/or LLGR timers expire). Goal is to retain the routes as long as applicable in order to minimize impact to the traffic flow
Configuration changes while GR in effect (in DUT and/or in Peers) such as admin-down, families negotiated, GR configuration itself, etc.
control-node restart while in the midst of GR helper mode for one or more bgp and/or xmpp peers (There should be no crash and restart should happen quickly and gracefully)
Agents subscribe to overlapping and non-overlapping subset of networks after restart
Agents send overlapping and non-overlapping subset of routes after restart
Routing Instance deletion or modification in the midst of GR helper mode (when agent is down or just coming up)
Route Target configuration changes before, during and after GR helper mode is in effect
GR Helper mode disable/enable for BGP and/or XMPP

UnitTest

graceful_restart_test attempts to test many of the following scenarios in UT. But those tests are equally applicable to systest environment as well

Bring up n_agents and n_peers in n_instances_ and advertise n_routes (v4 and v6) in each connection Verify that (n_agents + n_peers) * n_instances_ * n_routes_ routes are received in peer in each instance

Subset of agents/peers support GR
Subset of routing-instances are deleted before, during and after GR
Subset of agents/peers go down permanently (Triggered from agents)
Subset of agents/peers flip (go down and come back up) (Triggered from agents)
Subset of agents/peers go down permanently (Triggered from control-node)
Subset of agents flip/peers (Triggered from control-node)
Subset of subscriptions after restart
Subset of routes are [re]advertised after restart
Subset of routing-instances are deleted (during GR)

Misc tasks

Profile contrail-control using gprof and get performance data
Profile contrail-control using Valgrind and get info about memory leaks, corruptions, etc.
Run code-coverage against contrail-control daemon to get coverage data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful Restart

Release: 3.2

Reference

Clone this wiki locally