Skip to content

Graceful Restart

Ananth Suryanarayana edited this page Aug 23, 2016 · 7 revisions

GracefulRestart feature test plan is described in this document. More details about the feature itself are in GracefulRestart Functional Specifications.

Release: 3.2

GracefulRestart (GR) has two pieces embedded in it. One is Graceful Restart Helper mode where in controller keeps routes of all of its bgp peers and xmpp agents even if the session goes down (for certain period). If and when the session comes back up, routes are cleaned up using the standard mark and sweep approach. In this scenario, the control-node itself does not undergo any restart per se.

Second piece is where in control-node itself restarts. In this scenario, called GracefulRestart mode with proper advertisements to bgp peers (before restart), one would like to avail the GR helper mode functionality provided by the peer (such as JUNOS-MX).

Test Status

Reference

###GracefulRestart Mode Configure GracefulRestart timer interval via web-ui/api in vnc_cfg.xsd schema under global-system-config Bring up the controller, peers and agents, and learn the routes

Stop the control-node (service supervisor-control stop)

The routes advertised from the control-node towards its peers such as MX-JUNOS should continue to remain in the peers as 'STALE'. Once the control-node comes back up, routes should get re-advertised and no longer should remain as stale in the bgp peers

Events to test and verify

  1. Only GR time intervals configured only in DUT, the restarting contrail-control node

  2. Only LLGR time intervals configured only in DUT, the restarting contrail-control node

  3. Both GR and LLGR time intervals configured only in DUT, the restarting contrail-control node

  4. Only GR time intervals configured only in MX Peer, the non restarting GR helper node

  5. Only LLGR time intervals configured only in MX Peer, the non restarting GR helper node

  6. Both GR and LLGR time intervals configured MX Peer, the non restarting GR helper node

  7. Only GR time intervals configured only in both GR Helper MX node and restarting contrail-control node

  8. Only LLGR time intervals configured only in both GR Helper MX node and restarting contrail-control node

  9. Both GR and LLGR time intervals configured in both GR Helper MX node and restarting contrail-control node

  10. GR/LLGR is configured selectively for certain address families (or all address families) in MX JUNOS peer.

  11. Session reset due to cold reboots, warm reboots, process restarts, config changes, etc.

In all the above scenarios,

  1. GR and LLGR functionality must be verified
  2. Routes should remain in the table until GR timer (and LLGR timer) expires as negotiated
  3. EndOfRib must be sent out by the restarting contrail-control node only after at least BgpPeer::kMinEndOfRibSendTimeUsecs duration and at most BgpPeer::kMaxEndOfRibSendTimeUsecs duration.
  4. If number of updates to be sent out by the restarting node is not large, then EoR can be expected to be sent out pretty much after BgpPeer::kMinEndOfRibSendTimeUsecs duration. Otherwise, only after output queue is fully drained (Can be checked in introspect)
  5. GR and/or LLGR comes into effect only if GR is negotiated for all address families carried over the session
  6. Changes to negotiated list of families in GR should result in session closure (non-graceful)

###GracefulRestart Helper Mode

This is the more complicated mode of the two. In this mode, if GR is negotiated in a session, then routes received by a peer are kept intact even after a session goes down. The routes are managed using the standard mark and sweep approach.

Events to test and verify

  1. All scenarios listed above are applicable with restart step applied to the peer (not DUT control-node)
  2. When GR/LLGR is in effect, routes must be verified to for proper flags and path attributes (e.g. LLGR_STALE attribute)
  3. Best path selection must be verified for LLGR_STALE paths which are to be less preferred
  4. Nested closure where in sessions flap before they reach stable state (before GR and/or LLGR timers expire). Goal is to retain the routes as long as applicable in order to minimize impact to the traffic flow
  5. Configuration changes while GR in effect (in DUT and/or in Peers) such as admin-down, families negotiated, GR configuration itself, etc.
  6. control-node restart while in the midst of GR helper mode for one or more bgp and/or xmpp peers (There should be no crash and restart should happen quickly and gracefully)
  7. Agents subscribe to overlapping and non-overlapping subset of networks after restart
  8. Agents send overlapping and non-overlapping subset of routes after restart
  9. Routing Instance deletion or modification in the midst of GR helper mode (when agent is down or just coming up)
  10. Route Target configuration changes before, during and after GR helper mode is in effect
  11. GR Helper mode disable/enable for BGP and/or XMPP

UnitTest

graceful_restart_test attempts to test many of the following scenarios in UT. But those tests are equally applicable to systest environment as well

Bring up n_agents and n_peers in n_instances_ and advertise n_routes (v4 and v6) in each connection Verify that (n_agents + n_peers) * n_instances_ * n_routes_ routes are received in peer in each instance

  • Subset of agents/peers support GR
  • Subset of routing-instances are deleted before, during and after GR
  • Subset of agents/peers go down permanently (Triggered from agents)
  • Subset of agents/peers flip (go down and come back up) (Triggered from agents)
  • Subset of agents/peers go down permanently (Triggered from control-node)
  • Subset of agents flip/peers (Triggered from control-node)
  • Subset of subscriptions after restart
  • Subset of routes are [re]advertised after restart
  • Subset of routing-instances are deleted (during GR)

Misc tasks

  • Profile contrail-control using gprof and get performance data
  • Profile contrail-control using Valgrind and get info about memory leaks, corruptions, etc.
  • Run code-coverage against contrail-control daemon to get coverage data