Skip to content

Graceful Restart

Ananth Suryanarayana edited this page Feb 21, 2017 · 55 revisions

In Release 3.2, support to Graceful Restart (GR) and Long Lived Graceful Restart (LLGR) helper modes have been added to contrail-controller. This feature is not enabled by default. This is still marked as 'beta' as complete functionality including GR in agents is not available yet.

Reference

Applicability

When ever a bgp peer (or contrail-vrouter-agent) session down is detected, all routes learned from the peer are deleted and also withdrawn immediately from advertised peers. This causes instantaneous disruption to traffic flowing end-to-end even when routes are kept inside vrouter kernel module (in data plane) intact. GracefulRestart and LongLivedGracefulRestart features help to alleviate this problem.

When sessions goes down, learned routes are not deleted and also not withdrawn from advertised peers for certain period. Instead, they are kept as is and just marked as 'stale'. Thus, if sessions come back up and routes are relearned, the overall impact to the network is significantly contained.

Feature highlights

  • Support to advertise GR and LLGR capabilities in BGP (By configuring non-zero restart time)
  • Support for GR and LLGR helper mode to retain routes even after sessions go down (By configuring helper mode)
  • With GR is in effect, when ever a session down event is detected and close process is triggered, all routes (across all address families) are marked stale and remain eligible for best-path election for GracefulRestartTime duration (as exchanged)
  • With LLGR is in effect, stale routes can be retained for much longer time than however long allowed by GR alone. In this phase, route preference is brought down and best paths are recomputed. Also LLGR_STALE community is tagged for stale paths and re-advertised. However, if NO_LLGR community is associated with any received stale route, then such routes are not kept and deleted instead
  • After a certain time, if session comes back up, any remaining stale routes are deleted. If the session does not come back up, all retained stale routes are permanently deleted and withdrawn from advertised peers
  • GR/LLGR feature can be enabled for both BGP based and XMPP based peers
  • GR/LLGR configuration resides under global-system-config configuration section Configuration parameters
  • GR timers can be configured by UI or via provision script. e.g.
/opt/contrail/utils/provision_control.py --api_server_ip 10.84.13.20 --api_server_port 8082 --router_asn 64512 --admin_user admin --admin_password c0ntrail123 --admin_tenant_name admin --set_graceful_restart_parameters --graceful_restart_time 300 --long_lived_graceful_restart_time 60000 --end_of_rib_timeout 30 --graceful_restart_enable --graceful_restart_bgp_helper_enable
# --graceful_restart_xmpp_helper_enable (Not supported yet)

When BGP Peering with JUNOS, JUNOS must also be explicitly configured for gr/llgr. e.g.

set routing-options graceful-restart
set protocols bgp group a6s20 type internal
set protocols bgp group a6s20 local-address 10.87.140.181
set protocols bgp group a6s20 keep all
set protocols bgp group a6s20 family inet-vpn unicast graceful-restart long-lived restarter stale-time 20
set protocols bgp group a6s20 family route-target graceful-restart long-lived restarter stale-time 20
set protocols bgp group a6s20 graceful-restart restart-time 600
set protocols bgp group a6s20 neighbor 10.84.13.20 peer-as 64512

GR helper modes can be enabled via schema. They can be disabled selectively in a contrail-control for BGP and/or XMPP sessions by configuring gr_helper_disable in /etc/contrail/contrail-control.conf configuration file. For BGP, restart time shall be advertised in GR capability, as configured (in schema).

e.g.

/usr/bin/openstack-config --set /etc/contrail/contrail-control.conf DEFAULT gr_helper_bgp_disable 1
/usr/bin/openstack-config --set /etc/contrail/contrail-control.conf DEFAULT gr_helper_xmpp_disable 1
service contrail-control restart

Caveats

  • GR/LLGR feature with a peer comes into effect either to all negotiated address-families or to none. i.e, if a peer signals support to GR/LLGR only for a subset of negotiated address families (Via bgp GR/LLGR capability advertisement), then GR helper mode does not come into effect for any family among the set of negotiated address families
  • GracefulRestart for contrail-vrouter-agents is not supported yet (in 3.2). Hence, graceful_restart_xmpp_helper_enable should not be set. If agent restarts, data plane is reset and hence routes and flows get reprogrammed afresh (which typically results in traffic loss for new/existing flows for several seconds)
  • GR/LLGR is not supported for multicast routes
  • GR/LLGR helper mode may not work correctly for EVPN routes, if the restarting node does not preserve forwarding state

###Contrail-Vrouter Head-Less mode

/usr/bin/openstack-config /etc/contrail/contrail-vrouter-agent.conf DEFAULT headless_mode true 

Headless mode is introduced as a resilient mode of operation for Agent. When running in Headless mode, agent will retain the last "Route Path" from Contrail-Controller. The "Route Path" are held till a new stable connection is established to one of the Contrail-Controller. Once the XMPP connection is up and is stable for a pre-defined duration, the "Route Path" from old XMPP connection are flushed.

When Headless mode is used along with graceful-restart helper mode in contrail-control, vrouter can forward east-west traffic between vrouters for current and new flows (for already learnt routes) even if all control-nodes go down and remain down in the cluster. If graceful restart helper mode is also used in SDN gateways (such as JUNOS-MX), north south traffic between MX and Vrouters can also remain uninterrupted in headless mode. This particular aspect is not available in releases < 3.2.

Clone this wiki locally