Skip to content

Graceful Restart

Ananth Suryanarayana edited this page Aug 17, 2016 · 55 revisions

In Release 3.2, support to Graceful Restart (GR) and Long Lived Graceful Restart (LLGR) helper modes have been added to contrail-controller. This feature is not enabled by default.

Reference

Applicability

When ever a bgp peer (or contrail-vrouter-agent) session down is detected, all routes learned from the peer are deleted and also withdrawn immediately from advertised peers. This causes instantaneous disruption to traffic flowing end-to-end even when routes are kept inside vrouter kernel module (in data plane) intact. GracefulRestart and LongLivedGracefulRestart features help to alleviate this problem. When sessions goes down, learned routes are not deleted and also not withdrawn from advertised peers for certain period. Instead, they are kept as is and just marked as 'stale'. Thus, if sessions come back up and routes are relearned, then overall impact to the network can be significantly contained.

Also in completely headless mode when no contrail-control is running in a cluster, north-south traffic flows can also be preserved by using GR helper mode of BGP Peers (East West traffic flows are preserved by vrouters in head less mode in 3.0+ itself). Only this particular aspect of headless mode + contrail-control GR/LLGR feature shall be productized and fully qualified in 3.1 release. GR Helper modes from contrail-control for its BGP and XMPP peers shall be qualified in future releases

Feature highlights

  • Support to advertise GR and LLGR capabilities in BGP (By configuring non-zero restart time)
  • Support for GR and LLGR helper mode to retain routes even after sessions go down (By configuring helper mode)
  • With GR is in effect, when ever a session down event is detected and close process is triggered, all routes (across all address families) are marked stale and remain eligible for best-path election for GracefulRestartTime duration (as exchanged)
  • With LLGR is in effect, stale routes can be retained for much longer time than however long allowed by GR alone. In this phase, route preference is brought down and best paths are recomputed. Also LLGR_STALE community is tagged for stale paths and re-advertised. However, if NO_LLGR community is associated with any received stale route, then such routes are not kept and deleted instead
  • After a certain time, if session comes back up, any remaining stale routes are deleted. If the session does not come back up, all retained stale routes are permanently deleted and withdrawn from advertised peers
  • GR/LLGR feature can be enabled for both BGP based and XMPP based peers
  • GR/LLGR configuration resides under global-system-config configuration section Configuration parameters
  • GR timers can be configured by UI or via provision script. e.g. /opt/contrail/utils/provision_control.py --api_server_ip 10.84.13.20 --api_server_port 8082 --router_asn 64512 --admin_user admin --admin_password c0ntrail123 --admin_tenant_name admin --host_name a6s20 --host_ip 10.84.13.20 --graceful_restart_time 300 --long_lived_graceful_restart_time 60000

GR helper mode can be enabled for BGP and/or XMPP sessions by following these steps in contrail-control node. For BGP, restart time shall still be advertised in GR capability, as configured. This lets one still avail gr-helper mode from the bgp peer (JUNOS MX) for graceful restarts of contrail-control process. Also, one can tune end-of-rib receive wait timer values by configuring DEFAULT.bgp_end_of_rib_timeout and DEFAULT.xmpp_end_of_rib_timeout (in seconds)

  1. /usr/bin/openstack-config /etc/contrail/contrail-control.conf DEFAULT gr_helper_bgp_enable 1
  2. /usr/bin/openstack-config /etc/contrail/contrail-control.conf DEFAULT gr_helper_xmpp_enable 1
  3. service contrail-control restart

Caveats (3.1)

  • GR support in contrail-vrouter-agent is not present. It is only in contrail-control, does this take into effect. In future releases, GR/LLGR support shall be extended to contrail-vrouter-agent as well thus keeping end-to-end traffic intact during agent restarts.
  • GR/LLGR feature with a peer comes into effect either to all negotiated address-families or to none. i.e, if a peer signals support to GR/LLGR only for a subset of negotiated address families (Via bgp GR/LLGR capability advertisement), then GR helper mode does not come into effect for any family among the set of negotiated address families
Clone this wiki locally