Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exabgp stuck and cause tcp zerowindow error #1207

Open
muyiwhy opened this issue May 6, 2024 · 15 comments
Open

Exabgp stuck and cause tcp zerowindow error #1207

muyiwhy opened this issue May 6, 2024 · 15 comments
Assignees
Labels

Comments

@muyiwhy
Copy link

muyiwhy commented May 6, 2024

OS: CentOS Linux release 7.3.1611
Version: 4.2.21

What I did:
I established an ibgp link-state neighbor with a Cisco ios-xr device, when the topology is small, it all works fine. But when we use the same configuration on a live topology with more node and links(about 800 nodes and 6000 links), it comes with an error.

The exabgp can only receive 3 update messages and then it cannot get any more messages from stdin. I checked the tcp queue in Centos and shows it is full. From the peer Cisco device, all the messages are stuck in output queue.

I tried to capture packets in the Centos(sorry that I cannot share the pcap file since there is sensetive infos inside). I can see the first update message comes at 6.08 seconds.

image

Then all the other update messages come from 10.95 second to 11.31 second, there are about 150 update packages contains more the 300 update messages come within 0.4 seconds, then Centos send tcp zerowindow message to Cisco device.

image image

I tried to replace exabgp with gobgp and gobgp can work properly. So maybe there are some issue with exabgp to deal with large mount of update messages.

Please let me know if you need any more info.

@thomas-mangin
Copy link
Member

Currently travelling, short answer, it is probably the pipe filling .. most reported issue such as #1168

@thomas-mangin
Copy link
Member

thomas-mangin commented May 6, 2024

I will correctly read your issue when I can but you did not provide the information I request for help so can you please do, thank you .

@muyiwhy
Copy link
Author

muyiwhy commented May 7, 2024

It still stuck when I set api.ack to off.

@muyiwhy
Copy link
Author

muyiwhy commented May 7, 2024

The -d output is like:

08:50:18 | 18047  | welcome       | Thank you for using ExaBGP
08:50:18 | 18047  | version       | 4.2.11  
08:50:18 | 18047  | interpreter   | 3.6.8 (default, Nov 16 2020, 16:55:22)  [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
08:50:18 | 18047  | os            | Linux controller2 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64
08:50:18 | 18047  | installation  | /usr/local
08:50:38 | 18047  | advice        |         
08:50:38 | 18047  | advice        | Your OS is very slow when returning the host FQDN
08:50:38 | 18047  | advice        | Most likely you do not have valid forward/reverse DNS setup
08:50:38 | 18047  | advice        | Adding your hostname to the /etc/hosts file should fix the issue
08:50:38 | 18047  | advice        |         
08:50:38 | 18047  | cli           | could not find the named pipes (exabgp.in and exabgp.out) required for the cli
08:50:38 | 18047  | cli           | we scanned the following folders (the number is your PID):
08:50:38 | 18047  | cli control   |  - /run/exabgp/
08:50:38 | 18047  | cli control   |  - /run/0/
08:50:38 | 18047  | cli control   |  - /run/
08:50:38 | 18047  | cli control   |  - /var/run/exabgp/
08:50:38 | 18047  | cli control   |  - /var/run/0/
08:50:38 | 18047  | cli control   |  - /var/run/
08:50:38 | 18047  | cli control   |  - /usr/local/run/exabgp/
08:50:38 | 18047  | cli control   |  - /usr/local/run/0/
08:50:38 | 18047  | cli control   |  - /usr/local/run/
08:50:38 | 18047  | cli control   |  - /usr/local/var/run/exabgp/
08:50:38 | 18047  | cli control   |  - /usr/local/var/run/0/
08:50:38 | 18047  | cli control   |  - /usr/local/var/run/
08:50:38 | 18047  | cli control   | please make them in one of the folder with the following commands:
08:50:38 | 18047  | cli control   | > mkfifo /etc/exabgp/run/exabgp.{in,out}
08:50:38 | 18047  | cli control   | > chmod 600 /etc/exabgp/run/exabgp.{in,out}
08:50:38 | 18047  | configuration | performing reload of exabgp 4.2.11
08:50:38 | 18047  | configuration | > process          | 'parsed-route-backend'
08:50:38 | 18047  | configuration | . run              | 'python' '/home/cisco/scheduler/module/link_state.py'
08:50:38 | 18047  | configuration | . encoder          | 'json'
08:50:38 | 18047  | configuration | < process          | 
08:50:38 | 18047  | configuration | > neighbor         | '192.168.0.51'
08:50:38 | 18047  | configuration | . local-address    | '192.168.0.41'
08:50:38 | 18047  | configuration | . local-as         | '4837'
08:50:38 | 18047  | configuration | . peer-as          | '4837'
08:50:38 | 18047  | configuration | > family           | 
08:50:38 | 18047  | configuration | . bgp-ls           | 'bgp-ls'
08:50:38 | 18047  | configuration | < family           | 
08:50:38 | 18047  | configuration | > api              | 
08:50:38 | 18047  | configuration | . processes        | '[' 'parsed-route-backend' ']'
08:50:38 | 18047  | configuration | > receive          | 
08:50:38 | 18047  | configuration | . parsed           | 
08:50:38 | 18047  | configuration | . update           | 
08:50:38 | 18047  | configuration | . keepalive        | 
08:50:38 | 18047  | configuration | < receive          | 
08:50:38 | 18047  | configuration | < api              | 
08:50:38 | 18047  | configuration | < neighbor         | 
08:50:38 | 18047  | configuration | > neighbor         | '192.168.0.61'
08:50:38 | 18047  | configuration | . local-address    | '192.168.0.41'
08:50:38 | 18047  | configuration | . local-as         | '4837'
08:50:38 | 18047  | configuration | . peer-as          | '4837'
08:50:38 | 18047  | configuration | > family           | 
08:50:38 | 18047  | configuration | . bgp-ls           | 'bgp-ls'
08:50:38 | 18047  | configuration | < family           | 
08:50:38 | 18047  | configuration | > api              | 
08:50:38 | 18047  | configuration | . processes        | '[' 'parsed-route-backend' ']'
08:50:38 | 18047  | configuration | > receive          | 
08:50:38 | 18047  | configuration | . parsed           | 
08:50:38 | 18047  | configuration | . update           | 
08:50:38 | 18047  | configuration | . keepalive        | 
08:50:38 | 18047  | configuration | < receive          | 
08:50:38 | 18047  | configuration | < api              | 
08:50:38 | 18047  | configuration | < neighbor         | 
08:50:38 | 18047  | reactor       | new peer: neighbor 192.168.0.51 local-ip 192.168.0.41 local-as 4837 peer-as 4837 router-id 192.168.0.41 family-allowed in-open
08:50:38 | 18047  | reactor       | new peer: neighbor 192.168.0.61 local-ip 192.168.0.41 local-as 4837 peer-as 4837 router-id 192.168.0.41 family-allowed in-open
08:50:38 | 18047  | reactor       | loaded new configuration successfully
08:50:38 | 18047  | process       | forked process parsed-route-backend

While the log files is(I replaced some sensitive info with xxx):

Tue, 07 May 2024 08:50:38 18047  reactor       initialising connection to peer-2
Tue, 07 May 2024 08:50:38 18047  reactor       initialising connection to peer-1
Tue, 07 May 2024 08:50:38 18047  outgoing-1    attempting connection to 192.168.0.61:179
Tue, 07 May 2024 08:50:38 18047  outgoing-1    attempting connection to 192.168.0.51:179
Tue, 07 May 2024 08:50:38 18047  outgoing-1    sending TCP payload (  49) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0031 0104 12E5 00B4 C0A8 0029 1402 0601 0440 0400 4702 0641 0400 0012 E502 0206 00
Tue, 07 May 2024 08:50:38 18047  outgoing-1    sending TCP payload (  49) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0031 0104 12E5 00B4 C0A8 0029 1402 0601 0440 0400 4702 0641 0400 0012 E502 0206 00
Tue, 07 May 2024 08:50:38 18047  outgoing-1    >> OPEN version=4 asn=xxxx hold_time=180 router_id=192.168.0.41 capabilities=[Multiprotocol(bgp-ls bgp-ls), Extended Message(65535), ASN4(xxxx)]
Tue, 07 May 2024 08:50:38 18047  ka-outgoing-1 receive-timer 60 second(s) left
Tue, 07 May 2024 08:50:38 18047  outgoing-1    >> OPEN version=4 asn=xxxx hold_time=180 router_id=192.168.0.41 capabilities=[Multiprotocol(bgp-ls bgp-ls), Extended Message(65535), ASN4(xxxx)]
Tue, 07 May 2024 08:50:38 18047  ka-outgoing-1 receive-timer 60 second(s) left
Tue, 07 May 2024 08:50:38 18047  outgoing-1    received TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 004B 01
Tue, 07 May 2024 08:50:38 18047  outgoing-1    received TCP payload (  56) 0412 E500 B4DB 9E00 1D2E 0206 0104 4004 0047 0202 8000 0202 0200 0206 4104 0000 12E5 0214 0512 0001 0001 0002 0001 0002 0002 0001 0080 0002
Tue, 07 May 2024 08:50:38 18047  outgoing-1    << message of type OPEN
Tue, 07 May 2024 08:50:38 18047  outgoing-1    << OPEN version=4 asn=xxxx hold_time=180 router_id=xxx.xxx.0.29 capabilities=[Multiprotocol(bgp-ls bgp-ls), Route Refresh, NextHop(ipv4 unicast ipv6,ipv4 multicast ipv6,ipv4 mpls-vpn ipv6), ASN4(xxxx), Route Refresh]
Tue, 07 May 2024 08:50:38 18047  outgoing-1    sending TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0013 04
Tue, 07 May 2024 08:50:38 18047  outgoing-1    >> KEEPALIVE (OPENCONFIRM)
Tue, 07 May 2024 08:50:38 18047  outgoing-1    received TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0013 04
Tue, 07 May 2024 08:50:38 18047  outgoing-1    << message of type KEEPALIVE
Tue, 07 May 2024 08:50:38 18047  ka-outgoing-1 receive-timer 180 second(s) left
Tue, 07 May 2024 08:50:38 18047  reactor       connected to peer-1 with outgoing-1 192.168.0.41-192.168.0.51
Tue, 07 May 2024 08:50:38 18047  outgoing-1    sending TCP payload (  30) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 001E 0200 0000 0790 0F00 0340 0447
Tue, 07 May 2024 08:50:38 18047  outgoing-1    >> EOR bgp-ls bgp-ls
Tue, 07 May 2024 08:50:38 18047  peer-1        >> EOR(s)
Tue, 07 May 2024 08:50:38 18047  outgoing-1    received TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 004B 01
Tue, 07 May 2024 08:50:38 18047  outgoing-1    received TCP payload (  56) 0412 E500 B4DB 9E00 1E2E 0206 0104 4004 0047 0202 8000 0202 0200 0206 4104 0000 12E5 0214 0512 0001 0001 0002 0001 0002 0002 0001 0080 0002
Tue, 07 May 2024 08:50:38 18047  outgoing-1    << message of type OPEN
Tue, 07 May 2024 08:50:38 18047  outgoing-1    << OPEN version=4 asn=xxxx hold_time=180 router_id=xxx.xxx.0.30 capabilities=[Multiprotocol(bgp-ls bgp-ls), Route Refresh, NextHop(ipv4 unicast ipv6,ipv4 multicast ipv6,ipv4 mpls-vpn ipv6), ASN4(xxxx), Route Refresh]
Tue, 07 May 2024 08:50:38 18047  outgoing-1    sending TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0013 04
Tue, 07 May 2024 08:50:38 18047  outgoing-1    >> KEEPALIVE (OPENCONFIRM)
Tue, 07 May 2024 08:50:38 18047  outgoing-1    received TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0013 04
Tue, 07 May 2024 08:50:38 18047  outgoing-1    << message of type KEEPALIVE
Tue, 07 May 2024 08:50:38 18047  ka-outgoing-1 receive-timer 180 second(s) left
Tue, 07 May 2024 08:50:38 18047  reactor       connected to peer-2 with outgoing-1 192.168.0.41-192.168.0.61
Tue, 07 May 2024 08:50:38 18047  outgoing-1    sending TCP payload (  30) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 001E 0200 0000 0790 0F00 0340 0447
Tue, 07 May 2024 08:50:38 18047  outgoing-1    >> EOR bgp-ls bgp-ls
Tue, 07 May 2024 08:50:38 18047  peer-2        >> EOR(s)
Tue, 07 May 2024 08:50:39 18047  ka-outgoing-1 receive-timer 179 second(s) left
Tue, 07 May 2024 08:50:39 18047  ka-outgoing-1 send-timer 59 second(s) left
Tue, 07 May 2024 08:50:39 18047  ka-outgoing-1 receive-timer 179 second(s) left
Tue, 07 May 2024 08:50:39 18047  ka-outgoing-1 send-timer 59 second(s) left
Tue, 07 May 2024 08:50:40 18047  ka-outgoing-1 receive-timer 178 second(s) left
Tue, 07 May 2024 08:50:40 18047  ka-outgoing-1 send-timer 58 second(s) left
Tue, 07 May 2024 08:50:40 18047  ka-outgoing-1 receive-timer 178 second(s) left
Tue, 07 May 2024 08:50:40 18047  ka-outgoing-1 send-timer 58 second(s) left
Tue, 07 May 2024 08:50:41 18047  ka-outgoing-1 receive-timer 177 second(s) left
Tue, 07 May 2024 08:50:41 18047  ka-outgoing-1 send-timer 57 second(s) left
Tue, 07 May 2024 08:50:41 18047  ka-outgoing-1 receive-timer 177 second(s) left
Tue, 07 May 2024 08:50:41 18047  ka-outgoing-1 send-timer 57 second(s) left
Tue, 07 May 2024 08:50:42 18047  ka-outgoing-1 receive-timer 176 second(s) left
Tue, 07 May 2024 08:50:42 18047  ka-outgoing-1 send-timer 56 second(s) left
Tue, 07 May 2024 08:50:42 18047  ka-outgoing-1 receive-timer 176 second(s) left
Tue, 07 May 2024 08:50:42 18047  ka-outgoing-1 send-timer 56 second(s) left
Tue, 07 May 2024 08:50:43 18047  ka-outgoing-1 receive-timer 175 second(s) left
Tue, 07 May 2024 08:50:43 18047  ka-outgoing-1 send-timer 55 second(s) left
Tue, 07 May 2024 08:50:43 18047  ka-outgoing-1 receive-timer 175 second(s) left
Tue, 07 May 2024 08:50:43 18047  ka-outgoing-1 send-timer 55 second(s) left
Tue, 07 May 2024 08:50:43 18047  outgoing-1    received TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 00BE 02
Tue, 07 May 2024 08:50:43 18047  outgoing-1    received TCP payload ( 171) 0000 00A7 900E 0062 4004 4704 C0A8 0033 0000 0200 5502 0000 0000 0000 0000 0100 001A 0200 0004 0000 12E5 0201 0004 0000 0000 0203 0006 2191 5800 1080 0101 001A 0200 0004 0000 12E5 0201 0004 0000 0000 0203 0006 2191 5800 2036 0103 0004 DB9E 6D2D 0104 0004 DB9E 6D2E 4001 0100 4002 0040 0504 0000 0064 800A 04DB 9E00 1D80 0904 DB9E 013E 801D 2204 0400 04DB 9E01 5004 0600 04DB 9E02 2404 4700 0300 0BB8 044B 0007 3000 0000 0012 23
Tue, 07 May 2024 08:50:43 18047  outgoing-1    << message of type UPDATE
Tue, 07 May 2024 08:50:43 18047  parser        parsing UPDATE ( 171) 0000 00A7 900E 0062 4004 4704 C0A8 0033 0000 0200 5502 0000 0000 0000 0000 0100 001A 0200 0004 0000 12E5 0201 0004 0000 0000 0203 0006 2191 5800 1080 0101 001A 0200 0004 0000 12E5 0201 0004 0000 0000 0203 0006 2191 5800 2036 0103 0004 DB9E 6D2D 0104 0004 DB9E 6D2E 4001 0100 4002 0040 0504 0000 0064 800A 04DB 9E00 1D80 0904 DB9E 013E 801D 2204 0400 04DB 9E01 5004 0600 04DB 9E02 2404 4700 0300 0BB8 044B 0007 3000 0000 0012 23
Tue, 07 May 2024 08:50:43 18047  routes        withdrawn NLRI none
Tue, 07 May 2024 08:50:43 18047  parser        attribute mp-reach-nlri      flag 0x90 type 0x0e len 0x62 payload 4004 4704 C0A8 0033 0000 0200 5502 0000 0000 0000 0000 0100 001A 0200 0004 0000 12E5 0201 0004 0000 0000 0203 0006 2191 5800 1080 0101 001A 0200 0004 0000 12E5 0201 0004 0000 0000 0203 0006 2191 5800 2036 0103 0004 DB9E 6D2D 0104 0004 DB9E 6D2E
Tue, 07 May 2024 08:50:43 18047  parser        NLRI      bgp-ls bgp-ls      without path-information     payload 0002 0055 0200 0000 0000 0000 0001 0000 1A02 0000 0400 0012 E502 0100 0400 0000 0002 0300 0621 9158 0010 8001 0100 1A02 0000 0400 0012 E502 0100 0400 0000 0002 0300 0621 9158 0020 3601 0300 04DB 9E6D 2D01 0400 04DB 9E6D 2E
Tue, 07 May 2024 08:50:43 18047  parser        attribute origin             flag 0x40 type 0x01 len 0x01 payload 00
Tue, 07 May 2024 08:50:43 18047  parser        attribute as-path            flag 0x40 type 0x02 len 0x00
Tue, 07 May 2024 08:50:43 18047  parser        attribute local-preference   flag 0x40 type 0x05 len 0x04 payload 0000 0064
Tue, 07 May 2024 08:50:43 18047  parser        attribute cluster-list       flag 0x80 type 0x0a len 0x04 payload DB9E 001D
Tue, 07 May 2024 08:50:43 18047  parser        attribute originator-id      flag 0x80 type 0x09 len 0x04 payload DB9E 013E
Tue, 07 May 2024 08:50:43 18047  parser        attribute bgp-ls             flag 0x80 type 0x1d len 0x22 payload 0404 0004 DB9E 0150 0406 0004 DB9E 0224 0447 0003 000B B804 4B00 0730 0000 0000 1223
Tue, 07 May 2024 08:50:43 18047  routes        announced NLRI none
Tue, 07 May 2024 08:50:43 18047  parser        decoded UPDATE (   0) json { "exabgp": "4.0.1", "time": 1715043043.8016608, "host" : "controller2", "pid" : 18047, "ppid" : 16912, "counter": 2, "type": "update", "neighbor": { "address": { "local": "192.168.0.41", "peer": "192.168.0.51" }, "asn": { "local": xxxx, "peer": xxxx } , "direction": "in", "message": { "update": { "attribute": { "origin": "igp", "local-preference": 100, "originator-id": "xxx.xxx.1.62", "cluster-list": [ "xxx.xxx.0.29" ], "bgp-ls": { "local-te-router-ids": ["xxx.xxx.1.80"], "remote-te-router-id": "xxx.xxx.2.36", "igp-metric": 3000, "sr-adj-flags": {"F": 0, "B": 0, "V": 1, "L": 1, "S": 0, "P": 0, "RSV": 0}, "sids": [[4643]], "undecoded-sids": [], "sr-adj-weight": 0 } }, "announce": { "bgp-ls bgp-ls": { "192.168.0.51": [ { "ls-nlri-type": "bgpls-link", "l3-routing-topology": 0, "protocol-id": 2, "local-node-descriptors": { "autonomous-system": xxxx, "bgp-ls-identifier": "0", "router-id": "xxxxxx001080" }, "remote-node-descriptors": { "autonomous-system": xxxx, "bgp-ls-identifier": "0", "router-id": "xxxxxx002036" }, "interface-address": { "interface-address": "xxx.xxx.109.45" }, "neighbor-address": { "neighbor-address": "xxx.xxx.109.46" } } ] } } } } } }
Tue, 07 May 2024 08:50:43 18047  peer-1        << UPDATE #1
Tue, 07 May 2024 08:50:43 18047  peer-1           UPDATE #1 nlri  (   2) { "ls-nlri-type": "bgpls-link", "l3-routing-topology": 0, "protocol-id": 2, "local-node-descriptors": { "autonomous-system": xxxx, "bgp-ls-identifier": "0", "router-id": "xxxxxx001080" }, "remote-node-descriptors": { "autonomous-system": xxxx, "bgp-ls-identifier": "0", "router-id": "xxxxxx002036" }, "interface-address": { "interface-address": "xxx.xxx.109.45" }, "neighbor-address": { "neighbor-address": "xxx.xxx.109.46" } }
Tue, 07 May 2024 08:50:43 18047  outgoing-1    received TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 00A9 02
Tue, 07 May 2024 08:50:43 18047  outgoing-1    received TCP payload ( 150) 0000 0092 900E 0058 4004 4704 C0A8 0033 0000 0200 4B02 0000 0000 0000 0000 0100 001A 0200 0004 0000 12E5 0201 0004 0000 0000 0203 0006 2191 5800 1075 0101 001A 0200 0004 0000 12E5 0201 0004 0000 0000 0203 0006 2191 5800 1025 0107 0002 0002 4001 0100 4002 0040 0504 0000 0064 800A 04DB 9E00 1D80 0904 DB9E 013E 801D 1704 0400 04DB 9E01 4B04 0600 04DB 9E01 1904 4700 0300 0B54
Tue, 07 May 2024 08:50:43 18047  outgoing-1    << message of type UPDATE
Tue, 07 May 2024 08:50:43 18047  parser        parsing UPDATE ( 150) 0000 0092 900E 0058 4004 4704 C0A8 0033 0000 0200 4B02 0000 0000 0000 0000 0100 001A 0200 0004 0000 12E5 0201 0004 0000 0000 0203 0006 2191 5800 1075 0101 001A 0200 0004 0000 12E5 0201 0004 0000 0000 0203 0006 2191 5800 1025 0107 0002 0002 4001 0100 4002 0040 0504 0000 0064 800A 04DB 9E00 1D80 0904 DB9E 013E 801D 1704 0400 04DB 9E01 4B04 0600 04DB 9E01 1904 4700 0300 0B54
Tue, 07 May 2024 08:50:43 18047  routes        withdrawn NLRI none
Tue, 07 May 2024 08:50:43 18047  parser        attribute mp-reach-nlri      flag 0x90 type 0x0e len 0x58 payload 4004 4704 C0A8 0033 0000 0200 4B02 0000 0000 0000 0000 0100 001A 0200 0004 0000 12E5 0201 0004 0000 0000 0203 0006 2191 5800 1075 0101 001A 0200 0004 0000 12E5 0201 0004 0000 0000 0203 0006 2191 5800 1025 0107 0002 0002
Tue, 07 May 2024 08:50:43 18047  parser        NLRI      bgp-ls bgp-ls      without path-information     payload 0002 004B 0200 0000 0000 0000 0001 0000 1A02 0000 0400 0012 E502 0100 0400 0000 0002 0300 0621 9158 0010 7501 0100 1A02 0000 0400 0012 E502 0100 0400 0000 0002 0300 0621 9158 0010 2501 0700 0200 02
Tue, 07 May 2024 08:51:49 18047  reactor       ^C received
Tue, 07 May 2024 08:51:49 18047  reactor       performing shutdown
Tue, 07 May 2024 08:51:49 18047  outgoing-1    stop, message [shutting down]
Tue, 07 May 2024 08:51:49 18047  outgoing-1    outgoing-1 192.168.0.41-192.168.0.51, closing connection
Tue, 07 May 2024 08:51:49 18047  outgoing-1    stop, message [shutting down]
Tue, 07 May 2024 08:51:49 18047  outgoing-1    outgoing-1 192.168.0.41-192.168.0.61, closing connection
Tue, 07 May 2024 08:51:49 18047  process       terminating process parsed-route-backend

Please let me know if you need any more info.

@thomas-mangin
Copy link
Member

thomas-mangin commented May 7, 2024

4.2.11 is from August 2020 .. a lot has happened in nearly 4 years ...

I support bug fixes on master and perform partial backport to the 4.2 branch - depending on complexity. So, could you please re-check with 4.2.21 (as was indicated in the initial report) if you still have the issue, if still unlucky, then can you please check the branch main and see if the issue is still there?

The syntax to start exabgp on master changed slightly, I attempted to keep some backward compatibility, but please check the exabgp command line syntax if you need to leave the 4.2 branch.

I will assume you used the version provided by your OS vendor. The simpler may be to de-install the package and either re-install exabgp via pip or as a zipapp as explained on the readme, and adding exabgp to the path if required.

@thomas-mangin
Copy link
Member

For information, exabgp has no code to set the TCP window to zero, so I will attempt an educated guess that it is caused by the OS as the network inbound buffer in the kernel for the application is full. It would happen if the packets were not "consumed" by ExaBGP.

For it to happen, the application would need to be stuck (the code is async). I only saw this behaviour due to the pipe being filled to or from the external process.

Previously, some other users failed to realise that they had multiple versions of exabgp installed on the machine (having exabgp installed via pip and an os package is common).

It led them to think they had set the pipe option when they had not. You could check by changing the server command line to print the environment to be sure it is not the issue.

@muyiwhy
Copy link
Author

muyiwhy commented May 7, 2024

4.2.11 is from August 2020 .. a lot has happened in nearly 4 years ...

I support bug fixes on master and perform partial backport to the 4.2 branch - depending on complexity. So, could you please re-check with 4.2.21 (as was indicated in the initial report) if you still have the issue, if still unlucky, then can you please check the branch main and see if the issue is still there?

The syntax to start exabgp on master changed slightly, I attempted to keep some backward compatibility, but please check the exabgp command line syntax if you need to leave the 4.2 branch.

I will assume you used the version provided by your OS vendor. The simpler may be to de-install the package and either re-install exabgp via pip or as a zipapp as explained on the readme, and adding exabgp to the path if required.

I also tried the version 4.2.21 and get the same result. And I will try the main branch tomorrow.

@muyiwhy
Copy link
Author

muyiwhy commented May 7, 2024

For information, exabgp has no code to set the TCP window to zero, so I will attempt an educated guess that it is caused by the OS as the network inbound buffer in the kernel for the application is full. It would happen if the packets were not "consumed" by ExaBGP.

For it to happen, the application would need to be stuck (the code is async). I only saw this behaviour due to the pipe being filled to or from the external process.

Previously, some other users failed to realise that they had multiple versions of exabgp installed on the machine (having exabgp installed via pip and an os package is common).

It led them to think they had set the pipe option when they had not. You could check by changing the server command line to print the environment to be sure it is not the issue.

OK, I will try this, just to be clear, by checking the environment, I should focus on the api.ack option, right?

@thomas-mangin
Copy link
Member

I can not tell you what you should focus on. I can only offer guesses and I am not a very good fortune teller. I have no idea what your code is doing, you did not share it. I can only tell you what may cause issues from past users.

But yes, you should perhaps try to run the program from the command line. ExaBGP can print what it parsed from the environment and env configuration file. If correct you could run it with -d from the command line and see if the issue is occuring.

If it does happen, you may want to run the program with strace (perhaps using -f to also debug your own code) to figure out what OS call is causing the program to halt.

@muyiwhy
Copy link
Author

muyiwhy commented May 8, 2024

OK, I will give it a try. I will come back with more detail info if the issue still happened.

@muyiwhy
Copy link
Author

muyiwhy commented May 9, 2024

I tried with the version 4.2.21 and make sure the api.ack is set to false with the --fi option and it still fails. I cannot try the master branch since I cannot upgrade the python from 3.6 to 3.7 on live server.

My process script is simple, just get the messages from stdin and append them to a file.

The exabgp config file is:

process parsed-route-backend {
        run python /etc/exabgp/example.py;
        encoder json;
}

neighbor 192.168.0.51 {
        local-address 192.168.0.41;
        local-as 4837;
        peer-as 4837;
        family {
        bgp-ls bgp-ls;
    }

        api {
                processes [ parsed-route-backend ];
                receive {
                        parsed;
                        update;
                        keepalive;
                }
        }  
}

neighbor 192.168.0.61 {
        local-address 192.168.0.41;
        local-as 4837;
        peer-as 4837;
        family {
        bgp-ls bgp-ls;
    }

        api {
                processes [ parsed-route-backend ];
                receive {
                        parsed;
                        update;
                        keepalive;
                }
        }
}

And the process script is:

import sys
import time

output_filename = "/etc/exabgp/output.txt"

with open(output_filename, "a") as file:
    while True:
        try:
            line = sys.stdin.readline().strip()
            file.write(line + "\n")
            file.flush()  
            if line == "":
                time.sleep(1)
                continue

        except KeyboardInterrupt:
            break  
        except IOError:
            pass

And attach with the strace output here:
strace.tgz

@thomas-mangin
Copy link
Member

thomas-mangin commented May 9, 2024

Looking at the release note of Python 3.7, 3.6 could still work for master.

Reading the strace before receiving SIGTERM:

3009  write(5, "{ \"exabgp\": \"4.0.1\", \"time\": 171"..., 155) = 155
3009  select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000} <unfinished ...>
3024  <... read resumed>"{ \"exabgp\": \"4.0.1\", \"time\": 171"..., 8192) = 155

This ExaBGP writing to your program.

3024  write(3, "{ \"exabgp\": \"4.0.1\", \"time\": 171"..., 155) = 155

This is your program flushing to the file.

3024  read(0,  <unfinished ...>
3009  <... select resumed>)             = 0 (Timeout)

Your program is trying to read from stdin and timing out. I do not see any sleep as I would expect.

3009  lseek(4, 0, SEEK_END)             = 93289
3009  lseek(4, 0, SEEK_CUR)             = 93289
3009  write(4, "Thu, 09 May 2024 13:15:05 3009  "..., 88) = 88

ExaBGP writing something with time ... as strace is not setup to send the whole string, -s0 ? the message sent is not captured.

3009  mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f8868d9c000
3009  mprotect(0x7f8868d9c000, 4096, PROT_NONE) = 0
3009  clone(child_stack=0x7f886959bfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f886959c9d0, tls=0x7f886959c700, child_tidptr=0x7f886959c9d0) = 3213

clone ? This would mean something is started by fork, and the only thing forking in ExaBGP is the process.

3213  set_robust_list(0x7f886959c9e0, 24) = 0
3213  futex(0x7f88785f9584, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, {tv_sec=1715231705, tv_nsec=594786000}, 0xffffffff <unfinished ...>
3009  futex(0x7f88785f9584, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f88785f9580, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
3213  <... futex resumed>)              = 0
3009  futex(0x15e44a0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
3213  futex(0x7f88785f9540, FUTEX_WAKE_PRIVATE, 1) = 0
3213  mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f8860d9c000
3213  munmap(0x7f8860d9c000, 52838400)  = 0
3213  munmap(0x7f8868000000, 14270464)  = 0
3213  mprotect(0x7f8864000000, 135168, PROT_READ|PROT_WRITE) = 0
3213  futex(0x15e44a0, FUTEX_WAKE_PRIVATE, 1) = 1
3009  <... futex resumed>)              = 0
3213  kill(3024, SIGTERM <unfinished ...>

It was killed

Try changing your code to have:

import sys
import time

output_filename = "/etc/exabgp/output.txt"

with open(output_filename, "a") as file:
    while True:
        try:
            line = sys.stdin.readline().strip()
            file.write(line + "\n")
            file.flush()  
            if line == "":
                time.sleep(1)
                continue

        except Exception as exc:
            file.write(str(exc) + '\n')

This may give you an idea of what is wrong and if something is wrong.

But at first sight nothing obviously wrong.

@muyiwhy
Copy link
Author

muyiwhy commented May 9, 2024

I run the sbin/exabgp on master branch and it says exabgp need python3.7. I will try to use release to rebuild an exabgp. And I will also try to see if there are anything wrong with the script.

@thomas-mangin
Copy link
Member

thomas-mangin commented May 9, 2024

I can not remember why 3.7 was picked over 3.6 ... You can try to edit the test in the code to change it.

@muyiwhy
Copy link
Author

muyiwhy commented May 10, 2024

Turns out the master branch is working (at least not stuck at the first stage) after I remove the validation of python version. I will do more testing and come back to close the issue if everything works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants