Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

twisted.internet.test.test_tcp.AbortConnectionTests_*Tests.test_fullWriteBuffer test timeout on macOS #12151

Open
glyph opened this issue May 2, 2024 · 13 comments · May be fixed by #12162
Open

Comments

@glyph
Copy link
Member

glyph commented May 2, 2024

while debugging #12150 I saw this error locally:

===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/Users/glyph/Projects/Twisted/.tox/py312-alldeps-withcov-macos/lib/python3.12/site-packages/twisted/internet/test/test_tcp.py", line 2867, in test_fullWriteBuffer
    self.runAbortTest(StreamingProducerClient, NoReadServer)
  File "/Users/glyph/Projects/Twisted/.tox/py312-alldeps-withcov-macos/lib/python3.12/site-packages/twisted/internet/test/test_tcp.py", line 2763, in runAbortTest
    reactor = runProtocolsWithReactor(self, server, client, self.endpoints)
  File "/Users/glyph/Projects/Twisted/.tox/py312-alldeps-withcov-macos/lib/python3.12/site-packages/twisted/internet/test/connectionmixins.py", line 177, in runProtocolsWithReactor
    reactorBuilder.runReactor(reactor)
  File "/Users/glyph/Projects/Twisted/.tox/py312-alldeps-withcov-macos/lib/python3.12/site-packages/twisted/internet/test/reactormixins.py", line 356, in runReactor
    raise TestTimeoutError(f"reactor still running after {timeout} seconds")
twisted.internet.test.reactormixins.TestTimeoutError: reactor still running after 120.0 seconds

twisted.internet.test.test_tcp.AbortConnectionTests_AsyncioSelectorReactorTests.test_fullWriteBuffer
-------------------------------------------------------------------------------
Ran 10823 tests in 138.537s

FAILED (skips=584, errors=1, successes=10238)
ERROR: InvocationError for command /Users/glyph/Projects/Twisted/.tox/py312-alldeps-withcov-macos/bin/coverage run -p --rcfile=/Users/glyph/Projects/Twisted/.coveragerc -m twisted.trial --temp-directory=/Users/glyph/Projects/Twisted/.tox/py312-alldeps-withcov-macos/tmp/_trial_temp --reporter=verbose -j8 twisted (exited with code 1)

which suggests that maybe CFReactor isn't the only flaky test in there…

@glyph glyph added the bug label May 2, 2024
@adiroiban
Copy link
Member

adiroiban commented May 2, 2024

I am trying to reproduce it.

I am using something like this on Ubuntu...

$ trial -u -j 8 twisted.internet.test.test_tcp

It failed with

twisted.internet.test.test_tcp.TCPTransportTestsBuilder_GIReactorTests.test_writeSequenceWithUnicodeRaisesException
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/home/adi/chevah/twisted/src/twisted/internet/posixbase.py", line 487, in _doReadOrWrite
    why = selectable.doWrite()
  File "/home/adi/chevah/twisted/src/twisted/internet/tcp.py", line 617, in doConnect
    self._connectDone()
  File "/home/adi/chevah/twisted/src/twisted/internet/tcp.py", line 630, in _connectDone
    self.protocol = self.connector.buildProtocol(self.getPeer())
  File "/home/adi/chevah/twisted/src/twisted/internet/base.py", line 1291, in buildProtocol
    return self.factory.buildProtocol(addr)
  File "/home/adi/chevah/twisted/src/twisted/internet/protocol.py", line 131, in buildProtocol
    assert self.protocol is not None
builtins.AssertionError: 

twisted.internet.test.test_tcp.TCP4ClientTestsBuilder_EPollReactorTests.test_clientConnectionFailedStopsReactor
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/home/adi/chevah/twisted/src/twisted/internet/asyncioreactor.py", line 138, in _readOrWrite
    why = method()
  File "/home/adi/chevah/twisted/src/twisted/internet/tcp.py", line 617, in doConnect
    self._connectDone()
  File "/home/adi/chevah/twisted/src/twisted/internet/tcp.py", line 630, in _connectDone
    self.protocol = self.connector.buildProtocol(self.getPeer())
  File "/home/adi/chevah/twisted/src/twisted/internet/base.py", line 1291, in buildProtocol
    return self.factory.buildProtocol(addr)
  File "/home/adi/chevah/twisted/src/twisted/internet/protocol.py", line 131, in buildProtocol
    assert self.protocol is not None
builtins.AssertionError: 

twisted.internet.test.test_tcp.TCP4ClientTestsBuilder_AsyncioSelectorReactorTests.test_clientConnectionFailedStopsReactor
-------------------------------------------------------------------------------
Ran 473 tests in 0.410s

FAILED (skips=87, errors=2, successes=384)

If I don't run in parallel... it doesn't look like it's failing.

@adiroiban
Copy link
Member

Another failure anter running in a loop for about 1 minute

[ERROR]
Traceback (most recent call last):
  File "/home/adi/chevah/twisted/src/twisted/internet/selectreactor.py", line 148, in _doReadOrWrite
    why = getattr(selectable, method)()
  File "/home/adi/chevah/twisted/src/twisted/internet/tcp.py", line 617, in doConnect
    self._connectDone()
  File "/home/adi/chevah/twisted/src/twisted/internet/tcp.py", line 630, in _connectDone
    self.protocol = self.connector.buildProtocol(self.getPeer())
  File "/home/adi/chevah/twisted/src/twisted/internet/base.py", line 1291, in buildProtocol
    return self.factory.buildProtocol(addr)
  File "/home/adi/chevah/twisted/src/twisted/internet/protocol.py", line 131, in buildProtocol
    assert self.protocol is not None
builtins.AssertionError: 

twisted.internet.test.test_tcp.TCP6ClientTestsBuilder_SelectReactorTests.test_clientConnectionFailedStopsReactor
-------------------------------------------------------------------------------
Ran 473 tests in 0.810s

FAILED (skips=87, errors=1, successes=385)

@glyph
Copy link
Member Author

glyph commented May 3, 2024

The issue here has to do with vagaries regarding the maximum outbound buffer size. It writes one big outgoing chunk, then assumes that this chunk is big enough to flood its own buffer. IF the OS were to actually consume that data somehow, and just put it into a big buffer, hanging is what we would logically see.

@glyph
Copy link
Member Author

glyph commented May 3, 2024

I can actually get a failure relatively quickly with just trial -u twisted.internet.test.test_tcp.AbortConnectionTests_AsyncioSelectorReactorTests.test_fullWriteBuffer; it craps out a bit over 500 iterations for me.

@adiroiban
Copy link
Member

trial -u twisted.internet.test.test_tcp.AbortConnectionTests_AsyncioSelectorReactorTests.test_fullWriteBuffer

On Ubuntu Desktop 23.10 (default kernel setup) I am not able to get that test to fail.
It was running for more than 30.000 iteration, without a failure

@adiroiban
Copy link
Member

adiroiban commented May 3, 2024

After the merge of #12160 i still get an error.... but it might be due to OS limitations

$ netstat -tan | grep TIME_WAIT | wc -l
21214
[ERROR]
Traceback (most recent call last):
  File "/home/adi/chevah/twisted/src/twisted/internet/asyncioreactor.py", line 138, in _readOrWrite
    why = method()
  File "/home/adi/chevah/twisted/src/twisted/internet/tcp.py", line 617, in doConnect
    self._connectDone()
  File "/home/adi/chevah/twisted/src/twisted/internet/tcp.py", line 630, in _connectDone
    self.protocol = self.connector.buildProtocol(self.getPeer())
  File "/home/adi/chevah/twisted/src/twisted/internet/base.py", line 1291, in buildProtocol
    return self.factory.buildProtocol(addr)
  File "/home/adi/chevah/twisted/src/twisted/internet/protocol.py", line 131, in buildProtocol
    assert self.protocol is not None
builtins.AssertionError: 

twisted.internet.test.test_tcp.TCP4ClientTestsBuilder_AsyncioSelectorReactorTests.test_clientConnectionFailedStopsReactor
-------------------------------------------------------------------------------
Ran 473 tests in 0.419s

FAILED (skips=87, errors=1, successes=385)

Sometimes the error is something like this ...no sure if it's related


[FAIL]
Traceback (most recent call last):
  File "/home/adi/chevah/twisted/src/twisted/internet/test/test_tcp.py", line 684, in test_addresses
    self.fail(clientFactory.failReason.getTraceback())
twisted.trial.unittest.FailTest: Traceback (most recent call last):
Failure: twisted.internet.error.ConnectBindError: Couldn't bind: 98: Address already in use.


twisted.internet.test.test_tcp.TCP4ClientTestsBuilder_AsyncioSelectorReactorTests.test_addresses
-------------------------------------------------------------------------------
Ran 473 tests in 0.381s

@glyph
Copy link
Member Author

glyph commented May 3, 2024

twisted.internet.test.test_tcp.TCP4ClientTestsBuilder_AsyncioSelectorReactorTests.test_addresses

I am pretty sure this one is an OS limitation.

@glyph
Copy link
Member Author

glyph commented May 3, 2024

twisted.internet.test.test_tcp.TCP4ClientTestsBuilder_AsyncioSelectorReactorTests.test_clientConnectionFailedStopsReactor

I suspect this one is yet another, different test flake :)

@glyph
Copy link
Member Author

glyph commented May 3, 2024

It was running for more than 30.000 iteration, without a failure

I have instrumented it and I can get it to fail locally, reliably, in the same way: the server connection doesn't get notified of its connection dropping.

And, oh, of course, this is a predictable failure mode, because the server does stopReading and never starts again, which means that various reactors will not be able to detect that it has lost its connection. I am surprised this doesn't happen more often.

@glyph
Copy link
Member Author

glyph commented May 3, 2024

This might just be a bug in asyncio.

There are definitely some potential race conditions in the Twisted code here too, but when I fixed them and instrumented doRead and looked at when it was being called, I can see that when a second call to recv would provoke EWOULDBLOCK, asyncio's main loop does not wake up again to notify that the socket is disconnected; despite the fact that the socket reader is still registered as a reader with the asyncio loop, the loop stays asleep.

@glyph
Copy link
Member Author

glyph commented May 3, 2024

Scratch that, not asyncio, I can reproduce it on the select reactor.

@glyph
Copy link
Member Author

glyph commented May 3, 2024

It happens much more frequently if I lose focus on the terminal window where the test is running which makes me wonder if the OS is implicated too

@glyph glyph changed the title twisted.internet.test.test_tcp.AbortConnectionTests_AsyncioSelectorReactorTests.test_fullWriteBuffer test timeout twisted.internet.test.test_tcp.AbortConnectionTests_AsyncioSelectorReactorTests.test_fullWriteBuffer test timeout on macOS May 3, 2024
@glyph glyph changed the title twisted.internet.test.test_tcp.AbortConnectionTests_AsyncioSelectorReactorTests.test_fullWriteBuffer test timeout on macOS twisted.internet.test.test_tcp.AbortConnectionTests_*Tests.test_fullWriteBuffer test timeout on macOS May 3, 2024
@glyph
Copy link
Member Author

glyph commented May 3, 2024

I think I'm going to stop debugging this for a while because this looks like a kernel bug where select (and kqueue, and CFRunLoop, and select and friends) will notify that a socket whose peer has closed has some work to do, recv() will give errno 35, and if you busy-loop polling the file descriptor, it will start giving errno 54 as expected after 5 seconds or so, but if you just select() on it it will not start appearing as readable at that transition. Building a test case to illustrate that the sockets layer is broken in this way would be an annoying amount of work, because we'd really want an absolutely minimal test case that doesn't involve any Twisted code, just sockets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants