Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][native] Replace fixed worker port with ephemeral ports #22748

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

czentgr
Copy link
Contributor

@czentgr czentgr commented May 14, 2024

Previously the listener ports for the native works in the E2E tests was hard coded to be 1234 + worker count.
The change looks in the OS for an available ephemeral port and uses this value when spawning the native workers.

Description

Motivation and Context

On my Mac I encountered problems running the E2E native tests. The worker was up and running and listened on the port. Yet for some reason the HTTP request timed out. The connection was set up but there was no response.

Connection could occur but each request was eaten. Changing the port to a different port resolved the issue.

$ curl -v http://127.0.0.1:1234/v1/info 
*   Trying 127.0.0.1:1234...
* Connected to 127.0.0.1 (127.0.0.1) port 1234
> GET /v1/info HTTP/1.1
> Host: 127.0.0.1:1234
> User-Agent: curl/8.4.0
> Accept: */*
>
^C
I20240514 17:32:42.203320 52237991 PrestoServer.cpp:260] [PRESTO_STARTUP] Starting server at :::1234 (127.0.0.1)
presto_se 52290 czentgr   44u    IPv6 0x630812116eb7c8e3       0t0      TCP *:search-agent (LISTEN)

and in the logs

2024-05-14T17:25:15.854-0500    WARN    node-state-poller-0     com.facebook.presto.metadata.HttpRemoteNodeState        Node state update request to http://127.0.0.1:1234/v1/info/state has not returned in 10.05s
...
2024-05-14T17:26:36.941-0500    WARN    UpdateResponseHandler-20240514_222407_00000_zb5hw.0.0.0.0-572   com.facebook.presto.server.RequestErrorTracker  Error updating task 20240514_222407_00000_zb5hw.0.0.0.0: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed: http://127.0.0.1:1234/v1/task/20240514_222407_00000_zb5hw.0.0.0.0

continuously until the test case fails.

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== NO RELEASE NOTE ==

@ZacBlanco
Copy link
Contributor

ZacBlanco commented May 14, 2024

Just want to leave my $.02 here -- We had similar problems at my old job and we had attempted this type of solution to grab free worker ports. Ultimately it ended up being more reliable to pick a fixed port number that is usually not used by the OS for our E2E integration tests. This type of port selection didn't work because after releasing the socket back to the OS, we found a race condition occurred quite often where we would get assigned the same port back to back or in close succession before the port is actually allocated to the new server's socket.

I think in this case, since we don't launch workers in parallel, we probably won't run into this situation as often, but I do have a PR which parallelizes the launching that would probably cause issues (#22212). I think a better solution would let the worker bind to port 0, and then query the process internally for its assigned port once the socket is returned by the OS to the worker.

@czentgr
Copy link
Contributor Author

czentgr commented May 15, 2024

@ZacBlanco Thank you for your comment.
Yes, my assumption here is that the workers are sequentially launched (hence the comment in the code). If this occurs in parallel then this won't work reliably.

I also thought of the prestissimo side. We don't need to define a fixed port in the config. The worker will tell the coordinator how to reach it during the announcement. But not sure what would be needed for the HttpServer - we pass in the http/https config. I would need to look into it a bit more.

@czentgr czentgr changed the title [native] Replace fixed worker port with ephemeral ports [WIP][native] Replace fixed worker port with ephemeral ports May 15, 2024
@czentgr czentgr force-pushed the cz_auto_worker_port branch 3 times, most recently from 2184790 to 92c4482 Compare May 16, 2024 22:25
Previously the listener ports for the native works in the E2E tests
was hard coded to be 1234 + worker count.
The change looks in the OS for an available ephemeral port
and uses this value when spawning the native workers.

The native worker must then defer some configuration until the
port selection by the OS is known.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants