Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In OTP 26, Erlang's SSH client fails to connect if it receives an {'EXIT', Pid, normal} message from an unrelated process. #8223

Open
matthiasl opened this issue Mar 4, 2024 · 13 comments
Assignees
Labels
bug Issue is reported as a bug stalled waiting for input by the Erlang/OTP team team:PS Assigned to OTP team PS

Comments

@matthiasl
Copy link
Contributor

Describe the bug

If there’s an {‘EXIT’, _, normal} message in a process’ mailbox when calling ssh:connect(), the SSH library terminates without attempting to create a connection.

To Reproduce

Complete test case here:

%% Throwaway to debug SSH problems
%%
%% The Erlang SSH library doesn't seem to like old 'EXIT' messages left
%% lying around in the message queue. The mssage gets matched in
%% ssh_connection_handler:handshake/3; it looks like this behaviour was
%% added by 1963f9501
%%
%% I can reproduce the problem in otp_src_26.0.2
%% I cannot reproduce the problem in otp_src_23.1
%%
%%
-module(test_ssh_crash).

-export([works/0, fails/0]).

works() ->
    crypto:start(),
    ssh:start(),
    {ok, _SSH} = ssh:connect("localhost", 22,
			    [{user, "bogus"}, {password, "bogus"},
			     {silently_accept_hosts, true}]).

fails() ->
    self() ! {'EXIT', self(), normal},
    works().

Expected behavior

On a machine with an SSH server, both works() and fails() should terminate with an authentication failure.

Actual behaviour is that fails() exits prematurely.

(And, if you replace the user/password with something valid, then works() will correctly connect to the SSH server, whereas fails() will not succeed with the connection.)

Affected versions

Seen in OTP 25 and OTP 26.

Bug possibly introduced in 1963f95

Additional context

A workaround is to launch SSH from a process where you're sure you can't get any 'EXIT' messages.

@matthiasl matthiasl added the bug Issue is reported as a bug label Mar 4, 2024
@u3s u3s added the team:PS Assigned to OTP team PS label Mar 4, 2024
@u3s u3s self-assigned this Mar 4, 2024
@Maria-12648430
Copy link
Contributor

Maria-12648430 commented Mar 5, 2024

I analysed the problem a bit and it is indeed a bug introduced by 1963f95 as @matthiasl guessed.


handshake(Pid, Ref, Timeout) ->
receive
{Pid, ssh_connected} ->
erlang:demonitor(Ref, [flush]),
{ok, Pid};
{Pid, {not_connected, Reason}} ->
erlang:demonitor(Ref, [flush]),
{error, Reason};
{'DOWN', Ref, process, Pid, {shutdown, Reason}} ->
{error, Reason};
{'DOWN', Ref, process, Pid, Reason} ->
{error, Reason};
{'EXIT',_,Reason} ->
stop(Pid),
{error, {exit,Reason}}
after Timeout ->
erlang:demonitor(Ref, [flush]),
ssh_connection_handler:stop(Pid),
{error, timeout}
end.

The problem is receiving an 'EXIT' message at lines 502-504.

I was wondering why this is even done, since the process here is the process calling ssh:connect.

It turns out that this is because ssh_connection_handler:handshake/3, via ssh_connection_handler:takeover/4 and ssh_system_sup:start_subsystem, may be called from different places and in different contexts:

  • coming from a call to ssh:connect or ssh:daemon, in which case the process ultimately running handshake is the process calling that respective function. This process may be trapping exits or not; in either case, handling 'EXIT' messages in handshake makes no sense in that context, and is even harmful since it will steal the message
  • coming from inside a ssh_acceptor process, which does trap exits, and for which handling 'EXIT' messages does make sense

@MarkoMin
Copy link
Contributor

MarkoMin commented Mar 5, 2024

I was wondering why this is even done, since the process here is the process calling ssh:connect.

This is done because you want to gracefully shut down gen_statem that's handling the connection. It doesn't just exit, it sends some replies beforehand.

However, this is a problem because you don't know if you received your EXIT messages before or after spawning a connection handler.

I think we should move trapping exits to gen_statem directly, rather than trapping them in the user process.
Pros:

  1. user process won't get trap_exit value modified
  2. statem can still gracefully shut-down
  3. statem is linked rather than monitored by the user process.

@Maria-12648430
Copy link
Contributor

@MarkoMin hm, I think you got some of it wrong, it's not quite like that... But hard to explain in just a few words 🤷‍♀️

In a nutshell, receiving and handling 'EXIT's makes sense on the server side, when a connection is made to it. More to the point, it makes sense if the server has the parallel_login option set to true, but if not it is of no consequence. On the client side, it doesn't make sense and is harmful even if the client process is trapping exits.

(And no, the user process doesn't get its trap_exit modified even now.)

@MarkoMin
Copy link
Contributor

MarkoMin commented Mar 5, 2024

Aham, I see where I got it wrong, I looked only from the client side...

The solution now makes complete sense, but there is one more process to think about (except Parent) when receiving EXITs, and that is Socket process (IIRC sockets are implemented as processes), because process that traps exits is the controlling process of the socket. I'm not sure if socket process sends EXIT messages to the controlling process, but that is something that one should be aware of.

@Maria-12648430
Copy link
Contributor

Off the top of my head, I have no idea 😅

Maybe @u3s knows? *poke*

@u3s
Copy link
Contributor

u3s commented Mar 5, 2024

I thought that was the purpose (or one of them) for handling EXIT messages (to handle failing socket) in ssh_connection_handler.
But I just glimpsed on the code so far (busy doing other older ssh issues).

I plan to have deeper look on code later.

@Maria-12648430
Copy link
Contributor

Maria-12648430 commented Mar 5, 2024

Hm, I don't know, now that you say that... I thought it was to act on the death of the acceptor process, since with parallel_login (which was mentioned in the commit where the 'EXIT' was introduced, the passing around of the socket is done in a separate intermediate spawn_linked process. Pity @HansN isn't around any more to ask...

@Maria-12648430
Copy link
Contributor

Maria-12648430 commented Mar 6, 2024

@u3s @MarkoMin this is what I found out. Take this with a grain of salt though, I may easily have missed something there.

Regarding sockets and links, mainly by playing around in the shell. (I set trap_exit to true in my process to see what happens.)

  • when using the inet backend, the socket is a port, which is linked to the controlling process; if the socket just gets closed by the remote end one way or other, no 'EXIT' message is generated; only if the port exits unexpectedly (crashes or gets killed or whatever), an 'EXIT' message is generated which can be received
  • when using the socket backend, the socket is handled by a process, which is not linked to the controlling process; no matter what happens to the socket, no 'EXIT' message is generated

Regarding ssh, this is what I found out.

  • if the connection is incoming at the server and handled by an acceptor process...
    • if parallel_login is false, the acceptor process starts a ssh_system_sup for it, then waits in ssh_connection_handler:handshake for the pid of the process that handles it, an error, or a timeout
    • if parallel_login is true, the acceptor process spawn_links another process to do the above. In addition to the above, this process also waits for an ' EXIT' message which can AFAICS reasonably only be caused by the death of the (parent) acceptor process, in which case receiving the 'EXIT' message lets the waiting process exit with an error early (ie without waiting for the timeout to expire); the socket can only cause an 'EXIT' if it is using the inet backend (ie, is a port) and that port exits unexpectedly

If this is all correct, it would mean that my fix in #8226 is sound.

@MarkoMin
Copy link
Contributor

MarkoMin commented Mar 6, 2024

  • ' EXIT' message which can AFAICS reasonably only be caused by the death of the (parent) acceptor process; the socket can only cause an 'EXIT' if it is using the inet backend (ie, is a port) and the port exits unexpectedly

So what happens when port exits unexpectedly and we're trapping exits? Wouldn't that just be ignored if we're just matching on parent exit?

@Maria-12648430
Copy link
Contributor

So what happens when port exits unexpectedly and we're trapping exits? Wouldn't that just be ignored if we're just matching on parent exit?

Yes. I can provide for that easily if it is wanted, ie I'm not against it 🤷‍♀️ I'll just point out that this will then work somewhat differently depending on whether the inet or the socket backend is used, and since the socket backend is supposed to become the default at some point, the behavior would depend on the OTP release.

@IngelaAndin
Copy link
Contributor

The socket backend is suppose to behave the same as the inet-backend and will not become default until it does.
On a general note application processes in this case, ssh, should not create any links to processes not part of the ssh supervision tree, if such processes needs to be tracked they should be monitored, and ssh should not make any explicit receives in other processes contexts, should not at all should use the behavior callbacks. (I have not hade time to look into this issue, just trying to give some general guidance) I think Hans might have been a little over enthusiastic to make client and server use the same code everywhere as ssh protocol goes to quite some lengths to make the client and the server "equals" or "behaving the same", and might have failed to realize some differences in socket handling.

@Maria-12648430
Copy link
Contributor

Hi Ingela 👋

While I believe that my PR solves the issue described by the OP, it may be possible that it thereby fosters some possible structural errors in the ssh application that you hinted at. Therefore, I'll put the PR in Draft status for the time being, and wait for further instructions 🙂

@matthiasl
Copy link
Contributor Author

Thanks for the fix.

Building OTP from ea38c67 (in https://github.com/Maria-12648430/otp.git) completely solves the problem for me.

The minimal testcase I posted here now behaves as expected, as does the live code which triggered the debugging.

Particularly interesting to read the discussion about the code being called from two contexts.

@IngelaAndin IngelaAndin added the stalled waiting for input by the Erlang/OTP team label Mar 12, 2024
@u3s u3s removed the stalled waiting for input by the Erlang/OTP team label Mar 26, 2024
@IngelaAndin IngelaAndin added the stalled waiting for input by the Erlang/OTP team label May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug stalled waiting for input by the Erlang/OTP team team:PS Assigned to OTP team PS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants