Skip to content
This repository has been archived by the owner on Oct 28, 2021. It is now read-only.

fixes graceful termination when aleth --test is invoked but require… #5146

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

fixes graceful termination when aleth --test is invoked but require… #5146

wants to merge 3 commits into from

Conversation

christianparpart
Copy link
Member

Fixes graceful termination when aleth --test is invoked but required a SIGKILL to terminate.

This was because the UnixSocketServer requires at least one accepting
connection after SIGTERM/SIGINT has been received, which usually never
is the case.

This PR implements a trivial workaround by first waiting for up to N
msecs (N=1000 for now) for any input to arrive, and if not, it'll just
retry, hence, evaluating m_running, resulting in aleth --test to
properly terminate upon receival of SIGINT or SIGTERM.

@chfast chfast requested review from axic and gumb0 July 27, 2018 13:12
FD_ZERO(&out);
FD_ZERO(&err);
FD_SET(fd, &in);
struct timeval tv;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a more C++ way this could be timeval const tv = { timeoutMillis / 1000, timeoutMillis % 1000 }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little conservative. Thx. done! :-)

@@ -103,11 +104,27 @@ bool UnixDomainSocketServer::StopListening()
return false;
}

static bool waitForReadable(int fd, unsigned timeoutMillis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move it into unnamed namespace above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@gumb0 gumb0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please add a comment describing why it's needed

@codecov-io
Copy link

codecov-io commented Jul 27, 2018

Codecov Report

Merging #5146 into develop will decrease coverage by 0.22%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           develop   #5146      +/-   ##
==========================================
- Coverage    59.92%   59.7%   -0.23%     
==========================================
  Files          337     337              
  Lines        27306   27271      -35     
  Branches      3173    3175       +2     
==========================================
- Hits         16362   16281      -81     
- Misses        9860    9906      +46     
  Partials      1084    1084

@gumb0
Copy link
Member

gumb0 commented Jul 27, 2018

Can there be any better approach than calling select every second? Like maybe is there a way to make select interrupted by SIGINT/SIGTERM ?

@christianparpart
Copy link
Member Author

hey @gumb0. of course there is, by using signalfd() (Linux-only, not OS/X), or with the help with eventfd() in order to break out of the waiting loop (such as select()). The latter can be emulated via pipe() on OS/X too.

However, I didn't wanna be too intrusive for a very first PR in this repo. I can however look into finding a somewhat minimal solution of the latter suggestion I made.

@gumb0
Copy link
Member

gumb0 commented Jul 30, 2018

Let's take a step back, can you describe how exactly do you reproduce it?
I saw this kind of deadlock sometimes, but not every time.

Now when I tried the following steps

  • start aleth -- test
  • send some RPC calls
  • send SIGINT to aleth

accept call is being interrupted and returns -1. And aleth process immediately finishes.

Christian Parpart added 2 commits August 1, 2018 12:52
…d a `SIGKILL` to terminate.

This was because the UnixSocketServer requires at least one accepting
connection *after* SIGTERM/SIGINT has been received, which usually never
is the case.

This PR implements a trivial workaround by first waiting for up to N
msecs (N=1000 for now) for any input to arrive, and if not, it'll just
retry, hence, evaluating m_running, resulting in `aleth --test` to
properly terminate upon receival of `SIGINT` or `SIGTERM`.
FD_ZERO(&err);
FD_SET(fd, &in);
timeval tv { timeoutMillis / 1000, timeoutMillis % 1000 };
return select(fd + 1, &in, &out, &err, &tv) > 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What fd + does?

// We do this test before calling accept() in order to gracefully exit
// this function when an external thread has set m_running to true
// (such as a signal handler in the main thread).
if (!waitForReadable(m_socket, 500))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not way to add the timeout to accept()?

@chfast
Copy link
Collaborator

chfast commented Aug 1, 2018

start aleth -- test
send some RPC calls
send SIGINT to aleth

Try without " send some RPC calls".

@gumb0
Copy link
Member

gumb0 commented Aug 1, 2018

Still terminates fine without RPC calls.

Also tried running soltest and SIGINT after that, also aleth is terminated fine.

@chfast
Copy link
Collaborator

chfast commented Aug 1, 2018

I tried it tool and it terminates quickly. Can it be macOS @christianparpart ?

@christianparpart
Copy link
Member Author

So I was facing this issue on "Windows Subsystem for Linux" (which basically is a Linux kernel & ABI re-implementation within the Windows NT kernel).

However, I was trying to reproduce this behavior with a minimal test case, and I believe that I get the same behavior even on Ubuntu 16.04's Linux kernel 4.8.0 with the following test case (link below). So I may still be absolutely brain-dead right now, not finding the bug in my head (or we need this patch)

https://gist.github.com/christianparpart/8b89c9dbfba83478d1308ecc32b9753d

This code is a minimalistic TCP pong server that shares the same behavior this PR tries to fix within aleth's AF_UNIX listener code.

@christianparpart
Copy link
Member Author

I digged a little bit into it. it's actually not accept() being cancelled with -1 (EINTR) due to the SIGINT but a -1 (EINVAL) due to the unlink() on the socket. While that one should work on all UNIX'es, I don't think this is a clean solution, as only a side effect of closing a socket while being in a blocking I/O is made use of, which seems unreliable wrt. platform independence (?). I do like your solution (lean, but unfortunately wasn't quite obvious from the beginning), but it doesn't work on my end it seems.

@gumb0
Copy link
Member

gumb0 commented Aug 2, 2018

So we had in place a solution I hoped for, it just seems to not work on exotic systems.

I would prefer to not merge this (it complicates things quite a bit) and resolving this by you @christianparpart submitting a bug report to Microsoft (is this even a bug?)

Otherwise needs some work on better logging / better error reporting / better code formatting.

@gumb0
Copy link
Member

gumb0 commented Aug 2, 2018

@chfast this PR somehow is still to develop, not to master
nevermind, changed it

@gumb0 gumb0 changed the base branch from develop to master August 2, 2018 17:18
@christianparpart
Copy link
Member Author

I'd adapt the code to conform to coding style and "codecov" once you say I can proceed (:+1:).

Copy link
Member

@gumb0 gumb0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please apply clang-format to your changes as described in https://github.com/ethereum/cpp-ethereum/blob/master/CONTRIBUTING.md

@@ -54,16 +57,45 @@ fs::path getIpcPathOrDataDir()
return getDataDir();
return path;
}

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use only // for multiline comments

* @retval true One file descriptor became available.
* @retval false Most likely an error occured, or a timeout.
*/
static bool waitForReadable(std::initializer_list<int> sockets)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be declared static as it's in the unnamed namespace

@@ -92,6 +124,10 @@ bool UnixDomainSocketServer::StartListening()

bool UnixDomainSocketServer::StopListening()
{
int dummy = 0x90;
if (::write(m_exitChannel[1], &dummy, sizeof(dummy)) < 0)
cerr << "Failed to notify UNIX domain server loop. " << strerror(errno) << "\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better output it to clog(VerbosityWarn, "rpc"), It will not need \n in the end.

{
if (pipe(m_exitChannel) < 0)
abort(); // I don't like. Replace me with throw.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw exception here.
Declare it in IpcServerBase.h as

DEV_SIMPLE_EXCEPTION(FailedToCreatePipe);

Then throw here like

    BOOST_THROW_EXCEPTION(FailedToCreatePipe());

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants