Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error messages when a cluster fails to boot #887

Open
jroper opened this issue Apr 14, 2021 · 2 comments
Open

Improve error messages when a cluster fails to boot #887

jroper opened this issue Apr 14, 2021 · 2 comments

Comments

@jroper
Copy link
Contributor

jroper commented Apr 14, 2021

This feature request has quite a wide scope, and fixing it will be some sort of whack-a-mole with each different error condition, but one significant problem I've found with cluster bootstrap is that when it fails, interpreting the error messages you get requires an in depth understanding of Akka cluster bootstrap in order to map that to the actual cause.

So for example, here's a node that's failing to join a cluster:

2021-04-13 23:07:11.778 WARN akka.management.cluster.bootstrap.internal.HttpContactPointBootstrap - Probing [http://10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local:8558/bootstrap/seed-nodes] failed due to: Exceeded configured max-open-requests value of [32]. This means that the request queue of this pool (HostConnectionPoolSetup(10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local,8558,ConnectionPoolSetup(ConnectionPoolSettings(4,0,0,32,1,Duration.Inf,100 milliseconds,2 minutes,30 seconds,ClientConnectionSettings(Some(User-Agent: akka-http/10.1.12),10 seconds,60 seconds,512,None,WebSocketSettings(<function0>,ping,Duration.Inf,akka.http.impl.settings.WebSocketSettingsImpl$$$Lambda$915/0x0000000100645840@72127121),List(),ParserSettings(2048,16,64,64,8192,64,8388608,8388608,256,1048576,Strict,RFC6265,true,Set(),Full,Error,HashMap(If-Range -> 0, If-Modified-Since -> 0, If-Unmodified-Since -> 0, default -> 12, If-None-Match -> 0, User-Agent -> 32, Content-MD5 -> 0, Date -> 0, If-Match -> 0),false,true,akka.util.ConstantFun$$$Lambda$775/0x0000000100577840@b4e36a4,akka.util.ConstantFun$$$Lambda$775/0x0000000100577840@b4e36a4,akka.util.ConstantFun$$$Lambda$776/0x0000000100578040@cf1346d),100 milliseconds,None,TCPTransport),New,1 second),akka.http.scaladsl.HttpConnectionContext$@5d96a23c,akka.event.MarkerLoggingAdapter@6348ef3))) has completely filled up because the pool currently does not process requests fast enough to handle the incoming request load. Please retry the request later. See http://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html for more information.
2021-04-13 23:07:11.848 INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Looking up [Lookup(samples-js-shoppingcart,None,Some(tcp))]
2021-04-13 23:07:11.848 INFO akka.discovery.kubernetes.KubernetesApiServiceDiscovery - Querying for pods with label selector: [cloudstate.io/stateful-service=samples-js-shoppingcart]. Namespace: [3aceb7b3-4e74-4e0e-aa23-1bb50408b733]. Port: [None]
2021-04-13 23:07:11.860 INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Located service members based on: [Lookup(samples-js-shoppingcart,None,Some(tcp))]: [ResolvedTarget(10-8-9-3.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local,None,Some(/10.8.9.3)), ResolvedTarget(10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local,None,Some(/10.8.3.65))], filtered to [10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local:0, 10-8-9-3.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local:0]
2021-04-13 23:07:11.970 INFO akka.management.cluster.bootstrap.contactpoint.HttpClusterBootstrapRoutes - Bootstrap request from 10.8.9.3:53574: Contact Point returning 0 seed-nodes []
2021-04-13 23:07:11.972 INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Contact point [akka://cloudstate-proxy@10.8.9.3:2552] returned [0] seed-nodes []
2021-04-13 23:07:12.748 INFO akka.management.cluster.bootstrap.LowestAddressJoinDecider - Exceeded stable margins but missing seed node information from some contact points [10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local:0] (out of [10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local:0, 10-8-9-3.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local:0])

The above logs are output over and over again. The log message that tells me the most info is this one, which I will not put in a code block because that makes it hard to read because the line is so long:

2021-04-13 23:07:11.778 WARN akka.management.cluster.bootstrap.internal.HttpContactPointBootstrap - Probing [http://10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local:8558/bootstrap/seed-nodes] failed due to: Exceeded configured max-open-requests value of [32]. This means that the request queue of this pool (HostConnectionPoolSetup(10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local,8558,ConnectionPoolSetup(ConnectionPoolSettings(4,0,0,32,1,Duration.Inf,100 milliseconds,2 minutes,30 seconds,ClientConnectionSettings(Some(User-Agent: akka-http/10.1.12),10 seconds,60 seconds,512,None,WebSocketSettings(,ping,Duration.Inf,akka.http.impl.settings.WebSocketSettingsImpl$$$Lambda$915/0x0000000100645840@72127121),List(),ParserSettings(2048,16,64,64,8192,64,8388608,8388608,256,1048576,Strict,RFC6265,true,Set(),Full,Error,HashMap(If-Range -> 0, If-Modified-Since -> 0, If-Unmodified-Since -> 0, default -> 12, If-None-Match -> 0, User-Agent -> 32, Content-MD5 -> 0, Date -> 0, If-Match -> 0),false,true,akka.util.ConstantFun$$$Lambda$775/0x0000000100577840@b4e36a4,akka.util.ConstantFun$$$Lambda$775/0x0000000100577840@b4e36a4,akka.util.ConstantFun$$$Lambda$776/0x0000000100578040@cf1346d),100 milliseconds,None,TCPTransport),New,1 second),akka.http.scaladsl.HttpConnectionContext$@5d96a23c,akka.event.MarkerLoggingAdapter@6348ef3))) has completely filled up because the pool currently does not process requests fast enough to handle the incoming request load. Please retry the request later. See http://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html for more information.

Which is bad in its own right, eg toString of HostConnectionPoolSetup is used which makes the log message way too long. But it also is very misleading, the message makes it sound like the thing that is being communicated with is too slow. In actual fact, the problem in this case is that firewall rules were blocking all communication, ie, it wasn't too slow, the network was blocking it.

Now, of course, if I look earlier in the logs, I see a different error:

021-04-13 23:01:01.760 WARN akka.management.cluster.bootstrap.internal.HttpContactPointBootstrap - Probing [http://10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local:8558/bootstrap/seed-nodes] failed due to: Probing timeout of [http://10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local:8558]

Which is better, but I think we can still do better. The above error message doesn't explain why we are probing that node, nor does it say what the consequence of it timing out means, to understand it you need an in depth understanding of cluster bootstrap. There's also this log message:

2021-04-13 23:07:12.748 INFO akka.management.cluster.bootstrap.LowestAddressJoinDecider - Exceeded stable margins but missing seed node information from some contact points [10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local:0] (out of [10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local:0, 10-8-9-3.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local:0])

Which once again requires an in depth understanding of cluster bootstrap to understand. What I think would be useful is to, after some additional margin, start outputting something that is much more informative, ideally, multiple lines long. Imagine something like this:

This node has been unable to join or bootstrap an Akka cluster. In discovery, it has found the following nodes to form or join a cluster with (including itself):
* 10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local
* 10-8-9-3.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local
However, none of them have reported an existing cluster to join, and at least one of them is failing to be probed, with the following errors:
* 10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local - Probing timeout of [http://10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local:8558]
Starting a new cluster is therefore not safe. To debug, ensure that that node is running and configured correctly, and ensure that network communication between these two nodes is working. Alternatively, remove that node so it is no longer returned by the discovery call.

This would make it far easier for someone that has little to no knowledge of Akka cluster bootstrap to debug the issue. If we made such an error message to be output after a configurable amount of time, we could use the same configuration item to turn it off altogether for when people don't want it.

@patriknw
Copy link
Member

Sounds good to improve this.

@jrudolph
Copy link
Member

The scenario here probably is this:

  • Discovery send probes to other nodes but all connection attempts fail
  • The client connection pool will backoff connection attempts by default for up to two minutes
  • The probing rate is finally much higher than the failing rate (and there's no way to cancel outgoing requests in akka-http yet).
  • Ultimately, the requests fill the buffer in front of the pool, so that new probing attempts are now failed fast with this overload message.

There are multiple possible fix alternatives:

  • Don't use a connection pool for what should be mostly one-off connection attempts during bootstrap. This way all requests have their own connection timeout.
    • Also make sure that connection timeouts and probing intervals are reasonable (otherwise, there's the risk of an eventual memory leak, similar to the current situation)
  • Implement request timeouts / cancelling in akka-http and make use of it here
  • Reconsider failure behavior in the pool: should we drain all buffered requests when a connection attempt for a pool slot fails? For how long?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants