Bug? Same message is received over and over #1967

AqlaSolutions · 2023-04-06T12:07:07Z

We have a strange issue where under load over 200K RPS one of grain instances starts receiving the same request over and over. I'm sure that we don't send this request multiple times. And of course responses from such repeated requests never reach original request sender. We run everything on the same machine without remoting. The issue appears usually after few minutes of load when I use Release configuration and run without debugging (but later I attach after the issue reproduces).

Can you suggest where to start debugging it?

    <PackageReference Include="Proto.Actor" Version="1.1.0" />
    <PackageReference Include="Proto.Cluster" Version="1.1.0" />
    <PackageReference Include="Proto.Cluster.CodeGen" Version="1.1.0" />
    <PackageReference Include="Proto.Cluster.Consul" Version="1.1.0" />
    <PackageReference Include="Proto.Cluster.Dashboard" Version="1.1.0" />
    <PackageReference Include="Proto.Cluster.Kubernetes" Version="1.1.0" />
    <PackageReference Include="Proto.Cluster.TestProvider" Version="1.1.0" />
    <PackageReference Include="Proto.OpenTelemetry" Version="1.1.0" />
    <PackageReference Include="Proto.OpenTracing" Version="0.27.0" />
    <PackageReference Include="Proto.Persistence" Version="1.1.0" />
    <PackageReference Include="Proto.Remote" Version="1.1.0" />

The text was updated successfully, but these errors were encountered:

AqlaSolutions · 2023-04-13T09:39:07Z

I debugged it. The workaround is to set actorSystemConfig.SharedFutures to false. The request is sent again over and over in the DefaultClusterContext.RequestAsync because its future.Task is already cancelled even before next attempt to call context.Request. It causes TaskCancelledException which is handled by retrying the send operation.

rogeralsing · 2023-04-14T17:49:43Z

This sounds like a serious bug.
Have you been able to reproduce it in any other conditions than just high load?

AqlaSolutions · 2023-04-14T18:21:37Z

My colleague reported something similar during debugging but it could be caused by different reasons like timeouts due to debugger pauses. I don't know any other ways to reproduce it without high load.

The setup is pretty simple: 3 virtual actor types but only one is actually used. This actor has 3000 instances, requests are spread among them. The actor handles two types of requests: place order and cancel order. The benchmark sends up to 256 requests simultaneously without waiting for results (limited by SemaphoreSlim). The bug occurs only when I run without a debugger attached and after some time (a minute usually is enough). Each request has a unique id. I added Debugger.Launch call in place where duplicate order id gets detected. The counter in DefaultClusterContext.RequestAsync for this specific request goes higher than 600000 at the time I attach debugger and step into.

My two colleagues were able to reproduce it using the same code.

rogeralsing · 2023-04-16T06:50:40Z

Many thanks for the report.
Using actorSystemConfig.SharedFutures = false is safe (and was the original behavior), so you can keep using that for now.
It's just a bit slower due to the allocation/deallocation of PIDs per Future.

I'm still trying to figure out what causes this, if it is inside the shared future implementation itself.
I'll report my findings here

DenizPiri · 2023-04-18T13:55:36Z

I am not sure how much it relates to this, but we get these exceptions every now and then in production servers:

System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at Proto.Future.SharedFutureProcess.Cancel(UInt32 requestId)
   at Proto.SenderContextExtensions.RequestAsync[T](ISenderContext self, PID target, Object message, CancellationToken cancellationToken)
   at Proto.SenderContextExtensions.RequestAsync[T](ISenderContext self, PID target, Object message, TimeSpan timeout)
   ....

System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at Proto.Future.SharedFutureProcess.SendUserMessage(PID pid, Object message)
   ....

@rogeralsing I hope the callstacks help in some way.

We have been ignoring it as it didn't result in messages being processed again and again for us.
I guess we will also disable shared futures and deploy that way.

One thing to note is that we get these exceptions even though our per-actor throughput doesn't exceed 200 msgs/sec.

rogeralsing · 2023-04-18T14:28:53Z

cc @mhelleborg

mhelleborg · 2023-04-18T16:51:04Z

@AqlaSolutions Could you share a reproducing example? Trying to reproduce with just high load on shared futures has come up empty here, no issues.

AqlaSolutions · 2023-04-19T07:55:02Z

@mhelleborg I'm not sure cause NDA, you know... Also it will require some time to prepare and minimize it. I will let you know if I get the permission.

mhelleborg · 2023-04-19T08:05:02Z

I would not like your internal project, but a minimal reproducing example would be be a great help.

…

On Wed, Apr 19, 2023, 09:55 Vlad Dev ***@***.***> wrote: @mhelleborg <https://github.com/mhelleborg> I'm not sure cause NDA, you know... Also it will require some time to prepare and minimize it. I will let you know if I get the permission. — Reply to this email directly, view it on GitHub <#1967 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADKYXYXJIERR3PHQBGNFPPDXB6K6FANCNFSM6AAAAAAWVKEO2U> . You are receiving this because you were mentioned.Message ID: ***@***.***>

volkanalkilic · 2023-04-25T07:12:39Z

I am having a similar problem. The OnReceive method keeps getting the same message over and over again if I send the message as follows:

ActorSystemHelper.fSystem.Cluster().RequestAsync<object>(request.DeviceID, "FieldbusWorker_" + device.Gateway.Oid, "STOP", CancellationToken.None);

However, if I send it with the MethodIndex, the problem does not occur:

ActorSystemHelper.fSystem.Cluster().RequestAsync<object>(request.DeviceID, "FieldbusWorker_" + device.Gateway.Oid, new GrainRequestMessage(2,null), CancellationToken.None);

Pushcin · 2023-05-16T06:57:51Z

the source code to reproduce the problem described by AqlaSolutions in the attachment
shared_futures_repro_2.zip

AqlaSolutions · 2023-05-16T07:05:39Z

Just run project benchmarks/PrototypeBenchmark from BEP.sln solution without debugger and in Release configuration.

After a few minutes (may require several runs) you will see the console message:

Order 12345 for market abc_def was already processed

rogeralsing · 2023-05-18T08:07:06Z

I'm running the example right now and the first thing that comes to mind is that you are probably queueing up a lot of fire and forget tasks on the threadpool

.5987 RPS, 99% latency 17,61 ms, 95% latency 9,39 ms, max latency 167,61 ms
...60692 RPS, 99% latency 15,4 ms, 95% latency 6,51 ms, max latency 610,77 ms
...44698 RPS, 99% latency 20,9 ms, 95% latency 9,33 ms, max latency 745,69 ms
..35911 RPS, 99% latency 28,62 ms, 95% latency 11,54 ms, max latency 725,73 ms
.27488 RPS, 99% latency 33,26 ms, 95% latency 15,39 ms, max latency 999,47 ms
..31520 RPS, 99% latency 22,41 ms, 95% latency 11,55 ms, max latency 975,2 ms
.19651 RPS, 99% latency 39,24 ms, 95% latency 20,35 ms, max latency 1050,25 ms
.19856 RPS, 99% latency 39,76 ms, 95% latency 17,88 ms, max latency 1366,85 ms

The increasing latency might be that the threadpool is busy with other tasks.
e.g.

omsGrain.ProccedExecutionReport(omsRequest, CancellationToken.None).AndForget(TaskOption.Safe);

Eventually, the entire threadpool queue might be filled with this kind of tasks.

I'll dig deeper later today, but the increasing latency is very suspicious.

AqlaSolutions · 2023-05-18T08:14:23Z

@rogeralsing In this issue repro we don't have ProccedExecutionReport method. May be you meant to post here #1977

rogeralsing · 2023-05-18T08:15:17Z

yes 👍🏻

rogeralsing added the bug report label Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug? Same message is received over and over #1967

Bug? Same message is received over and over #1967

AqlaSolutions commented Apr 6, 2023

AqlaSolutions commented Apr 13, 2023 •

edited

rogeralsing commented Apr 14, 2023

AqlaSolutions commented Apr 14, 2023 •

edited

rogeralsing commented Apr 16, 2023 •

edited

DenizPiri commented Apr 18, 2023

rogeralsing commented Apr 18, 2023

mhelleborg commented Apr 18, 2023

AqlaSolutions commented Apr 19, 2023

mhelleborg commented Apr 19, 2023 via email

volkanalkilic commented Apr 25, 2023 •

edited

Pushcin commented May 16, 2023

AqlaSolutions commented May 16, 2023 •

edited

rogeralsing commented May 18, 2023

AqlaSolutions commented May 18, 2023

rogeralsing commented May 18, 2023

Bug? Same message is received over and over #1967

Bug? Same message is received over and over #1967

Comments

AqlaSolutions commented Apr 6, 2023

AqlaSolutions commented Apr 13, 2023 • edited

rogeralsing commented Apr 14, 2023

AqlaSolutions commented Apr 14, 2023 • edited

rogeralsing commented Apr 16, 2023 • edited

DenizPiri commented Apr 18, 2023

rogeralsing commented Apr 18, 2023

mhelleborg commented Apr 18, 2023

AqlaSolutions commented Apr 19, 2023

mhelleborg commented Apr 19, 2023 via email

volkanalkilic commented Apr 25, 2023 • edited

Pushcin commented May 16, 2023

AqlaSolutions commented May 16, 2023 • edited

rogeralsing commented May 18, 2023

AqlaSolutions commented May 18, 2023

rogeralsing commented May 18, 2023

AqlaSolutions commented Apr 13, 2023 •

edited

AqlaSolutions commented Apr 14, 2023 •

edited

rogeralsing commented Apr 16, 2023 •

edited

volkanalkilic commented Apr 25, 2023 •

edited

AqlaSolutions commented May 16, 2023 •

edited