Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance doesn't scale with more cores #1977

Open
AqlaSolutions opened this issue Apr 27, 2023 · 12 comments
Open

Performance doesn't scale with more cores #1977

AqlaSolutions opened this issue Apr 27, 2023 · 12 comments

Comments

@AqlaSolutions
Copy link
Contributor

We've performed several benchmarks on 8, 16, 32, 48 AWS cores and discovered these results:

8->16 cores: almost +100% RPS
16->32 cores: +20% RPS
32->48 cores: +1% RPS

Take a look at these results.

16 cores:
image
image

32 cores:
image
image

48 cores:
image
image

The benchmark runs up to 256 parallel requests.

Profiler shows that most work is done by thread pool WorkerThreadStart method inside its loop where it waits for tasks and calls Semaphore.Wait.
image

We tried running different configurations with 1-2 clients and 1-2 servers, changing parallel requests count, changing dispatcher throughput but nothing showed any significant improvement.

What can cause this?

We'll try to prepare a reproducable example if you are willing to investigate this.

@rogeralsing
Copy link
Contributor

Yes, we need some code examples for what you are doing here.
I see in the benchmark that you are using Proto.Remote. it could very well be that you are maxing out your network ?
if the network can only push x messages per second, you are not going to benefit from more cores.

But please do post some example on what you are actually doing here, as it is only guesswork otherwise

@AqlaSolutions
Copy link
Contributor Author

We have the similar issue even without Remote

@rogeralsing
Copy link
Contributor

ok, then we need some code example to reproduce this

@AqlaSolutions
Copy link
Contributor Author

Sorry, we are still going to provide an example. This stuff requires some time to agree with the company.

@Pushcin
Copy link

Pushcin commented May 16, 2023

the source code to reproduce the problem in the attachment
performance-repro.zip

@AqlaSolutions
Copy link
Contributor Author

AqlaSolutions commented May 16, 2023

To run from IDE:

  1. Open solution BEP.sln
  2. Run docker-start-dev.cmd
  3. Run project benchmarks\PrototypeBenchmark

To run remotely:

  1. Install docker on remote machine and Docker Desktop on local
  2. ssh -L 2378:127.0.0.1:2375 ubuntu@example.com
    in another session, git bash:
  3. export DOCKER_HOST=tcp://127.0.0.1:2378
    ./docker-start-staging.cmd

@rogeralsing please reopen

@rogeralsing
Copy link
Contributor

I'm running the example right now and the first thing that comes to mind is that you are probably queueing up a lot of fire and forget tasks on the threadpool

.5987 RPS, 99% latency 17,61 ms, 95% latency 9,39 ms, max latency 167,61 ms
...60692 RPS, 99% latency 15,4 ms, 95% latency 6,51 ms, max latency 610,77 ms
...44698 RPS, 99% latency 20,9 ms, 95% latency 9,33 ms, max latency 745,69 ms
..35911 RPS, 99% latency 28,62 ms, 95% latency 11,54 ms, max latency 725,73 ms
.27488 RPS, 99% latency 33,26 ms, 95% latency 15,39 ms, max latency 999,47 ms
..31520 RPS, 99% latency 22,41 ms, 95% latency 11,55 ms, max latency 975,2 ms
.19651 RPS, 99% latency 39,24 ms, 95% latency 20,35 ms, max latency 1050,25 ms
.19856 RPS, 99% latency 39,76 ms, 95% latency 17,88 ms, max latency 1366,85 ms

The increasing latency might be that the threadpool is busy with other tasks.
e.g.

omsGrain.ProccedExecutionReport(omsRequest, CancellationToken.None).AndForget(TaskOption.Safe);

Eventually, the entire threadpool queue might be filled with this kind of tasks.

I'll dig deeper later today, but the increasing latency is very suspicious.

@AqlaSolutions
Copy link
Contributor Author

AqlaSolutions commented May 18, 2023

That's a pretty big latency. Do you run in Debug configuration or with debugger attached? I wouldn't rely on these latency numbers. As you see in the screenshots above, with all optimizations and 16+ cores we don't have latency changing that much.

@AqlaSolutions
Copy link
Contributor Author

In this repro no additional executions are added to the list in ObActor.ExecuteOrder so OmsActor shouldn't do any fire-and-forget calls because the single returned ExecutionReport belongs to this OmsActor instance. So I'm surprised that you see such calls. Though, in our real app those calls are present.

@rogeralsing
Copy link
Contributor

There seems to be a lot of locking going on in this example.
I saw that there is some use of SemaphoreSlim and .Wait(), but I haven't analyzed the impact of that specifically.
But looking at the profiler results. something in this example is explicitly blocking threads in the threadpool

Skärmavbild 2023-05-20 kl  10 42 01 Skärmavbild 2023-05-20 kl  10 42 13

@AqlaSolutions
Copy link
Contributor Author

AqlaSolutions commented May 21, 2023

@rogeralsing, we use Semaphore to limit the number of concurrent requests. The waiting time should be as big as needed for the system to process a request and free its "slot". It's not a problem at all. It's only one thread and it even doesn't belong to threadpool.

We already profiled the thing, I already saw what's on the screenshots. For example, WorkerThreadStart is not a new thread startup but a loop that pick ups tasks from thread pool queue. Btw, this method also uses Semaphore.

@AqlaSolutions
Copy link
Contributor Author

AqlaSolutions commented May 21, 2023

@rogeralsing my guess is that too much GC is going on in 0 and 1 generations. The garbage is produced by tasks and async state machines. Unlike 2nd generation, such collections are always stop-the-world. It looks like at some point of vertical scaling GC time grows more than the load that can be processed so we see no improvement from adding cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants