Strange spikes in latency with toxy #38
Comments
It should. I noticed that in some scenarios when replaying requests with payload under high concurrency (>50 rps) there could be some performance issues or errors.
No. There's no poison enabled by default. Just a couple of questions that can help me:
|
It should. I noticed that in some scenarios when replaying requests with payload under high concurrency (>50 rps) there could be some performance issues or errors.
|
Thanks! I'll do some benchmark tests based on your scenario and I'll let you know with the conclusions. |
I've just added some benchmark suites. Those tests are not based on your scenario, so I did one specific test which is more close to yours: forward 2KB payload without poisoning with a concurrency of 60 rps to a remote HTTPS server.
I also ran the same suite with out TLS transport. Results:
Here're some conclusions:
|
Thanks for digging into this @h2non! I reran our simulation with 40 RPS and we only had two 20 second spikes as opposed to the five or six we normally get with 60 RPS for the same period. It's strange that it is always 20 seconds for every spike, even for your benchmark there are approx 20 second spikes. I wonder what could be causing such consistent latency? Is there any information I could provide that would help you debug this? |
Definitively I've to dig into this, however note that benchmark tests forwarding to loopback interface has no performance downgrades in that sense. When I've time I'll work in multiple testing scenarios to discover where's the potential bottleneck. I'll let you know. |
@h2non wondering if you've had time to run some different scenarios. We are picking this work back up and would like to use toxy if possible as it's the most robust solution we've seen for our use case. |
Hi @shorea. I didn't forget about this, but it's not simple to mitigate the real problem here. I'll let you know with the diagnostic then. |
Hey @h2non I work with @shorea and wanted to provide an update on this issue. In short we were able to alleviate this issue by capping the maximum number of sockets to 500 (via https.globalAgent.maxSockets) and giving Node more memory to work with (--max-old-space-size=8192) and turning off the idle garbage collector (--nouse-idle-notification). I do not have much experience with Node but have seen similar issues in Java applications that ended up being heap/GC related. FWIW, we have been able to run 30 minute slices of traffic without latency spikes using these settings but in more extended runs of 2+ hours we ultimately do hit a large spike. I'm guessing this is somewhat expected behavior given these settings as we're effectively holding off on smaller GCs while giving Node more memory space, effectively meaning that once we do exhaust the available memory we're going to hit a larger GC sweep. Regardless this seems to have unblocked us for the time being. |
Glad to hear news about this. Honestly, I haven't had time to dig into this, so I appreciate your update. Initially I didn't thought it was directly related to GC/memory issues since the RSS memory was stable during my stress testing for more than 15 min. My opinion is that there's a clear memory leak somewhere, if not, you should not be forced to increase the V8 heap limit. I would recommend you to take a look to the following utilities: This kind of issues are hard to debug, but I would like to invest time on this soon since it's a challenging problem to solve. |
Created a simple pass through toxy script to just forward traffic and I'm seeing large ~20 sec spikes fairly regularly. I'm sending 60 requests per second through toxy. I've got some client side metrics enabled and the majority of the time is spent receiving the response from toxy. Is there some default poison applied that could be causing this? Can toxy not handle that rate of requests (I assume it can per some benchmarks on the rocky README)? When I run the same code without toxy (directly hitting the endpoint) I'm not getting any spikes. Hitting toxy with http and toxy is forwarding to an https endpoint.
The text was updated successfully, but these errors were encountered: