Skip to content
This repository has been archived by the owner on Nov 19, 2023. It is now read-only.

Strange spikes in latency with toxy #38

Open
shorea opened this issue Nov 21, 2015 · 10 comments
Open

Strange spikes in latency with toxy #38

shorea opened this issue Nov 21, 2015 · 10 comments

Comments

@shorea
Copy link

shorea commented Nov 21, 2015

Created a simple pass through toxy script to just forward traffic and I'm seeing large ~20 sec spikes fairly regularly. I'm sending 60 requests per second through toxy. I've got some client side metrics enabled and the majority of the time is spent receiving the response from toxy. Is there some default poison applied that could be causing this? Can toxy not handle that rate of requests (I assume it can per some benchmarks on the rocky README)? When I run the same code without toxy (directly hitting the endpoint) I'm not getting any spikes. Hitting toxy with http and toxy is forwarding to an https endpoint.

var toxy = require('toxy')
var fs = require('fs')

var poisons = toxy.poisons
var rules = toxy.rules
var proxy = toxy()

// Configure and enable proxy
proxy.all("*").forward("https://dynamodb.us-west-2.amazonaws.com")
proxy.listen(6241)
@h2non
Copy link
Owner

h2non commented Nov 22, 2015

Can toxy not handle that rate of requests?

It should. I noticed that in some scenarios when replaying requests with payload under high concurrency (>50 rps) there could be some performance issues or errors.

Is there some default poison applied that could be causing this?

No. There's no poison enabled by default.

Just a couple of questions that can help me:

  • Are you using some request or response interceptor?
  • What poison are you using, if any?
  • Which kind of traffic is handled by toxy? (e.g: HTTP with large/small payloads)

@shorea
Copy link
Author

shorea commented Nov 23, 2015

It should. I noticed that in some scenarios when replaying requests with payload under high concurrency (>50 rps) there could be some performance issues or errors.
I can try dropping the request rate down to 40 per second and seeing if that makes a difference.

  • Are you using some request or response interceptor?
    Nope
  • What poison are you using, if any?
    None right now, Trying to get a baseline before applying poisons and that's where we noticed the spikes. The code snippet above is verbatim the script I'm running.
  • Which kind of traffic is handled by toxy? (e.g: HTTP with large/small payloads)
    I'm connecting to toxy via HTTP and toxy itself it establishing an https connection to DynamoDB. The requests themselves are just PutItem requests with very small payload sizes and very small responses (<1 KB both ways).

@h2non
Copy link
Owner

h2non commented Nov 23, 2015

Thanks! I'll do some benchmark tests based on your scenario and I'll let you know with the conclusions.

@h2non
Copy link
Owner

h2non commented Nov 23, 2015

I've just added some benchmark suites.
I'm covering similar scenarios like in rocky. All is working fine, even with better performance than I personally expected.

Those tests are not based on your scenario, so I did one specific test which is more close to yours: forward 2KB payload without poisoning with a concurrency of 60 rps to a remote HTTPS server.
Here're the results:

Requests    [total]             600
Duration    [total, attack, wait]       34.676810741s, 9.982575757s, 24.694234984s
Latencies   [mean, 50, 95, 99, max]     7.351279208s, 7.039588307s, 15.731271949s, 24.86061736s, 24.86061736s
Bytes In    [total, mean]           3326400, 5544.00
Bytes Out   [total, mean]           1033800, 1723.00
Success     [ratio]             100.00%
Status Codes    [code:count]            200:600  

I also ran the same suite with out TLS transport. Results:

# Running benchmark suite: forward+payload
Requests    [total]             600
Duration    [total, attack, wait]       24.555032664s, 9.985355351s, 14.569677313s
Latencies   [mean, 50, 95, 99, max]     3.467563026s, 3.184269332s, 10.463356674s, 20.235857461s, 20.235857461s
Bytes In    [total, mean]           3325800, 5543.00
Bytes Out   [total, mean]           1033800, 1723.00
Success     [ratio]             100.00%
Status Codes    [code:count]            200:600  

Here're some conclusions:

  • TLS handshake seems to be expensive in some cases, and it could be the main performance downgrade, so communicating in raw will be (obviously) faster.
  • RTT delay / jitter (I'm in Europe and the server is located in USA).
  • I'm using a wireless connection and there's some network congestion.
  • High concurrency (> 50 rps) implies some performance bottlenecks, but no critical ones (however I need to dig into it).
  • RSS memory doesn't increases too much (~ 50MB) and it's stable (no evident memory leaks).
  • CPU usage is not high (< 20%).
  • As I wrote before, I can confirm that stressing the server for a couple of minutes with high concurrency (60 rps) implies some performance issues. I need to dig into this.

@shorea
Copy link
Author

shorea commented Nov 23, 2015

Thanks for digging into this @h2non! I reran our simulation with 40 RPS and we only had two 20 second spikes as opposed to the five or six we normally get with 60 RPS for the same period. It's strange that it is always 20 seconds for every spike, even for your benchmark there are approx 20 second spikes. I wonder what could be causing such consistent latency? Is there any information I could provide that would help you debug this?

@h2non
Copy link
Owner

h2non commented Nov 25, 2015

Definitively I've to dig into this, however note that benchmark tests forwarding to loopback interface has no performance downgrades in that sense.

When I've time I'll work in multiple testing scenarios to discover where's the potential bottleneck. I'll let you know.

@shorea
Copy link
Author

shorea commented Jan 13, 2016

@h2non wondering if you've had time to run some different scenarios. We are picking this work back up and would like to use toxy if possible as it's the most robust solution we've seen for our use case.

@h2non
Copy link
Owner

h2non commented Jan 13, 2016

Hi @shorea.

I didn't forget about this, but it's not simple to mitigate the real problem here.
Lately I'm working to push a new product to production and unfortunately I don't have too much time, but my availability will increase considerably in about 2 weeks.

I'll let you know with the diagnostic then.

@h2non h2non added the bug label Jan 13, 2016
@breedloj
Copy link

breedloj commented Feb 3, 2016

Hey @h2non

I work with @shorea and wanted to provide an update on this issue. In short we were able to alleviate this issue by capping the maximum number of sockets to 500 (via https.globalAgent.maxSockets) and giving Node more memory to work with (--max-old-space-size=8192) and turning off the idle garbage collector (--nouse-idle-notification). I do not have much experience with Node but have seen similar issues in Java applications that ended up being heap/GC related. FWIW, we have been able to run 30 minute slices of traffic without latency spikes using these settings but in more extended runs of 2+ hours we ultimately do hit a large spike. I'm guessing this is somewhat expected behavior given these settings as we're effectively holding off on smaller GCs while giving Node more memory space, effectively meaning that once we do exhaust the available memory we're going to hit a larger GC sweep. Regardless this seems to have unblocked us for the time being.

@h2non
Copy link
Owner

h2non commented Feb 3, 2016

Glad to hear news about this. Honestly, I haven't had time to dig into this, so I appreciate your update. Initially I didn't thought it was directly related to GC/memory issues since the RSS memory was stable during my stress testing for more than 15 min.

My opinion is that there's a clear memory leak somewhere, if not, you should not be forced to increase the V8 heap limit.

I would recommend you to take a look to the following utilities:
https://github.com/node-inspector/node-inspector
https://github.com/node-inspector/v8-profiler
https://github.com/bnoordhuis/node-heapdump

This kind of issues are hard to debug, but I would like to invest time on this soon since it's a challenging problem to solve.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants