Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues under load #78

Open
JensenAlexander opened this issue Feb 10, 2022 · 12 comments
Open

Performance issues under load #78

JensenAlexander opened this issue Feb 10, 2022 · 12 comments
Assignees

Comments

@JensenAlexander
Copy link

The cql-proxy sidecar is having trouble handling larger loads. We brought it up to our staging environment which has consistent loads at about an average of at about .25 million reads per hour and we noticed that our requests started timing out.

DB Retry for Get object has finally succeeded with 1 retries. No error on the query.
DB Retry last error: gocql: no response received from cassandra within timeout period
DB Retry for Objects exceeded retry count of 10.
DB Retry last error: gocql: no response received from cassandra within timeout period
@mpenick
Copy link
Contributor

mpenick commented Feb 10, 2022

Thanks for trying it out. Couple questions:

  • What version of the proxy? (0.0.4?)
  • Are you using it against Astra or Cassandra?
  • ".25 million reads per hour" is this a single proxy instance?
  • Approx. what size are the request and response payloads?

I think this might have been resolved by the following:

  • I've optimized parsing significantly (Optimize parser #58). I'm seeing way less CPU usage under much higher load. Give me a couple days to quantify this. I'll try the request rate from your workload (versus 0.0.4).
  • I'm adding a new --num-conns parameter here: CLI and README updates #76. You might try bumping this up to --num-conns 2. This should be merged in the next couple days.

Give main a try and see if you see better results.

@mpenick
Copy link
Contributor

mpenick commented Feb 10, 2022

.25 million reads per hour

1e6*.25/3600 == ~69.4 request/s

Is that right? I've run much bigger workloads than this w/o issue. Maybe the request/response payloads are much bigger?

@JensenAlexander
Copy link
Author

JensenAlexander commented Feb 11, 2022

  • Yes, we are using 0.0.4
  • We're using it against Cassandra
  • The .25 million reads is on a single proxy instance
  • The size can vary from extremely small to megabyte streams
  • That math seems correct.

@mpenick
Copy link
Contributor

mpenick commented Feb 14, 2022

Thanks for all the info. I'll try to reproduce and see what I can find.

The size can vary from extremely small to megabyte streams

This is something I have not tried yet.

@mpenick
Copy link
Contributor

mpenick commented Feb 14, 2022

I ran some tests using 1MB payloads (reads) @ ~70 request/s. Here are the results (%iles in microseconds):

direct (no proxy):

num_requests,   duration, final rate,       min,       mean,     median,       75th,       95th,       98th,       99th,     99.9th,        max
       10000,    142.864,    69.9966,      3528,      10942,      10751,      10855,      11751,      13399,      14695,      27023,      40863

cql-proxy (--num-conns 1, version 7a02afd)

num_requests,   duration, final rate,       min,       mean,     median,       75th,       95th,       98th,       99th,     99.9th,        max
       10000,    142.878,    69.9897,       203,      20232,      21535,      22367,      27279,      28079,      30095,      47935,      68799

cql-proxy (--num-conns 2, version 7a02afd)

num_requests,   duration, final rate,       min,       mean,     median,       75th,       95th,       98th,       99th,     99.9th,        max
       10000,    142.875,    69.9914,       159,      20364,      21647,      23023,      27519,      28575,      30783,      47327,     270591

Raw data: https://gist.github.com/mpenick/256c313c4f075315c9a07cdf2bddc7c5

@mpenick
Copy link
Contributor

mpenick commented Feb 14, 2022

I'm currently running a 1hr test @ ~70 request/s using cql-proxy. I'll post the result when done.

Results:

num_requests,   duration, final rate,       min,       mean,     median,       75th,       95th,       98th,       99th,     99.9th,        max
      250000,    3571.72,    69.9943,       232,      20476,      21487,      22655,      27215,      28191,      30111,      50175,     236543

Raw data: https://gist.github.com/mpenick/75bdd66e2699e6564e87d0e6140154a1

@mpenick
Copy link
Contributor

mpenick commented Feb 15, 2022

I noticed this: ea8421b, Seeing a huge difference in performance (from reduced system calls). Give that a try, it's now on main. I'll also be pushing a release tomorrow.

@mpenick mpenick self-assigned this Feb 15, 2022
@JensenAlexander
Copy link
Author

We just tried v0.1.1 and we're seeing the same issue.

@JensenAlexander
Copy link
Author

It was initially spurts now exactly (within a minute) of every 20 minutes. 10:21, 10:51, 11:11, but then it became fairly constant and occasional peaks of over 1000 failures within two minutes.

@JensenAlexander
Copy link
Author

@mpenick Do you have any update/suggestions?

@mpenick
Copy link
Contributor

mpenick commented Mar 29, 2022

@mpenick Do you have any update/suggestions?

I haven't been able to reproduce the issue yet so it's hard for me to know how to proceed here. I've tried to reproduce in tests above.

What's the request timeout setting you're using for gocql?

@mpenick
Copy link
Contributor

mpenick commented Mar 29, 2022

Thinking about this a bit more, maybe it makes sense to add some initial metrics to cql-proxy. We could then see where the extra latency is coming from by pointing a Grafana instance at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants