Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow response on metrics http endpoint #194

Open
bastelfreak opened this issue Jun 17, 2020 · 15 comments
Open

slow response on metrics http endpoint #194

bastelfreak opened this issue Jun 17, 2020 · 15 comments

Comments

@bastelfreak
Copy link

Hi people,
I use The Foreman in version 2.0.0. That version bundles the prometheus-client gem in version 1.0.0. A simple call to the /metrics endpoint takes 60 to 75 seconds. I raised in issue for that already at the foreman project: https://projects.theforeman.org/issues/30029

They advised me to create an issue here as well. I'm not really sure how I should debug this. The whole stack runs on a dedicated server with CentOS 7. I don't see any CPU/IO peaks.

I'm aware of the fact that those information aren't really useful to debug the issue. Can you please tell me which infos you need / how I should debug this? I'm happy to debug this for you or discuss it in any chat.

Cheers, Tim

@dmagliola
Copy link
Collaborator

Hello Tim,
Thank you for your report.

The only case where we've heard of those kinds of timings is when having a very large amount of metrics (tens of thousands), using the DirectFileStore. The output you report in the Foreman issue has over 3000 lines, so it does look like you're getting near the point where that starts being an issue.

The suggestion Lukas gave you on that ticket makes sense to me, as having fewer buckets for histograms should result in fewer metrics.

  • Could you compare the output of /metrics before and after making that change to the buckets, and let me know how many lines are in the output before and after it? I'm trying to confirm that that change is indeed having the effect it intends.

  • Could you let me know how many processes and threads per process you are using on your Puma server? If you are running multi-process, you are probably using the DirectFileStore, which is the slowest of the stores (sadly, this is necessary to achieve the cross-process communication). If you are running on a single process, however, you can configure Prometheus to use the Synchronized store, which is faster.

  • Could you confirm which Prometheus store you are using?

  • Can you reproduce these slow timings in a development / staging environment? Or does it happen only in production?

This would help us diagnose if there's some quick wins. However, the sad news is that as the library stands, you may not be able to have as large a number of metrics as you currently have, or you may need to make adjustments to allow for really long render times of the /metrics endpoint (such as reducing the scraping frequency from Prometheus).

It saddens me to say so, but unfortunately this is as fast as we've been able to make the library at the moment. We would like to make it faster, of course, but we don't currently have any clear plans on how we could achieve a significant difference (we're talking an order of magnitude faster here to make a real difference).

It may be necessary, if there are no other choices, to reduce the number of metrics you are reporting on. The easiest way to do this would be to reduce the cardinality of your labels. For example, if you report on the controller but not the action, that could reduce the number somewhat.

Or, alternatively, reduce the number of metrics. If there are some of those you are not currently using in your dashboards / alerts / etc, you could remove them.

I'm aware this is not a very happy suggestion, but i'm afraid at the moment I don't have a better answer.

I'm also aware that you may or may not be able to modify how these metrics are reported. (I'm not familiar with Foreman, sorry). This may require modification to Foreman to make the metrics it reports more customizable, if it's not currently possible.

With the information at hand, this is sadly the best answer I have. Maybe if we could get some more information, like a profile of what's happening when the metrics get scraped, we might notice some anomaly based on that. This again may be something that The Foreman project may need to help you with.

Let me know the answers to the questions above, and hopefully we can find a way of making this faster for you!

@dmagliola
Copy link
Collaborator

dmagliola commented Jun 17, 2020

One more thing I remembered.
This may be related to this other issue #143

When you have lots of metrics, you will end up with lots of files on disk. (Lots of metric names, that is, this isn't affected by labels and their cardinality).

I don't know for sure, but I think this doesn't help with the export performance, and I want to have an implementation of a store that has a single file per process, rather than one per process/metric. I expect this would perform less well when observing samples, but better when exporting.

PR #161 was a first attempt at this, trying to add the functionality to the existing store, which had a lot of problems documented in there. We probably need to make a separate store, which feels a bit less clean, or refactor this one heavily. If anyone is interested in contributing to this effort, it'd be highly appreciated. It's still in our list of things to do, but we haven't had time yet.

@bastelfreak
Copy link
Author

Hi @dmagliola,
thanks for the quick feedback. I started with three curl calls with the default bucket size:

DEFAULT_BUCKETS = [1, 5, 20, 50, 100, 250, 500, 2000].freeze

which resulted in:

real	0m46.964s
user	0m0.005s
sys	0m0.015s
real	0m48.860s
user	0m0.002s
sys	0m0.016s
real	0m53.635s
user	0m0.004s
sys	0m0.019s

I again changed the bucket to:

DEFAULT_BUCKETS = [100, 500, 2000].freeze

and restarted foreman. Now three curls again:

real	1m7.207s
user	0m0.002s
sys	0m0.013s
real	0m56.515s
user	0m0.003s
sys	0m0.014s
real	0m57.622s
user	0m0.004s
sys	0m0.011s

during the /metrics call I see that one to three puma threads run at 100% cpu time, but the machine has enough free cores, memory and IO. Sadly I don't have a dedicated test environment so there is always some production load while I test.

I use the default prometheus backend, currently with prometheus 2.19.0. But scraping for foreman is currently not enabled, I just tried it with curl. Since the response takes so long I wanted to be sure that scraping it doesn't decrease the overall application performance.

I know that puma has some tuning options for processes/threads, but I'm no Ruby/Puma/Rack expert and can't really tell what's the best way to go here. the default foreman installation provides these settings to puma:

Environment=FOREMAN_PUMA_THREADS_MIN=0
Environment=FOREMAN_PUMA_THREADS_MAX=16
Environment=FOREMAN_PUMA_WORKERS=2

That didn't work for my workload and I changed it to:

Environment=FOREMAN_PUMA_THREADS_MIN=0
Environment=FOREMAN_PUMA_THREADS_MAX=24
Environment=FOREMAN_PUMA_WORKERS=4

I'm happy to change this to whatever you suggest.

I know that some of the reported metrics aren't required for my personal usecase and I can live without them. I'm not sure if foreman has an option to disable/ filter them. I will investigate that and report back as soon as I've more information. Please let me know if you've any questions. suggestions or need more data from me.

@dmagliola
Copy link
Collaborator

dmagliola commented Jun 18, 2020

Different buckets

Would you be able to look at the text output of /metrics, both with the default buckets and the reduced buckets, and count how many lines each has?
Or... can you look at the output with the reduced number of buckets and confirm that it actually has those new buckets?

My working hypothesis here is that reducing the number of buckets should significantly reduce the number of time series.
You have about (these are some rough greps here, not precise counts) 3200 time series. About 2300 of them are histogram buckets.
If you reduce the numbers of buckets from 9 to 4, like you did (the buckets you specify, plus +inf), i'd expect you'd end up with about 2000 total metrics.

Given my previous expectation that rendering time is somewhat proportional to number of time series, I would've expected a measurable drop in rendering time.

The rendering time didn't drop, which makes me think of 2 obvious possibilities:

  1. the rendering time is not proportional to number of time series
  2. the code change to reduce the number of buckets didn't actually work, and you may still have the original default buckets :-)

I want to make sure it's not #2 happening.


during the /metrics call I see that one to three puma threads run at 100% cpu time

yes, I would expect this. While we're reading from files, I would expect that most of the time is spent mutating the hashes read from the files to aggregate all the processes, and/or turning those resulting hashes into the text output.

We've put most of our performance effort in observing samples real fast, not on exporting them, so I don't have actual evidence that that is true, I haven't measured it... So this may be very wrong, but that's my expectation given what the code needs to do.


I use the default prometheus backend, currently with prometheus 2.19.0.

Prometheus by default uses the Synchronized store, which is in RAM and does not support multi-processing.
Looking at the Foreman source code, though, it does set the store to DirectFileStore, which is the correct setting in a multi-process environment, which you have.
So if you're not changing it, that somewhat confirms you're using the file-backed store.


As it stands, I don't have any obvious recommendations.

The next step to move this forward would be to replicate a use case with those same metrics / label counts, with files from 4 processes, run it in development and profile it. And then start optimizing the hot paths.

We are unlikely to have time to work on this "soon", although I would really like to. It would be amazing if someone has time to contribute to this effort.


One final thought...
On disabling metrics you don't need.
This is a fairly horrible hack and not how the system is intended to work, but...
Foreman is declaring a bunch of metrics, and putting them in the Prometheus default registry. Then when /metrics gets hit, the Prom Client will read the registry, find all the metrics in there, and export them.

You could (emphasis on could, not necessarily should) post-facto remove the metrics you don't want from the registry. The metrics would still exist, they'd still get incremented, but they wouldn't be exported.
It'd have to be code that runs after Foreman has initialized the metrics, i don't know its code enough to know where you'd add that.

Now, this goes against the design of the system, and it could break something (even though I'm assuming it probably wouldn't). Tread carefully if doing that, and I don't recommend it as a permanent solution... But it could be a quick test of "does this get significantly faster if I remove the metrics I don't want".

If the answer is yes, maybe this is something that could be changed in Foreman, letting you pick and choose. I'm not sure whether this is at all interesting to the Foreman team, and i'm aware I'm bouncing you back to them, when it's this piece of code that is slower than it should be.

But sadly I don't have any quick wins. We would need to find a way to significantly speed up the Prometheus Client. I don't have any ideas currently of how we could do that without a good amount of time investment into it, and i'm not even sure the kind of order-of-magnitude improvement you'd need is possible given the constraints of the multi-process environment.

@lzap
Copy link

lzap commented Jun 18, 2020

Hello guys,

Looking at the Foreman source code, though, it does set the store to DirectFileStore

I can confirm that.

What I am going to do is two changes:

  1. We default buckets down to 3-4.
  2. We are going to implement allowlist of controllers, others will be ignored. Controller/action is the most used tags and this can also bring significantly amount of metrics.

Our telemetry approach was rather simple - I thought just throw all metrics at the user and let them to dig interesting data from it. Looks like we need to cherry pick a decent set.

There is actually a way out - in Foreman you can also use statsd, then use prometheus statsd exporter which will load the data over UDP and export them over HTTP. It's written in Go, it should be pretty fast and also UDP transport actually solves the multi-process issue nicely. The packets stack up in kernel buffer until they are processed.

@bastelfreak
Copy link
Author

thanks again for the detailed response! the amount of metrics indeed reduces with different bucket sizes.

With:

DEFAULT_BUCKETS = [1, 5, 20, 50, 100, 250, 500, 2000].freeze

I get 3333 lines of output.

With:

DEFAULT_BUCKETS = [100, 500, 2000].freeze

I get 2522 lines of output. I got a response in https://projects.theforeman.org/issues/30029#change-137579 about disabling metrics which seems to be less ugly/bad than your mentioned way. I will try that as the next step and see if I can reduce the amount of metrics

@dmagliola
Copy link
Collaborator

dmagliola commented Jun 18, 2020

We are going to implement allowlist of controllers, others will be ignored.
Controller/action is the most used tags and this can also bring significantly amount of metrics.

This sounds great.

Our telemetry approach was rather simple - I thought just throw all metrics at the user and let them to dig interesting data from it. Looks like we need to cherry pick a decent set.

That sounds like the right approach. This is what i'd normally do initially... Until some limit is hit.

the amount of metrics indeed reduces with different bucket sizes.

Ok. Thanks for confirming that.


One more thought, which would almost certainly explain bad performance...

Are you running this in an ephemeral filesystem (for example, in a container?), or are files persistent between runs of Foreman in production?

The reason i'm asking is these metrics are being kept in files in /tmp. If you are keeping these files long-term, you will continue accumulating more and more files over time, and all of those will need to be read on each scrape. That will definitely make this slower.

A bit more information about this and how to fix it here:
https://github.com/prometheus/client_ruby#directfilestore-caveats-and-things-to-keep-in-mind

Check the "Resetting your metrics on each run" sub-heading.

@lzap
Copy link

lzap commented Jun 25, 2020

That will definitely make this slower.

Good point, we have actually implemented metric label allow-list and we only allow the key labels now. Also, during startup we delete all *.bin files in the storage directory. The PR is up for review. theforeman/foreman#7765

@lzap
Copy link

lzap commented Dec 2, 2020

For the record, after chat with @dmagliola it's clear what is the problem - multi-process servers (e.g. Puma/Passenger and others) do recycle worker processes quite often and although I thought it should be possible to simply delete those temporary files when process exit since I had an impression that the implementation holds the data in memory. However @dmagliola explained to me that this would ultimately lead to loss of information as data is not squashed or aggregated anywhere. Indeed, this would break counters which are expected to be monotonic.

Unfortenately, any kind of data "squashing" will be challenging to implement. I suggest to keep this issue opened until this is resolved. I am going to link this issue from our docs warning users about this problem - we do have a second option how to export data via statsd (UDP packets) which can be used with statd_exporter to aggregate and pull into Prometheus.

@dploeger
Copy link

Is this still relevant? I'm starting to implement Prometheus-based metrics at our site and my first tries didn't show this problem. Also, the PR is already merged.

So, @bastelfreak is this still happening with the latest Foreman? (We're on 3.3.0)

@bastelfreak
Copy link
Author

Hey,
I do not have access anymore to the environment :( I don't know if it's fixed or not.

@dploeger
Copy link

Oh, okay. Then I'll just have to try. Thanks. 😄

I will send you an update if everything goes smooth so you can close the issue.

@fbegyn
Copy link

fbegyn commented Dec 14, 2023

@dploeger any follow up on this? We're looking to spin up some Foreman metrics as well and would be grateful if I don't have to test this out 😅

@dploeger
Copy link

@fbegyn Ha. Thanks for asking me that. I just realized, that prometheus wasn't enabled anymore in our environment. I'll enable it again and will let you know.

@dploeger
Copy link

I've re-enabled the check last week and can't see any problems. Will enable it on our prod system today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants