One file per process instead of per metric/process for DirectFileStore [DOESNT WORK] #161

dmagliola · 2019-10-14T11:47:16Z

This doesn't work!!

I'm just leaving it here as documentation in case someone tries to do this to solve #143
Read the commit comments for more.

NOTE: I still think this is the right thing to do (one file per process, rather than per metric/process)
It'll have more contention in multi-threaded scenarios, but it'll drastically reduce the number of files

NOTE2: We also need to add a Mutex around that shared Dict, if we're having the shared Dict! The @lock in in_process_sync needs to be shared by all MetricStores

Anyway, i'm back to thinking we should either make a second DirectFileStore that does this (and have two separate files, instead of a param), or considerably refactor the DirectFileStore. My quick stab definitely doesn't work

This was a quick experiment on having all metrics for each process on the same file, because it seemed like it would be easy. (just do these things in this commit) However, even though tests pass, this doesn't actually work. See next commit on why. If the change were just this, i'd say we should do this. However, as you'll see in the next commit, it's more involved than that, and i'm not sure it's worth doing, at least not with this approach... I'm just pushing this up so it doesn't get lost.

So, again, I don't propose we actually do this, this is just illustrative of what we need to do, and didn't want it to get lost. The problem with the previous commit is that for each metric, we open a new FileMappedDict. All of them sharing the same file is "fine", actually. The problem comes from the fact that each instance of FileMappedDict has their own version of `@used`. When one metric adds a new entry to the file, it moves its own `@used` pointed, but the other instances don't notice this. As soon as a second instance that had already been loaded adds a new labelset, it corrupts the file. This only shows up when running the benchmarks, with histograms. Counters (in our benchmark) don't reproduce it, because each metric in our benchmark only has one labelset. They all get added initially to the file, and an already loded Dict doesn't add new ones. The reason Histograms reproduce it is that new observations may report a bucket that wasn't seen before, which is effectively a new labelset. The solution is for all MetricStore instances to share one single FileMappedDict. This is not particularly hard, but the current MetricStore is not written in a way that allows that without doing this crap. We should find a better way. I'm just leaving this here as documentation of this problem, for the next brave soul attempting this

coveralls · 2019-10-14T11:52:13Z

Coverage decreased (-16.0%) to 84.042% when pulling 149fc3c on all_metrics_one_file into af29066 on master.

coveralls · 2019-10-14T11:52:13Z

Coverage decreased (-16.0%) to 84.042% when pulling 149fc3c on all_metrics_one_file into af29066 on master.

coveralls · 2019-10-14T11:52:13Z

Coverage decreased (-16.0%) to 84.042% when pulling 149fc3c on all_metrics_one_file into af29066 on master.

saviogl · 2021-03-24T07:42:41Z

@dmagliola Do you know if there's a working model of this proposed approach somewhere, I'd like to give it a try if possible? We are currently running into this issue due to the use of unicorn_worker_killer and the number of files that end up getting created whenever a new worker process gets spun up. Over time the /metrics endpoint becomes very slow and memory usage increases unboundedly.

dmagliola · 2021-03-24T12:02:31Z

Hi @saviogl, there isn't sadly one.
This is next on my "things to fix" list, and i'm hoping to get to it "in the next couple of weeks", but I don't want to under-deliver on that promise.

The way to try to solve this is to declare your own class that exposes the FileStore interface, and you'd point to that when declaring metrics.

The key thing to keep in mind here is that, while you'll have a lot of MetricStores, they all need to point to the same, shared instance of FileMappedDict, which you'd define at the FileStore level, not in the MetricStores as we do in DirectFileStore.

It's a non-trivial change, but that's basically what needs to happen.

The challenge for me is I wanted to offer the option: you can have file per process, or file per process-metric.
Unfortunately, you can't do that with a flag as this PR proposes (or not without making the code a mess), so we'll probably have to either choose one or the other to offer, or offer two separate DataStores to pick from, which have a lot of code duplication.

I'm sure there's a better way i'm missing, but anyway, that's my thinking on this.

saviogl · 2021-03-24T20:13:04Z

@dmagliola Just trying to understand better whether or not this will actually solve our issue. While the amount of files is a problem to be had (maybe soon), the biggest issue right now is around memory consumption due to the fact we are "constantly" reaping unicorn worker processes for graceful memory management and therefore creating new "metrics" per processes. Moving to "One file per process instead of per metric/process for DirectFileStore" would tame the number of files created but the amount of data (metrics per process) would remain the same "correct"? And therefore we would still have an unbounded growth in regards to metrics -> memory for exporting those?

Is my assumption correct?

If I'm understanding the current model correctly, the solution for the memory unbounded growth would be to have processes to write the same file/metric with some process concurrent control system in place right?

Sinjo · 2021-03-25T00:33:21Z

While the amount of files is a problem to be had (maybe soon), the biggest issue right now is around memory consumption due to the fact we are "constantly" reaping unicorn worker processes for graceful memory management and therefore creating new "metrics" per processes.

If I've understood you right this comment in #214 is the situation running into, and you're generating a tonne of metric series by churning through unicorn workers. Whatever we come up with in there won't help you when it comes to scraping metrics becoming slow because of a large number of files, but might help you reduce the number of metrics (and ultimately timeseries in your Prometheus instance) produced.

wizardmatas · 2021-06-02T11:04:25Z

FYI: on GitLab similar implementation is used: https://gitlab.com/gitlab-org/prometheus-client-mmap

Daniel Magliola added 2 commits October 14, 2019 12:33

dmagliola mentioned this pull request Jun 17, 2020

slow response on metrics http endpoint #194

Open

dmagliola mentioned this pull request Oct 26, 2020

Add note about at_exit into README #207

Closed

stanhu mentioned this pull request Mar 8, 2023

Improving DirectFileStore for performance and scalability #281

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One file per process instead of per metric/process for DirectFileStore [DOESNT WORK] #161

One file per process instead of per metric/process for DirectFileStore [DOESNT WORK] #161

dmagliola commented Oct 14, 2019

coveralls commented Oct 14, 2019

coveralls commented Oct 14, 2019

coveralls commented Oct 14, 2019

saviogl commented Mar 24, 2021

dmagliola commented Mar 24, 2021

saviogl commented Mar 24, 2021

Sinjo commented Mar 25, 2021

wizardmatas commented Jun 2, 2021

One file per process instead of per metric/process for DirectFileStore [DOESNT WORK] #161

Are you sure you want to change the base?

One file per process instead of per metric/process for DirectFileStore [DOESNT WORK] #161

Conversation

dmagliola commented Oct 14, 2019

This doesn't work!!

coveralls commented Oct 14, 2019

coveralls commented Oct 14, 2019

coveralls commented Oct 14, 2019

saviogl commented Mar 24, 2021

dmagliola commented Mar 24, 2021

saviogl commented Mar 24, 2021

Sinjo commented Mar 25, 2021

wizardmatas commented Jun 2, 2021