Improving DirectFileStore for performance and scalability #281

stanhu · 2023-03-08T19:25:12Z

At GitLab, we've been running prometheus-client-mmap to record and report Prometheus metrics. It's an early fork of this project and uses a C extension with mmap() to optimize both the read and write paths.

We'd like to switch to DirectFileStore, but we see a number of issues:

File descriptor usage: I don't believe DirectFileStore recording one metric per file is going to work at scale. With thousands of metrics, that's a lot of file descriptors that may have to be opened to read and write metrics. In prometheus-client-mmap we use one file metric per metric type (counter, histogram, gauge). I see there was an attempt to use one file per process in One file per process instead of per metric/process for DirectFileStore [DOESNT WORK] #161.
Read performance (related to CPU usage growing over time with DirectFileStore #232, DirectFileStore creates too many files for long running applications #143, slow response on metrics http endpoint #194): Aggregating thousands of metrics in Ruby is pretty CPU and memory intensive. We found that this was even difficult to do efficiently with a Go exporter that reads the prometheus-client-mmap metrics since garbage collection becomes an issue due to memory allocations. We have a prototype Rust port that handles reading of metrics that is faster than the C implementation.
Aggregating metrics in a separate process: The metrics stored by prometheus-client-mmap can be aggregated by a separate process since the metric type is in the filename. With DirectFileStore, I believe the types are registered in the registry, so there's no way an outside process can determine the metric type just by scanning the .bin files.

I would like to propose a path forward:

Switch DirectFileStore to use file per process, or one file per metric type. The latter is simpler from our experience. If we do the former, we'd probably want to encode the metric type in the file.
Add an optional Rust extension for reading metrics generated by DirectFileStore. We can easily adapt the work in our Rust port for DirectFileStore.

A side point: I believe the metrics are always written in the native endian format. I propose that we enforce little endian (which is what x86 and ARM64 use) to avoid cross-platform confusion.

@dmagliola What do you think?

The text was updated successfully, but these errors were encountered:

xlgmokha · 2023-03-08T19:38:00Z

I propose that we enforce little endian (which is what x86 and ARM64 use) to avoid cross-platform confusion.

I like this idea to make it easy to port from one machine to another for analysis etc.

If we're open to discussing adjustments to the binary format, would it be useful to consider a field in the header of the file to specify some sort of version of the binary format to allow for easier upgrade? I think it might make it easier to evolve the schema of the binary format over time and reduce the risk of upgrades.

dmagliola · 2023-03-08T23:25:02Z

Switch DirectFileStore to use file per process

Yes!!! So there are two reasons I haven't done this...

One is "fear": maybe that'll actually be slower / worse in some way for some case or someone. It would be worse if you have a lot of thread contention, basically, which was the logic behind "file per metric": if 2 threads are incrementing at the same time, if they're not the same metric, they don't block on each other because it's separate files. In retrospect, I consider this was a mistake.

But this fear is why I tried making it configurable in the PR you sent, which didn't quite work. Having interchangeable stores kinda solves that, though, so i'm not too fussed, if we can convince ourselves that, for most cases, the default (file per process) is better, and if someone has that high thread contention, they can switch to the "file per metric" store, which would be in a separate accessory gem, for example.

The second reason is honestly I haven't had much time to work on this, and this is a not insubstantial change, so I've been putting it off (and feel appropriately bad about that!). So i'm really happy to hear you're interested in helping with it!

or one file per metric type. The latter is simpler from our experience.
If we do the former, we'd probably want to encode the metric type in the file.

At the time I refactored the stores, I did see that you were doing "file per metric type", but I didn't quite understand why (or if I did, I no longer remember). Can you tell me a bit more about why this is useful / advantageous?

Add an optional Rust extension for reading metrics generated by DirectFileStore.
We can easily adapt the work in our Rust port for DirectFileStore.

Again, if this is also a separate store, particularly in a separate gem (i'll expand on why) I would love to see this!

The main challenge I had with making the exporting faster was that I was letting purity of the interfaces get in the way. It's been a while, so I may be getting details wrong here, but it's something like the exporter does a "foreach metric, get the data", and so for each of the metrics you're going to the files and reading them and processing them, and then for the next metric you do that again. Which reads very elegant in the code, but the performance is terrible. You'd want to read all the files, munch them all at once, and dump back a giant result with all the numbers for all the metrics, and then the exporter can deal with that. That'd be faster, and from what you're saying, I gather you could do that even faster in C/Rust. The "problem" is the way the code is currently structured doesn't play nice with that; we need to refactor it, and probably accept a "elegance/performance" tradeoff (which i'm happy to do)

I'd say let's have a sketch of what your idea is before sinking a lot of time and effort in that, and once we know roughly the direction, that'd be awesome!

The reason i'm saying this Rust solution should be in a separate gem, however, is we want the main gem to be pure-Ruby, because we want to support platforms like JRuby, which can't run native code gems, and which don't need these solutions (because they don't fork). So ideally the Rust solution would be a separate gem one includes and points at in configuration.

A side point: I believe the metrics are always written in the native endian format. I propose that we enforce little endian (which is what x86 and ARM64 use) to avoid cross-platform confusion.

I have no opinion on that. I don't intuitively understand why that's better (and i'd like to know just for my own learning), but i'm happy either way :)

If we're open to discussing adjustments to the binary format

We absolutely are. The binary format is what I found in a proof of concept someone (Julius?) had done and I blatantly stole it. I have no attachment to it, and as per the answer below, it doesn't matter what it is :)

would it be useful to consider a field in the header to allow for easier upgrade

This has been explicitly not considered. These files are intended to be ephemeral. You're supposed to delete all of them every time you restart your web server. There shouldn't be any situations where one version of the code needs to read files generated by another version. So we didn't worry about these types of headers / migrations / versions / etc, but also, we almost want to actively discourage that kind of thing. There shouldn't be a reason for doing that :)

port from one machine to another for analysis etc.

Ok, fine you found a reason ;-)
I'm not convinced of how useful this is, tbh.
@Sinjo has an issue to do something similar: #118

Tbh, i'm skeptical, but i'm not going to particularly stand in the way, assuming "no serious trade-offs". I would worry about introducing code complexity (having more involved code that can read different versions of files), or any sort of performance hit, I don't think pretty much any tradeoff is worth this.

But again, i'm open to changing my mind on this one!

Thank you for this, i'm really excited to see this move forward. I've had it in my "wish i had time to do" list for years now!

xlgmokha · 2023-03-09T00:27:01Z

Again, if this is also a separate store, particularly in a separate gem (i'll expand on why) I would love to see this!

IIUC I think you're suggesting creating a new gem to offer a custom data store that can store and process the data however it wants as long as it implements the Store interface. Does that sound about right?

SuperQ · 2023-03-09T10:10:50Z

@stanhu One thing I've been thinking about is how we could make a multi-process shared file store that is consistent across different client library implementations.

The main reason for this is we've been working on a new multi-process worker controller to replace things like unicorn/puma/gunicorn. Kinda like Phusion Passenger, but our controller is written in Go. I'm in the process of getting approval to open source this new worker controller.

One feature I've been thinking about is that we implement the metrics gathering read path in Go so we can take Ruby/Python/Node out of the loop.

dmagliola · 2023-03-09T10:24:33Z

I think you're suggesting creating a new gem to offer a custom data store that can store and process the data

Exactly that!
Take this for example: https://github.com/gocardless/prometheus-client-ruby-data-stores-experiments
That's not a useful gem, I left it there for documentation of the different approaches we took and their performance characteristics, but that's sort of the shape you'd have :)

One feature I've been thinking about is that we implement the metrics gathering read path in Go so we can take Ruby/Python/Node out of the loop

Would you be able to call this Go "web server" from the Ruby/Python/Node code? If so, you can do the incrementing / observing there too, and you don't need files anymore...
That said, unless and until that catches on and replaces Puma/Unicorn, I feel that that is a very niche edge case, very Gitlab specific, so I feel about it the same as for those extra headers I mentioned above... If there's no tradeoff, then sure. But I'm not sure we want to make the code much more complex if that's necessary, for example, to support that?

SuperQ · 2023-03-09T11:09:55Z

Would you be able to call this Go "web server" from the Ruby/Python/Node code? If so, you can do the incrementing / observing there too, and you don't need files anymore...

Yes, that's possible, but I think it would be less efficient all-around because each increment would require network packets. We'd be back to the inefficiencies of statsd. 😁

That said, unless and until that catches on and replaces Puma/Unicorn, I feel that that is a very niche edge case, very Gitlab specific,

I'm sure it will catch on once I show the numbers on how much performance improvement we got. Although we're actually replacing Einhorn, not Unicorn. Also this is not for GitLab/Ruby. It's actually a Python service.

My only request is that we make a "Shared filesystem metrics format" that is portable across implementations. This way we could share functionality across client_ruby, client_python, etc. Same thing goes with the idea of "works with pure ruby" vs "Ruby with a Rust plugin".

stanhu · 2023-03-10T20:58:21Z

One feature I've been thinking about is that we implement the metrics gathering read path in Go so we can take Ruby/Python/Node out of the loop.

@SuperQ As I mentioned earlier, we actually tried this approach in our staging environments (see parsing code). As @mttkay said in https://gitlab.com/groups/gitlab-org/-/epics/9433#note_1271623080:

The problem is that with the way prometheus-client-mmap writes data, you essentially need to solve a map-reduce problem in the server: given N per-process data files, you need to read, transform and aggregate samples in-memory, then serialize them all to JSON. This is very memory and compute heavy. We thought since this is a parallel problem, Go was a great solution because we can process each partition in a goroutine in parallel, but the Go GC is not well suited for these data heavy, spiky work loads and memory use was very high and unpredictable even in staging and even after months of tuning and investigation.

That's why we're looking at Rust now, and we're quite encouraged that our benchmarks show that the Rust implementation for the read path is already faster than the C implementation for a representative test case with thousands of metrics:

Warming up --------------------------------------
                   C     1.000  i/100ms
                rust     1.000  i/100ms
Calculating -------------------------------------
                   C     11.377  (± 0.0%) i/s -     57.000  in   5.012109s
                rust     11.910  (± 0.0%) i/s -     60.000  in   5.042086s

Comparison:
                rust:       11.9 i/s
                   C:       11.4 i/s - 1.05x  slower

We're doing some work to validate that Rust memory usage is also in line, but we don't really expect that many surprises since there's no garbage collection.

My only request is that we make a "Shared filesystem metrics format" that is portable across implementations. This way we could share functionality across client_ruby, client_python, etc. Same thing goes with the idea of "works with pure ruby" vs "Ruby with a Rust plugin".

I like that idea. That suggests to me that all the metric data--including docstrings, metric types--should be serialized to a file so that any client can read and aggregate the data just from the files themselves.

The reason i'm saying this Rust solution should be in a separate gem, however, is we want the main gem to be pure-Ruby, because we want to support platforms like JRuby, which can't run native code gems, and which don't need these solutions (because they don't fork). So ideally the Rust solution would be a separate gem one includes and points at in configuration.

@dmagliola That makes sense. I wonder if we want to follow the model of redis-client, which ships a C extension as a separate gem in the same repository (https://github.com/redis-rb/redis-client/tree/master/hiredis-client).

At the time I refactored the stores, I did see that you were doing "file per metric type", but I didn't quite understand why (or if I did, I no longer remember). Can you tell me a bit more about why this is useful / advantageous?

That goes to my previous point. I believe with DirectFileStore you have to rely on the registry to know whether a given metric is a counter, gauge, etc. in order to aggregate it. But if you want a separate process or tool to read this data, you can decode individual metric files, but you can't aggregate the data. If we want a single file per process, then we'd also have to encode the metric type in the data. That makes the format a little more complex, but I think that's solvable.

Going forward, I'd propose:

We agree on a binary format that makes sense to support one file per process.
Implement a new store in Ruby. Let's call this SuperFileStore for now. 😄
Implement a separate gem, such as prometheus-client-rust, that implements a reader for SuperFileStore.

If I were to propose a new binary format, I might suggest something like:

< ----- 4 bytes ----- >  | < ----- 4 bytes ----- >
|              Total used bytes                   |
| Total length of key 1  | Key 1 metric type      |
| Key 1 docstring length | Key 1 docstring        |
| Key 1 docstring (continued with padding)        |
| Key 1 name length      | Key 1 name             |
| Key 1 name (continued with padding)             |
| Key 1 value (double)                            |
| Total length of key 2  | Key 2 metric type      |
...

Some notes:

The metric type is 0 for counter, 1 for gauge, and 2 for histograms.
In prometheus-client-mmap the length of Key 1 name is the number of bytes without the padding, while in DirectFileStore spaces are added. I do think it's easier to reason about if the length covers the padding. The DirectFileStore implementation adds spaces that have to be stripped.
In prometheus-client-mmap, the metric name is serialized to JSON with labels in separate key-value arrays. For example, for a given metric:

cache_misses_total{method="GET",feature_category="build_artifacts"}

This is serialized as (ignore the redundant metric name):

["cache_misses_total", "cache_misses_total", ["method","feature_category"], ["GET", "build_artifacts"]]

In DirectFileStore, I believe labels are encoded as CGI-encoded strings:

cache_misses_total?method=GET&feature_category=build_artifacts

The second approach is 3x slower for decoding than the JSON approach in Ruby. I think we stuck with JSON in the C implementation because we had a fast JSON parser, and parsing CGI strings would be more complicated. I'd favor sticking with JSON for performance reasons.

What do you think?

stanhu · 2023-03-12T16:42:01Z

@slai11 pointed out that we may also have to store the aggregation mode for gauges (sum/max/min).

mttkay · 2023-03-13T08:58:42Z

@slai11 pointed out that we may also have to store the aggregation mode for gauges (sum/max/min).

That would be my preference too, at least from a convenience point of view. Currently, we have to parse the file name (which contains the aggregation mode) to understand which aggregation is required, and it also leads to asymmetric naming because the aggregation mode is meaningless for anything but gauges (a counter you usually want to sum up).

OTOH, it also means we should account for this to be optional in the binary format since we would not want to waste storage for this in a counter file, when we know counters are always summed up? Or do we want to support all aggregation modes on all metric types? Maybe there are use cases for this, but I believe this is currently unsupported in prometheus-client-mmap for anything except gauges so I thought it's worth pointing out.

dmagliola · 2023-03-13T09:06:22Z

Right now we support all aggregation modes in all metric types.
We default to the sensible thing (sum for counters), but I don't see a reason not to cater for whatever particular use case folks might have that'd have them choose others. Particularly, I'm thinking "all" could be useful to someone.

The one exception is we only support "most_recent" for gauges. I don't remember why but it's gotta be either in a code comment or a commit message

mttkay · 2023-03-13T11:28:26Z

Makes sense. Another reason not to make aggregation part of the binary format is that we probably wouldn't want to store it redundantly as each entry denotes a sample, not a metric, but we only need to store aggregation mode once per metric.

dmagliola · 2023-03-13T12:03:28Z

To clarify, I'm not saying not to make aggregation part of the binary format. I'm happy to do that if we find a good way. I'm just suggesting that we don't "force" sum for non-gauges

slai11 · 2023-03-13T12:06:34Z

Another reason not to make aggregation part of the binary format is that we probably wouldn't want to store it redundantly as each entry denotes a sample, not a metric, but we only need to store aggregation mode once per metric.

w.r.t Stan's proposal, I was thinking we could store both the metric type (0 for counter, 1 for gauge, and 2 for histograms would only take up 2 bits) and aggregation mode within the 32 bits allocated for metric type information?

dmagliola · 2023-03-13T12:26:42Z

I may be way off here, but I'd measure before going too hard on how many bits/bytes this takes... Compared to the space taken up by the labels, this is probably not going to make a difference to performance...

Which by the way reminds me of something... It was mentioned above in this thread we should use JSON instead of CGI encoding for the labels... I'm pretty sure I benchmarked that and CGI encoding was faster in Ruby. We should retest that... It's been a couple years, maybe JSON is much faster now... But I wouldn't switch to JSON unless benchmarks show it's not slower

SuperQ · 2023-03-13T15:08:04Z

I think this might make a good topic for the Prometheus Dev summit that's on 2023-03-23. This discussion may also interest client_python.

CC @csmarchbanks

stanhu · 2023-03-14T07:32:59Z

Out of curiosity, I see that client_python:

Uses one file per metric type: https://github.com/prometheus/client_python/blob/30f83196ac1f1a1a2626da9d724d83955aff79b1/prometheus_client/multiprocess.py#L60-L61
Stores the aggregation mode in the filename itself: https://github.com/prometheus/client_python/blob/30f83196ac1f1a1a2626da9d724d83955aff79b1/prometheus_client/multiprocess.py#L162
Serializes the labels as a JSON Hash and the docstring in the key itself: https://github.com/prometheus/client_python/blob/30f83196ac1f1a1a2626da9d724d83955aff79b1/prometheus_client/mmap_dict.py#L144-L145

Sinjo · 2023-03-23T16:27:45Z

Hey, first off: sorry for only chiming in now. I was finishing off and recording a conference talk when the bulk of this discussion happened, and I've only just got to it on my to-do list.

I'd like to address everything that's come up, but that's a lot, so I'm going to break my response into sections so it doesn't get too unwieldy.

Reducing number of open files

Right, let's start with the least contentious part of this.

I completely agree with us building a store that does file-per-process rather than file-per-metric-per-process. The original idea of minimising lock contention was a reasonable one to pursue, but has demonstrably not worked well in practice for many of our users.

Whatever else we do, I think we should definitely build a store that does this, and perhaps even make it the default to avoid people running into a sharp edge.

Changing the binary format

In the general case, I'm entirely in favour of doing this if it helps us meet a goal. The files produced and consumed by DirectFileStore are very explicitly an implementation detail, and should not be kept between runs of a server (see the point titled "Resetting your metrics on each run".

If we need to do something which implies making a change to this format, or if a change to this format means a significant performance win, then let's do it.

That said...

Switching to little endian

I'm fine with us switching to little endian, but I'm not sure it buys us anything beyond a little internal consistency. I would be in favour of this if the format weren't so internal and the files weren't so transient, but they are and I'd like to keep them that way.

Introducing a header for versioning

I'm pretty strongly opposed to this, and it was a very deliberate choice not to have something like this in the current implementation.

These files aren't meant for external consumption, and are meant to be thrown away once the server process is finished. I don't want to do anything that encourages people to treat them any other way, and I don't want the complexity of handling multiple versions of these files in the store code.

Changing the store interface

I'll echo @dmagliola here and say that I'm totally fine with us revisiting the performance vs interface cleanliness trade-off.

Having a "just get me all the metrics in one go so that I can serialise them" method feels pretty inevitable as a performance win.

Where does metric type information live?

I feel fairly neutral about this one.

There hasn't been a strong reason to externalise this information into the files themselves (either in the file name or as part of the storage format within the file), but I can believe it would be a useful optimisation to have this information alongside the metrics for that big "aggregate everything together" method.

I'm totally down to introduce this if it gets us a performance win. I don't mind too much whether we do that via making the storage file-per-process-per-metric-type or by encoding it in the metric data.

Rust extension

This has been on my list of side-projects since forever, and I didn't exactly get far before getting distracted by something else.

I'm extremely keen to have a really performant version of the data store we end up creating in pure Ruby. I do think we should have a pure Ruby version that makes all of the choices we've talked about (e.g. file-per-process), and then have an opt-in high-performance version that uses Rust to do the heavy lifting.

In `client_ruby` or separately?

If we get significant performance gains by doing this, I can be convinced to have it in the main repo, probably as a separate gem for JRuby's sake, but considered a first-class component when it comes to support. Equally, one advantage of the store interface is that stores can come from other gems, so we can experiment first and then make a decision.

The thing that will convince me it should be in client_ruby is if we can give people a notably better out-of-the-box experience.

JSON vs CGI string vs ...

I'm definitely happy to revisit the internal format if we can get a performance boost by doing so. This one's a no brainer!

Shared file format and external aggregation

This is the proposal I'm least keen on, at least in the short and medium term.

We'd be bringing in a bunch of complexity and committing to a format for what's currently an implementation detail, all for an approach to running web servers that doesn't yet have any adoption. I think the idea is cool, but that we'd be massively jumping the gun by racing to do this work in client_ruby.

The saving grace is that I don't think it matters whether we do that work up front. We can make all the other improvements we're talking about here (file-per-process, more efficient store interface, Rust extension), and then if we end up in a world where Ruby web servers have gone in that direction, we can slap a version header in the file and start caring about breaking changes to the format.

Supporting aggregation modes

I think we continue to get this for minimal effort if we're not trying to support them outside of the Ruby process that owns the storage files. It only becomes hard (and necessary to externalise into the storage) if we have other processes reading the files, which we don't for the time being.

If we do go down that route, then at the same time as slapping the version header into the file format, we'd also slap the aggregation mode in there.

Sidenote: versioning and branch maintenance

There's a lot of improvements we're talking about here, and we need to think about how we're going to release them.

Some of them are non-breaking, and it's fine for us to land those on main and release them in our current major version. Others aren't, and I think it's worth us planning how we ship them.

What I don't want to end up with is a situation where we've got a bunch of in-progress work on main and we can't ship a patch release (been there, it sucks, not doing it again). Depending on how much work needs to be bundled together in the breaking release, we should have a separate branch where that's happening, and rebase it periodically on main.

Phew, sorry for the wall of text. Hopefully what I've said makes sense and gives us a reasonable path forward.

SuperQ · 2023-03-24T12:55:02Z

I'm pretty strongly opposed to this, and it was a very deliberate choice not to have something like this in the current implementation.

+1 for the most part. The only reason I would want this is to support external reading by an external aggregator. For example, having a Go server that does this to avoid having Ruby do any work in doing the aggregations.

But if there's a rust HTTP /metrics worker/server, that would be good enough for us to pass-through the scrape request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving DirectFileStore for performance and scalability #281

Improving DirectFileStore for performance and scalability #281

stanhu commented Mar 8, 2023 •

edited

xlgmokha commented Mar 8, 2023

dmagliola commented Mar 8, 2023

xlgmokha commented Mar 9, 2023

SuperQ commented Mar 9, 2023

dmagliola commented Mar 9, 2023

SuperQ commented Mar 9, 2023 •

edited

stanhu commented Mar 10, 2023 •

edited

stanhu commented Mar 12, 2023

mttkay commented Mar 13, 2023

dmagliola commented Mar 13, 2023

mttkay commented Mar 13, 2023

dmagliola commented Mar 13, 2023

slai11 commented Mar 13, 2023

dmagliola commented Mar 13, 2023

SuperQ commented Mar 13, 2023

stanhu commented Mar 14, 2023

Sinjo commented Mar 23, 2023

SuperQ commented Mar 24, 2023

Improving DirectFileStore for performance and scalability #281

Improving DirectFileStore for performance and scalability #281

Comments

stanhu commented Mar 8, 2023 • edited

xlgmokha commented Mar 8, 2023

dmagliola commented Mar 8, 2023

xlgmokha commented Mar 9, 2023

SuperQ commented Mar 9, 2023

dmagliola commented Mar 9, 2023

SuperQ commented Mar 9, 2023 • edited

stanhu commented Mar 10, 2023 • edited

stanhu commented Mar 12, 2023

mttkay commented Mar 13, 2023

dmagliola commented Mar 13, 2023

mttkay commented Mar 13, 2023

dmagliola commented Mar 13, 2023

slai11 commented Mar 13, 2023

dmagliola commented Mar 13, 2023

SuperQ commented Mar 13, 2023

stanhu commented Mar 14, 2023

Sinjo commented Mar 23, 2023

Reducing number of open files

Changing the binary format

Switching to little endian

Introducing a header for versioning

Changing the store interface

Where does metric type information live?

Rust extension

In client_ruby or separately?

JSON vs CGI string vs ...

Shared file format and external aggregation

Supporting aggregation modes

Sidenote: versioning and branch maintenance

SuperQ commented Mar 24, 2023

stanhu commented Mar 8, 2023 •

edited

SuperQ commented Mar 9, 2023 •

edited

stanhu commented Mar 10, 2023 •

edited

In `client_ruby` or separately?