Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DirectFileStore creates too many files for long running applications #143

Open
maximkosov opened this issue Jul 10, 2019 · 7 comments
Open

Comments

@maximkosov
Copy link

We use passenger to run our rails application. Worker processes recreated after every 100 requests. Recently we tried client_ruby with DirectFileStore. Each worker process has its own pid so depending on server load there will be tens of thousands of files in prometheus work dir after couple of hours/days when app wasn't restarted.

With 50k files in prometheus work dir /metrics route starts to be very slow with processing time about 10 seconds which could lead to prometheus scraper timeouts.

Is there any workaround for long running processes with DirectFileStore? One possible workaround I can think of is just restart app once in a few hours. Instead of restarting the whole application we can just wipe prometheus work directory once in a few hours, but this is looks a little bit hacky for me.

@dmagliola
Copy link
Collaborator

dmagliola commented Jul 10, 2019

This is related to issue #109, namely, how do you know whether a file is being written to by a process that's still alive. It's not exactly the same problem, since in this case, we can't get rid of files from old processes (or else counters would go down), but I have a feeling both problems will be solved in a related way.

@lawrencejones
Copy link
Contributor

I would think an appropriate fix for this would be to have each process open a shared lock on the file it manages, for the duration of it needing the file.

Then, every time a file store tries creating a new file, you examine all those that exist in the directory that you can exclusively lock and compact them into a new file, while destroying all those that existed before.

Provided you think about how to do concurrency control, this would be a decent mechanism for solving it. But at this point I reckon you should probably be using an mmap system.

@brian-brazil
Copy link

Is there a locking system that will work across all OSes?

@lawrencejones
Copy link
Contributor

Is there a locking system that will work across all OSes?

That would certainly be a consideration, and you'd probably end up with the compaction as an optional configuration on the file store. But honestly, I'd rather see a more performant mmap approach before doing this. Just wanted to sketch out a possible implementation.

@brian-brazil
Copy link

We haven't solved this, and we're using mmap over in Python. The reads were actually changed to not use mmap recently for performance.

@dmagliola
Copy link
Collaborator

Another potential option would be to have one file per process, rather than one file per process/metric.

This doesn't fundamentally solve the issue, but it makes it orders of magnitude less problematic. The downside to this is that now each metric increment has essentially a mutex around it, so it is not the best performance in multi-threaded scenarios.

But if, like most Ruby apps, each process is running single-threaded, or if it's not incrementing counters that often, the performance penalty will probably be negligible. Each store save should be in the order of single-digit microseconds after all.

This would need a separate store (i'm not proposing modifying the existing file store to do this), but that's the point of having the swappable store backends, and it would be significantly easier to write than the compaction we're talking about here.

@Sinjo
Copy link
Member

Sinjo commented Aug 19, 2019

I'm taking this out of the v0.10.0 milestone. I think we should come up with an answer to this problem, but I also think there's a lot of value in getting 0.10.0 out of alpha and into more people's hands in its current state.

I'm going to try and get through the documentation issues so we can get to a release.

After that, we can go one of a couple of ways:

  • If we find we need another run of breaking changes before 1.0, we can have an 0.11.0 release with alphas as needed.
  • If not, we'll basically promote what we've got here to 1.0.

dmagliola pushed a commit that referenced this issue Oct 14, 2019
This was a quick experiment on having all metrics for each process on
the same file, because it seemed like it would be easy. (just do these
things in this commit)

However, even though tests pass, this doesn't actually work. See next commit on why.
If the change were just this, i'd say we should do this.

However, as you'll see in the next commit, it's more involved than that,
and i'm not sure it's worth doing, at least not with this approach...

I'm just pushing this up so it doesn't get lost.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants