Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce local metadata storage footprint #109

Closed
tasket opened this issue Aug 17, 2022 · 2 comments
Closed

Reduce local metadata storage footprint #109

tasket opened this issue Aug 17, 2022 · 2 comments
Labels
enhancement New feature or request help wanted Extra attention is needed research
Milestone

Comments

@tasket
Copy link
Owner

tasket commented Aug 17, 2022

Problem

Existing metadata caching strategy uses more space in /var than it should, leaving both compressed and uncompressed copies of the same session metadata in /var/lib/wyng.

Possible mitigations

  • Use an aging algorithm to remove manifest and/or manifest.z files from the /var cache on Wyng exit. They can be automatically retrieved and/or decoded as needed. The user may have control over the aging with a --meta-reduce option.

  • Find alternatives to unix sort --merge tool (see Explanation below)

Explanation

Currently Wyng stores an unencoded copy of each session manifest because unix sort --merge is used to merge them (a merge is required to create a complete picture of the volume for any referenced session except the oldest session). The sort tool is very fast (a requirement here), but does not input directly from compressed files without expensive shell tricks, so the manifests take up extra space in uncompressed form on local disk.

Ideally, there should be a merging tool that could use the encoded/compressed manifests directly, decoding them on-the-fly as needed.

Using heapq and itertools

First experiment replacing sort with this test code executed in the main section of Wyng:

mfnames=(vol.sessions[y].path+"/manifest" for y in (x for x in vol.sesnames))
maplist=[map(str.split, open(x)) for x in mfnames]
for ln in itertools.groupby(heapq.merge(*maplist, key=lambda x: x[1] ), key=lambda y: y[1]):
    print(" ".join(tuple(ln[-1])[-1]), file=outf)

With 133 sessions in the volume, this test took 1.93 seconds to run on average, about 4X as long as the existing merge_manifests() routine. The manifest sources were unencoded, so this result doesn't include overhead that would eventually be added. So much for that.

I don't know if the approach I used above could be tweaked or if there are better approaches to handle this in Python.

I'm open to suggestions!


I've already helped reduce the manifest disk usage by doubling the default archive chunksize, reducing the number of manifest entries by half. The compression fs attribute has also been enabled which offers some reduction for /var fs like Btrfs that support it.

Update: The aging feature has also been implemented. Using --meta-reduce=on:0 should lower the /var footprint by about 2/3.

@tasket tasket added help wanted Extra attention is needed research labels Aug 17, 2022
@tasket
Copy link
Owner Author

tasket commented Aug 27, 2022

Note the compression fs attribute has been added to the v0.4 alpha.

@tasket tasket changed the title Merge manifests directly from encoded/compressed files Reduce local metadata storage footprint May 6, 2024
tasket added a commit that referenced this issue May 6, 2024
Handle lack of cryptodome lib gracefully

Improve add_volume() lvol handling

receive: add process timeout check

Fix arch-deduplicate parameter check

New consistency checks for encryption, chunk size
@tasket tasket added this to the far future milestone May 7, 2024
@tasket tasket added the enhancement New feature or request label May 7, 2024
@tasket tasket modified the milestones: far future, v0.8 May 22, 2024
@tasket
Copy link
Owner Author

tasket commented May 22, 2024

Looking at the effectiveness of --meta-reduce, I'd consider the basic goal of this issue to be met. Although /var usage still balloons somewhat during runtime, much of that will eventually be moved to /var/cache and so doesn't present much of an issue. Outside runtime, disk usage is now much more compact, with other disused archive dirs no longer retaining their session data for long periods. Finally, Wyng has changed such that metadata is fetched from archives as needed, independent of whether Wyng or user directly culls /var metadata.

@tasket tasket closed this as completed May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed research
Projects
None yet
Development

No branches or pull requests

1 participant