Reduce local metadata storage footprint #109

tasket · 2022-08-17T21:04:57Z

Problem

Existing metadata caching strategy uses more space in /var than it should, leaving both compressed and uncompressed copies of the same session metadata in /var/lib/wyng.

Possible mitigations

Use an aging algorithm to remove manifest and/or manifest.z files from the /var cache on Wyng exit. They can be automatically retrieved and/or decoded as needed. The user may have control over the aging with a --meta-reduce option.
Find alternatives to unix sort --merge tool (see Explanation below)

Explanation

Currently Wyng stores an unencoded copy of each session manifest because unix sort --merge is used to merge them (a merge is required to create a complete picture of the volume for any referenced session except the oldest session). The sort tool is very fast (a requirement here), but does not input directly from compressed files without expensive shell tricks, so the manifests take up extra space in uncompressed form on local disk.

Ideally, there should be a merging tool that could use the encoded/compressed manifests directly, decoding them on-the-fly as needed.

Using heapq and itertools

First experiment replacing sort with this test code executed in the main section of Wyng:

mfnames=(vol.sessions[y].path+"/manifest" for y in (x for x in vol.sesnames))
maplist=[map(str.split, open(x)) for x in mfnames]
for ln in itertools.groupby(heapq.merge(*maplist, key=lambda x: x[1] ), key=lambda y: y[1]):
    print(" ".join(tuple(ln[-1])[-1]), file=outf)

With 133 sessions in the volume, this test took 1.93 seconds to run on average, about 4X as long as the existing merge_manifests() routine. The manifest sources were unencoded, so this result doesn't include overhead that would eventually be added. So much for that.

I don't know if the approach I used above could be tweaked or if there are better approaches to handle this in Python.

I'm open to suggestions!

I've already helped reduce the manifest disk usage by doubling the default archive chunksize, reducing the number of manifest entries by half. The compression fs attribute has also been enabled which offers some reduction for /var fs like Btrfs that support it.

Update: The aging feature has also been implemented. Using --meta-reduce=on:0 should lower the /var footprint by about 2/3.

The text was updated successfully, but these errors were encountered:

tasket · 2022-08-27T23:42:59Z

Note the compression fs attribute has been added to the v0.4 alpha.

Handle lack of cryptodome lib gracefully Improve add_volume() lvol handling receive: add process timeout check Fix arch-deduplicate parameter check New consistency checks for encryption, chunk size

tasket · 2024-05-22T17:20:29Z

Looking at the effectiveness of --meta-reduce, I'd consider the basic goal of this issue to be met. Although /var usage still balloons somewhat during runtime, much of that will eventually be moved to /var/cache and so doesn't present much of an issue. Outside runtime, disk usage is now much more compact, with other disused archive dirs no longer retaining their session data for long periods. Finally, Wyng has changed such that metadata is fetched from archives as needed, independent of whether Wyng or user directly culls /var metadata.

tasket added help wanted Extra attention is needed research labels Aug 17, 2022

tasket changed the title ~~Merge manifests directly from encoded/compressed files~~ Reduce local metadata storage footprint May 6, 2024

tasket added this to the far future milestone May 7, 2024

tasket added the enhancement New feature or request label May 7, 2024

tasket modified the milestones: far future, v0.8 May 22, 2024

tasket closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce local metadata storage footprint #109

Reduce local metadata storage footprint #109

tasket commented Aug 17, 2022 •

edited

tasket commented Aug 27, 2022

tasket commented May 22, 2024

Reduce local metadata storage footprint #109

Reduce local metadata storage footprint #109

Comments

tasket commented Aug 17, 2022 • edited

Problem

Possible mitigations

Explanation

Using heapq and itertools

tasket commented Aug 27, 2022

tasket commented May 22, 2024

tasket commented Aug 17, 2022 •

edited