Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/blobserver/fsbacked: add blobserver using existing local files as storage [WIP] #1282

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

bobg
Copy link
Contributor

@bobg bobg commented Nov 27, 2019

This PR is a preliminary sketch for a new blobserver type that uses files uploaded to it as their own storage.

When you add a file to an fsbacked.Storage that's within the directory tree it controls, an entry is added to a database that maps between files and blobrefs; but the file's contents are not copied anywhere. When fetching the file's content blob later, the database directs the Storage to the right local file and the data is served from there.

Adding files outside the directory tree, or adding any other kind of blob, fails over to another blobserver nested inside the fsbacked.Storage.

This solves the problem of wanting to add a tree of large files (e.g., videos of my kids growing up) to a local Perkeep instance without storing all the data twice. This should be used only on directory trees whose files do not change, lest the blobrefs in the database become mismatched to their corresponding files.

A number of other changes throughout Perkeep would be needed to make this truly useful. The io.Reader presented to a blobserver's ReceiveBlob method is usually (always?) some wrapper object (like checkHashReader) that conceals the underlying *os.File, without which fsbacked.Storage cannot detect that a file within its tree is being uploaded. And in any case, Perkeep imposes rather a low limit on blob sizes for this purpose.

Presented for further discussion.

@googlebot googlebot added the cla: yes Author has submitted the Google CLA. label Nov 27, 2019
@zenhack
Copy link
Contributor

zenhack commented Nov 27, 2019

Re: the blob size limit, you could solve that by keeping track of offset & size info in the db as well, so that multiple blobs could be contained in the same file. Then the file gets split up when added as normal, but the store just points the different parts to different chunks of that same physical file.

As far as the problem of including the data in perkeep without storing it twice -- what I do here is to just dump the data into perkeep and then access it via the fuse filesystem.

Something like this would have been nice to have when I was doing a mass import of my backups months ago; not having to actually copy 2TiB of data might have sped things up a bit. I probably still would have had to write custom tooling though...

@bobg
Copy link
Contributor Author

bobg commented Nov 29, 2019

you could solve that by keeping track of offset & size info in the db as well, so that multiple blobs could be contained in the same file

Thanks for that suggestion! It's implemented in 4d9ca00. That commit also adds a new type, FileSectionReader, that, if used by schema.WriteFileChunks et al., would be most of what this needs to be fully useful. That's a project for a future commit on this branch.

@bobg
Copy link
Contributor Author

bobg commented Dec 1, 2019

Just discovered there's an existing feature request for this: #1226.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Author has submitted the Google CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants