Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] consolidate files into volumes to reduce number of files #427

Open
onlyjob opened this issue May 4, 2022 · 4 comments
Open

[feature] consolidate files into volumes to reduce number of files #427

onlyjob opened this issue May 4, 2022 · 4 comments

Comments

@onlyjob
Copy link

onlyjob commented May 4, 2022

CryFS have some good ideas and interesting design but terrible implementation...
I did some testing on CryFS-0.10.2 and rsync'ed 290329 files from my home folder into CryFS:

This is how it looked in htop just before I interrupted rsync:

   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%▽  TIME+  Command
153410 user       21   1 12.7G 11.1G  4708 S  0.7 17.4 23h40:33 cryfs . /tmp/qqq.cryfs

Note massive memory use. But it gets worse.
CryFS created 4096 directories in its folder (000...FFF), with around 21500 files per directory: 4096*21500=88_064_000

So it created 88 million files(!) -- about 300 times greater number of files that in source.

This is incredibly inefficient. No underlying file system can deal with that many files without severe performance degradation. It took a week(!) to remove 88 million files...

Instead of multiplying number of files by factor 300 (or so), CryFS should reduce number of files by packing chunks (that are currently stored as files) into 64 MiB volume files.

SeaweedFS implements this concept very well and can handle millions of files very efficiently.

@smessmer
Copy link
Member

smessmer commented May 4, 2022 via email

@onlyjob
Copy link
Author

onlyjob commented May 5, 2022

Thanks but the problem is with number of files that CryFS produces - not with their block size.

Of course SeaweedFS is for a different use case. But it nicely implements the very concept I'm talking about: consolidation of small files into volumes. IMHO that idea is worthy of borrowing or at least considering for a somewhat similar implementation.

@codebling
Copy link

Thanks but the problem is with number of files that CryFS produces - not with their block size.

@onlyjob the number of files produced depends on the block size. Increasing the block size will reduce the number of files produced, though you are right, there is always at least one block per file.

It does have a significant downside though - larger blocks slow down synchronization and file system latency.

Can you clarify here? There are two major factors (ignoring IOPS for now) contributing to "speed": throughput and latency.

My understanding of how these affect CryFS is

  • Optimal performance will always be achieved when the file being written is exactly the same size as the block size (minus the block header size)
  • The larger the file relative to the block, the more latency impacts the speed, eventually becoming the only limiting factor
  • The smaller the file relative to the block, the more throughput impacts the speed, eventually becoming the only limiting factor

So when you say larger blocks slow down synchronization, I think you are talking about the third case. For files larger or equal to the block size, it would speed up synchronization and latency. Is this correct or am I completely wrong?

Block size also affects storage:

  • The larger the block size, the more wasted storage

Unfortunately, files come in all sizes. For my use case, I don't think there is a block size that will be fast enough for dealing with large files while also not inflating small file sizes so much that my backup becomes unaffordably expensive.

I think @onlyjob's idea could be implemented fairly easily using a filesystem on top of CryFS. It doesn't seem like a perfect solution, though.

Do you think that there is a way to have different buckets of block sizes

  1. without impacting confidentiality
  2. as a layer on top of CryFS, possibly multiple backing CryFS instances of different block sizes (or would this be a change better made in CryFS or using it as a library)

@antymat
Copy link

antymat commented Mar 17, 2024

Unfortunately, files come in all sizes. For my use case, I don't think there is a block size that will be fast enough for dealing with large files while also not inflating small file sizes so much that my backup becomes unaffordably expensive.

The current scheme means that I would either pay for a huge inflation of the FS by making the block 64MiB (same as the default storj block size), or pay for a huge inflation of the number of files for my 8TiB repo. Seems there is very little to choose from; perhaps the only solution is the one that

could be implemented fairly easily using a filesystem on top of CryFS.

Sounds like a big PITA, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants