[feature] consolidate files into volumes to reduce number of files #427

onlyjob · 2022-05-04T09:54:34Z

CryFS have some good ideas and interesting design but terrible implementation...
I did some testing on CryFS-0.10.2 and rsync'ed 290329 files from my home folder into CryFS:

This is how it looked in htop just before I interrupted rsync:

   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%▽  TIME+  Command
153410 user       21   1 12.7G 11.1G  4708 S  0.7 17.4 23h40:33 cryfs . /tmp/qqq.cryfs

Note massive memory use. But it gets worse.
CryFS created 4096 directories in its folder (000...FFF), with around 21500 files per directory: 4096*21500=88_064_000

So it created 88 million files(!) -- about 300 times greater number of files that in source.

This is incredibly inefficient. No underlying file system can deal with that many files without severe performance degradation. It took a week(!) to remove 88 million files...

Instead of multiplying number of files by factor 300 (or so), CryFS should reduce number of files by packing chunks (that are currently stored as files) into 64 MiB volume files.

SeaweedFS implements this concept very well and can handle millions of files very efficiently.

The text was updated successfully, but these errors were encountered:

smessmer · 2022-05-04T16:52:39Z

CryFS allows you to change the block size when creating s file system. Feel free to experiment with 64MB blocks. It does have a significant downside though - larger blocks slow down synchronization and file system latency. CryFS does not currently merge multiple files into one block. The reason for this is to minimize synchronization conflicts. If two different clients modify different files that happen to map.to.the same block, we don't want this to cause a synchronization conflict in your Dropbox. SeaweedFS looks very interesting but has a very different use case. It is meant for running distributed systems when you can run your own server cluster. I don't think you can use it for the use case CryFS is meant for, i.e. running on a local machine and possibly putting encrypted files in a Dropbox or other third party cloud provider.

…

On May 4, 2022 2:54:48 AM onlyjob ***@***.***> wrote: CryFS have some good ideas and interesting design but terrible implementation... I did some testing on CryFS-0.10.2 and rsync'ed 290329 files from my home folder into CryFS: This is how it looked in htop just before I interrupted rsync: PID USER PRI NI VIRT RES SHR S CPU% MEM%▽ TIME+ Command 153410 user 21 1 12.7G 11.1G 4708 S 0.7 17.4 23h40:33 cryfs . /tmp/qqq.cryfs Note massive memory use. But it gets worse. CryFS created 4096 directories in its folder (000...FFF), with around 21500 files per directory: 4096*21500=88_064_000 So it created 88 million files(!) -- about 300 times greater number of files that in source. This is incredibly inefficient. No underlying file system can deal with that many files without severe performance degradation. It took a week(!) to remove 88 million files... Instead of multiplying number of files by factor 300 (or so), CryFS should reduce number of files by packing chunks (that are currently stored as files) into 64 MiB volume files. SeaweedFS implements this concept very well and can handle millions of files very efficiently. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

onlyjob · 2022-05-05T05:36:10Z

Thanks but the problem is with number of files that CryFS produces - not with their block size.

Of course SeaweedFS is for a different use case. But it nicely implements the very concept I'm talking about: consolidation of small files into volumes. IMHO that idea is worthy of borrowing or at least considering for a somewhat similar implementation.

codebling · 2022-07-31T20:43:48Z

Thanks but the problem is with number of files that CryFS produces - not with their block size.

@onlyjob the number of files produced depends on the block size. Increasing the block size will reduce the number of files produced, though you are right, there is always at least one block per file.

It does have a significant downside though - larger blocks slow down synchronization and file system latency.

Can you clarify here? There are two major factors (ignoring IOPS for now) contributing to "speed": throughput and latency.

My understanding of how these affect CryFS is

Optimal performance will always be achieved when the file being written is exactly the same size as the block size (minus the block header size)
The larger the file relative to the block, the more latency impacts the speed, eventually becoming the only limiting factor
The smaller the file relative to the block, the more throughput impacts the speed, eventually becoming the only limiting factor

So when you say larger blocks slow down synchronization, I think you are talking about the third case. For files larger or equal to the block size, it would speed up synchronization and latency. Is this correct or am I completely wrong?

Block size also affects storage:

The larger the block size, the more wasted storage

Unfortunately, files come in all sizes. For my use case, I don't think there is a block size that will be fast enough for dealing with large files while also not inflating small file sizes so much that my backup becomes unaffordably expensive.

I think @onlyjob's idea could be implemented fairly easily using a filesystem on top of CryFS. It doesn't seem like a perfect solution, though.

Do you think that there is a way to have different buckets of block sizes

without impacting confidentiality
as a layer on top of CryFS, possibly multiple backing CryFS instances of different block sizes (or would this be a change better made in CryFS or using it as a library)

antymat · 2024-03-17T10:05:47Z

Unfortunately, files come in all sizes. For my use case, I don't think there is a block size that will be fast enough for dealing with large files while also not inflating small file sizes so much that my backup becomes unaffordably expensive.

The current scheme means that I would either pay for a huge inflation of the FS by making the block 64MiB (same as the default storj block size), or pay for a huge inflation of the number of files for my 8TiB repo. Seems there is very little to choose from; perhaps the only solution is the one that

could be implemented fairly easily using a filesystem on top of CryFS.

Sounds like a big PITA, though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] consolidate files into volumes to reduce number of files #427

[feature] consolidate files into volumes to reduce number of files #427

onlyjob commented May 4, 2022

smessmer commented May 4, 2022 via email

onlyjob commented May 5, 2022

codebling commented Jul 31, 2022

antymat commented Mar 17, 2024

[feature] consolidate files into volumes to reduce number of files #427

[feature] consolidate files into volumes to reduce number of files #427

Comments

onlyjob commented May 4, 2022

smessmer commented May 4, 2022 via email

onlyjob commented May 5, 2022

codebling commented Jul 31, 2022

antymat commented Mar 17, 2024