Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dictionary-in-stream format in contrib #2349

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

yotann
Copy link

@yotann yotann commented Oct 11, 2020

Adds a format in contrib/dict_in_stream that stores a dictionary, optionally compressed, in a skippable frame at the beginning of a stream. This format is useful to reduce the size of streams that have many frames. For instance, Wget-AT is using this format to compress WARC files using one frame per record and allow random access.

cc @JustAnotherArchivist

@facebook-github-bot
Copy link

Hi @yotann!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@Cyan4973
Copy link
Contributor

Thanks @yotann ,
this looks like a very interesting contribution.
It's also great to learn that this format is already implemented and in use.

Adding dictionary support to the seekable format is something we are definitely interested in.
This is probably the place where we will focus most of our attention,
to ensure it can become a base for future developments.

All in all, a very interesting idea worth reviewing.

@yotann yotann marked this pull request as ready for review November 4, 2020 16:23
Comment on lines +26 to +27
a normal Zstandard stream. All compressed frames in the stream must have been
compressed using the dictionary.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No dictionary should also be a valid option for frames.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. Part of me wants to keep it simple, as an implementation that requires every frame to use the same dictionary may be easier and cleaner than potentially switching between dict and no-dict. On the other hand, other specifications (such as the one for the .warc.zst files produced by Wget-AT) can always impose additional restrictions. And I don't see any obvious harm from allowing dictionary-less frames other than that it might lead to the immediate follow-up 'why not allow multiple dictionaries in a stream?', which – while there may well be realistic applications of it – seems like feature creep to me at this point.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JustAnotherArchivist Dictionaries already have an ID and Frames should use that ID if they want to. It doesn't really extend any features, it just registers the dictionaries and frames with their ID can pick it up.

That way you can add several dictionaries, for example for different file types and the encoder can use them as it sees fit. While the improvement will be marginal it gives some flexibility.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JustAnotherArchivist

Following compressed frames in the stream may have been
compressed using the dictionary.

A bit clunky wording. But allows for more than one dictionary per stream. Maybe add:

If a dictionary with the same ID has previously been registered, this replaces it for the following frames.

@klauspost
Copy link

klauspost commented Jun 21, 2021

This is a good start, but IMO it needs a bit more to be widely useful.

For me this would be an index that contains information about the start of each frame and where it is located in the stream.

This will have the stream seekable without having to read all data, and allow reading an arbitrary frame with only a few disk seeks.

What I could see is something like this...

[seekable stream indication with zero or more dictionaries]
[zstd frames]
[frame index]
For each non-skippable frame:
	* Absolute offset of beginning of frame
	* Compressed size of frame
	* Uncompressed Size
[index pointer]
	* Absolute offset of frame index, 8 bytes
[EOF]

Instead of adding only one dictionary, we instead upfront state that this stream is seekable. This could be a skippable frame as proposed, with some additional magic in the start to indicate that this is a seekable stream.

We allow specifying 0 or even several dictionaries up front. We keep them here so they can be read without seeking.

At the end a mapping of uncompressed position -> file offset is created for each frame. The frame must start at the absolute offset and the content must match "Uncompressed Size" - this effectively acts as "Frame Content Size" of the frame.

The frame must be able to decode using "Uncompressed Size" bytes from the specified offset. This can be used for HTTP range requests, etc.

Finally a frame with the absolute offset of the index is added. With a skippable frame this should be a fixed size, so it can always be read by reading X bytes back from the EOF.

This will allow appending data, by simply adding the new index with the added frames, and the existing index will simply be skipped.

Decoding a "frame" can then be done by 1) reading from the start, parse the dictionaries 2) Seek to the end, read last 64KB for example 3) check we have read enough by checking the index pointer, decode it 4) Seek to the frame we want.

So at most 4, likely 3 seeks to decode at specific content offset, while maintaining fully stream-able writes. When it is impossible to seek, you will have to decode all frames just like the proposal, but you still get one or more dictionaries.

I am tempted to allow some custom content for each frame, though that brings us rather close to a ZIP functionality. Maybe adding the option for custom, skippable data per frame would be nice.

@yotann
Copy link
Author

yotann commented Sep 15, 2021

@klauspost There's actually a seekable format defined already: https://github.com/facebook/zstd/blob/dev/contrib/seekable_format/zstd_seekable_compression_format.md

I think this PR's dictionary-in-stream format is already compatible with that seekable format, although I'd need to check to make sure. (And it's certainly very important to make sure both formats are compatible.)

In general, I think it should be possible to use the dictionary-in-stream format independently from the seekable format, because there are a lot of variations on the seekable format that could be useful for different situations:

  • A linear lookup table in the Zstandard stream (like the existing format linked above)
  • A Btree-based lookup table in the Zstandard stream, to reduce the number of blocks read from disk
  • A custom table in a separate file that includes other information about the entries in the stream (like Wget-AT uses)
  • Lookup tables stored in some sort of separate database (small pieces of data go directly in the database; large pieces go in a Zstandard stream with only byte offsets stored in the database)

All these situations can still use the same dictionary-in-stream format. And a tool that understands the dictionary-in-stream format can read all the data in the stream even if it doesn't understand the seek table; it just won't be able to seek efficiently.

@JustAnotherArchivist
Copy link

JustAnotherArchivist commented Sep 15, 2021

I fully agree with that. They're both useful concepts, and there may be cases where both at the same time may be desired, but they're ultimately solving unrelated problems, and I feel it's best to keep them separate.

Regarding the compatibility of the two formats: I don't see any reason why they wouldn't be compatible. The two frames are essentially opaque to each other, and they're both standard skippable frames. The seek table would have to include the dict-in-stream frame as it would any other frame (or rather as any other skippable frame, with Decompressed_Size = 0), and the dictionary would not be relevant for the seek table frame since that frame isn't compressed. They're on the opposite ends of a stream, so there's no confusion over which would come first, and their magic numbers even differ as well.

The 'which comes first' bit made me think about future-proofing though. Perhaps this format should be amended to permit any number of other skippable frames before the dict-in-stream frame. Since skippable frames aren't compressed (unless their User_Data is explicitly defined as such, as with the potentially compressed dict in here), the dict wouldn't be relevant for them. And it would avoid conflict with future formats that also define skippable frames at the beginning of a stream.

@klauspost
Copy link

@yotann @JustAnotherArchivist Yeah, forgot about the already defined seekable format..

The dictionary data frames are relevant for future frames. First of all the index should show if there are embedded dictionaries. That can be done using one of the 'Reserved' bits. This will also make it clear to previous versions that it cannot decompress the output.

The main problem then is that it doesn't show if a skippable frame is or is not a dictionary. I would hate to force the reader to read all skippable frames, potentially skipping all over the file.

As I see it, we can do it in, without major breakage:

A) Add magic values for CRC, e.g. 0xffffffff. This will never be true since that is not the size 0 CRC. @Cyan4973 should be able to confirm.

B) Force dictionaries to be before any non-skippable frame.

C) Add a bit field byte to each Seek_Table_Entries in the definition.

I am in favor of A) since it will allow concatenating streams. It is on the limit of being too hacky, but the "Reserved Bit" can make it clear that it is incompatible.

C) is the "correct" solution, but requires parsing code to be rewritten significantly and adds too much complexity to the format IMO. I mainly just put it in here for completion.

@tomberek
Copy link

I'm a fan of (B) and (C)

(A): I'm looking at a use-case where we can do frame deduplication: https://github.com/facebook/zstd/pull/2737/files#diff-50e7772e96927d183da7ec7cac6a9f4e34f7e33ed00ca518611d0c52e0af4350R103
and magic CRC values would be awkward.

(B): Retain the invariant that a stream processor can pre-load everything as needed. So the dictionary would only be needed before the frames that require it. This should still allow concatenation.

(C): If the seek table index exposed the Dictionary_ID then a client that already had that dictionary wouldn't need to fetch or process it.

@yotann
Copy link
Author

yotann commented Sep 30, 2021

A) Add magic values for CRC, e.g. 0xffffffff. This will never be true since that is not the size 0 CRC.

We wouldn't need a magic value; all skippable frames will have Decompressed_Size=0, so we can repurpose the CRC field just for those frames. But then the new field would only be present when the CRC field is enabled, and there wouldn't be any room to distinguish between dictionary frames and other skippable frames, which is too awkward IMHO.

(C): If the seek table index exposed the Dictionary_ID then a client that already had that dictionary wouldn't need to fetch or process it.

That's a good idea. If a file has lots of dictionaries, searching for the right one could be expensive. 128 dictionaries of 8MiB each would mean seeking to 128 different places in the first 1GiB of the file. (Not sure if HDD seek time is relevant anymore, but HTTP request latency certainly can be.) In that case, it would be important to include Dictionary_IDs in the seek table.

This is what I think the most general design could look like:

  • Dictionary frames can be anywhere in the file
  • Each dictionary frame must precede any non-skippable frames that use it
  • If multiple dictionaries with the same Dictionary_ID are seen while reading the file sequentially, the most recent one takes precedence.
  • Dictionary frames are included in the seek table with Decompressed_Size=0, just like other skippable frames.
  • The seek table includes additional information about dictionary frames, using one of these ideas:
    1. Each seek table entry has an additional 4-byte field giving its Dictionary ID. If Decompressed_Size=0, it marks a dictionary frame with the given ID; otherwise, this is a non-skippable frame and the ID indicates which dictionary is needed to decompress it.
    2. Each seek table entry has an additional 4-byte field, which is the index (in the seek table) of the correct dictionary frame needed to decompress the current frame. -1 if no dictionary is used.

This is definitely overkill for our original use case. OTOH, it covers every use case I can think of with a reasonable amount of complexity.

@tomberek
Copy link

@yotann for (C), a slight modification would be to utilize the Seek_Table_Descriptor bitfield to indicate the presence or lack thereof of the dictionary entries. I suspect dictionary-in-stream will not be used all the time (or even most of the time), so it should be opt-in. But dictionaries are central enough to zstd to devote a reserved bit (or two, the same way the dict ID length is encoded in the Dictionary_ID_flag in the Frame_Header_Descriptor ) in that field to indicate their presence.

@klauspost
Copy link

klauspost commented Oct 5, 2021

I propose the following change to the seek table information:

Seek_Table_Descriptor

A bitfield describing the format of the seek table.

Bit number Field name
7 Checksum_Flag
6 Stream_Dictionaries
5-2 Reserved_Bits
1-0 Unused_Bits

While only Checksum_Flag and Stream_Dictionaries currently exists, there are 6 other bits in this field that can be used for future changes to the format.

[...]

Stream_Dictionaries

When the stream dictionaries is set one or more dictionaries are present in the stream.

Dictionaries can be required to decode frames following the dictionary.

Enabling this adds 4 bytes to every seek table entry.

Seek_Table_Entries

Seek_Table_Entries consists of Number_Of_Frames (one for each frame in the data, not including the seek table frame) entries of the following form, in sequence:

Compressed_Size Decompressed_Size [Checksum] [Dictionary_ID]
4 bytes 4 bytes 4 bytes 4 bytes

[...]

Dictionary_ID

Only present if Stream_Dictionaries is set in the Seek_Table_Descriptor.

This will contain the dictionary ID contained or used in the frame.

If this is a skippable frame (Decompressed_Size == 0) and Dictionary_ID is non-zero the skippable frame will contain the dictionary with the specified ID. If there are several entries with the same Dictionary_ID, the last entry before the frame that is to be decoded should be used.

If this is a non-skippable frame, this indicates the dictionary id required to decode the frame. If value is 0 no dictionary is required.

This will allow to maintain backwards compatibility and allow decompressing with the least number of seeks.

Comment on lines +9 to +10
This document defines a format for including a Zstandard dictionary inside a
compressed Zstandard stream. When combined with the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This document defines a format for including a Zstandard dictionary inside a
compressed Zstandard stream. When combined with the
This document defines a format for including Zstandard dictionaries inside a
compressed Zstandard stream. When combined with the


__`Compressed_or_Uncompressed_Dictionary`__

The dictionary data, which may optionally be compressed.
Copy link

@klauspost klauspost Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make it crystal clear:

Suggested change
The dictionary data, which may optionally be compressed.
The dictionary data, which may optionally be compressed.
This __must__ start with either little-endian 0xEC30A437 or 0xFD2FB528, otherwise is should be considered a regular skippable block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants