-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify that decoders may reject non-zero probabilities for larger offset codes than implementation supports #4013
base: dev
Are you sure you want to change the base?
Conversation
…fset codes than supported by the implementation
What would be the benefit of this? I could easily see speed focused encoders having a predefined table - even if suboptimal having a non-zero probability for unused codes IMO shouldn't be a crime. For other encoders I have experimented with sub-optimal tables to increase their reusability. If this is for an optimization I don't see how you would avoid having to check offsets anyway - at least for one offset value once you have a filled window. This will just add another (albeit cheap) check for a decoded FSE table. It seems reasonable that content that is unsupported by the decoder would already have been rejected when checking the requested window size. Decoded offsets are already checked against the window size - so if a block has an offset > window size it would be invalid anyway. Finally, how do you know there isn't content already out there with this property? |
Due to the behavior of less-than-one probabilities, the symbol range determines how much temporary storage is needed for FSE table construction, so it is beneficial for implementations to allow the symbol range to be limited there. If the maximum offset code is bounded by the window size, then maybe the maximum allowed offset code should be specified as 64? That isn't currently the behavior of the reference decoder though. The behavior of the reference decoder is what's described in this change: |
offset-codes can already be limited by the decoder:
My reading is that you can freely reject above 22. There is however no requirement that offset codes base value cannot exceed the window size and IMO it is a rather unsafe assumption that zstd content out there will respect that. |
The original sentence, mentioning "out-of-range values in an FSE table", was actually more focused on Literals and Match Lengths, for which the range is bounded. In the case of offsets, this range is not bounded that clearly. And that's the difference between format's limitation, and implementation limitation. An implementation may reject a frame on the ground that it requests capabilities it cannot provide, The modified paragraph was focused on determining what's an "invalid" or "corrupted" FSE header. I think it's not helpful to mix this concept with the one of "not supported by this implementation". I would separate them more clearly. And indeed, as mentioned by @klauspost, the topic of implementation limitation is already covered by another paragraph. |
I don't think we'll ever see a real implementation go up to 64-bit offsets, but >32-bit offsets is plausible. I think there are two main things I want to address: First, since any offset code >63 will always exceed the maximum possible window size, should 63 be specified as the top of the valid range? (This affects temp usage, which may matter for constrained implementations like ASICs.) Second, right now the reference decoder rejects a frame if it has non-zero probs for offsets >31. Should the decoder be updated to permit larger offset codes and reject them as out-of-range in sequence decoding or execution instead? If not, should the spec provide guidance that someone writing an encoder could use to avoid hitting this kind of limit, and that someone writing a decoder could expect encoders to follow? |
This is a somewhat pedantic change, but basically, the current wording states that out-of-range values in an FSE table are invalid. However, technically offset codes exceeding the decoder's limit are still valid. This adds some text to allow decoders to reject prob tables with unsupported offset codes.
There is one edge case with this that needs to be figured out though: The current spec states that a decoder may limit the maximum offset code length and the minimum recommended is 22. However, the predefined table permits offset codes as large as 28. So, this wording could mean that a decoder is allowed to reject frames using predefined codes, which sounds like something that shouldn't be allowed?
If that's the case, then this text would need to be changed to specify that it only applies to encoded tables, not to the predefined tables, and that predefined tables must be supported.