Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify that decoders may reject non-zero probabilities for larger offset codes than implementation supports #4013

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

elasota
Copy link
Contributor

@elasota elasota commented Apr 2, 2024

This is a somewhat pedantic change, but basically, the current wording states that out-of-range values in an FSE table are invalid. However, technically offset codes exceeding the decoder's limit are still valid. This adds some text to allow decoders to reject prob tables with unsupported offset codes.

There is one edge case with this that needs to be figured out though: The current spec states that a decoder may limit the maximum offset code length and the minimum recommended is 22. However, the predefined table permits offset codes as large as 28. So, this wording could mean that a decoder is allowed to reject frames using predefined codes, which sounds like something that shouldn't be allowed?

If that's the case, then this text would need to be changed to specify that it only applies to encoded tables, not to the predefined tables, and that predefined tables must be supported.

…fset codes than supported by the implementation
@klauspost
Copy link

What would be the benefit of this?

I could easily see speed focused encoders having a predefined table - even if suboptimal having a non-zero probability for unused codes IMO shouldn't be a crime. For other encoders I have experimented with sub-optimal tables to increase their reusability.

If this is for an optimization I don't see how you would avoid having to check offsets anyway - at least for one offset value once you have a filled window. This will just add another (albeit cheap) check for a decoded FSE table.

It seems reasonable that content that is unsupported by the decoder would already have been rejected when checking the requested window size. Decoded offsets are already checked against the window size - so if a block has an offset > window size it would be invalid anyway.

Finally, how do you know there isn't content already out there with this property?

@elasota
Copy link
Contributor Author

elasota commented Apr 2, 2024

Due to the behavior of less-than-one probabilities, the symbol range determines how much temporary storage is needed for FSE table construction, so it is beneficial for implementations to allow the symbol range to be limited there. If the maximum offset code is bounded by the window size, then maybe the maximum allowed offset code should be specified as 64?

That isn't currently the behavior of the reference decoder though. The behavior of the reference decoder is what's described in this change: ZSTD_decodeSeqHeaders passes MaxOff (31) as the max to ZSTD_buildSeqTable, which passes it to FSE_readNCount, so a frame containing a non-zero probability for an offset code value larger than 31 will be rejected.

@klauspost
Copy link

offset-codes can already be limited by the decoder:

Offset codes are values ranging from 0 to N. A decoder is free to limit its maximum N supported. Recommendation is to support at least up to 22. For information, at the time of this writing. the reference decoder supports a maximum N value of 31.

My reading is that you can freely reject above 22. There is however no requirement that offset codes base value cannot exceed the window size and IMO it is a rather unsafe assumption that zstd content out there will respect that.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Apr 2, 2024

The original sentence, mentioning "out-of-range values in an FSE table", was actually more focused on Literals and Match Lengths, for which the range is bounded.

In the case of offsets, this range is not bounded that clearly.
One could derive that, given that the maximum possible window size is the size of content, and this content size is limited by 64-bit unsigned, then we may need up to 63. But then, it's also allowed to reference data into a dictionary of unspecified size, so we could imagine needing even more.
This is all theoretical, there is no system out there able nor willing to reach such extreme distances.

And that's the difference between format's limitation, and implementation limitation.
Limits on offset values are necessarily a matter of implementation limitation.

An implementation may reject a frame on the ground that it requests capabilities it cannot provide,
but that's different from stating that the frame is invalid.

The modified paragraph was focused on determining what's an "invalid" or "corrupted" FSE header. I think it's not helpful to mix this concept with the one of "not supported by this implementation". I would separate them more clearly.

And indeed, as mentioned by @klauspost, the topic of implementation limitation is already covered by another paragraph.

@Cyan4973 Cyan4973 self-assigned this Apr 2, 2024
@elasota
Copy link
Contributor Author

elasota commented Apr 3, 2024

I don't think we'll ever see a real implementation go up to 64-bit offsets, but >32-bit offsets is plausible.

I think there are two main things I want to address:

First, since any offset code >63 will always exceed the maximum possible window size, should 63 be specified as the top of the valid range? (This affects temp usage, which may matter for constrained implementations like ASICs.)

Second, right now the reference decoder rejects a frame if it has non-zero probs for offsets >31. Should the decoder be updated to permit larger offset codes and reject them as out-of-range in sequence decoding or execution instead? If not, should the spec provide guidance that someone writing an encoder could use to avoid hitting this kind of limit, and that someone writing a decoder could expect encoders to follow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants