GH-2402: RDF Patch Binary handles malformed inputs better #2408
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Discovered in $dayjob work that
RDFPatchReaderBinary
would silently accept malformed inputs in some cases. Started by adding several test cases, some of which failed, to demonstrate that the RDF Patch Binary reader can silently accept invalid streams. Then debugged from there are introduced additional logic which makes a best effort attempt to distinguish between genuine EOF and EOF due to malformed input that should produce an error.Unfortunately the Thrift API doesn't have a clean way to detect that you've genuinely reached the EOF so we just have to attempt to read the next row from the patch, and detect the obvious cases of malformed input:
TProtocolUtil.skip()
getting called, this indicates that Thrift recognised a Field ID but that the Type ID was not the expected type for that field. So if we hit a EOF while skipping over such a field we're definitely reading malformed input.TUnion.read()
getting called multiple times. This means that we're partway through reading an otherwise valid data structure at the point where we encounter the EOF so again is a clear indicator of malformed input.I appreciate that this solution is somewhat hacky, if anyone has a more robust solution please feel free to suggest it
GitHub issue resolved #2402
By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.
See the Apache Jena "Contributing" guide.