Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-2402: RDF Patch Binary handles malformed inputs better #2408

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rvesse
Copy link
Member

@rvesse rvesse commented Apr 9, 2024

Discovered in $dayjob work that RDFPatchReaderBinary would silently accept malformed inputs in some cases. Started by adding several test cases, some of which failed, to demonstrate that the RDF Patch Binary reader can silently accept invalid streams. Then debugged from there are introduced additional logic which makes a best effort attempt to distinguish between genuine EOF and EOF due to malformed input that should produce an error.

Unfortunately the Thrift API doesn't have a clean way to detect that you've genuinely reached the EOF so we just have to attempt to read the next row from the patch, and detect the obvious cases of malformed input:

  1. TProtocolUtil.skip() getting called, this indicates that Thrift recognised a Field ID but that the Type ID was not the expected type for that field. So if we hit a EOF while skipping over such a field we're definitely reading malformed input.
  2. TUnion.read() getting called multiple times. This means that we're partway through reading an otherwise valid data structure at the point where we encounter the EOF so again is a clear indicator of malformed input.

I appreciate that this solution is somewhat hacky, if anyone has a more robust solution please feel free to suggest it

GitHub issue resolved #2402


  • Tests are included.
  • Documentation change and updates are provided for the Apache Jena website
  • Commits have been squashed to remove intermediate development commit messages.
  • Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.


See the Apache Jena "Contributing" guide.

@rvesse rvesse self-assigned this Apr 9, 2024
@rvesse rvesse added the bug label Apr 9, 2024
@rvesse rvesse force-pushed the rdf-patch-binary-invalid-inputs branch from 3c9d53e to b028b3e Compare April 25, 2024 09:17
@rvesse rvesse changed the title GH-2402: Failing test cases for binary patch bug GH-2402: RDF Patch Binary handles malformed inputs better Apr 25, 2024
Previously the RDFPatchReaderBinary simply exited its read loop when it
detected an EOF error as Thrift doesn't have a clean way of apriori
detecting whether we've reached the EOF.  The downside of this approach
was that it meant the reader would silently ignore some genuinely
malformed inputs if they happened to be the right bytes for Thrift to
think it had encountered a field it knew about.

To address this a new method is introduced that inspects the
StackTraceElement's associated with the Thrift EOF exception to detect a
couple of cases where the input is clearly malformed and throw an
appropriate error.

Test cases around malformed patch inputs are also expanded to further
test this capability.
@rvesse rvesse force-pushed the rdf-patch-binary-invalid-inputs branch from b028b3e to ef4dc79 Compare April 25, 2024 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RDF Patch Binary Reader silently accepts some invalid patch files
1 participant