Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datafactory errors not propagated properly by StreamParser #308

Open
LaurensRietveld opened this issue Oct 8, 2022 · 8 comments
Open

Datafactory errors not propagated properly by StreamParser #308

LaurensRietveld opened this issue Oct 8, 2022 · 8 comments
Labels

Comments

@LaurensRietveld
Copy link
Contributor

LaurensRietveld commented Oct 8, 2022

If a custom datafactory throws an error then the StreamParser does not gracefully emit this error.

A unit-test to reproduce (failed to find out how to style my code to satisfy the linter, so copy/pasting it here ;) ):

(done) => {
        const factory = DataFactory
        factory.namedNode = (iri) => {throw Error("Some error")}
        Readable.from([``,`<a:a> <b:b> <c:c> .`])
          .pipe(new StreamParser({ factory: factory }))
          .on("end", () => {
            reject(new Error("expected error"))
          })
          .on("error", (e) => {
            error.should.be.an.instanceof(Error);
            error.message.should.equal("Some error");
            done()
          });
        
     }

Usecase: custom validation that happens in a datafactory

@RubenVerborgh
Copy link
Member

Thanks, @LaurensRietveld. I think you must be the most prolific N3.js bug finder by now.

@RubenVerborgh
Copy link
Member

I worry that the performance cut of wrapping error handling around factories is too much. I think we should be able to trust factories, especially given that N3.js will have pre-validated the URL. In what cases are factories erroring?

@LaurensRietveld
Copy link
Contributor Author

Validation that we'd like to perform are validation of literals (based on their datatype) and validation of IRIs. Applying such validation in the datafactory instead of some other place (e.g. in a stream after parsing) allows us to use the RDF-JS ecosystem of tools by just using a different datafactory, instead of having to wrap all other RDF-JS tools with a custom pre/post process step (kind of defeating the purpose of RDF-JS ;) )

@jeswr
Copy link
Collaborator

jeswr commented Nov 12, 2022

To quickly jump in with my 2cents: I think there is definitely value in having validation; but that it should be opt-in given that there are cases in which it is safe to assume the data has already come from another validated source.

I also think there is value in having the API for toggling this validation behavior propagated to the RDFJS spec level at some stage.

@LaurensRietveld
Copy link
Contributor Author

If we expect a performance decrease, then I'm happy putting this behind a toggle obviously 👍

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Nov 12, 2022

instead of having to wrap all other RDF-JS tools with a custom pre/post process step (kind of defeating the purpose of RDF-JS ;)

Well well, let's not exaggerate 🙂

So N3.js offers a parser. The job of the parser is to transform a syntactically valid serialization format into an in-memory representation.

RDF/JS is very useful as a standard model for data, and for component interfaces.
And this thread only proves that usefulness.
You can perfectly implement validation within the RDF/JS model without having to bend the interfaces.

The purpose of the RDF/JS DataFactory interface is to create RDF model terms for interoperable use by other components, such as validators. So if one wants to write an RDF/JS-compatible validator, they would write it such that it accepts RDF/JS and then performs it task. That way, it can accept input from any RDF/JS-compatible component.

we'd like to perform are validation of literals (based on their datatype)

That's possible. The N3.js parser has streaming output, in an RDF/JS compatible format.
Or more specifically, it will output whatever tokens the factory generates. But let's put that aside for a second.

The logical way to do validation is not to interfere with the parser (whose job is only the transformation of Turtle into a valid RDF in-memory model, which all of its output is). Instead, chain an RDF/JS-compatible parser into an RDF/JS-compatible validator, which checks if the model-valid RDF nodes (which the parser is guaranteed to output) also meet other non-RDF validation constraints (such as whether literals have the right type).

Now if you must do this in the factory, recall that the factory is in full control of whatever it outputs. So nothing is stopping you from, for instance, outputting { valid: false, reason: 'mismatched datatype' } as a token, which you can then immediately catch afterwards.

But I see no architectural nor performance need to catch non-model errors in the parser. Attaching the validator as an immediate next step in the stream will not delay the validation in any way, given that N3.js has a fully streaming output.

validation of IRIs

N3.js will only instantiate valid IRIs (as this is an RDF model requirement). But if this validation includes more specific constraints, then indeed that goes beyond the parser's job of generating a valid model.

If we expect a performance decrease, then I'm happy putting this behind a toggle obviously 👍

The thing is, it would require different code paths. Because currently, we make assumptions (for performance reasons) about where errors can occur and where they cannot; and when errors can occur, we avoid an expensive try/catch construct that would have to run hundreds of thousands of times per second.

Simply said: you either wrap the whole thing in a try/catch block or you don't; this is not switchable except for separate code paths.

Which I could always do of course, but I just don't see a convincing case yet: having streaming non-model validation after the streaming model parser is the right separation of concerns, does not sacrifice performance, and allows us to make stronger assumptions about what can fail and what cannot.

@LaurensRietveld
Copy link
Contributor Author

LaurensRietveld commented Nov 14, 2022

Now if you must do this in the factory, recall that the factory is in full control of whatever it outputs. So nothing is stopping you from, for instance, outputting { valid: false, reason: 'mismatched datatype' } as a token, which you can then immediately catch afterwards.

We'll probably settle on an option similar to this one, where we'll wrap the N3 parser interface and expose a stream with proper error handling.

N3.js will only instantiate valid IRIs (as this is an RDF model requirement). But if this validation includes more specific constraints, then indeed that goes beyond the parser's job of generating a valid model.

Nice! I did not know that validating IRIs was in scope for N3 (as we've had issues with IRIs that were invalid according to rfc 3987).

So N3.js offers a parser. The job of the parser is to transform a syntactically valid serialization format into an in-memory representation.

In general, I'd expect a parser (or any other tool that creates RDF-JS terms for me) to return errors when the input does not match the spec. I agree that's difficult in this case without increasing the scope of N3, considering we have several specifications to take into account: the turtle family, IRI spec, XSD datatypes, and possibly other datatypes.
That being said, it's a different matter to prohibit these additional checks completely. Especially considering the RDF-JS spec offers a nice way (the datafactory) to include such optional validation steps to RDF-JS libraries like N3

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Nov 14, 2022

We'll probably settle on an option similar to this one, where we'll wrap the N3 parser interface and expose a stream with proper error handling.

Great! Do let me know if there's any issues; I definitely want to support your use case (even if in a slightly different way).

I did not know that validating IRIs was in scope for N3 (as we've had issues with IRIs that were invalid according to rfc 3987).

Do let me know; the Turtle grammar is strict as to what URLs are allowed, and N3.js follows that.

In general, I'd expect a parser (or any other tool that creates RDF-JS terms for me) to return errors when the input does not match the spec.

Agreed; we might just differ in what we consider "the spec". For me, that's RDF 1.1 and Turtle.

That being said, it's a different matter to prohibit these additional checks completely.

Well, we don't—I just think that this one particular way of validating clashes with performance goals, and that there is a better way of implementing it that offers better performance and other benefits.

Especially considering the RDF-JS spec offers a nice way (the datafactory) to include such optional validation steps to RDF-JS libraries like N3

And that's where we differ as well. RDF/JS indeed has a DataFactory interface, but it does not impose an interface for parser etc. to accept it as a constructor argument. In other words: N3.js is voluntarily offering DataFactory as an option. Everything still works without accepting a factory: the parser is an RDF/JS Source implementation that you can use at will.

where we'll wrap the N3 parser interface and expose a stream with proper error handling.

No need to wrap; just chain a validating stream after it. It's literally .pipe(new StreamParser()).pipe(validator), or if you want class ValidatingParser { constructor() { return new StreamParser().pipe(validator); }}. Performance will be significantly better than if intrusive error handling is used.

So far, I have not heard drawbacks to implementing validation after the parser—in fact, I think there are only advantages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants