Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check input is valid JSON without full parsing #1780

Open
daverigby opened this issue Jan 10, 2022 Discussed in #1393 · 5 comments
Open

Check input is valid JSON without full parsing #1780

daverigby opened this issue Jan 10, 2022 Discussed in #1393 · 5 comments

Comments

@daverigby
Copy link

Discussed in #1393

Originally posted by daverigby January 14, 2021
Hi all,

As per subject, I'm wondering what's the most efficient way to use simdjson to simply check an input is valid JSON, without actually parsing out the content. Essentially the same use-case as http://www.json.org/JSON_checker/

Ideally this would require state which isn't proportional to the size of the input document; for my use-case I need to validate many MB JSON files without consuming a similar amount of memory.

Thanks in advance.

@daverigby
Copy link
Author

I'm still interested in this use-case for simdjson, so seemed worth converting into an Issue for tracking purposes.

@lemire
Copy link
Member

lemire commented Jan 10, 2022

It is entirely valid. We have not done work toward this use case at this point in time.

Can you tell us more about your use case?

I am not sure why you'd ever want to check validity without also parsing. If you can provide motivation, this could help this issue along.

@daverigby
Copy link
Author

daverigby commented Jan 10, 2022

Thanks Daniel.

The primary use-case is a database system (key-value store) which lets users store documents in two main datatypes -

  1. "binary" (opaque value which cannot really be manipulated server side),
  2. JSON - which can be manipulated on the server (read individual paths, mutate to insert new paths etc.

Clients can write new documents to the server in either binary or JSON; the server (which is the code I'm responsible for) wants to be able to verify the type of document at the point the client writes it.

We don't blindly trust the datatype the client specifies, primarily because of data consistency issues - for example we don't want another client at some point in the future trying to manipulate a supposedly "JSON" document, then getting an error that we cannot parse the field(s) they are trying to access because the previous client sent us invalid JSON some time in the past.

Broadly speaking, the point at which we accept and store some new JSON data isn't necessarily when it is manipulated, and we want to check it is valid up-front, not necessarily wait until the fields of the JSON data are accessed.

@lemire
Copy link
Member

lemire commented Jan 10, 2022

Great answer.

@TysonAndre
Copy link
Contributor

Use cases I can think of: Datastores that store json without parsing as the other commentor said (and hypothetically, database/service clients that validate before sending JSON to such datastores, or before making network calls to avoid calling external APIs (such as datastores) that expect JSON) - though I haven't had those real-world use cases personally

That would also make it easier to avoid the need for an arbitrary depth limit (no longer need to allocate the additional 8-byte struct open_container per each level, just need the 1 byte for tracking whether each stack element is an object vs an array)

  • e.g. to avoid pathological cases such as requiring 4.5GB instead of .5GB to validate 1GB of JSON with a max depth needing ~500,000,000 levels of stack in the worst case to track whether a level is an object vs an array, assuming a validation implementation that didn't use the C call stack to track state

This would let you reduce the amount of memory needed when setting a depth limit for the rare use cases where you want to ignore any sort of depth limit (set depth to document length/2) and just check if the JSON is valid

Ideally this would require state which isn't proportional to the size of the input document; for my use-case I need to validate many MB JSON files without consuming a similar amount of memory.

If this continued to have a depth limit, that'd be possible - it'd be proportional to depth limit instead.

(because {[{[{[{[{[.... would require a stack of some sort (manual, or implicit through recursion) that used memory to track which nesting level was an object and which was an array)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants