Check input is valid JSON without full parsing #1780

daverigby · 2022-01-10T12:23:03Z

Discussed in #1393

^{Originally posted by daverigby January 14, 2021}
Hi all,

As per subject, I'm wondering what's the most efficient way to use simdjson to simply check an input is valid JSON, without actually parsing out the content. Essentially the same use-case as http://www.json.org/JSON_checker/

Ideally this would require state which isn't proportional to the size of the input document; for my use-case I need to validate many MB JSON files without consuming a similar amount of memory.

Thanks in advance.

daverigby · 2022-01-10T12:23:56Z

I'm still interested in this use-case for simdjson, so seemed worth converting into an Issue for tracking purposes.

lemire · 2022-01-10T17:03:13Z

It is entirely valid. We have not done work toward this use case at this point in time.

Can you tell us more about your use case?

I am not sure why you'd ever want to check validity without also parsing. If you can provide motivation, this could help this issue along.

daverigby · 2022-01-10T17:12:24Z

Thanks Daniel.

The primary use-case is a database system (key-value store) which lets users store documents in two main datatypes -

"binary" (opaque value which cannot really be manipulated server side),
JSON - which can be manipulated on the server (read individual paths, mutate to insert new paths etc.

Clients can write new documents to the server in either binary or JSON; the server (which is the code I'm responsible for) wants to be able to verify the type of document at the point the client writes it.

We don't blindly trust the datatype the client specifies, primarily because of data consistency issues - for example we don't want another client at some point in the future trying to manipulate a supposedly "JSON" document, then getting an error that we cannot parse the field(s) they are trying to access because the previous client sent us invalid JSON some time in the past.

Broadly speaking, the point at which we accept and store some new JSON data isn't necessarily when it is manipulated, and we want to check it is valid up-front, not necessarily wait until the fields of the JSON data are accessed.

lemire · 2022-01-10T17:25:03Z

Great answer.

TysonAndre · 2022-08-07T19:09:29Z

Use cases I can think of: Datastores that store json without parsing as the other commentor said (and hypothetically, database/service clients that validate before sending JSON to such datastores, or before making network calls to avoid calling external APIs (such as datastores) that expect JSON) - though I haven't had those real-world use cases personally

There's also the use case of memory-efficiently providing bindings in other programming languages for implementing those datastores and avoiding surprises caused by an arbitrary default depth limit (https://github.com/simdjson/simdjson#bindings-and-ports-of-simdjson) (e.g. simdjson_is_valid fails silently when json has depth exceeding 1024 crazyxman/simdjson_php#35 (noticed the simdjson issue when searching the simdjson issue tracker for related requests))

That would also make it easier to avoid the need for an arbitrary depth limit (no longer need to allocate the additional 8-byte struct open_container per each level, just need the 1 byte for tracking whether each stack element is an object vs an array)

e.g. to avoid pathological cases such as requiring 4.5GB instead of .5GB to validate 1GB of JSON with a max depth needing ~500,000,000 levels of stack in the worst case to track whether a level is an object vs an array, assuming a validation implementation that didn't use the C call stack to track state

This would let you reduce the amount of memory needed when setting a depth limit for the rare use cases where you want to ignore any sort of depth limit (set depth to document length/2) and just check if the JSON is valid

Ideally this would require state which isn't proportional to the size of the input document; for my use-case I need to validate many MB JSON files without consuming a similar amount of memory.

If this continued to have a depth limit, that'd be possible - it'd be proportional to depth limit instead.

(because {[{[{[{[{[.... would require a stack of some sort (manual, or implicit through recursion) that used memory to track which nesting level was an object and which was an array)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check input is valid JSON without full parsing #1780

Check input is valid JSON without full parsing #1780

daverigby commented Jan 10, 2022

daverigby commented Jan 10, 2022

lemire commented Jan 10, 2022

daverigby commented Jan 10, 2022 •

edited

lemire commented Jan 10, 2022

TysonAndre commented Aug 7, 2022

Check input is valid JSON without full parsing #1780

Check input is valid JSON without full parsing #1780

Comments

daverigby commented Jan 10, 2022

Discussed in #1393

daverigby commented Jan 10, 2022

lemire commented Jan 10, 2022

daverigby commented Jan 10, 2022 • edited

lemire commented Jan 10, 2022

TysonAndre commented Aug 7, 2022

daverigby commented Jan 10, 2022 •

edited