do away with padding #174

lemire · 2019-05-22T21:22:51Z

The padded string approach simplifies the code logic somewhat. However, it is not genuinely required. In time, it should be removed.

padding should go away once #174 is resolved.

jkeiser · 2021-03-20T17:22:47Z

@lemire do you have any thoughts about how to do this? It seems like the biggest usability issue remaining in the library, sometimes genuinely affecting performance for users who get their strings from another library that (reasonably) does not provide buffer padding.

The issues are the string parser, the number parser and true / false / null. The string parser in particular seems like it would be hard to get fast without reading ahead. If there was a performant way to process the last 32 bytes differently from the rest, the problem could go away. Further, for numbers it only matters if we're on the last token. For strings we can check if the next token's position is within 32 bytes of the end.

jkeiser · 2021-03-20T18:26:15Z

(I think it has become a barrier to adoption for some--it's definitely consistently misunderstood, and our most-asked question by far at the moment.)

jkeiser · 2021-03-20T18:27:51Z

Depending how it's implemented, too, it would let us safely expose a conversion from document to value, solving the annoying template problem.

lemire · 2021-03-20T19:41:55Z

@jkeiser

Here is how I would like to solve the problem. Complete stage 1. You get your index. Your index can be viewed as a series of memory pointers. Take the last few bytes of the document and copy them to a temporary buffer. Adjust the end of the index so that you point at the temporary buffer. You are then done.

Currently, this can't work because our indexes are 32-bit values. We would need to move them to 64-bit values.

There are some side-effects... For example, anyone relying on the pointers to be within the original document would be in trouble but we could still support that by enabling a correction to recover the original location (since it would not be entirely lost). Also, it involves writing and reading twice as much data related to the indexes. However, this would be compensated in part by the fact that translating 32-bit indexes to 64-bit indexes requires a bit of work (one instruction per load maybe?).

This would probably entice us even more to "process by block".

lemire · 2021-03-20T19:42:14Z

I have marked this "1.0" so that we consider it.

jkeiser · 2021-03-20T21:05:42Z

Yeah, I think it's very reasonable to make it 1.0. It's clearly a barrier to adoption.

That plan could work ... I think we are already using too much memory, though. 4x document size is something I've seen people do a spit take over :)

jkeiser · 2021-03-20T21:07:43Z

I'm looking into what happens if we just special case the last 32 bytes in the frontend--i.e. how much of a hit we take adding this comparison.

lemire · 2021-03-21T00:28:31Z

too much memory, though

See the follow up... "This would probably entice us even more to "process by block"."

lemire · 2021-03-21T00:31:21Z

My guess is that if you couple block-based processing with 64-bit indexes, you probably can solve all problems. You would reduce memory usage down to a constant factor on large inputs, and be able to do away with padding.

It is a significant engineering effort, however.

jkeiser · 2021-03-21T01:07:19Z

Yep, agreed block processing is the right place to go.

I'm not sure we can pull it off before 1.0, though, so I'd like to find out just how much of a hit we'll take, and then we can decide whether it's worth it for increased adoption / fewer issues filed against simdjson.

jkeiser · 2021-03-21T01:09:48Z

I have a branch, jkeiser/no-padding, which checks whether it's stepping off the end of the index buffer. That should take care of any accesses directly at EOF.

lemire · 2021-03-21T18:57:45Z

@jkeiser

We can possibly use templates to have both approaches.

jkeiser · 2021-03-21T20:38:04Z

We can possibly use templates to have both approaches.

Yeah; I was thinking maybe if you provide a padded_string or padded_string_view, we return a faster padded::document, and if not, we return a regular document.

lemire · 2021-03-22T00:05:09Z

Right. That would be a good idea.

lemire · 2021-03-25T15:40:02Z

Note that PR #1518 seems relevant here. Doing a copy as needed, especially if we can avoid the memory allocation due to buffer reuse, can go a long way toward making the padding issue transparent.

bobergj · 2021-04-20T07:59:06Z

Leaving our use-case here to give you a data point.

genuinely affecting performance for users who get their strings from another library that (reasonably) does not provide buffer padding.

In our case, we are experimenting with a Swift library with iterators built on top of the C++ simdjson dom API. Our json data comes from a OS-provided HTTP client API. That API gives us the HTTP body response with the json data in a system allocated buffer. We can't control the allocation at all, thus neither the padding.

The largest json we consume is a a couple of megabytes (as opposed to hundreds of MBs or a GBs). Even with the memcpy to the reused internal buffer (in the above mentioned PR) our simdjson approach is at least 10x faster than what we had.
So in our case, this issue is not a blocker to adoption, more of a nice to have.

On the other hand, if our json data was in the range of hundreds of megabytes we wouldn't want to memcpy it, especially not on a mobile device. We wouldn't want to have that whole Json in memory at all, rather we would want to give the Json data in pieces to the json parser, as we receive it from the network.

lemire · 2021-04-20T12:29:27Z

@bobergj Thanks for the feedback.

I would think that if the file is huge, then DOM is pretty much a bad idea to begin with, so you'd want to go with On Demand.

Avoids the copy of the data, saving a measurable but small amount of time. Simdjson has an open issue for looking into alternatives, simdjson/simdjson#174, but it does not look like a priority.

lemire mentioned this issue Dec 21, 2019

[WIP] Multithreaded version of simdjson for parsing regular but large JSON documents #406

Closed

lemire added this to To do in Release 0.4 Jan 9, 2020

lemire mentioned this issue Jan 13, 2020

Experiment with a "process by small blocks" strategy #442

Open

lemire added a commit that referenced this issue Jan 15, 2020

SIMDJSON_PADDING is now an absolute constant. This is temporary since

27861f6

padding should go away once #174 is resolved.

jkeiser added design issues Exploring a change to simdjson's internals performance labels Sep 22, 2020

lemire added this to the 1.0 milestone Mar 20, 2021

lemire added this to In progress in Get simdjson 1.0 out!!! May 24, 2021

lemire linked a pull request Jul 22, 2021 that will close this issue

[WIP] Lifting the padding requirement from simdjson APIs #1665

Draft

6 tasks

lemire modified the milestones: 1.0, 2.0 Jul 24, 2021

lemire added the research Exploration of the unknown label Jul 24, 2021

lemire removed this from In progress in Get simdjson 1.0 out!!! Jul 24, 2021

lemire mentioned this issue Aug 4, 2021

Experiment with bogus-error approach for no-overhead bound-check-like behavior #1686

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do away with padding #174

do away with padding #174

lemire commented May 22, 2019

jkeiser commented Mar 20, 2021

jkeiser commented Mar 20, 2021

jkeiser commented Mar 20, 2021 •

edited

lemire commented Mar 20, 2021

lemire commented Mar 20, 2021

jkeiser commented Mar 20, 2021

jkeiser commented Mar 20, 2021

lemire commented Mar 21, 2021

lemire commented Mar 21, 2021

jkeiser commented Mar 21, 2021

jkeiser commented Mar 21, 2021

lemire commented Mar 21, 2021

jkeiser commented Mar 21, 2021

lemire commented Mar 22, 2021

lemire commented Mar 25, 2021

bobergj commented Apr 20, 2021

lemire commented Apr 20, 2021

do away with padding #174

do away with padding #174

Comments

lemire commented May 22, 2019

jkeiser commented Mar 20, 2021

jkeiser commented Mar 20, 2021

jkeiser commented Mar 20, 2021 • edited

lemire commented Mar 20, 2021

lemire commented Mar 20, 2021

jkeiser commented Mar 20, 2021

jkeiser commented Mar 20, 2021

lemire commented Mar 21, 2021

lemire commented Mar 21, 2021

jkeiser commented Mar 21, 2021

jkeiser commented Mar 21, 2021

lemire commented Mar 21, 2021

jkeiser commented Mar 21, 2021

lemire commented Mar 22, 2021

lemire commented Mar 25, 2021

bobergj commented Apr 20, 2021

lemire commented Apr 20, 2021

jkeiser commented Mar 20, 2021 •

edited