Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do away with padding #174

Open
lemire opened this issue May 22, 2019 · 17 comments · May be fixed by #1665
Open

do away with padding #174

lemire opened this issue May 22, 2019 · 17 comments · May be fixed by #1665
Labels
design issues Exploring a change to simdjson's internals performance research Exploration of the unknown
Projects

Comments

@lemire
Copy link
Member

lemire commented May 22, 2019

The padded string approach simplifies the code logic somewhat. However, it is not genuinely required. In time, it should be removed.

@jkeiser
Copy link
Member

jkeiser commented Mar 20, 2021

@lemire do you have any thoughts about how to do this? It seems like the biggest usability issue remaining in the library, sometimes genuinely affecting performance for users who get their strings from another library that (reasonably) does not provide buffer padding.

The issues are the string parser, the number parser and true / false / null. The string parser in particular seems like it would be hard to get fast without reading ahead. If there was a performant way to process the last 32 bytes differently from the rest, the problem could go away. Further, for numbers it only matters if we're on the last token. For strings we can check if the next token's position is within 32 bytes of the end.

@jkeiser
Copy link
Member

jkeiser commented Mar 20, 2021

(I think it has become a barrier to adoption for some--it's definitely consistently misunderstood, and our most-asked question by far at the moment.)

@jkeiser
Copy link
Member

jkeiser commented Mar 20, 2021

Depending how it's implemented, too, it would let us safely expose a conversion from document to value, solving the annoying template problem.

@lemire
Copy link
Member Author

lemire commented Mar 20, 2021

@jkeiser

Here is how I would like to solve the problem. Complete stage 1. You get your index. Your index can be viewed as a series of memory pointers. Take the last few bytes of the document and copy them to a temporary buffer. Adjust the end of the index so that you point at the temporary buffer. You are then done.

Currently, this can't work because our indexes are 32-bit values. We would need to move them to 64-bit values.

There are some side-effects... For example, anyone relying on the pointers to be within the original document would be in trouble but we could still support that by enabling a correction to recover the original location (since it would not be entirely lost). Also, it involves writing and reading twice as much data related to the indexes. However, this would be compensated in part by the fact that translating 32-bit indexes to 64-bit indexes requires a bit of work (one instruction per load maybe?).

This would probably entice us even more to "process by block".

@lemire lemire added this to the 1.0 milestone Mar 20, 2021
@lemire
Copy link
Member Author

lemire commented Mar 20, 2021

I have marked this "1.0" so that we consider it.

@jkeiser
Copy link
Member

jkeiser commented Mar 20, 2021

Yeah, I think it's very reasonable to make it 1.0. It's clearly a barrier to adoption.

That plan could work ... I think we are already using too much memory, though. 4x document size is something I've seen people do a spit take over :)

@jkeiser
Copy link
Member

jkeiser commented Mar 20, 2021

I'm looking into what happens if we just special case the last 32 bytes in the frontend--i.e. how much of a hit we take adding this comparison.

@lemire
Copy link
Member Author

lemire commented Mar 21, 2021

too much memory, though

See the follow up... "This would probably entice us even more to "process by block"."

@lemire
Copy link
Member Author

lemire commented Mar 21, 2021

My guess is that if you couple block-based processing with 64-bit indexes, you probably can solve all problems. You would reduce memory usage down to a constant factor on large inputs, and be able to do away with padding.

It is a significant engineering effort, however.

@jkeiser
Copy link
Member

jkeiser commented Mar 21, 2021

Yep, agreed block processing is the right place to go.

I'm not sure we can pull it off before 1.0, though, so I'd like to find out just how much of a hit we'll take, and then we can decide whether it's worth it for increased adoption / fewer issues filed against simdjson.

@jkeiser
Copy link
Member

jkeiser commented Mar 21, 2021

I have a branch, jkeiser/no-padding, which checks whether it's stepping off the end of the index buffer. That should take care of any accesses directly at EOF.

@lemire
Copy link
Member Author

lemire commented Mar 21, 2021

@jkeiser

We can possibly use templates to have both approaches.

@jkeiser
Copy link
Member

jkeiser commented Mar 21, 2021

We can possibly use templates to have both approaches.

Yeah; I was thinking maybe if you provide a padded_string or padded_string_view, we return a faster padded::document, and if not, we return a regular document.

@lemire
Copy link
Member Author

lemire commented Mar 22, 2021

Right. That would be a good idea.

@lemire
Copy link
Member Author

lemire commented Mar 25, 2021

Note that PR #1518 seems relevant here. Doing a copy as needed, especially if we can avoid the memory allocation due to buffer reuse, can go a long way toward making the padding issue transparent.

@bobergj
Copy link
Contributor

bobergj commented Apr 20, 2021

Leaving our use-case here to give you a data point.

genuinely affecting performance for users who get their strings from another library that (reasonably) does not provide buffer padding.

In our case, we are experimenting with a Swift library with iterators built on top of the C++ simdjson dom API. Our json data comes from a OS-provided HTTP client API. That API gives us the HTTP body response with the json data in a system allocated buffer. We can't control the allocation at all, thus neither the padding.

The largest json we consume is a a couple of megabytes (as opposed to hundreds of MBs or a GBs). Even with the memcpy to the reused internal buffer (in the above mentioned PR) our simdjson approach is at least 10x faster than what we had.
So in our case, this issue is not a blocker to adoption, more of a nice to have.

On the other hand, if our json data was in the range of hundreds of megabytes we wouldn't want to memcpy it, especially not on a mobile device. We wouldn't want to have that whole Json in memory at all, rather we would want to give the Json data in pieces to the json parser, as we receive it from the network.

@lemire
Copy link
Member Author

lemire commented Apr 20, 2021

@bobergj Thanks for the feedback.

I would think that if the file is huge, then DOM is pretty much a bad idea to begin with, so you'd want to go with On Demand.

@lemire lemire added this to In progress in Get simdjson 1.0 out!!! May 24, 2021
@lemire lemire linked a pull request Jul 22, 2021 that will close this issue
6 tasks
@lemire lemire modified the milestones: 1.0, 2.0 Jul 24, 2021
@lemire lemire added the research Exploration of the unknown label Jul 24, 2021
@lemire lemire removed this from In progress in Get simdjson 1.0 out!!! Jul 24, 2021
Jereq added a commit to Jereq/load-gltf that referenced this issue Aug 21, 2022
Avoids the copy of the data, saving a measurable but small amount of
time.

Simdjson has an open issue for looking into alternatives,
simdjson/simdjson#174, but it does not look
like a priority.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design issues Exploring a change to simdjson's internals performance research Exploration of the unknown
Projects
No open projects
Release 0.4
  
To do
Development

Successfully merging a pull request may close this issue.

3 participants