New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
do away with padding #174
Comments
padding should go away once #174 is resolved.
@lemire do you have any thoughts about how to do this? It seems like the biggest usability issue remaining in the library, sometimes genuinely affecting performance for users who get their strings from another library that (reasonably) does not provide buffer padding. The issues are the string parser, the number parser and true / false / null. The string parser in particular seems like it would be hard to get fast without reading ahead. If there was a performant way to process the last 32 bytes differently from the rest, the problem could go away. Further, for numbers it only matters if we're on the last token. For strings we can check if the next token's position is within 32 bytes of the end. |
(I think it has become a barrier to adoption for some--it's definitely consistently misunderstood, and our most-asked question by far at the moment.) |
Depending how it's implemented, too, it would let us safely expose a conversion from |
Here is how I would like to solve the problem. Complete stage 1. You get your index. Your index can be viewed as a series of memory pointers. Take the last few bytes of the document and copy them to a temporary buffer. Adjust the end of the index so that you point at the temporary buffer. You are then done. Currently, this can't work because our indexes are 32-bit values. We would need to move them to 64-bit values. There are some side-effects... For example, anyone relying on the pointers to be within the original document would be in trouble but we could still support that by enabling a correction to recover the original location (since it would not be entirely lost). Also, it involves writing and reading twice as much data related to the indexes. However, this would be compensated in part by the fact that translating 32-bit indexes to 64-bit indexes requires a bit of work (one instruction per load maybe?). This would probably entice us even more to "process by block". |
I have marked this "1.0" so that we consider it. |
Yeah, I think it's very reasonable to make it 1.0. It's clearly a barrier to adoption. That plan could work ... I think we are already using too much memory, though. 4x document size is something I've seen people do a spit take over :) |
I'm looking into what happens if we just special case the last 32 bytes in the frontend--i.e. how much of a hit we take adding this comparison. |
See the follow up... "This would probably entice us even more to "process by block"." |
My guess is that if you couple block-based processing with 64-bit indexes, you probably can solve all problems. You would reduce memory usage down to a constant factor on large inputs, and be able to do away with padding. It is a significant engineering effort, however. |
Yep, agreed block processing is the right place to go. I'm not sure we can pull it off before 1.0, though, so I'd like to find out just how much of a hit we'll take, and then we can decide whether it's worth it for increased adoption / fewer issues filed against simdjson. |
I have a branch, jkeiser/no-padding, which checks whether it's stepping off the end of the index buffer. That should take care of any accesses directly at EOF. |
We can possibly use templates to have both approaches. |
Yeah; I was thinking maybe if you provide a padded_string or padded_string_view, we return a faster |
Right. That would be a good idea. |
Note that PR #1518 seems relevant here. Doing a copy as needed, especially if we can avoid the memory allocation due to buffer reuse, can go a long way toward making the padding issue transparent. |
Leaving our use-case here to give you a data point.
In our case, we are experimenting with a Swift library with iterators built on top of the C++ simdjson dom API. Our json data comes from a OS-provided HTTP client API. That API gives us the HTTP body response with the json data in a system allocated buffer. We can't control the allocation at all, thus neither the padding. The largest json we consume is a a couple of megabytes (as opposed to hundreds of MBs or a GBs). Even with the memcpy to the reused internal buffer (in the above mentioned PR) our simdjson approach is at least 10x faster than what we had. On the other hand, if our json data was in the range of hundreds of megabytes we wouldn't want to memcpy it, especially not on a mobile device. We wouldn't want to have that whole Json in memory at all, rather we would want to give the Json data in pieces to the json parser, as we receive it from the network. |
@bobergj Thanks for the feedback. I would think that if the file is huge, then DOM is pretty much a bad idea to begin with, so you'd want to go with On Demand. |
Avoids the copy of the data, saving a measurable but small amount of time. Simdjson has an open issue for looking into alternatives, simdjson/simdjson#174, but it does not look like a priority.
The padded string approach simplifies the code logic somewhat. However, it is not genuinely required. In time, it should be removed.
The text was updated successfully, but these errors were encountered: