Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved implementation of PrefixMapStd #1475

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Aklakan
Copy link
Contributor

@Aklakan Aklakan commented Aug 9, 2022

GitHub issue resolved #1474

Reimplementation of PrefixMapStd to combine "fast-track" with trie backing.


  • Tests are included. (All existing tests apply - no new ones created)
  • Documentation change and updates are provided for the Apache Jena website
  • Commits have been squashed to remove intermediate development commit messages.
  • Key commit messages start with the issue number (GH-xxxx or JENA-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.


See the Apache Jena "Contributing" guide.

@Aklakan
Copy link
Contributor Author

Aklakan commented Aug 24, 2022

This PR is ready for review.

The small extra changes are:
Longest-prefix-lookup-cache invalidation now only happens if there is an actual change; adding duplicates or removing non-existing entries do not trigger invalidation.
Also, I wasn't sure about the thread-safety of PrefixMapStd - probably its better to have it so I added RWL locking.

Is there a place where the utility methods calcWithLock(Lock, Supplier) and runWithLock(Lock, Runnable) could be publicly added - or maybe they already exist in one of the dependencies?

@Aklakan
Copy link
Contributor Author

Aklakan commented Sep 17, 2022

I converted back to draft because I have some pending updates which I need to benchmark against the well-working parts of the original PrefixMapStd (IRIs with / or #) first in order to determine whether performance-wise it would make sense to include them.

The general idea is to have PrefixMap implementation that auto-adapts to a given workload - such as parsing lots of queries without the overhead of updating reverse-lookup structures as well as updating them on-demand upon writing out RDF.

In essence the direction I am investigating is about building/updating the reverse-lookup (iri-to-prefix) structures lazily (upon abbreviating). This would buffer prefix-iri inserts/deletions similar to your BufferingPrefixMap; upon abbreviate only the delta is materialized into the reverse-lookup structures.

@afs
Copy link
Member

afs commented Sep 17, 2022

Another choice is to restrict to the basic case of prefix at the final "/", "#" and ":" (for URNs). Only have the "fast path" abbreviate.

Do you have cases where abbreviation is not one of these?

If you are going for the complicated version,maybe the best way is to have a new PrefixMapCaching and leave PrefixMapStd.

@Aklakan
Copy link
Contributor Author

Aklakan commented Sep 19, 2022

Another choice is to restrict to the basic case of prefix at the final "/", "#" and ":" (for URNs). Only have the "fast path" abbreviate.

I think your suggestions of including ':' in the list and only using fast path (without resorting to scanning) would work efficiently and be sufficient for the vast majority of use cases. Without scanning, even relative IRIs (that do not contain any of the fast track chars) wouldn't cause problems.

Do you have cases where abbreviation is not one of these?

Right now I only have some initial experiments where I abuse the prefix map as a poor-mans dictionary encoding in order to reduce the amount of bytes that need to be parsed in order to produce triples/quads. For this I am using trie-based lookups to encode the data (so IRIs can be split anywhere), but I have yet to evaluate whether this actually gives a noticeable performance boost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PrefixMapStd is very slow for lookups that 'miss'
2 participants