Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple remote query executions merged together due to timestamp clash #40

Open
miguel76 opened this issue Sep 9, 2022 · 2 comments
Open

Comments

@miguel76
Copy link
Contributor

miguel76 commented Sep 9, 2022

I noticed that in some of the published datasets there are issues with single instances of lsqv:RemoteExec that have multiple values for properties like lsqv:hostHash and lsqv:uri, which (conceptually) should be functional.
Further analysing the data and later the source code, I discovered that the problem is that if the timestamp is available (which I guess is most of the times) it is used (alongside the service id) to build the IRI for the remote execution.
The problem is exacerbated in the case of the dbpedia.3.5.1 log, because for some reason the timestamps are truncated at the hour and hence blocks of several executions are merged together.
But it easily happens also in other cases (for sure in the case of the bioportal log) cause multiple query executions may be logged in the same second.

My suggestion is to either use always the sequential id (easiest hack, I guess) or add a mechanism to differentiate the IRIs when the timestamp is the same.

@Aklakan
Copy link
Member

Aklakan commented Sep 23, 2022

I guess the better solution is the middle ground: prefer the timestamp - but if there are clashes then start using sequential ids within that timestamp. The idea is that even when processing only a subset of a log one would get the same RDF data for the remote executions because a request usually can be globally identified by its destination URL and timestamp. With sequence ID this information would get completely lost.

@miguel76
Copy link
Contributor Author

I proposed a simple solution. Obviously other solutions may be found, but I think this an important bug and should be addressed. The advantage of this solution is that is does not add any bottleneck.
If the problem is preserving the ID when processing a subset of a log, this could be addressed separately if LSQ is aware of the line of the log on which it starts (could be a parameter), by starting seqId from an appropriate initial value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants