Bulk unindexing for IndexerInterface #598

Enet4 · 2022-07-21T10:33:40Z

This proposes an extension to the IndexerInterface so that API consumers can request multiple items to be unindexed at once. Resolves #594.

This extension to the API should be carefully reviewed and evaluated to ensure that implementers can use it properly and that it can be well integrated into existing workflows.

Breaking changes are minimal. It is only a problem if an existing plugin already has a method with the exact same signature, which is unlikely. Hence, we are in position to deliver this in 3.2.0, or postpone to version 4 anyway.

To ensure some level of quality in this suggestion, we should have:

at least one proof of concept plugin implementation
- We could take this as an opportunity to move forward with hosting the open plugins on GitHub.
at least one example that the Dicoogle core can consume this method in a usable way without downsides
- I suggest an extension to the unindexing service so that it accepts multiple URIs

Problems resolved in the latest version:

No way to retrieve the task's unindexing progress from another thread. Now has a second parameter as a progress callback.
Even if the indexer periodically flushes some unindexing operations during the task, other routines cannot work with URIs which have already been unindexed before the task ends completely. This could pose an issue in file removal tasks, because it requires unindex to finish before we start cleaning up the files. The operation is now asynchronous.

Known caveats:

It is not clear whether we can properly constrain the number of parallel unindexing tasks by pushing this to our indexing task pool. The current indexing task manager depends on task objects returning a different kind of report, which UnindexReport does not implement. Maybe it should be subclassed from Report? As of the latest version, unindexing tasks can be dispatched by the indexer thread pool.
Having a second parameter to keep track of progress is quirky. One would naturally expect an UnindexTask<T> to extend Task<T>, but it is very different from indexing tasks and would make it even harder to attach to the task pool.
The resulting object is too complex and granular. While there is a lot of power in being able to specify an error for each URI, it is not always the case that files will be unindexed independently, so it may be hard to build such results in batched unindexing.

bastiao · 2022-07-22T08:07:34Z

My opinion is to have it in 3.2.0.
Thinking aloud: my main concern about the current new API is the not capability to report the URIs unindexed (or perhaps the ones it failed). But I can take a look in our use case and see how it will look like.

Enet4 · 2022-07-22T08:57:45Z

my main concern about the current new API is the not capability to report the URIs unindexed (or perhaps the ones it failed).

That is true. I feared that including that information would overcomplicate the interface, but the cost of not knowing which files were unindexed and which ones were not is possibly too high. I will revise this.

bastiao · 2022-08-25T07:44:10Z

do you have plans to "un-draft" this PR?

Enet4 · 2022-08-25T09:52:13Z

do you have plans to "un-draft" this PR?

I have updated the PR with a plan. It should not become ready for review until there is evidence that this API works well for implementers and consumers. We also have two known problems and further thinking is needed to either resolve them or just accept them as not enough of a problem to merit blocking the extension. Feedback is more than welcome.

Enet4 · 2022-10-15T13:58:55Z

Updated the API to include a progress callback. I updated the root message with known caveats.

bastiao · 2023-07-08T15:37:45Z

The unindex bulk API can be beneficial in certain scenarios, and we are currently approaching a milestone where new APIs will be incorporated. It would be appreciated if you could take into account the undraft this pull request, as it has been pending for quite some time. If necessary, we can revisit and iterate on it at a later stage.

Enet4 · 2023-07-21T11:56:23Z

The unindex bulk API can be beneficial in certain scenarios, and we are currently approaching a milestone where new APIs will be incorporated. It would be appreciated if you could take into account the undraft this pull request, as it has been pending for quite some time. If necessary, we can revisit and iterate on it at a later stage.

I have integrated bulk unindexing to the core unindex Web service. It would be better to also have a proof of concept implementation other than the default.

Enet4 · 2024-03-25T11:36:38Z

I added this point to the known caveats.

The resulting object is too complex and granular. While there is a lot of power in being able to specify an error for each URI, it is not always the case that files will be unindexed independently, so it may be hard to build such results in batched unindexing.

After trying to implement this interface, I am convinced that the API needs to be redesigned to be easier to implement.

- for unindexing in bulk - clarify that both unindex methods are synchronous, unlike the indexing ones

- adjust the content to align with good practices (see also https://bioinformatics-ua.github.io/dicoogle-learning-pack/docs/query_index/)

- add `UnindexReport` class and nested classes - for containing errors which may occur in bulk unindexing - change `IndexerInterface#unindex(Collection<URI>)` - returns `UnindexReport` - can throw `IOException`

- make it asynchronous: returns a `Task` like in `index` - add second parameter for keeping track of progress

- clarify that it returns a task - remove unused import

- can only handle one indexer at a time, but other than that it works

- remove deprecated method call #handles, check scheme instead

- record a collection of URIs in each unindex failure

Enet4 · 2024-04-17T15:54:02Z

I have readjusted the Unindex report so that each failure may specify multiple URIs, which is a better reflection of what will happen in most implementations.

Enet4 added enhancement feedback-request sdk labels Jul 21, 2022

Enet4 requested a review from bastiao July 22, 2022 07:48

Enet4 force-pushed the new/indexer/unindex-bulk branch from 3477736 to 9362f5a Compare October 15, 2022 13:52

Enet4 force-pushed the new/indexer/unindex-bulk branch from 9362f5a to be2438e Compare July 21, 2023 11:55

Enet4 self-assigned this Aug 31, 2023

Enet4 marked this pull request as ready for review March 7, 2024 11:05

Enet4 added this to the 3.4.0 milestone Mar 7, 2024

Enet4 added 12 commits April 17, 2024 15:52

[sdk] Add IndexerInterface::unindex(Collection<URI>)

e6a23a2

- for unindexing in bulk - clarify that both unindex methods are synchronous, unlike the indexing ones

[sdk] Revamp IndexerInterface documentation

f1e1762

- adjust the content to align with good practices (see also https://bioinformatics-ua.github.io/dicoogle-learning-pack/docs/query_index/)

[sdk] rethink bulk unindexing to be more informative

8b7cb75

- add `UnindexReport` class and nested classes - for containing errors which may occur in bulk unindexing - change `IndexerInterface#unindex(Collection<URI>)` - returns `UnindexReport` - can throw `IOException`

[sdk] reiterate on IndexerInterface batch unindex

76493aa

- make it asynchronous: returns a `Task` like in `index` - add second parameter for keeping track of progress

[sdk] format UnindexReport

4c99599

[sdk] Improve bulk IndexerInterface#unindex

0c067e2

- clarify that it returns a task - remove unused import

[sdk] Add UnindexReport#errorCount

680c154

Add bulk unindexing to plugin controller

38d7640

- can only handle one indexer at a time, but other than that it works

Tweak PluginController

483f814

- remove deprecated method call #handles, check scheme instead

Update unindex servlet to use bulk unindexing where appropriate

1b99804

[core] Dispatch batch-unindex tasks

70ce1b0

[sdk] Reiterate on the UnindexReport API

216b11f

- record a collection of URIs in each unindex failure

Enet4 force-pushed the new/indexer/unindex-bulk branch from 8ac8afb to 216b11f Compare April 17, 2024 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk unindexing for IndexerInterface #598

Bulk unindexing for IndexerInterface #598

Enet4 commented Jul 21, 2022 •

edited

bastiao commented Jul 22, 2022 •

edited

Enet4 commented Jul 22, 2022

bastiao commented Aug 25, 2022

Enet4 commented Aug 25, 2022

Enet4 commented Oct 15, 2022

bastiao commented Jul 8, 2023

Enet4 commented Jul 21, 2023

Enet4 commented Mar 25, 2024

Enet4 commented Apr 17, 2024

Bulk unindexing for IndexerInterface #598

Are you sure you want to change the base?

Bulk unindexing for IndexerInterface #598

Conversation

Enet4 commented Jul 21, 2022 • edited

bastiao commented Jul 22, 2022 • edited

Enet4 commented Jul 22, 2022

bastiao commented Aug 25, 2022

Enet4 commented Aug 25, 2022

Enet4 commented Oct 15, 2022

bastiao commented Jul 8, 2023

Enet4 commented Jul 21, 2023

Enet4 commented Mar 25, 2024

Enet4 commented Apr 17, 2024

Enet4 commented Jul 21, 2022 •

edited

bastiao commented Jul 22, 2022 •

edited