Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement find as an iterator in SBT to provide library users more options #2665

Open
wants to merge 7 commits into
base: latest
Choose a base branch
from

Conversation

morsecodist
Copy link

Hello! I work on Chan Zuckerberg Infectious Disease and we are working on using sourmash for removing redundant sequences. You can take a look at some of our efforts here, though we are still in the exploratory stages so there is not much there in the way of documentation. We are using an approach where if any sequence we already have is closer than the threshold to a given sequence we want to omit that sequence. I am using sourmash as a library for this. I have implemented my own slightly different SBT heavily based on what you already have here for our use case. I think that all or some of this may make a nice addition to the library.

One simple piece of low hanging fruit that I feel would benefit all users of the library is implementing find as an iterator. If we find a similar sequence we want to stop the search instead of continuing to search the whole tree. I was going to implement a method that either checks for the presence of a signature or returns the first one. However, I realized an iterator solves a lot of these use cases simultaneously. Users of the library will be able to stop the search after an arbitrary number of matches as well as do things like filtering and mapping without adding filtered elements to the final vector.

This is a minimal addition. I didn't add an iterator version of find to the Index trait yet but if you are OK with going in this direction we can merge in this change and I would be happy to implement more iterators for more Indexes. We can also use the default implementation which already uses an iterator and converts it to a vector.

I didn't end up writing new tests because I modified the existing find method to depend on the iterator implementation. This way it is covered by the existing find test cases.

@morsecodist morsecodist changed the title Find iter lifetimes Implement find as an iterator in SBT to provide library users more options Jul 3, 2023
@ctb ctb added the rust label Jul 3, 2023
@ctb
Copy link
Contributor

ctb commented Jul 3, 2023

hi @morsecodist this sounds fantastic! thanks for engaging!

note to @luizirber this is touching the Rust code / implementation of SBT!

(I've approved your PR for GitHub actions, also.)

A few hot takes -

thanks again!

@codecov
Copy link

codecov bot commented Jul 3, 2023

Codecov Report

Attention: Patch coverage is 0% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 86.78%. Comparing base (fe4790f) to head (9582770).

Files Patch % Lines
src/core/src/index/mod.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           latest    #2665   +/-   ##
=======================================
  Coverage   86.78%   86.78%           
=======================================
  Files         136      136           
  Lines       15524    15524           
  Branches     2626     2626           
=======================================
  Hits        13472    13472           
  Misses       1751     1751           
  Partials      301      301           
Flag Coverage Δ
hypothesis-py 25.53% <ø> (ø)
python 92.86% <ø> (ø)
rust 61.21% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@morsecodist
Copy link
Author

@ctb Thanks for the quick response and for linking. My tree implementation makes use of the principle from this issue. I would love to discuss folding that in though if there is a better place than this PR let me know. My tree has a few key differences:

  • It uses containment to avoid searching down certain tree branches. This was a huge deal for our use case because we are only adding sequences to the tree if they are far away from the existing sequences so we are likely to be able to give up fairly early. This works for both jaccard and containment search functions because jaccard will always be less than or equal to containment. However we run into difficulties with the API because we can't use this method given an arbitrary search function. We would need to somehow constrain it or perhaps require two search functions with the properties we need.
  • It doesn't differentiate between internal and leaf nodes. This lets me add to the tree without shuffling. Each node serves a dual role as an internal node and a leaf node. This seemed to make a difference to performance.
  • It uses rayon to search levels of the tree in parallel.
  • It support arbitrary branching factors. I noted that tweaking the branching factor can make a bit of a performance difference as well, especially when combined with per-level parallelism.

I would be happy to go into more detail about it. I was planning on adding some of these elements separately either to the core SBT tree or making a new implementation of it with these properties. This is all compatible with find as an iterator so I think it may make sense to land this less disruptive change first then maybe build some more optimizations on top of it.

@ctb
Copy link
Contributor

ctb commented Jul 3, 2023

hey, and look, all your tests passed 🎉

I also forgot to mention this hot take: we have an early-stage but (in my experience) quite functional plug-in system, see #2428 and #2438. It should be easy to add your own experimental Index implementations without contributing to core, if and as you wish to go that way. Happy to chat more about this as you have questions.

I like your incrementalist approach!

@luizirber
Copy link
Member

luizirber commented Jul 3, 2023

Wholeheartedly agree with the find changes! The Index trait design still suffers from being one of the first things I tried out/designed in Rust... So Iterators make a lot of sense where you added them =]

In #2230 I... removed the SBT code. Because I figured no one was using it. Should it be kept around?
(I should also probably split #2230 into the RevIndex/mastiff implementation, and all the other changes to the API, including some concepts available in Python, like Manifests and Picklists...)

@morsecodist
Copy link
Author

@luizirber for my tree I actually am not using any of the SBT code and it is much more minimal than what you have for SBTs. From what it sounds like if you are open to having this index type there are two ways forward:

  • I could implement it as an SBT and keep the existing SBT code
  • I could add it as a new more slimmed down tree index implementation

Would you be OK with either of these and if so which would you prefer.

Also it sounds like you are happy with a full refactor of find and search to iterators rather than just this one change. I am happy to do that as well as a separate PR. So two total changes:

  1. A refactor of the Index Rust code to turn find and search into iterators and updating relevant implementations
  2. Folding in a tree based index with these optimizations

Does that make sense as a breakdown?

@ctb
Copy link
Contributor

ctb commented Aug 22, 2023

hi @morsecodist please see #2230 (comment) - @luizirber is splitting out the big PR into many small bits!

In particular, for now, we suggest going this route,

I could add it as a new more slimmed down tree index implementation

and keeping it all in Rust - no need to expose to Python. You can keep your tree index separated and not in core - use sourmash-rust as dep here. In terms of exposing it to Python, you could use plugins for that, if it's desired; see https://github.com/sourmash-bio/pyo3_branchwater for my nascent efforts there on other fronts.

In re:

Also it sounds like you are happy with a full refactor of find and search to iterators rather than just this one change. I am happy to do that as well as a separate PR.

we are definitely on board with this, but @luizirber has some thoughts about how to make your life easier here by splitting off & merging the removal of SBT and BIGSI in #2230 so you don't need to deal with that code.

In brief: please let us know if/when this PR is ready for full review!

@luizirber let me know if I got that all right 😆

luizirber added a commit that referenced this pull request Aug 27, 2023
(spun off #2230, might help with #2665)

BIGSI and SBT are prototype-level code (and some of the first Rust code
I wrote...), and mostly makes it harder to change/refactor other parts
of the codebase.

Can bring it back later in the future if needed, but `mastiff` cover
many of the same use cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants