Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow linear_sum_assignment to accept a ufunc for very large graphs #20729

Open
matham opened this issue May 16, 2024 · 3 comments
Open
Labels
enhancement A new feature or improvement scipy.optimize

Comments

@matham
Copy link

matham commented May 16, 2024

Is your feature request related to a problem? Please describe.

We are trying to evaluate how different cell detection algorithms find cells. And so want to find the minimum distance pairing between two sets of cells in terms of euclidean distance.

The problem is that even for 250k cells, to use linear_sum_assignment the cost matrix has to be 200+GB assuming float32. Which is too big.

Describe the solution you'd like.

I'd like for linear_sum_assignment to accept something like a ufunc that computes the distance on the fly instead of a cost matrix. Or even better (but obviously less general) if it actually computes the distance between two sets of inputs.

Describe alternatives you've considered.

I thought about hacking up a fake array that computes the distance on every access, but I can't imagine that'll be fast.

Additional context (e.g. screenshots, GIFs)

brainglobe/brainglobe-utils#74.

@dschmitz89
Copy link
Contributor

I am not familiar with linear_sum_assignment but could sparse matrices be a potential alternative?

@matham
Copy link
Author

matham commented May 17, 2024 via email

@matham
Copy link
Author

matham commented May 17, 2024

What about a proposed function like this:

def linear_sum_assignment(
    array1: np.ndarray, array2: np.ndarray, cost: Optional[Callable],
    callback: Optional[Callable],
) -> tuple[np.ndarray, np.ndarray]:
    pass

Where array1 and array2 are NxK and MxK in size respectively. And cost can be None or a function (or possibly a string?). If it's None (or l2/distance string?) then we compute the L2 norm as the cost using L2(array1[i, :], array2[j, :]). Otherwise we call the function with the input parameters.

And callback is an optional progress callback that is called for each array1 step (for a total of N steps).

Or to start off with, there's no cost arg and L2 is assumed?

I can try my hand at implementing this if this seems acceptable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.optimize
Projects
None yet
Development

No branches or pull requests

3 participants