Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryError: Allocation failed (probably too large) with ped.compute() #1213

Open
fouerghi opened this issue Apr 8, 2024 · 7 comments
Open

Comments

@fouerghi
Copy link

fouerghi commented Apr 8, 2024

Hi,

I am running a big simulation with a million people and I'd like to calculate the genealogical relatedness between them. I tried running sgkit and unfortunately, when I hit the ped.compute() step, I run into this error: MemoryError: Allocation failed (probably too large).

Any advice on how I can bypass this? I tried doing some stuff related to dask but didn't get anything useful.

P.S: I have a lot of unrelated pairs of individuals in the trees and I am only interested in ones that are above a certain threshold. Could that help? Any suggestions are welcome.

@jeromekelleher
Copy link
Collaborator

That's a tricky one - I think all our pedigree algorithms are "dense", and aren't really designed for these kinds of large pedigree. I think @timothymillar had some thoughts about how we might implement though?

@timothymillar
Copy link
Collaborator

Hi @fouerghi, yes we currently only support dense arrays. From memory, we decided to avoid sparse arrays until there is more standardized support in our dependencies (Xarray and Zarr).

An alternative that I have considered implementing is chunked computation of kinship/relatedness. This would allow you to compute the relationship matrix in chunks rather than all at once. Theoretically, this should let you identify closely related individuals (in each chunk) without holding the whole matrix in memory. Does this sound like a reasonable solution for your use-case?

@timothymillar
Copy link
Collaborator

timothymillar commented Apr 9, 2024

P.S: I have a lot of unrelated pairs of individuals in the trees and I am only interested in ones that are above a certain threshold. Could that help? Any suggestions are welcome.

Is this a single pedigree in which all individuals are connected, or is it a single dataset with multiple independent pedigrees? If it is the latter, you may be able to split it into multiple smaller pedigrees and perform your computations on those. The pedigree_sel function should be helpful here.

@fouerghi
Copy link
Author

fouerghi commented Apr 9, 2024

Hi Jerome and Tim,

Thank you for getting back to me. Could you elaborate more on the chunking part? I am interested in identifying (full sibs, first cousins, second cousins, third cousins and 4th cousins). Would it work for these cases?

I have a single pedigree where all individuals are connected.

@timothymillar
Copy link
Collaborator

Could you elaborate more on the chunking part?

It's a strategy to compute large arrays/matrices in parts which can avoid needing to hold the entire matrix in memory at once. We haven't implemented it for pedigree based kinship estimation yet because it's quite complex. I have written a short script demonstrating how you could do this manually, but it will be quite slow and may require some tuning of the chunk size:
https://gist.github.com/timothymillar/2bb2a4521daf948d76808d60eccb390a

I am interested in identifying (full sibs, first cousins, second cousins, third cousins and 4th cousins). Would it work for these cases?

In that case you may not actually need to compute genealogical kinship/relatedness. There are multiple relationship types that can give similar estimates of genealogical relatedness. So it might be better to try and pull these relationships directly from the pedigree. For example, you can identify sets of full sibs based on them having identical parents. But the more distant relations will be more complex.

@fouerghi
Copy link
Author

Hi Tim,

For the script you linked above, it uses pedigree_sel() function, but it seems that it's in the unreleased version of sgkit. Is there another function I could use?

@timothymillar
Copy link
Collaborator

Hi @fouerghi, sorry for my slow response. I'm afraid we don't have an equivalent function in the current 0.7 release. Hopefully the next release won't be too far away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants