Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illogical Lineage Counts in algorithms.py #2242

Open
agushin101 opened this issue Dec 12, 2023 · 7 comments
Open

Illogical Lineage Counts in algorithms.py #2242

agushin101 opened this issue Dec 12, 2023 · 7 comments

Comments

@agushin101
Copy link

sims.ipynb.zip

Above is a Notebook describing the issue in more detail. The summary is that: when running basic selective sweeps using algorithms.py's framework, there is some illogical output. The sweep_pop_sizes array, which I've interpreted to represent the lineage counts for wild type and mutant respectively, does not always converge. When the --population-sizes flag is not specified, it never converges to 1 lineage. When it is specified, it sometimes converges and sometimes does not, getting stuck at 2 mutant lineages, even when recombination is set to 0. I was wondering if someone could clear up why this is happening.

Thanks!

@jeromekelleher jeromekelleher added this to To do in Sweeps refactoring via automation Dec 13, 2023
@jeromekelleher
Copy link
Member

Thanks @agushin101, I think @GertjanBisschop is taking a look. This is possibly related to #2217 - we didn't do a very good job of keeping the Python prototype up to date with the C implementation as bugs got identified and fixed, and there are inconsistencies between the two.

@GertjanBisschop
Copy link
Member

Thanks @agushin101, this is insightful! This definitely exposes a couple of issues.
The impact of the sweep is clearly not as strong as it should be when the population size is specified. This can be rectified. I will post a draft pull-request that fixes two issues: 1. the expected duration of the sweep in algorithms.py, 2. modifications to the rate of coalescence vs recombination logic.
Do pay attention though to the recombination rate you specify, this is a per base per generation rate, not the population-scaled mutation rate. So in a way it makes sense that the number of lineages actually increases. The actual rate should probably be about 20 000 times smaller.
I am not sure yet whether this fixes all discrepancies between both implementations, but I am sure we'll get there in the end.

@YuanboSong
Copy link

We adjust the recombination rate by dividing it by the product of the population size and the segment length, i.e., 0.1 / (population size * segment length). However, there are no recombination events observed during the sweep. If everything runs as expected, what should the sweep population size be at the beginning? In some runs, I observed it became [0,0]. Shouldn't it be [0,1], considering there should be a beneficial ancestor at the start of the sweep? Thank you for your time.

@GertjanBisschop
Copy link
Member

The simulation is a structured coalescent based on a pre-simulated allele frequency trajectory. This does not guarantee that at the end of the simulated frequency trajectory everything will have coalesced. Similarly, all samples might have coalesced before the end of the simulated trajectory. Simply by chance you might have drawn some samples that coalesced right end the end of the sweep (so in the beginning when looked backwards in time).

@YuanboSong
Copy link

Thank you for the insights. I understand the simulation is based on a structured coalescent model with a pre-simulated allele frequency trajectory. However, my concern is specifically about the initial conditions of the sweep. We deliberately start the sweep with only one beneficial allele, implying that all lineages should eventually coalesce into this single ancestor. If the simulation sometimes ends with more than one lineage remaining, doesn't this indicate a potential issue in the coalescent function or the way it's handling these initial conditions? It seems like the simulation should consistently converge to a single lineage in this scenario, given the starting premise of a single beneficial allele.

@jeromekelleher
Copy link
Member

If the simulation sometimes ends with more than one lineage remaining, doesn't this indicate a potential issue in the coalescent function or the way it's handling these initial conditions? It seems like the simulation should consistently converge to a single lineage in this scenario, given the starting premise of a single beneficial allele.

This is something I've been confused about too, FWIW!

@GertjanBisschop
Copy link
Member

Yup, I get your point, this is somewhat confusing. The original Braverman paper on which this implementation is based gives some insight though: "The deterministic frequency trajectory starts at frequency 1 - E and ends at E . KAPLAN et al. (1989) chose E= 5 / (2Nes) based on their calculation of the point at which the probability of extinction of the new mutant due to stochastic effects becomes approximately zero". This means that if the start and end frequency we provide become too close to 0 (and 1 respectively), then we ignore the stochastic effects of the sweep. This is where our approximation no longer holds and so we shouldn't interpret the start frequency as a strict bound.
From a more practical point of view, we do expect the distribution of the number of neutral lineages with the favoured allele to be very centred around 1 in case of a hard sweep, but it will never be always exactly one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

4 participants