Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CpG mutation models #2221

Open
hyanwong opened this issue Nov 14, 2023 · 3 comments
Open

CpG mutation models #2221

hyanwong opened this issue Nov 14, 2023 · 3 comments

Comments

@hyanwong
Copy link
Member

hyanwong commented Nov 14, 2023

Elevated mutation rates at CpG dinucleotides are one of the major contributors to mutation rate variation in mammals. These can't easily be simulated by sim_mutations, but there are probably reasonable approximations if a reference sequence is defined.

@petrelharp says, on slack:

It'd take some thinking to figure out how to do it - since we can't do real context-dependence, the most obvious thing is a dinucleotide model, but then this misses half the CpG pairs. I think a pretty good approximation to real context-dependence could be done by just using the reference sequence when you don't know what the neighboring nucleotide is, but getting that into msprime would be a significant project.

The main time CpG is mentioned on the msprime GitHub repo is in #972, but only in passing, so I have opened this issue for people who are searching for the term in conjunction with msprime, and who might want to think about an implementation.

@petrelharp
Copy link
Contributor

Note however that you can simulate variation in mutation rate using a RateMap - so, for instance, a crude thing to do would be to calculate the density of CpG pairs in 100Kb windows, and then use this to adjust the mutation rate. This would account for coarse-scale mutation rate variation, but not the per-nucleotide heterogeneity, and obviously not the context-dependence.

@hyanwong
Copy link
Member Author

hyanwong commented Mar 23, 2024

One way to approximate this would be to designate regions as CpG and non-CpG on the basis of the reference sequence, then apply mutations twice: once at the CpG sites (masking out the non-CpG sites) and again at the non-CpG sites (masking out the CpG sites). For more accuracy you could restrict this to the methylated CpG sites (see e.g. here).

It's not quite right, because there should be some context dependence associated with adjacent sites changing state (e.g. it is possible for the G at a CpG to mutate to T, which would make the C no longer part of a CpG site). But that's probably a second-order effect as long as mutation is rare.

For reference, there's a table of transition/transversion rates in humans at CpG and nonCpG sites at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3548427/table/T2/ - you could imagine using these values in the msprime.HKY model
Screenshot 2024-03-25 at 11 44 03

Here's some code to do it. It's not quite right yet (see comments in the code), but I think the approach is reasonable

import tarfile
import msprime
import stdpopsim
import re
import os
import warnings
import numpy as np
from urllib.request import urlretrieve

def get_refseq(
    chromosome,
    max_chars,
    ancestral_alleles_prefix="homo_sapiens_ancestor_",
    build="GRCh38",
    base_url="https://ftp.ensembl.org/pub/release-111/fasta/ancestral_alleles/"
):
    """
    Download and return max_chars of the ancestral sequence.
    The download can be large (800Mb), so ensure you have space
    """
    def get_fasta_seq(file, max_chars=None):
        """Return the first max_chars of a gzipped FASTA file as a single string"""
        seq = ""
        seq_num = 0
        for i, line in enumerate(file):
            if seq_num > 0:
                break
            if max_chars is not None and len(seq) >= max_chars:
                break
            if line.startswith(b">"):
                if i > 0:
                    seq_num += 1
            else:
                seq += line.decode().strip()
        if max_chars is None:
            return seq
        else:
            return seq[:max_chars]

    ancestral_alleles_file = ancestral_alleles_prefix + build
    ancestral_alleles_tarfile = ancestral_alleles_file + ".tar.gz"

    if not os.path.isfile(ancestral_alleles_tarfile):
        ## Large download: make sure you have space
        print(f"Downloading large tarball of sequences from {ancestral_alleles_tarfile}")
        urlretrieve(
            ancestral_alleles_tarfile,
            ancestral_alleles_tarfile,
        )
    with tarfile.open(ancestral_alleles_tarfile, "r:gz") as tar:
        filehandle=tar.extractfile(os.path.join(
            ancestral_alleles_file,
            f"{ancestral_alleles_prefix}{chromosome}.fa"
        ))
        return get_fasta_seq(filehandle, max_chars)


def impose_CpG_mutations(
    ts, CpG_transition_rate, CpG_transv_rate, nonCpG_transition_rate, nonCpG_transv_rate,
    remove_refseq=True,
):
    if ts.num_sites > 0:
        raise ValueError("Expecting a ts with no sites / mutations")
    if not ts.has_reference_sequence():
        raise ValueError(
            "Expecting a ts with a reference sequence (e.g. from "
            "https://ftp.ensembl.org/pub/release-111/fasta/ancestral_alleles/)"
        )

    tables = ts.dump_tables()
    print("Adding CpG mutations based on the reference sequence")
    # Add ancestral states for any ACGT/acgt values
    for pos, ancestral_allele in enumerate(ts.reference_sequence.data):
        if ancestral_allele in "ACGTacgt":
            tables.sites.add_row(position=pos, ancestral_state=ancestral_allele.upper())

    used_regions = np.zeros(len(ts.reference_sequence.data), dtype=bool)
    CpG_regions = np.zeros(len(ts.reference_sequence.data), dtype=bool)
    for m in re.finditer(r"[ACGTacgt]+", ts.reference_sequence.data, re.IGNORECASE):
        used_regions[m.start(): m.end()] = True
    for m in re.finditer(r"(?:CG)+", ts.reference_sequence.data, re.IGNORECASE):
        CpG_regions[m.start(): m.end()] = True
    assert np.all(used_regions[CpG_regions])  # all CpG regions are in used_regions
    nonCpG_regions = used_regions & ~CpG_regions

    # find the switchpoints between regions
    CpG_regions = np.where(np.diff(np.pad(CpG_regions, (1, 1))) != 0)[0]
    assert len(CpG_regions) % 2 == 0
    CpG_regions = CpG_regions.reshape((-1, 2))

    nonCpG_regions = np.where(np.diff(np.pad(nonCpG_regions, (1, 1))) != 0)[0]
    assert len(nonCpG_regions) % 2 == 0
    nonCpG_regions = nonCpG_regions.reshape((-1, 2))
    
    regions_and_rates = {
        "CpG": (CpG_regions, [CpG_transition_rate , CpG_transv_rate]),
        "nonCpG": (nonCpG_regions, [nonCpG_transition_rate , nonCpG_transv_rate])
    }
    models = {k: msprime.HKY(rate[0]/rate[1]) for k, (_, rate) in regions_and_rates.items()}
    ratemaps = {}
    for k, (regions, rates) in regions_and_rates.items():
        pos = regions.flatten()
        use = [np.sum(rates), 0]
        if regions[0, 0] != 0:
            pos = np.insert(pos, 0, 0)
            use = [use[1], use[0]]
        if regions[-1, 1] != ts.sequence_length:
            pos = np.append(pos, [int(ts.sequence_length)])
        ratemaps[k] = msprime.RateMap(position=pos, rate=np.tile(use, len(pos)//2)[0:(len(pos)-1)])
    print("Adding CpG and non-CpG mutations separately, in regions of known ancestral state")
    if remove_refseq:
        tables.reference_sequence.data = ""
    mutated_ts = tables.tree_sequence()
    for k in ratemaps.keys():
        mutated_ts = msprime.sim_mutations(mutated_ts, rate=ratemaps[k], model=models[k], random_seed = 123)
    mutated_ts = mutated_ts.simplify()
    print(
        f"Added {mutated_ts.num_mutations} mutations: av rate =",
        mutated_ts.trim().segregating_sites(mode="site") /
        mutated_ts.trim().segregating_sites(mode="branch"),
        "per bp per gen"
    )
    return mutated_ts, CpG_regions, nonCpG_regions


## Start simulation
def simulate_OOA_no_muts(samples, chromosome, sim_length):
    refseq = get_refseq(chromosome=20, max_chars=sim_length)
    start = len(refseq) - len(refseq.lstrip("Nn.*"))
    end = len(refseq.rstrip("Nn.*"))
    print(f"Simulating chr{chromosome} from position {start} to {end} without mutations")
    species = stdpopsim.get_species("HomSap")
    model = species.get_demographic_model("OutOfAfrica_3G09")
    contig = species.get_contig(f"chr{chromosome}", mutation_rate=0, left=start, right=end)
    engine = stdpopsim.get_engine("msprime")
    with warnings.catch_warnings():
        warnings.filterwarnings('ignore', message='.*mutation rate 0\.')
        ts = engine.simulate(model, contig, samples, seed=123)
    tables = ts.dump_tables()
    # until https://github.com/popsim-consortium/stdpopsim/issues/1404 is fixed
    # add the deleted region back in at the start, so the refseq lines up
    if ts.first().num_edges > 0:
        tables.edges.left = tables.edges.left + start
        tables.edges.right = tables.edges.right + start
        tables.sequence_length = tables.sequence_length + start
    assert len(refseq) == tables.sequence_length
    tables.reference_sequence.data = refseq
    return tables.tree_sequence()

chrom = 20
samples = {"YRI": 10, "CEU": 190}
ts = simulate_OOA_no_muts(samples, chromosome=chrom, sim_length=4_000_000)
mutated_ts, CpGregions, nonCpGregions = impose_CpG_mutations(
    ts,  # Figures below from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3548427/table/T2/
    CpG_transition_rate=1.12e-7,
    CpG_transv_rate=9.59e-9,
    nonCpG_transition_rate=6.18e-9,
    nonCpG_transv_rate=3.76e-9,
)

To test:

print(sum(1 for s in mutated_ts.sites() if len(s.mutations) > 1),
      f"/ {mutated_ts.num_sites} sites with >1 mutation")
nCpG = [0, 0]
for s in mutated_ts.sites():
    if np.any((s.position >= CpGregions[:, 0]) & (s.position < CpGregions[:, 1])):
        if len(s.mutations) > 0:
            nCpG[0] += 1
        if len(s.mutations) > 1:
            nCpG[1] += 1
print(
    f"{nCpG[0]} variable CpG sites ({nCpG[1]} with more than one mutation)")

Giving:

73 / 12873 sites with >1 mutation
3294 variable CpG sites (54 with more than one mutation)

@hyanwong
Copy link
Member Author

hyanwong commented Mar 26, 2024

@jeromekelleher suggested that the dinucleotide model would be a good way to get a version that has the correct dynamics. As @petrelharp points out, a dinucleotide model will miss half of the CpGs. Additionally it may be tedious converting the dinucleotide output into a sensible format: the alignments method will only accept single letter allelic states, so we would need to iterate over the variants and split them into two sites before exporting.

Nevertheless, something like a dinucleotide model might be the only way to get an equialibrium distribution of CpG mutations. If we have ancestral states and a set of transition / transversion probabilities, then the recipe above should work fine, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants