Skip to content

jmschrei/tangermeme

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tangermeme

Unit Tests

[docs][tutorials]

Note

tangermeme is under active development. The API has largely been decided on, but may change slightly across versions until the first major release.

The MEME Suite is a collection of biological sequence analysis tools that rely almost solely on a collection of sequences or the motifs derived from them; tangermeme is an extension of this concept to biological sequence analysis when you have a collection of sequences and a predictive model. Hence, it implements many atomic sequence operations such as adding a motif to a sequence or shuffling it out, efficient tools for applying predictive models to these sequences, methods for dissecting what these predictive models have learned, and tools for designing new sequences using these models. tangermeme aims to be assumption free: models can be multi-input or multi-output, functions do not assume a distance and instead return the raw predictions, and when loss functions are necessary they can be supplied by the user. Although we will provide best practices for how to use these functions, our hope is that being assumption-free makes adaptation of tangermeme into your settings as frictionless as possible. All functions are unit-tested and implemented with both compute- and memory-efficient in mind. Finally, although the library was built with operations on DNA sequences in mind, all functions are extensible to any alphabet.

In addition to a library of functions to help you apply predictive models to sequences, future iterations of tangermeme will include PyTorch-based/GPU accelerated command-line tools that range from reimplementations of some of the tools in the MEME suite to new tools for sequence analysis that include attribution scores.

Please see the documentation and tutorials linked at the top of this README for more extensive documentation.

Installation

pip install tangermeme

Roadmap

This first release focused on the core prediction-based functionality (e.g., marginalizations, ISM, etc..) that subsequent releases will build on. Although my focus will largely follow my research projects and the feedback I receive from the community, here is a roadmap for what I currently plan to focus on in the next few releases.

  • v0.1.0: Prediction-based functionality ✔️
  • v0.2.0: Attribution-based functionality (e.g., attribution marginalization, support for DeepLIFT, seqlet calling..) ✔️
  • v0.3.0: PyTorch ports for MEME and TOMTOM and command-line tools for the prediction- and attribution- based functionality
  • v0.4.0: Focus on interleaving tools and iterative approaches

Usage

tangermeme aims to be as low-level and simple as possible. This means that models can be any PyTorch model or any wrapper of a PyTorch model as long as the forward function is still exposed, i.e., y = model(X) still works. This also means that if you have a model that potentially takes in multiple inputs or outputs and you want to simplify it for the purpose of ISM or sequence design that you can take your model and wrap it however you would like and still use these functions. It also means that all data are PyTorch tensors and that broadcasting is supported wherever possible. Being this flexible sometimes results in bugs, however, so please report any anomalies when doing fancy broadcasting or model wrapping.

Ersatz

tangermeme implements atomic sequence operations to help you ask "what if?" questions of your data. These operations can be found in tangermeme.ersatz. For example, if you want to insert a subsequence or motif into the middle of a sequence you can use the insert function.

from tangermeme.ersatz import insert
from tangermeme.utils import one_hot_encode   # Convert a sequence into a one-hot encoding
from tangermeme.utils import characters   # Convert a one-hot encoding back into a string

seq = one_hot_encode("AAAAAA").unsqueeze(0)
merge = insert(seq, "GCGC")[0]

print(characters(merge))
# AAAGCGCAAA

Sometimes, when people say "insert" what they really mean is "substitute", where a block of characters are changed without changing the length of the sequence. Most functions in tangermeme that involve adding a motif to a sequence use substitutions instead of insertions.

from tangermeme.ersatz import substitute
from tangermeme.utils import one_hot_encode   # Convert a sequence into a one-hot encoding
from tangermeme.utils import characters   # Convert a one-hot encoding back into a string

seq = one_hot_encode("AAAAAA").unsqueeze(0)
merge = substitute(seq, "GCGC")[0]

print(characters(merge))
# AGCGCA

If you want to dinucleotide shuffle a sequence, you can use the dinucleotide_shuffle command.

from tangermeme.ersatz import dinucleotide_shuffle
from tangermeme.utils import one_hot_encode
from tangermeme.utils import characters

seq = one_hot_encode('CATCGACAGACTACGCTAC').unsqueeze(0)
shuf = dinucleotide_shuffle(seq, random_state=0)

print(characters(shuf[0, 0]))
# CAGACACGATACGCTCTAC
print(characters(shuf[0, 1]))
# CGACATACGAGCTCACTAC

Both shuffling and dinucleotide shuffling can be applied to entire sequence, but they can also be applied to portions of the sequence by supplying start and end parameters if you want to, for instance, eliminate a motif by shuffling the nucleotides.

from tangermeme.ersatz import dinucleotide_shuffle
from tangermeme.utils import one_hot_encode
from tangermeme.utils import characters

seq = one_hot_encode('CATCGACAGACGCATACTCAGACTTACGCTAC').unsqueeze(0)
shuf = dinucleotide_shuffle(seq, start=5, end=25, random_state=0)

print(characters(shuf[0, 0]))
# CATCG AGCGACTCAGATACACACTT ACGCTAC    Spacing added to emphasize that the flanks are identical
print(characters(shuf[0, 1]))
# CATCG ACGAGCATCACACTAGACTT ACGCTAC

Prediction

Once you have your sequence of interest, you usually want to apply a predictive model to it. tangermeme.predict implements predict, which will handle constructing batches from a big blob of data, moving these batches to the GPU (or whatever device you specify) one at a time and moving the results back to the CPU, and making sure the inference is done without gradients and in evaluation mode. Also, there can be a cool progress bar.

from tangermeme.predict import predict

y = predict(model, X, batch_size=2)

DeepLIFT/SHAP Attributions

A powerful form of analysis is to run your predictive model backwards to highlight the input characters driving predictions. If the model predictions are accurate one can interpret these highlights -- or attributions -- as the actual driver of experimental signal. One of these attribution methods is called DeepLIFT/SHAP (merging ideas from DeepLIFT and DeepSHAP). tangermeme has a built-in implementation that is simpler, more robust, and corrects a few issues with other implementations of DeepLIFT/SHAP.

from tangermeme.deep_lift_shap import deep_lift_shap

X_attr = deep_lift_shap(model, X, target=267, random_state=0)

Note that for multi-task models a target must be set to calculate attributions for one output at a time.

image

Marginalization

Given a predictive model and a set of known motifs, a common question is to ask what motifs affect the model's predictions. Rather than trying to scan these motifs against the genome and averaging predictions at all sites -- which is challenging and computationally costly -- you can simply substitute in the motif of interest into a background set of sequences and see what the difference in predictions is. Because tangermeme aims to be assumption-free, these functions take in a batch of examples that you specify, and return the predictions before and after adding the motif in for each example. If the model is multi-task, y_before and y_after will be a tuple of outputs. If the model is multi-input, additional inputs can be specified as a tuple passed into args.

y_before, y_after = marginalize(model, X, "CTCAGTGATG")

By default, these functions use the nucleotide alphabet, but you can pass in any alphabet if you're using these models in other settings.

Below, we can see how a BPNet model trained to predict GATA2 binding responds to marginalizing over a GATA motif.

Importantly, all methods that modify the sequence can take in an optional func parameter to change the function that gets applied before and after performing the sequence modification (in this case, substituting in a motif). By default, this function is just the predict function but it can just as easily be the deep_lift_shap method to give you attributions before and after.

attr_before, attr_after = marginalize(model, X, "CTCAGTGATG", func=deep_lift_shap)

image

Ablation

The conceptual opposite of marginalization is ablation; rather than adding information to a sequence in the form of a motif, you remove information from it by shuffling out a portion of the sequence. This function will handle shuffling (or dinucleotide shuffling) the given number of times and returns the predictions for each shuffle.

y_before, y_after = ablate(model, X, 995, 1005)

As you might expect, if we shuffle a sequence that has the GATA motif in the middle, we see the opposite as the marginalization experiment.

Spacing

Motifs do not occur in isolation in the genome and so, frequently, we want to measure how these motifs interact with each other and how this changes across different spacings. This function takes in a set of motifs and either a fixed distance or one distance for each adjacent pair of motifs, returning the predictions before and after insertion of the full set of motifs.

y_before, y_after = space(model, X, ["CTCAGTGATG", "CTCAGTGATG"], spacing=10)

By running this function across several spacings one can, for instance, measure the cooperative effect that nearby AP-1 motifs have on predictions and observe that this cooperativity fades as the motifs get further apart.

Saturation Mutagenesis

Given an observed sequence a simple question is "what positions are driving model predictions?" One simple way to answer this question is through saturation mutagenesis, i.e., compare the predictions on the original sequence with that of sequences that comprehensively each contain one mutation with respect to that original sequence. This is another form of attribution method that is conceptually similar to deep mutational scanning but using a predictive model instead of running an experiment. In a region with an AP-1 motif, we can run ISM on Beluga and look at AP-1 factor tasks to identify that the AP-1 motif is what is driving the predictions.

from tangermeme.ism import saturation_mutagenesis

X_attr = saturation_mutagenesis(model, X)

This yields a tensor of a similar shape as the original sequence X where each value contains the predicted value (or values) for making the associated substitution. There are numerous ways to combine the predictions of each variant with the predictions on the original sequence and tangermeme allows you to use whichever approach you would like. However, a common one is the Euclidean distance followed by mean-subtracting to get the influence that each observed nucleotide has on the prediction. See the tutorial for more details.

Variant Effect

A common use case of these predictive models is evaluating the effect that mutations in sequence have on predictions. In practice, there are several situations where calculating variant effect is helpful but potentially the most clinically relevant one is having a list of variants that have been implicated by some sort of association study and wanting to screen them for some likelihood of being truly causal. In this case, given a model that predicts a readout of interest, e.g. protein binding and chromatin accessibility, one can see how the predictions change before and after variants are incorporated into the sequence. Presumably, the mutations that are actual drivers of phenotype will cause changes in predicted readout, and those that are passengers will not have as strong an effect.

The simplest form of variant is the substitution, where one or more characters in a sequence are changed to another character. We can easily get predictions before and after substitutions are included in the sequence. Importantly, more than one substitution can be encoded in each sequence.

from tangermeme.variant_effect import substitution_effect

substitutions = torch.tensor([
    [0, 1058, 0]
])

y, y_var = substitution_effect(model, X, substitutions)

When using a BPNet model that predicts GATA2 binding, we can see that a single substitution encoded at this position in the example seems to knock out predicted signal in the middle of the window entirely. Quite a strong effect.

image

tangermeme can also handle deletions and insertions. These situations are a bit more complicated because the sequences before and after incorporating variants are different lengths and so trimming needs to be done. See the tutorials for more explanation for how to control this trimming.

from tangermeme.variant_effect import deletion_effect
from tangermeme.variant_effect import insertion_effect

y, y_var = deletion_effect(model, X, deletions)
y, y_var = insertion_effect(model, X, insertions)

Design

Given a trained predictive model, one can try to design a sequence that has desired attributions. Specifically, one can try to design a sequence that causes the model to give desired predictions.

Currently, the only design algorithm implemented in tangermeme is a greedy substitution algorithm that will try every given motif at every position each iteration, and take the motif + position combo that yields predictions closest to the desired output from the model.

from tangermeme.design import greedy_substitution

X_hat = greedy_substitution(model, X, motifs, y, mask=idxs, max_iter=3, verbose=True)

When the model is the Beluga model and the goal is to design a sequence that yields strong AP-1 binding but ignore the effect on all other tasks, this function inserts three AP-1 binding sites close together. The predictions from the model are much higher for the AP-1 tasks on the designed sequence than the original sequence.

Seqlet Calling

In contrast to motif scanning, which usually relies solely on nucleotide sequence to determine whether a motif matches, seqlet calling is the identification of spans of nucleotides that have high attribution score. Seqlet calling methods usually do not rely on sequence at all because they are the first step in identifying repeating patterns based on having high attributions.

from tangermeme.seqlet import tfmodisco_seqlets

pos_seqlets, neg_seqlets = tfmodisco_seqlets(X_attr)

image

About

Biological sequence analysis for the modern age.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages