Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1:1 alignment for thousands of sequence pairs, not pairwise. #813

Open
Paupiera opened this issue Feb 14, 2024 · 7 comments
Open

1:1 alignment for thousands of sequence pairs, not pairwise. #813

Paupiera opened this issue Feb 14, 2024 · 7 comments

Comments

@Paupiera
Copy link

Dear mmseqs2 developvers,

I have a list of thousands of subject - query fasta pairs, and I would like to run mmseqs to align each of these pairs in a way that each sequence is only aligned to its pair. This is how my list looks:

S10CNODE_1.fasta       S1CNODE_25.fasta
S10CNODE_2.fasta       S2CNODE_8.fasta 
S10CNODE_3.fasta       S5CNODE_11.fasta   
S10CNODE_4.fasta       S3CNODE_7.fasta
S10CNODE_5.fasta       S6CNODE_10.fasta

Is there an efficient way to perform these 1:1 alignments? Could I create a database that contains all sequences and then align a database subentry?

I am trying to avoid aligning all against all.

@milot-mirdita
Copy link
Member

milot-mirdita commented Feb 14, 2024

This is possible but a bit tricky:

Please make one FASTA file containing all sequence entries. Then call createdb

mmseqs createdb YOUR_FASTA_FILE.fasta db

Then take a look at the db.lookup file it generated.
You will find entries in the format:

numeric_key accession set_id
...

The first two columns are important, you can ignore the last. You will need to make a new TSV file, of the keys of the two matching accessions.

In your example, you should see something like the following in the db.lookup.

0 S10CNODE_1.fasta
1 S10CNODE_2.fasta
2 S10CNODE_3.fasta
3 S10CNODE_4.fasta
4 S10CNODE_5.fasta
5 S1CNODE_25.fasta
6 S2CNODE_8.fasta 
7 S5CNODE_11.fasta   
8 S3CNODE_7.fasta
9 S6CNODE_10.fasta

The new tsv file you need to create should look like this:

0 5
1 6
2 7
3 8
4 9
...

Next sort this file according to the first column:

sort -k1,1n NEW_TSV_FILE.tsv  > NEW_TSV_FILE_sort.tsv 

Now you can pass this file to tsv2db and compute the alignments and generate a normal MMseqs2 m8 output:

# 6 is a clustering database, which only requires a target key and nothing else
mmseqs tsv2db NEW_TSV_FILE_sort.tsv clu_db --output-dbtype 6
mmseqs align db db clu_db aln_db -a
mmseqs convertalis db db aln_db aln.m8

@milot-mirdita
Copy link
Member

Oh, also this will only work for protein sequences. Nucleotide sequences need a diagonal seed point to compute the alignment, which would be much more difficult to hack.

@Paupiera
Copy link
Author

Thank you so much for your fast reply. Unfortunately, my sequences are nucleotide. Would be of any help if I provide the seed points?

@milot-mirdita
Copy link
Member

You can try.

The tsv you need to create has the same format of, but with two more columns score (can be 0) and match diagonal (position i-j):

query_key target_key score(0) diagonal

Then call:

# 7 is a prefiltering database
mmseqs tsv2db NEW_TSV_FILE_sort.tsv pref_db --output-dbtype 7
mmseqs align db db pref_db aln_db -a
mmseqs convertalis db db aln_db aln.m8

In either case, you can't mix nucleotides and protein-pairs in one run, needs to be either or.

@Paupiera
Copy link
Author

Paupiera commented Feb 15, 2024

What criteria should I follow to define the score?
Is the format of diagonal i-j being i: seed start position in query and j: start position in target? Also, for each row, is it possible to set more than one seed/diagonal position?

@milot-mirdita
Copy link
Member

You can ignore the score, its not used further in the align module. Just set it to 0.

You can pass only one diagonal per query-target pair. You should be able to create multiple lines per query target pair with different diagonals though.

The seed point refers to the seed start positions i and j in query and target, respectively.

Just out of curiosity: Can I ask what your use-case for this is?

@Paupiera
Copy link
Author

Of course. We are developing a method that, given a set of dna sequences from different environmental samples, generates pairs of candidates across samples, suggesting they could be generated from the same source. In this case, we would like to curate the candidates through alignment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants