Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Banded Alignments and long sequences #400

Open
hjruscheweyh opened this issue Feb 29, 2020 · 3 comments
Open

Banded Alignments and long sequences #400

hjruscheweyh opened this issue Feb 29, 2020 · 3 comments

Comments

@hjruscheweyh
Copy link

Hi!

I have a question regarding handling of sequences with different length.

I usually align sequences with a fixed identity cutoff of 97% and, depending on the sample, different read length. vsearch is really fast when aligning short sequences (~100bp) but becomes prohibitively slow when aligning longer sequences (~250 bp). I assume this is due to the matrix sizes for the DP alignment step. Can that be? One way to address this problem is by allowing for a banded alignment. Assuming that I look for an alignment with 97% identity in a read with length 250 then the maximum band needed would be 8. Or in terms or memory consumption --> The full matrix would be 250*250 = 62500 (The reference is of course longer but I assume that there is a smart way this is being taking into account). The reduced matrix would be 250 * 17 = 4250. Considering that you need multiple matrices and not only creating but also filling them takes time I would assume that this way you could lower runtime significantly. Is there an option that does something similar?
Thanks a lot,
Hans

@frederic-mahe
Copy link
Collaborator

Hi @hjruscheweyh

vsearch implements two pairwise alignment methods. The manpage states that the second method is triggered when aligning very long sequences:

For comparisons involving sequences with a length product greater than 25 million (for example two sequences of length 5 kb), vsearch uses a slower alignment method described by Hirschberg (1975) and Myers and Miller (1988), with much smaller memory requirements.

Implementing a banded pairwise alignment method for high-similarity searches is an interesting idea, duly noted. In the meantime, a great way to speed up thing is to limit the number of pairwise alignments you actually need to perform. Increasing the wordlength value could help you there by reducing a bit the number of candidate sequences to align:

--wordlength positive integer
Length of words (i.e. k-mers) for database indexing. The range of possible values goes from 3 to 15, but values near 8 or 9 are generally recommended. Longer words may reduce the sensitivity/recall for weak similarities, but can increase precision. On the other hand, shorter words may increase sensitivity or recall, but may reduce precision. Computation time generally increases with shorter words and decreases with longer words, but it increases again for very long words. Memory requirements for a part of the index increase with a factor of 4 each time word length increases by one nucleotide, and this generally becomes significant for long words (12 or more). The default value is 8.

@hjruscheweyh
Copy link
Author

Hi @frederic-mahe
Thanks for the feedback

Hi @hjruscheweyh

I think this is the linear space algorithm which was developed for the case that the matrices built from aligning 2 sequences is too large to fit into memory. We rather deal with the case that many fields in the matrix will never ever be used but still being generated.

vsearch implements two pairwise alignment methods. The manpage states that the second method is triggered when aligning very long sequences:

For comparisons involving sequences with a length product greater than 25 million (for example two sequences of length 5 kb), vsearch uses a slower alignment method described by Hirschberg (1975) and Myers and Miller (1988), with much smaller memory requirements.

We need to produce a large amount of alignments for benchmarking our tool and for finding useful cutoffs. So limiting maxaccepts and maxrejects is no option for now. Btw, is there a way to get the order in which accepts and rejects were found out of vsearch? E.g. when setting maxaccepts 100 maxrejects 100 you generate a maximum of 200 alignments. Would be nice to know in which order accepts and rejects appeared, e.g like aaaaaaarrrrraaarraraaaarrrrrrrrrrrr. That way you could effectively try to limit maxrejects without losing precision.

The idea with wordsize is cool. I will check how that impacts runtime. Do you know how words are calculated if IUPAC characters appear in the reference sequences?

Implementing a banded pairwise alignment method for high-similarity searches is an interesting idea, duly noted. In the meantime, a great way to speed up thing is to limit the number of pairwise alignments you actually need to perform. Increasing the wordlength value could help you there by reducing a bit the number of candidate sequences to align:

--wordlength positive integer
Length of words (i.e. k-mers) for database indexing. The range of possible values goes from 3 to 15, but values near 8 or 9 are generally recommended. Longer words may reduce the sensitivity/recall for weak similarities, but can increase precision. On the other hand, shorter words may increase sensitivity or recall, but may reduce precision. Computation time generally increases with shorter words and decreases with longer words, but it increases again for very long words. Memory requirements for a part of the index increase with a factor of 4 each time word length increases by one nucleotide, and this generally becomes significant for long words (12 or more). The default value is 8.

@torognes
Copy link
Owner

torognes commented Mar 4, 2020

Thanks for suggesting the idea of a banded alignment. This is something that we have and will consider. It is clear that full alignment is time-consuming with long sequences and a banded alignment could save considerable time (and memory). However, it is complex to implement in combination with the SIMD parallelisation performed during alignment by vsearch.

Unfortunately, it is currently not possible to output the order of accepted and rejected hits. It could potentially be added as a kind of diagnostic option.

Words containing ambiguous nucleotide symbols (those other than ACGTU) are ignored in the initial step of the search algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants