Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAM file formatting #845

Open
Naturalist1986 opened this issue May 9, 2024 · 7 comments
Open

SAM file formatting #845

Naturalist1986 opened this issue May 9, 2024 · 7 comments

Comments

@Naturalist1986
Copy link

Expected Behavior

Hi,

I'm trying to convert a SAM file created from a search result but I get this error:

(mmseqs) moshea@shannon:/mnt/scratch$ samtools view -Sb SRR5234505.sam > SRR5234505.bam
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1_sam] Parse error at line 363193
samtools view: error reading file "SRR5234505.sam"

This is the part of the file where the error refers to:

(mmseqs) moshea@shannon:/mnt/scratch$ sed -n '363193p' SRR5234505.sam

SRR5234505.241048 16 NZ_SMKV01000006.1_67 26 254 35M * 0 0 GGTCTTACCTGCGCTCCACAGGCTGTCACCGCCAGCCCGGCGGCCTTCGCGACGTTGACCGCCGCGTTGATGTCCCGGTCCAGGTACGTCCCGCACGACCCGCAC * AS:i:98 NM:i:12

Maybe the sam file is not formatted properly?

I used:

mmseqs convertalis SRR5234505_DB curated_protein_catalogue SRR5234505_Result SRR5234505 --format-mode 1

Thanks for your help!

@milot-mirdita
Copy link
Member

If I understand this correctly, this is a nucleotide-vs-protein search (translated search), right?

I think we have a bug for the SAM backtraces for this mode, where either the back trace should be converted to the nucleotide match count (not codon match count), or the nucleotide sequence should be translated to a protein sequence.

@milot-mirdita
Copy link
Member

Do you expect this search to return a nucleotide sequence or a protein sequence?

@genomewalker
Copy link
Contributor

@milot-mirdita, we are also interested in the SAM formatted output for translated searches. In our use case, we would like to get a nucleotide output.
Many thanks!

@Naturalist1986
Copy link
Author

I'm searching a metagenome (nucleotides) against a protein catalogue. I need to know how many reads mapped to each protein from the catalogue for functional analysis. I guess I expect the search to return the matches for each read?

@milot-mirdita
Copy link
Member

I pushed a fix that should correct some of the weirdness with SAM. Please check if this is what you expected as we don't use much sam internally.

  • We were reporting the revcomp bit (16) for forward instead of rev-comp hits
  • The printed sequence was for a reverse hit, was just the forward sequence, not a reverse-complemented one
  • We were printing the protein backtrace for nucleotide strings (not multiplying by 3)

@Naturalist1986
Copy link
Author

Thanks! Do I need to download and recompile mmseqs? or can I just replace with the fixed file?

@milot-mirdita
Copy link
Member

You can download precompiled binaries from: https://mmseqs.com/latest/

you don’t need to compile them yourself

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants