-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check InterProScan seqtype #5891
base: main
Are you sure you want to change the base?
Conversation
@neoformit please LMK if you have any comments on the UX or anything else. |
Looks good to me, is it possible to add "detected nucleotide when you selected protein" to the error message, just to be a bit more specific for the user? |
@neoformit that would be better, but the values of the seqtype variable are "p" and "n" so it makes the command section a bit nastier. See 7a25359 - worth it? |
Since you already have the |
Thanks! @TomHarrop can you look at the failing lint, please? How is interproscan failing and how fast? We could also add a Some thoughts for the future, because this is now coming up more and more often:
|
Thanks @bgruening . The lint failure is because of InterProScan's non-standard version numbering.
I believe it doesn't fail but creates huge jobs that run for days, and there is a similar problem with blastp. My understanding is that @igormakunin has to cancel jobs manually to keep the queue moving. Hopefully he can comment.
I don't think there is any technical difference in FASTA that we can rely on, e.g. "ATGC" is a valid peptide string.
This seems like a good option to me. I could look into that if you point me in the right direction.
As above we might not detect them reliably, but I don't know how the sniffer works under the hood. |
Ages ago I did start developing a Python lib It has a validator for DNA, protein, sequence count etc. and some common formatting for FASTA files. Could also implement for FASTQ but haven't yet. It raises stderr that results in user-friendly error messages in Galaxy. Maybe there are existing libraries but I don't know of one that does this exactly. It wraps BioPython etc. under the hood. |
This seems like a nice solution but the logic for checking and raising error messages still has to be re-implemented in every wrapper, right? |
FOR CONTRIBUTOR:
This tool is causing a bit of support work when users input nucleotide sequences but select protein as the sequence type.
This greps for anything that is not an IUPAC nucleotide and checks the result against the selected seqtype.
It won't always be correct because a protein consisting only of amino acid residues that are in the nucleotide alphabet is valid (but probably not common). It could also take a long time to search through valid nucleotide input. I could limit it to the first 1000 lines or something?
Ping @igormakunin